name: detecting-tips-zones
description: Text-prompted image zone detection using TIPSv2 B/14 on CPU. Produces focus_targets / focus_edges bbox lists from natural-language labels, ready to feed into svg-portrait-mode. Use when you want automatic foreground/background separation from prompts like "dog face" + "wooden floor" instead of hand-annotating bboxes.
metadata:
version: 0.1.0
Detecting TIPS Zones
Zero-shot zone detection: text prompts → patch-grid cosine heatmaps → bboxes.
Companion to svg-portrait-mode — replaces manual focus_targets / focus_edges
annotation with a TIPSv2 B/14 forward pass.
Quick Start
from tips_zones import detect_zones
from portrait_mode import portrait_mode
focus_targets, focus_edges = detect_zones(
"photo.jpg",
targets=["dog face"],
edges=["dog paws", "dog ears", "dog body"],
distractors=["wooden floor", "carpet rug", "shoes", "wall"],
ckpt_dir="/path/to/tips/checkpoints",
tips_root="/path/to/tips",
)
svg, stats = portrait_mode(
"photo.jpg",
focus_targets=focus_targets,
focus_edges=focus_edges,
style_transforms={"background": "desaturate:0.7"},
)
Amortise model load across multiple images:
from tips_zones import load_models, detect_zones
models = load_models(ckpt_dir, tips_root, device="cpu")
for img in images:
ft, fe = detect_zones(img, targets=[...], edges=[...], distractors=[...],
ckpt_dir=ckpt_dir, tips_root=tips_root, models=models)
...
How It Works
image → B/14 vision encoder (MaskCLIP values trick on last block)
→ (32×32 patch grid at 448, or 64×64 at 896) × 768-d patch features
text labels → prompt ensemble (9 TCL templates) → B/14 text encoder
→ per-label mean feature → L2-normalise
per-label heatmap = cos(patch feature, label feature) # raw, no softmax
bbox = top-k% patches → largest connected component → scaled + padded to image coords
Why no softmax over labels
Naïve softmax assumes labels are mutually exclusive. dog face, dog ears,
and dog body are all true of the same pixels, so softmax collapses to
near-uniform and every heatmap covers the whole subject. Raw cosines +
per-label top-k threshold works much better — at the cost of requiring
distractor labels to anchor the relative scale. Always pass some
distractors (floor, wall, props — whatever is in the scene but not the
subject).
Parameters
detect_zones(
image, # path | PIL Image
targets, # ["main subject label", ...]
edges=(), # ["sub-region label", ...]
distractors=(), # scene elements to anchor against — pass these!
*,
ckpt_dir, # has tips_v2_oss_b14_{vision,text}.npz + tokenizer.model
tips_root, # local clone of google-deepmind/tips
input_size=448, # 448 → 32×32 grid, 896 → 64×64 (~12× slower on CPU)
target_top_frac=0.04, # fraction of patches kept per target label
edge_top_frac=0.06, # fraction of patches kept per edge label
pad_frac=0.02, # bbox padding as fraction of image dim
device="cpu",
models=None, # optional pre-loaded (img_model, text_model, tokenizer)
)
Returns (focus_targets, focus_edges) — both lists of {'bbox': (x1,y1,x2,y2), 'label': str}.
Performance (CPU, 16 cores)
| Step | Time |
|---|---|
load_models (warm) | ~3.5s |
load_models (cold, over 9p) | ~50s |
| Text encoding (9 templates × N labels) | ~0.1s |
| Vision forward @ 448 | 0.3–0.6s |
| Vision forward @ 896 | ~6–7s |
Inference is negligible next to portrait_mode() on large images.
Capability Notes
Subject / background split: strong. B/14 separates subject from scene reliably — typical split ~30/70 subject:background on single-subject photos.
Sub-part discrimination: weak at B/14 + 448. "dog face" vs "dog paws" vs "dog ears" tend to fire on the same region. The 32×32 patch grid is not the bottleneck (64×64 at 896 barely helps); B/14's patch features just don't encode fine sub-part semantics strongly. If you need per-part zones:
- Sharpen prompts — "close-up of dog's furry face" > "dog face" (try first)
- L/14 or SO/14 model (richer features, larger download)
- Sliding-window inference (tile crops, stitch heatmaps)
For coarse target/edge zoning (the portrait_mode use case), B/14 at 448 is
enough.
Requirements
Python deps:
pip install torch torchvision tensorflow tensorflow-text scipy pillow numpy --break-system-packages -q
Upstream TIPS repo (for the tips.pytorch image/text encoder modules):
git clone https://github.com/google-deepmind/tips /path/to/tips
B/14 checkpoints (~500MB total) go in a directory passed as ckpt_dir:
tips_v2_oss_b14_vision.npztips_v2_oss_b14_text.npztokenizer.model
Download links are in the TIPS repo README.
Prompt Engineering Tips
- Always include distractors. Without them, top-k thresholding has no relative scale. 3–7 distractors covering scene elements (floor, wall, background objects) is the sweet spot.
- Use concrete nouns over abstract ones. "carpet rug" > "textured floor".
- Top_frac tuning. If a target bbox is too small, raise
target_top_frac(0.04 → 0.08). Too big / bleeds into scene: lower it. - Pad modestly.
pad_frac=0.02works for most photos; raise to 0.05 for subjects near frame edges.
EXIF Caveat
portrait_mode (via OpenCV) honours EXIF rotation. PIL (this skill's
preprocessing) does not. For correctly-oriented source images they agree; for
EXIF-rotated phone photos the detected bboxes will be in the raw pixel
orientation. Either:
- Re-save the source with EXIF baked in:
Image.open(p).rotate(0, expand=True).save(p) - Or call
ImageOps.exif_transpose(pil)before passing todetect_zones.