name: detecting-tips-zones description: Text-prompted image zone detection using TIPSv2 B/14 on CPU. Produces `focus_targets` / `focus_edges` bbox lists from natural-language labels, ready to feed into `svg-portrait-mode`. Use when you want automatic foreground/background separation from prompts like "dog face" + "wooden floor" instead of hand-annotating bboxes. metadata: version: 0.1.0

Detecting TIPS Zones

Zero-shot zone detection: text prompts → patch-grid cosine heatmaps → bboxes. Companion to svg-portrait-mode — replaces manual focus_targets / focus_edges annotation with a TIPSv2 B/14 forward pass.

Quick Start

from tips_zones import detect_zones
from portrait_mode import portrait_mode

focus_targets, focus_edges = detect_zones(
    "photo.jpg",
    targets=["dog face"],
    edges=["dog paws", "dog ears", "dog body"],
    distractors=["wooden floor", "carpet rug", "shoes", "wall"],
    ckpt_dir="/path/to/tips/checkpoints",
    tips_root="/path/to/tips",
)

svg, stats = portrait_mode(
    "photo.jpg",
    focus_targets=focus_targets,
    focus_edges=focus_edges,
    style_transforms={"background": "desaturate:0.7"},
)

Amortise model load across multiple images:

from tips_zones import load_models, detect_zones

models = load_models(ckpt_dir, tips_root, device="cpu")
for img in images:
    ft, fe = detect_zones(img, targets=[...], edges=[...], distractors=[...],
                          ckpt_dir=ckpt_dir, tips_root=tips_root, models=models)
    ...

How It Works

image → B/14 vision encoder (MaskCLIP values trick on last block)
     → (32×32 patch grid at 448, or 64×64 at 896) × 768-d patch features
text labels → prompt ensemble (9 TCL templates) → B/14 text encoder
     → per-label mean feature → L2-normalise
per-label heatmap = cos(patch feature, label feature)  # raw, no softmax
bbox = top-k% patches → largest connected component → scaled + padded to image coords

Why no softmax over labels

Naïve softmax assumes labels are mutually exclusive. dog face, dog ears, and dog body are all true of the same pixels, so softmax collapses to near-uniform and every heatmap covers the whole subject. Raw cosines + per-label top-k threshold works much better — at the cost of requiring distractor labels to anchor the relative scale. Always pass some distractors (floor, wall, props — whatever is in the scene but not the subject).

Parameters

detect_zones(
    image,                     # path | PIL Image
    targets,                   # ["main subject label", ...]
    edges=(),                  # ["sub-region label", ...]
    distractors=(),            # scene elements to anchor against — pass these!
    *,
    ckpt_dir,                  # has tips_v2_oss_b14_{vision,text}.npz + tokenizer.model
    tips_root,                 # local clone of google-deepmind/tips
    input_size=448,            # 448 → 32×32 grid, 896 → 64×64 (~12× slower on CPU)
    target_top_frac=0.04,      # fraction of patches kept per target label
    edge_top_frac=0.06,        # fraction of patches kept per edge label
    pad_frac=0.02,             # bbox padding as fraction of image dim
    device="cpu",
    models=None,               # optional pre-loaded (img_model, text_model, tokenizer)
)

Returns (focus_targets, focus_edges) — both lists of {'bbox': (x1,y1,x2,y2), 'label': str}.

Performance (CPU, 16 cores)

Step	Time
`load_models` (warm)	~3.5s
`load_models` (cold, over 9p)	~50s
Text encoding (9 templates × N labels)	~0.1s
Vision forward @ 448	0.3–0.6s
Vision forward @ 896	~6–7s

Inference is negligible next to portrait_mode() on large images.

Capability Notes

Subject / background split: strong. B/14 separates subject from scene reliably — typical split ~30/70 subject:background on single-subject photos.

Sub-part discrimination: weak at B/14 + 448. "dog face" vs "dog paws" vs "dog ears" tend to fire on the same region. The 32×32 patch grid is not the bottleneck (64×64 at 896 barely helps); B/14's patch features just don't encode fine sub-part semantics strongly. If you need per-part zones:

Sharpen prompts — "close-up of dog's furry face" > "dog face" (try first)
L/14 or SO/14 model (richer features, larger download)
Sliding-window inference (tile crops, stitch heatmaps)

For coarse target/edge zoning (the portrait_mode use case), B/14 at 448 is enough.

Requirements

Python deps:

pip install torch torchvision tensorflow tensorflow-text scipy pillow numpy --break-system-packages -q

Upstream TIPS repo (for the tips.pytorch image/text encoder modules):

git clone https://github.com/google-deepmind/tips /path/to/tips

B/14 checkpoints (~500MB total) go in a directory passed as ckpt_dir:

tips_v2_oss_b14_vision.npz
tips_v2_oss_b14_text.npz
tokenizer.model

Download links are in the TIPS repo README.

Prompt Engineering Tips

Always include distractors. Without them, top-k thresholding has no relative scale. 3–7 distractors covering scene elements (floor, wall, background objects) is the sweet spot.
Use concrete nouns over abstract ones. "carpet rug" > "textured floor".
Top_frac tuning. If a target bbox is too small, raise target_top_frac (0.04 → 0.08). Too big / bleeds into scene: lower it.
Pad modestly. pad_frac=0.02 works for most photos; raise to 0.05 for subjects near frame edges.

EXIF Caveat

portrait_mode (via OpenCV) honours EXIF rotation. PIL (this skill's preprocessing) does not. For correctly-oriented source images they agree; for EXIF-rotated phone photos the detected bboxes will be in the raw pixel orientation. Either:

Re-save the source with EXIF baked in: Image.open(p).rotate(0, expand=True).save(p)
Or call ImageOps.exif_transpose(pil) before passing to detect_zones.

ナビゲーション

Skillsとは？

リンク

detecting-tips-zones