Ring Sizer 2.0, Foundation-Model Segmentation for Real-World Photos

2026/04/16

This is a follow-up to the previous ring sizer report. It documents the new upgrade, in which the two classical CV stages most responsible for real-world failures — a standard-size card detection and hand segmentation — were replaced by Meta’s Segment Anything, while keeping the entire pipeline local, CPU-only, and under 5 s of model compute per image.


1. Background

The earlier reports’ ±0.5 mm accuracy numbers were measured under lab conditions: a tripod, a ring flash, a clean white sheet of paper, and a standard-sized card placed coplanar to the hand on a uniform background. Real users do none of this. A photo from the wild looks like:

Both card and finger edge detection failures gated the project’s goal of a robust tool for real-world application. The premise of 2.0: replace both with a foundation segmentation model, keep everything else.


2. SAM and Model Size Choice

ConstraintChoiceReason
LicenseSAM 2.1 (Apache 2.0)Safe for public Spaces hosting. SAM 3 is research-only.
SizeHiera Tiny → Hiera Small (~150 MB)Fits HF free-tier CPU. Tiny and Small are visually indistinguishable on this task (IoU 0.987 vs 0.982 on the comparison image), so the upgrade was free in memory but slightly cleaner around nails.
Inference modePrompt-based, not AMGA single decoder pass on a point prompt is ~30× cheaper than the 256-prompt automatic mask grid we initially used for card detection.
TrainingZero-shotNo fine-tuning; the model is used as-is.

SAM 2.1 Small inference on CPU (Apple Silicon M5): ~0.5–0.7 s per prompt-based call.


3. Hand Mask Generation

For the hand, we have a strong prior — MediaPipe gives us 21 landmarks, including five (wrist, MCP of each finger) that lie firmly inside the palm. So instead of AMG, we feed SAM a single positive point at the palm centre (mean of landmarks 0, 5, 9, 13, 17).

A side-by-side experiment on a representative test image:

BackendTimeMask quality
MediaPipe (current)0.2 sPoor — convex polygon, doesn’t trace fingers, nails, or inter-finger gaps
SAM 2.1 Tiny (point prompt)0.6 sPixel-perfect. Every finger, nail, and the ring on the middle finger traced cleanly. IoU 0.987
SAM 2.1 Small (point prompt)0.7 sVisually indistinguishable from Tiny. IoU 0.982

4. Prompt-based Card Detection

Once hand segmentation had proved that prompt-based SAM was both fast and accurate, the same pattern was applied to card detection. The trick was finding seed points without knowing where the card was.

Solution: use the hand mask as a negative-space prior. Run hand segmentation first (~0.5 s), then sample a 5×5 grid of points on the background (outside the hand mask) — the card is somewhere on that background. For each seed:

SAM returns a handful of candidate masks per seed; the same rectangularity / aspect-ratio / area filters pick the card.


5. Architecture After the Upgrade

Image 
	→ MediaPipe landmarks
	→ SAM hand mask (palm-centre prompt)
	→ SAM card mask (background-seed prompts)
	→ Per-finger isolation against the SAM mask
	→ Landmark-based axis (MCP→PIP, proximal phalanx)
	→ Ring-zone localisation (anatomical)
	→ Width measurement: SAM mask boundary
	→ Confidence (4-component) → JSON + overlay PNG

What changed structurally


6. Performance and Footprint

StageCPU time (Apple Silicon M5)
Image quality + MediaPipe~0.2 s
SAM hand mask~0.5 s
SAM card mask (prompt)~1.3 s
Finger isolation + axis + zone<0.1 s
mask edge measurement<0.05 s
Visualisation + JSON~0.1 s
Total end-to-end~2.2 s

Memory: shared SAM 2.1 Small encoder ≈ 150 MB resident. First run downloads weights from HuggingFace (facebook/sam2.1-hiera-small); subsequent runs hit the local cache.

The end-to-end CPU budget on commodity Apple Silicon is comfortably under 3 s. HF free-tier CPU is slower; the prompt-based path should still fit inside the 5 s public budget, but final latency on that hardware is the next thing to benchmark.


7. Takeaways

1. Foundation models are not always the answer — but when two stages of classical CV both stop working in the same kinds of photos, they often are. The earlier reports’ accuracy work was real and valuable. 2.0’s premise was different: the failures we still saw in real-world photos were not measurement-precision failures but detection failures, where the input to the precise measurement was simply wrong. SAM directly addresses that class.

2. Prompt-based inference changes the deployment story. A 30× speedup over AMG (0.6 s vs 18–20 s) is the difference between an academic regression script and a Hugging Face Space. It also makes it cheap to run SAM twice per image — once for the hand, once for the card — using one as a prior for the other.


Net result: a measurement pipeline that no longer needs the user to find a white sheet of paper. The lab-condition accuracy from the earlier reports still holds when the conditions hold; what’s new is how far the conditions can drift before the pipeline gives up. Recalibrating the linear bias correction on top of the cleaner SAM-derived measurement is the next obvious piece of work.