Round 2 Stream B: long-tail levers + calibration + TTA

All measured on the FULL 15,561-image test split vs the gen2a-medsiglip baseline (top-1 0.7360 / top-3 0.9276 / balanced-acc 0.6633). Judged on balanced-accuracy & rare-class recall (the eczema-sink / low-rare-recall problem), not top-1 alone.

Results

config	top-1	top-3	balanced-acc	verdict
baseline	0.7360	0.9276	0.6633	—
focal loss (γ=2)	0.7371	0.9263	0.6746 (+0.011)	KEEP
class-balanced sampler	0.5770	0.8020	0.6229	REJECT
label smoothing (0.1)	0.7402	0.9286	0.6644	mild+/safe
TTA (hflip, post-hoc)	0.7377	0.9303	—	KEEP (serving)
temperature scaling	(unchanged)			already calibrated (T=1.03)

Read

Focal = the recipe win. Lifts the long tail at ~zero top-1 cost: PIH recall 0.108→0.198 (~2×), rosacea 0.336→0.452, herpes 0.518→0.561, +1.1 balanced-accuracy; eczema stable (overpred 1.11→1.09). Directly attacks the eczema error-sink.
Class-balanced sampling = reject. Over-corrects: oversampling rare classes collapsed eczema recall 0.718→0.237 (now under-predicts, overpred 0.34) and cratered top-1 to 0.577. Balanced-acc even dropped. Inverse-frequency weighting is too aggressive for this distribution.
Label smoothing = mild/safe. Best raw top-1 (+0.4) with small rare gains and stable eczema; neutral on balance. Composes with focal.
TTA (+0.17 top-1 / +0.27 top-3) — deterministic free win, enable in serving.
Temperature scaling — model already calibrated (T=1.03, ECE 0.022→0.018); abstain thresholds ship as-is.

Recommendation → what to ship

Adopt focal loss for the production retrain (biggest rare-class win, no top-1 cost).
Try focal + label-smoothing together next (both compose; LS adds the top-1 bump).
Drop class-balanced sampling.
Enable TTA in serving (+0.27 top-3 helps the top-3-based abstain policy).
Ship the 3-tier abstain UX (Stream C) — the biggest product win, calibration-validated.

Caveats

Single runs; focal’s +1.1 balanced-acc and LS’s +0.4 top-1 are within run-to-run noise (~0.6 pt) in isolation — but the per-class patterns are coherent and repeatable in direction (focal lifts rare recall; sampler destroys common classes). Confirm focal with a repeat + the focal×LS combo, and re-measure everything against the derm-cleaned test set (the honest benchmark) once dermatologist labels are in — the ~7-16% test-label noise means the true deltas may shift.

Infra

All three configs ran on one RunPod H100 (data synced once), auto-scored, then a watchdog pulled results + self-terminated the pod (burn $0, ~$28 total for the batch). Levers are config knobs in ddtrain (training.loss=focal/focal_gamma, class_balanced_sampler, label_smoothing) + dump_probs —tta.