Round 2 Stream B: long-tail levers + calibration + TTA
All measured on the FULL 15,561-image test split vs the gen2a-medsiglip baseline (top-1 0.7360 / top-3 0.9276 / balanced-acc 0.6633). Judged on balanced-accuracy & rare-class recall (the eczema-sink / low-rare-recall problem), not top-1 alone.
Results
| config | top-1 | top-3 | balanced-acc | verdict |
|---|---|---|---|---|
| baseline | 0.7360 | 0.9276 | 0.6633 | β |
| focal loss (Ξ³=2) | 0.7371 | 0.9263 | 0.6746 (+0.011) | KEEP |
| class-balanced sampler | 0.5770 | 0.8020 | 0.6229 | REJECT |
| label smoothing (0.1) | 0.7402 | 0.9286 | 0.6644 | mild+/safe |
| TTA (hflip, post-hoc) | 0.7377 | 0.9303 | β | KEEP (serving) |
| temperature scaling | (unchanged) | already calibrated (T=1.03) |
Read
- Focal = the recipe win. Lifts the long tail at ~zero top-1 cost: PIH recall 0.108β0.198 (~2Γ), rosacea 0.336β0.452, herpes 0.518β0.561, +1.1 balanced-accuracy; eczema stable (overpred 1.11β1.09). Directly attacks the eczema error-sink.
- Class-balanced sampling = reject. Over-corrects: oversampling rare classes collapsed eczema recall 0.718β0.237 (now under-predicts, overpred 0.34) and cratered top-1 to 0.577. Balanced-acc even dropped. Inverse-frequency weighting is too aggressive for this distribution.
- Label smoothing = mild/safe. Best raw top-1 (+0.4) with small rare gains and stable eczema; neutral on balance. Composes with focal.
- TTA (+0.17 top-1 / +0.27 top-3) β deterministic free win, enable in serving.
- Temperature scaling β model already calibrated (T=1.03, ECE 0.022β0.018); abstain thresholds ship as-is.
Recommendation β what to ship
- Adopt focal loss for the production retrain (biggest rare-class win, no top-1 cost).
- Try focal + label-smoothing together next (both compose; LS adds the top-1 bump).
- Drop class-balanced sampling.
- Enable TTA in serving (+0.27 top-3 helps the top-3-based abstain policy).
- Ship the 3-tier abstain UX (Stream C) β the biggest product win, calibration-validated.
Caveats
Single runs; focalβs +1.1 balanced-acc and LSβs +0.4 top-1 are within run-to-run noise (~0.6 pt) in isolation β but the per-class patterns are coherent and repeatable in direction (focal lifts rare recall; sampler destroys common classes). Confirm focal with a repeat + the focalΓLS combo, and re-measure everything against the derm-cleaned test set (the honest benchmark) once dermatologist labels are in β the ~7-16% test-label noise means the true deltas may shift.
Infra
All three configs ran on one RunPod H100 (data synced once), auto-scored, then a watchdog pulled results + self-terminated the pod (burn $0, ~$28 total for the batch). Levers are config knobs in ddtrain (training.loss=focal/focal_gamma, class_balanced_sampler, label_smoothing) + dump_probs βtta.