Skip to Content
Ai LogRound 2 Stream B β€” recipe/calibration results (focal wins, sampler busts)

Round 2 Stream B: long-tail levers + calibration + TTA

All measured on the FULL 15,561-image test split vs the gen2a-medsiglip baseline (top-1 0.7360 / top-3 0.9276 / balanced-acc 0.6633). Judged on balanced-accuracy & rare-class recall (the eczema-sink / low-rare-recall problem), not top-1 alone.

Results

configtop-1top-3balanced-accverdict
baseline0.73600.92760.6633β€”
focal loss (Ξ³=2)0.73710.92630.6746 (+0.011)KEEP
class-balanced sampler0.57700.80200.6229REJECT
label smoothing (0.1)0.74020.92860.6644mild+/safe
TTA (hflip, post-hoc)0.73770.9303β€”KEEP (serving)
temperature scaling(unchanged)already calibrated (T=1.03)

Read

  • Focal = the recipe win. Lifts the long tail at ~zero top-1 cost: PIH recall 0.108β†’0.198 (~2Γ—), rosacea 0.336β†’0.452, herpes 0.518β†’0.561, +1.1 balanced-accuracy; eczema stable (overpred 1.11β†’1.09). Directly attacks the eczema error-sink.
  • Class-balanced sampling = reject. Over-corrects: oversampling rare classes collapsed eczema recall 0.718β†’0.237 (now under-predicts, overpred 0.34) and cratered top-1 to 0.577. Balanced-acc even dropped. Inverse-frequency weighting is too aggressive for this distribution.
  • Label smoothing = mild/safe. Best raw top-1 (+0.4) with small rare gains and stable eczema; neutral on balance. Composes with focal.
  • TTA (+0.17 top-1 / +0.27 top-3) β€” deterministic free win, enable in serving.
  • Temperature scaling β€” model already calibrated (T=1.03, ECE 0.022β†’0.018); abstain thresholds ship as-is.

Recommendation β†’ what to ship

  1. Adopt focal loss for the production retrain (biggest rare-class win, no top-1 cost).
  2. Try focal + label-smoothing together next (both compose; LS adds the top-1 bump).
  3. Drop class-balanced sampling.
  4. Enable TTA in serving (+0.27 top-3 helps the top-3-based abstain policy).
  5. Ship the 3-tier abstain UX (Stream C) β€” the biggest product win, calibration-validated.

Caveats

Single runs; focal’s +1.1 balanced-acc and LS’s +0.4 top-1 are within run-to-run noise (~0.6 pt) in isolation β€” but the per-class patterns are coherent and repeatable in direction (focal lifts rare recall; sampler destroys common classes). Confirm focal with a repeat + the focalΓ—LS combo, and re-measure everything against the derm-cleaned test set (the honest benchmark) once dermatologist labels are in β€” the ~7-16% test-label noise means the true deltas may shift.

Infra

All three configs ran on one RunPod H100 (data synced once), auto-scored, then a watchdog pulled results + self-terminated the pod (burn $0, ~$28 total for the batch). Levers are config knobs in ddtrain (training.loss=focal/focal_gamma, class_balanced_sampler, label_smoothing) + dump_probs β€”tta.

Last updated on