Round 2 Stream A: retrain on the cleanlab-cleaned manifest
Result: dropping the cleanlab-flagged rows did NOT help — it slightly hurt, and the rare classes it was meant to fix got worse. Naive “remove the flagged 7%” cleaning is not the win.
Setup
Dropped the 8,670 cleanlab-flagged training rows (train 25-class 121,514 → 112,846),
retrained gen2a-medsiglip with the same recipe (MedSigLIP-448, trimmed
questionnaire, AdamW/cosine, bf16, 1 frozen + 6 finetune epochs, seed 17) on a rented
RunPod H100, then scored the full 15,561-image test split. Trained from a minimal
manifest (full cleaned 25-class train/val/test + a bounded “other” pool) to fit the pod’s
100 GB local disk; the “other” sample counts match the full-pool recipe (5,939 train / 796
val at other_ratio 0.05), so the comparison is like-for-like apart from the dropped rows.
Result (full test split)
| top-1 | top-3 | |
|---|---|---|
| baseline (73.6% run) | 0.7360 | 0.9276 |
| cleaned retrain | 0.7320 | 0.9259 |
| delta | −0.0040 | −0.0017 |
−0.4 top-1 is within run-to-run noise (σ≈0.6 pt for this recipe), so at best cleaning is neutral. But the per-class picture is worse and directional:
| class (flagged issue rate) | cleaned F1 | Stream C baseline F1 |
|---|---|---|
| post-inflammatory hyperpigmentation (28.7%) | 0.131 (recall 0.072) | 0.181 |
| psoriasis (17.2%) | 0.509 | 0.529 |
| viral exanthem (12.2%) | 0.495 | 0.525 |
| herpes simplex (18.0%) | 0.539 | 0.585 |
| rosacea | 0.407 | 0.409 |
Every rare/flagged class regressed. Eczema overprediction ratio barely moved (1.11 → 1.08).
Why (interpretation)
The in-sample cleanlab caveat we flagged upfront bit us: on imbalanced rare classes, confident-learning conflates hard/rare with mislabeled. Dropping 20–29% of PIH / psoriasis / viral-exanthem training rows removed genuinely useful hard examples from classes that were already data-starved, so their recall fell further (PIH recall 0.108 → 0.072). The label noise that is real (the scanned document, wrong-body-part onychomycosis) is a small absolute count and doesn’t outweigh the lost signal.
Recommendations (redirects the plan)
- Do NOT blind-drop cleanlab flags. For rare classes, more data / augmentation / class-balancing beats deletion.
- Targeted relabel, not wholesale drop: fix the genuine garbage (non-skin images, wrong-body-part) surgically; leave hard-but-correct examples in.
- The real levers are Stream B + data: class-balanced sampling / focal loss to stop the eczema error-sink and lift rare-class recall (this is now the top Round 2 priority, over label cleanup), and acquiring more rare-class data.
- If pursuing confident learning again, use k-fold OOF probs (not in-sample) so hard examples aren’t systematically over-flagged.
Also fixed
drop_last=True on the train DataLoader (commit cf50210) — a trailing 1-sample batch
(118,785 % 64 == 1) crashed BatchNorm; real bug, unrelated to the cleaning outcome.
Infra notes (for next RunPod run)
- Pod container had no FUSE → mountpoint-s3 failed; fell back to parallel
aws s3 cpof the needed image subset to local disk (the 104 GB full set doesn’t fit 100 GB). - Devbox→pod uplink is only ~6 MB/s, so pushing data from the devbox is not viable; pull from S3 on the pod instead.
- H100 epoch times: frozen 512 s, each finetune epoch ~1735 s (~29 min) → ~3 h for 1+6. Budget ~$16 / ~5 h including setup at ~$3.3/hr; check RunPod balance before launching.