Round 2 Stream A: retrain on the cleanlab-cleaned manifest

Result: dropping the cleanlab-flagged rows did NOT help — it slightly hurt, and the rare classes it was meant to fix got worse. Naive “remove the flagged 7%” cleaning is not the win.

Setup

Dropped the 8,670 cleanlab-flagged training rows (train 25-class 121,514 → 112,846), retrained gen2a-medsiglip with the same recipe (MedSigLIP-448, trimmed questionnaire, AdamW/cosine, bf16, 1 frozen + 6 finetune epochs, seed 17) on a rented RunPod H100, then scored the full 15,561-image test split. Trained from a minimal manifest (full cleaned 25-class train/val/test + a bounded “other” pool) to fit the pod’s 100 GB local disk; the “other” sample counts match the full-pool recipe (5,939 train / 796 val at other_ratio 0.05), so the comparison is like-for-like apart from the dropped rows.

Result (full test split)

	top-1	top-3
baseline (73.6% run)	0.7360	0.9276
cleaned retrain	0.7320	0.9259
delta	−0.0040	−0.0017

−0.4 top-1 is within run-to-run noise (σ≈0.6 pt for this recipe), so at best cleaning is neutral. But the per-class picture is worse and directional:

class (flagged issue rate)	cleaned F1	Stream C baseline F1
post-inflammatory hyperpigmentation (28.7%)	0.131 (recall 0.072)	0.181
psoriasis (17.2%)	0.509	0.529
viral exanthem (12.2%)	0.495	0.525
herpes simplex (18.0%)	0.539	0.585
rosacea	0.407	0.409

Every rare/flagged class regressed. Eczema overprediction ratio barely moved (1.11 → 1.08).

Why (interpretation)

The in-sample cleanlab caveat we flagged upfront bit us: on imbalanced rare classes, confident-learning conflates hard/rare with mislabeled. Dropping 20–29% of PIH / psoriasis / viral-exanthem training rows removed genuinely useful hard examples from classes that were already data-starved, so their recall fell further (PIH recall 0.108 → 0.072). The label noise that is real (the scanned document, wrong-body-part onychomycosis) is a small absolute count and doesn’t outweigh the lost signal.

Recommendations (redirects the plan)

Do NOT blind-drop cleanlab flags. For rare classes, more data / augmentation / class-balancing beats deletion.
Targeted relabel, not wholesale drop: fix the genuine garbage (non-skin images, wrong-body-part) surgically; leave hard-but-correct examples in.
The real levers are Stream B + data: class-balanced sampling / focal loss to stop the eczema error-sink and lift rare-class recall (this is now the top Round 2 priority, over label cleanup), and acquiring more rare-class data.
If pursuing confident learning again, use k-fold OOF probs (not in-sample) so hard examples aren’t systematically over-flagged.

Also fixed

drop_last=True on the train DataLoader (commit cf50210) — a trailing 1-sample batch (118,785 % 64 == 1) crashed BatchNorm; real bug, unrelated to the cleaning outcome.

Infra notes (for next RunPod run)

Pod container had no FUSE → mountpoint-s3 failed; fell back to parallel aws s3 cp of the needed image subset to local disk (the 104 GB full set doesn’t fit 100 GB).
Devbox→pod uplink is only ~6 MB/s, so pushing data from the devbox is not viable; pull from S3 on the pod instead.
H100 epoch times: frozen 512 s, each finetune epoch ~1735 s (~29 min) → ~3 h for 1+6. Budget ~$16 / ~5 h including setup at ~$3.3/hr; check RunPod balance before launching.