Round 2 Stream A: confident-learning (cleanlab) mislabel surfacing
Second Stream A deliverable (after the leakage audit). Goal: find likely label errors in the training set so we can clean the manifest before a re-train.
Method
Scored the full 25-class train split (121,514 images) with the deployed gen2a-medsiglip
checkpoint on the two devbox GPUs (dump_probs.py, in-sample probs — a deliberate cheaper
choice than k-fold OOF), then ran cleanlab_mislabels.py: L1-normalize the sigmoid probs to
a per-row distribution, cleanlab.filter.find_label_issues + label-quality scores, aggregate
per-class issue rates and given->predicted confusions, render a vetting montage.
Caveat: in-sample probs mean this is candidate-generation for human vetting, not ground truth. Because the model was trained on these rows, a confidently-wrong prediction despite training is a strong mislabel signal; but in-sample also masks some errors (memorization), so 7.13% is a noisy lower bound. Vet the top candidates before dropping.
Result: 8,670 / 121,514 train rows (7.13%) flagged as likely label errors
Label-issue rate is heavily concentrated in exactly the classes that underperform on the test set (see Stream C):
| class | train issue rate | (Stream C test F1) |
|---|---|---|
| post-inflammatory hyperpigmentation | 28.7% | 0.18 |
| vitiligo | 21.3% | — |
| herpes simplex | 18.0% | 0.585 |
| psoriasis | 17.2% | 0.529 |
| viral exanthem | 12.2% | 0.525 |
| folliculitis | 9.6% | 0.613 |
| insect bite | 8.9% | — |
| eczema uns | 8.9% | — |
Key insight: the low recall on PIH / psoriasis / viral-exanthem / herpes-simplex is substantially a labeling problem, not just visual difficulty — up to ~29% of their training labels are suspect. Cleaning these should lift the exact classes that drag balanced accuracy (0.663) below top-1 (0.736).
Top given->predicted confusions among flagged rows mirror the eczema error-sink: eczema↔ psoriasis (478 / 359), folliculitis↔insect-bite (424 / 333), insect-bite->eczema (320), seborrheic-dermatitis->eczema (161), intertrigo->eczema (130).
Visual verification (eval_output/cleanlab/mislabel_candidates.png)
The top-60 (lowest quality score) contains three kinds of issue, all real:
- Non-image / garbage contamination - e.g. a scanned text document labeled
eczema uns. These should be dropped outright. - Systematic wrong-label / wrong-body-part - several
onychomycosis(nail fungus) labels are actually armpit / hairy-skin photos (predicted intertrigo). A targeted onychomycosis body-part/label review is warranted. - Plausible single-image mislabels + genuine look-alikes - acne vs keratosis pilaris, rosacea vs pityriasis, etc. These need per-image human judgement.
Recommendation -> cleaned-manifest retrain
- Programmatic prune of clear garbage (non-skin images; the flagged CSV +
image_uuidlist is the starting set). - First retrain experiment: drop the cleanlab-flagged rows (8,670) from training, retrain
gen2a-medsiglip(RunPod H100, same recipe), and measure the full-test delta - especially per-class recall on PIH / psoriasis / viral-exanthem. If test top-1 and those recalls rise, cleaning helped; if not, the flags were mostly hard-but-correct examples. - Targeted human vetting of the systematic clusters (onychomycosis body-part; PIH) via the
montage +
label_issues.csvbefore committing relabels. - For a fully rigorous pass later, k-fold OOF probs remove the in-sample caveat.
Outputs: eval_output/cleanlab/{label_issues.csv, issues_per_class.csv, issue_confusions.csv, mislabel_candidates.png}. Reproduce:
cd src/training
uv run --with cleanlab python scripts/eval/cleanlab_mislabels.py \
--probs ~/.cache/dermadetect/round2/train_probs.parquet