Round 2 Stream A: confident-learning (cleanlab) mislabel surfacing

Second Stream A deliverable (after the leakage audit). Goal: find likely label errors in the training set so we can clean the manifest before a re-train.

Method

Scored the full 25-class train split (121,514 images) with the deployed gen2a-medsiglip checkpoint on the two devbox GPUs (dump_probs.py, in-sample probs — a deliberate cheaper choice than k-fold OOF), then ran cleanlab_mislabels.py: L1-normalize the sigmoid probs to a per-row distribution, cleanlab.filter.find_label_issues + label-quality scores, aggregate per-class issue rates and given->predicted confusions, render a vetting montage.

Caveat: in-sample probs mean this is candidate-generation for human vetting, not ground truth. Because the model was trained on these rows, a confidently-wrong prediction despite training is a strong mislabel signal; but in-sample also masks some errors (memorization), so 7.13% is a noisy lower bound. Vet the top candidates before dropping.

Result: 8,670 / 121,514 train rows (7.13%) flagged as likely label errors

Label-issue rate is heavily concentrated in exactly the classes that underperform on the test set (see Stream C):

class	train issue rate	(Stream C test F1)
post-inflammatory hyperpigmentation	28.7%	0.18
vitiligo	21.3%	—
herpes simplex	18.0%	0.585
psoriasis	17.2%	0.529
viral exanthem	12.2%	0.525
folliculitis	9.6%	0.613
insect bite	8.9%	—
eczema uns	8.9%	—

Key insight: the low recall on PIH / psoriasis / viral-exanthem / herpes-simplex is substantially a labeling problem, not just visual difficulty — up to ~29% of their training labels are suspect. Cleaning these should lift the exact classes that drag balanced accuracy (0.663) below top-1 (0.736).

Top given->predicted confusions among flagged rows mirror the eczema error-sink: eczema↔ psoriasis (478 / 359), folliculitis↔insect-bite (424 / 333), insect-bite->eczema (320), seborrheic-dermatitis->eczema (161), intertrigo->eczema (130).

Visual verification (eval_output/cleanlab/mislabel_candidates.png)

The top-60 (lowest quality score) contains three kinds of issue, all real:

Non-image / garbage contamination - e.g. a scanned text document labeled eczema uns. These should be dropped outright.
Systematic wrong-label / wrong-body-part - several onychomycosis (nail fungus) labels are actually armpit / hairy-skin photos (predicted intertrigo). A targeted onychomycosis body-part/label review is warranted.
Plausible single-image mislabels + genuine look-alikes - acne vs keratosis pilaris, rosacea vs pityriasis, etc. These need per-image human judgement.

Recommendation -> cleaned-manifest retrain

Programmatic prune of clear garbage (non-skin images; the flagged CSV + image_uuid list is the starting set).
First retrain experiment: drop the cleanlab-flagged rows (8,670) from training, retrain gen2a-medsiglip (RunPod H100, same recipe), and measure the full-test delta - especially per-class recall on PIH / psoriasis / viral-exanthem. If test top-1 and those recalls rise, cleaning helped; if not, the flags were mostly hard-but-correct examples.
Targeted human vetting of the systematic clusters (onychomycosis body-part; PIH) via the montage + label_issues.csv before committing relabels.
For a fully rigorous pass later, k-fold OOF probs remove the in-sample caveat.

Outputs: eval_output/cleanlab/{label_issues.csv, issues_per_class.csv, issue_confusions.csv, mislabel_candidates.png}. Reproduce:


cd src/training
uv run --with cleanlab python scripts/eval/cleanlab_mislabels.py \
    --probs ~/.cache/dermadetect/round2/train_probs.parquet