Round 2 Stream C: clinical evaluation of test-set probabilities
What this stream delivered
A CPU-only analysis tool that turns a file of pre-computed test-set probabilities into the clinical read-outs the Round 2 model review asks for. It runs no CV model and needs no GPU - it consumes sigmoid multilabel probabilities produced by another Round 2 stream and joins them back to the curated v1 manifest for subgroup analysis.
New files (all under src/training/scripts/eval/):
clinical_eval.py- the deliverable._synth_probs_for_test.py- a synthetic probabilities generator used only to self-test the harness end-to-end before real probabilities exist. Its output is never committed (writes to/tmpby default).
matplotlib>=3.8 was added to the eval optional-dependency group in
src/training/pyproject.toml (the confusion-matrix and abstain-curve PNGs need it).
Input contract
clinical_eval.py reads exactly what a peer stream will produce:
~/.cache/dermadetect/round2/test_probs.parquetwith columnsimage_uuid(str),patient_id(i64),diagnosis(str, true label),true_idx(i32, index into the ordered target-label list), andprob_0..prob_24(f32 sigmoid probabilities for the 25 target classes).~/.cache/dermadetect/round2/target_labels.json- JSON list of the 25 label strings in prob-column order.
Probabilities are treated as sigmoid multilabel outputs: they need not sum to 1 and are
used as-is for argmax / topk (top-1, top-3) and as the confidence score (max class prob).
This matches the torch.sigmoid(...)[:, target_idx] convention in score_full_test.py.
What it produces
Outputs go to src/training/eval_output/clinical/ (gitignored):
- Per-class precision / recall / F1 / support using the top-1 argmax over the 25
classes, as
per_class_metrics.csvplus a printed table sorted by F1. Also reports overall top-1, top-3, macro-F1, balanced accuracy (mean per-class recall), and n. - 25x25 row-normalized confusion matrix as
confusion_matrix.csvand aconfusion_matrix.pngheatmap, plus the top ~15 most-confused ordered class pairs (true -> pred, off-diagonal) - the input to the taxonomy / “other”-class review. - Abstain / refer-to-specialist policy using max class prob as the confidence score.
Sweeps the abstain threshold to produce an accuracy@coverage curve
(
abstain_curve.csv+.png), and reports the coverage achievable while holding covered-set top-1 (and top-3) at >= 90% and >= 95%, with the threshold for each. - Overprediction / error-sink analysis (
overprediction.csv+ a donor CSV): per class, predicted-count vs true-count and the overprediction ratio; classes ranked by the share of ALL top-1 false positives they absorb; and, for the #1 sink, the breakdown of which true classes are misfiled into it (as % of each donor class’s test images). - Tiered policy analysis (
tiered_policy.csv): a 3-tier Confident / Possible / No-report policy split at the--tier-splitscoverage percentiles (default0.50,0.87), printing per-tier n / % / top-1 / top-3 and the single-decision outcome mix (confident-correct, confident-wrong, possible-in-top3, possible-miss, no-report). - Fairness / subgroup stratification by
gender, anagebucket, and body site (location_primary), joining back tomanifest.parquetonimage_uuid. Subgroups below 50 support are shown but never flagged; a subgroup whose top-1 trails overall by= 5 points is flagged
LOW.
Fitzpatrick data gap (key finding)
Round 2 asks for Fitzpatrick-stratified accuracy, but this cannot be computed: the v1 manifest has no Fitzpatrick / skin-tone column (verified by listing all 45 manifest columns - none match fitzpatrick / skin_tone / ITA / phototype). The tool detects this automatically and prints a loud data-gap warning instead of silently skipping it.
Recommended remediation:
- Capture Fitzpatrick at intake (patient self-report or clinician assessment) so future data supports the stratification directly; and/or
- Derive an ITA-degree (Individual Typology Angle) skin-tone proxy from image pixels as a follow-up, on healthy perilesional skin, as an interim stand-in for Fitzpatrick.
As a partial substitute the tool stratifies top-1 accuracy by the demographics that do
exist (gender, age bucket, body site). Note that the manifest age column contains
mixed / implausible units for a subset of rows (values well above 120), so ages outside
0-120 are bucketed as unknown rather than trusted.
How to run it
cd src/training
uv run --package ddtrain --extra eval python scripts/eval/clinical_eval.py \
--probs ~/.cache/dermadetect/round2/test_probs.parquet \
--labels-json ~/.cache/dermadetect/round2/target_labels.json \
--manifest ~/.cache/dermadetect/v1/manifest.parquet \
--out src/training/eval_output/clinicalAll flags default to the contract paths, so ... clinical_eval.py with no arguments works.
Real results (full 15,561-image test split, gen2a-medsiglip)
The harness was run on the real scored probabilities for the current best model over the full test split (n=15,561):
=== Overall ===
top-1 accuracy : 0.7360
top-3 accuracy : 0.9276
macro-F1 : 0.6813
balanced accuracy : 0.6633
n images : 15561Per-class (F1)
- Best: onychomycosis 0.943, tinea versicolor 0.884, tinea pedis 0.864, acne vulgaris 0.857, molluscum contagiosum 0.827.
- Worst: post-inflammatory hyperpigmentation (PIH) 0.181 (recall 0.108 - the model almost never predicts it), rosacea 0.409, viral exanthem 0.525, psoriasis 0.529, herpes simplex 0.585, herpes zoster 0.592, folliculitis 0.613.
Imbalance signal: balanced accuracy (0.663) sits well below top-1 (0.736). The gap is driven by rare inflammatory / pigmented classes (PIH, rosacea, viral exanthem) whose low recall drags the per-class mean down even though the frequent classes score well.
Abstain / refer-to-specialist (confidence = max class prob)
- top-1: reaching 90% covered-set top-1 requires abstaining on 44.5% of cases (coverage 0.555, threshold 0.774); 95% top-1 pushes coverage down to 0.369. Abstaining on nearly half the caseload is a weak operating point.
- top-3: already 0.928 at full coverage. Reaching 95% covered-set top-3 needs abstaining on only ~13% (coverage 0.871). This is the strong clinical story: present the top-3 as suggestions and refer only the least-confident ~13% to a specialist.
Subgroups
- Gender: small gap - female 0.747 vs male 0.722.
- Age: 65+ is flagged LOW at 0.679 (support 330); the
unknownage bucket is 0.714 (this absorbs the ~20% of rows with garbage / mixed-unit ages - see the age caveat above). - Body site: groin area flagged LOW at 0.569 (support 355) and back LOW at 0.678 - the two most notable weak spots.
- Fitzpatrick / skin tone: still absent (see data gap above); the flag and remediation recommendation stand.
Feeds Stream A taxonomy review
PIH, rosacea, and viral exanthem are the prime candidates for label / taxonomy scrutiny
given their low recall (PIH especially, at recall 0.108 it is effectively never predicted).
The confusion-pair output (confusion_matrix.csv + printed top pairs) shows where their
true cases are being routed instead, which Stream A can use to decide whether these are
labeling-quality issues, genuinely hard visual overlaps, or candidates for merging /
“other”-class handling.
Analysis & recommendations
Eczema is the #1 error sink
eczema uns is only mildly overpredicted by volume (true=2567, pred=2854, ratio 1.11,
precision 0.645, recall 0.718), but it absorbs 1012 false positives = 24.6% of all top-1
errors - by far the largest error sink. It is where the model dumps cases it can’t place.
The donor breakdown (share of each donor class’s own test images that land in eczema):
| Donor true class | into eczema (% of donor) |
|---|---|
| psoriasis | 22.2% |
| viral exanthem | 22.1% |
| pityriasis rosea | 21.1% |
| intertrigo | 12.9% |
| seborrheic dermatitis | 11.4% |
| insect bite | 10.4% |
Fix priority:
- Stop the sink (Stream B): class-balanced or focal loss so eczema stops acting as the default catch-all; it is the single highest-leverage training change.
- Clean the donors (Stream A): run cleanlab over the top donor classes - psoriasis, viral exanthem, pityriasis rosea - to find mislabeled/ambiguous images that are effectively teaching “when unsure, say eczema.”
- Temperature-scale the logits so the confidence score is calibrated (see caveat below).
- Until (1)-(3) land, the tiered policy below already contains the risk by routing the low-confidence eczema dumping-ground into the Possible/No-report tiers.
Recommended 3-tier policy: Confident / Possible / No-report
Confidence = max class prob. Splitting at the 50% / 87% confidence percentiles:
| Tier | Confidence | Share | top-1 | top-3 |
|---|---|---|---|---|
| Confident | >= 0.815 | 50.0% | 0.917 | 0.978 |
| Possible | 0.433 - 0.815 | 37.0% | 0.624 | 0.912 |
| No-report | < 0.433 | 13.0% | 0.358 | 0.777 |
Single-decision outcome mix: confident-correct 45.9%, confident-wrong 4.1%, possible-in-top3 33.8%, possible-miss 3.2%, no-report 13.0%. In words: auto-answer half the caseload at 91.7% top-1, offer top-3 suggestions for the next 37% (truth is in the list 91% of the time), and refer only the bottom 13%.
Tuning knob: tightening the Confident tier to the top ~37% of confidence raises its top-1
to ~95% and drops confident-wrong to ~1.9% - trading auto-answer volume for safety. Exposed
via --tier-splits (e.g. --tier-splits 0.63,0.87).
Caveat (important): these thresholds are derived from uncalibrated confidences and were fit on the test set. They are indicative, not deployable as-is. After temperature scaling, the tier cut points must be re-fit on a held-out calibration split (not test) before they drive any production routing.
Self-test (synthetic, harness validation)
Before the real probabilities existed, the harness was validated end-to-end with
_synth_probs_for_test.py, which samples real test-split rows and fabricates probabilities
correct ~73% of the time. It produced every output (per-class + curve CSVs, confusion CSV,
3 subgroup CSVs, 2 PNGs, summary.json) without error and correctly emitted the Fitzpatrick
data-gap warning. To reproduce:
cd src/training
uv run --package ddtrain --extra eval python scripts/eval/_synth_probs_for_test.py \
--n 2000 --accuracy 0.73 --out /tmp/round2_synth
uv run --package ddtrain --extra eval python scripts/eval/clinical_eval.py \
--probs /tmp/round2_synth/test_probs.parquet \
--labels-json /tmp/round2_synth/target_labels.json \
--out /tmp/round2_synth/clinical_out