Round 2 Stream C: clinical evaluation of test-set probabilities

What this stream delivered

A CPU-only analysis tool that turns a file of pre-computed test-set probabilities into the clinical read-outs the Round 2 model review asks for. It runs no CV model and needs no GPU - it consumes sigmoid multilabel probabilities produced by another Round 2 stream and joins them back to the curated v1 manifest for subgroup analysis.

New files (all under src/training/scripts/eval/):

clinical_eval.py - the deliverable.
_synth_probs_for_test.py - a synthetic probabilities generator used only to self-test the harness end-to-end before real probabilities exist. Its output is never committed (writes to /tmp by default).

matplotlib>=3.8 was added to the eval optional-dependency group in src/training/pyproject.toml (the confusion-matrix and abstain-curve PNGs need it).

Input contract

clinical_eval.py reads exactly what a peer stream will produce:

~/.cache/dermadetect/round2/test_probs.parquet with columns image_uuid (str), patient_id (i64), diagnosis (str, true label), true_idx (i32, index into the ordered target-label list), and prob_0..prob_24 (f32 sigmoid probabilities for the 25 target classes).
~/.cache/dermadetect/round2/target_labels.json - JSON list of the 25 label strings in prob-column order.

Probabilities are treated as sigmoid multilabel outputs: they need not sum to 1 and are used as-is for argmax / topk (top-1, top-3) and as the confidence score (max class prob). This matches the torch.sigmoid(...)[:, target_idx] convention in score_full_test.py.

What it produces

Outputs go to src/training/eval_output/clinical/ (gitignored):

Per-class precision / recall / F1 / support using the top-1 argmax over the 25 classes, as per_class_metrics.csv plus a printed table sorted by F1. Also reports overall top-1, top-3, macro-F1, balanced accuracy (mean per-class recall), and n.
25x25 row-normalized confusion matrix as confusion_matrix.csv and a confusion_matrix.png heatmap, plus the top ~15 most-confused ordered class pairs (true -> pred, off-diagonal) - the input to the taxonomy / “other”-class review.
Abstain / refer-to-specialist policy using max class prob as the confidence score. Sweeps the abstain threshold to produce an accuracy@coverage curve (abstain_curve.csv + .png), and reports the coverage achievable while holding covered-set top-1 (and top-3) at >= 90% and >= 95%, with the threshold for each.
Overprediction / error-sink analysis (overprediction.csv + a donor CSV): per class, predicted-count vs true-count and the overprediction ratio; classes ranked by the share of ALL top-1 false positives they absorb; and, for the #1 sink, the breakdown of which true classes are misfiled into it (as % of each donor class’s test images).
Tiered policy analysis (tiered_policy.csv): a 3-tier Confident / Possible / No-report policy split at the --tier-splits coverage percentiles (default 0.50,0.87), printing per-tier n / % / top-1 / top-3 and the single-decision outcome mix (confident-correct, confident-wrong, possible-in-top3, possible-miss, no-report).
Fairness / subgroup stratification by gender, an age bucket, and body site (location_primary), joining back to manifest.parquet on image_uuid. Subgroups below 50 support are shown but never flagged; a subgroup whose top-1 trails overall by

= 5 points is flagged LOW.

Fitzpatrick data gap (key finding)

Round 2 asks for Fitzpatrick-stratified accuracy, but this cannot be computed: the v1 manifest has no Fitzpatrick / skin-tone column (verified by listing all 45 manifest columns - none match fitzpatrick / skin_tone / ITA / phototype). The tool detects this automatically and prints a loud data-gap warning instead of silently skipping it.

Recommended remediation:

Capture Fitzpatrick at intake (patient self-report or clinician assessment) so future data supports the stratification directly; and/or
Derive an ITA-degree (Individual Typology Angle) skin-tone proxy from image pixels as a follow-up, on healthy perilesional skin, as an interim stand-in for Fitzpatrick.

As a partial substitute the tool stratifies top-1 accuracy by the demographics that do exist (gender, age bucket, body site). Note that the manifest age column contains mixed / implausible units for a subset of rows (values well above 120), so ages outside 0-120 are bucketed as unknown rather than trusted.

How to run it


cd src/training
uv run --package ddtrain --extra eval python scripts/eval/clinical_eval.py \
    --probs ~/.cache/dermadetect/round2/test_probs.parquet \
    --labels-json ~/.cache/dermadetect/round2/target_labels.json \
    --manifest ~/.cache/dermadetect/v1/manifest.parquet \
    --out src/training/eval_output/clinical

All flags default to the contract paths, so ... clinical_eval.py with no arguments works.

Real results (full 15,561-image test split, gen2a-medsiglip)

The harness was run on the real scored probabilities for the current best model over the full test split (n=15,561):


=== Overall ===
  top-1 accuracy    : 0.7360
  top-3 accuracy    : 0.9276
  macro-F1          : 0.6813
  balanced accuracy : 0.6633
  n images          : 15561

Per-class (F1)

Best: onychomycosis 0.943, tinea versicolor 0.884, tinea pedis 0.864, acne vulgaris 0.857, molluscum contagiosum 0.827.
Worst: post-inflammatory hyperpigmentation (PIH) 0.181 (recall 0.108 - the model almost never predicts it), rosacea 0.409, viral exanthem 0.525, psoriasis 0.529, herpes simplex 0.585, herpes zoster 0.592, folliculitis 0.613.

Imbalance signal: balanced accuracy (0.663) sits well below top-1 (0.736). The gap is driven by rare inflammatory / pigmented classes (PIH, rosacea, viral exanthem) whose low recall drags the per-class mean down even though the frequent classes score well.

Abstain / refer-to-specialist (confidence = max class prob)

top-1: reaching 90% covered-set top-1 requires abstaining on 44.5% of cases (coverage 0.555, threshold 0.774); 95% top-1 pushes coverage down to 0.369. Abstaining on nearly half the caseload is a weak operating point.
top-3: already 0.928 at full coverage. Reaching 95% covered-set top-3 needs abstaining on only ~13% (coverage 0.871). This is the strong clinical story: present the top-3 as suggestions and refer only the least-confident ~13% to a specialist.

Subgroups

Gender: small gap - female 0.747 vs male 0.722.
Age: 65+ is flagged LOW at 0.679 (support 330); the unknown age bucket is 0.714 (this absorbs the ~20% of rows with garbage / mixed-unit ages - see the age caveat above).
Body site: groin area flagged LOW at 0.569 (support 355) and back LOW at 0.678 - the two most notable weak spots.
Fitzpatrick / skin tone: still absent (see data gap above); the flag and remediation recommendation stand.

Feeds Stream A taxonomy review

PIH, rosacea, and viral exanthem are the prime candidates for label / taxonomy scrutiny given their low recall (PIH especially, at recall 0.108 it is effectively never predicted). The confusion-pair output (confusion_matrix.csv + printed top pairs) shows where their true cases are being routed instead, which Stream A can use to decide whether these are labeling-quality issues, genuinely hard visual overlaps, or candidates for merging / “other”-class handling.

Analysis & recommendations

Eczema is the #1 error sink

eczema uns is only mildly overpredicted by volume (true=2567, pred=2854, ratio 1.11, precision 0.645, recall 0.718), but it absorbs 1012 false positives = 24.6% of all top-1 errors - by far the largest error sink. It is where the model dumps cases it can’t place. The donor breakdown (share of each donor class’s own test images that land in eczema):

Donor true class	into eczema (% of donor)
psoriasis	22.2%
viral exanthem	22.1%
pityriasis rosea	21.1%
intertrigo	12.9%
seborrheic dermatitis	11.4%
insect bite	10.4%

Fix priority:

Stop the sink (Stream B): class-balanced or focal loss so eczema stops acting as the default catch-all; it is the single highest-leverage training change.
Clean the donors (Stream A): run cleanlab over the top donor classes - psoriasis, viral exanthem, pityriasis rosea - to find mislabeled/ambiguous images that are effectively teaching “when unsure, say eczema.”
Temperature-scale the logits so the confidence score is calibrated (see caveat below).
Until (1)-(3) land, the tiered policy below already contains the risk by routing the low-confidence eczema dumping-ground into the Possible/No-report tiers.

Recommended 3-tier policy: Confident / Possible / No-report

Confidence = max class prob. Splitting at the 50% / 87% confidence percentiles:

Tier	Confidence	Share	top-1	top-3
Confident	>= 0.815	50.0%	0.917	0.978
Possible	0.433 - 0.815	37.0%	0.624	0.912
No-report	< 0.433	13.0%	0.358	0.777

Single-decision outcome mix: confident-correct 45.9%, confident-wrong 4.1%, possible-in-top3 33.8%, possible-miss 3.2%, no-report 13.0%. In words: auto-answer half the caseload at 91.7% top-1, offer top-3 suggestions for the next 37% (truth is in the list 91% of the time), and refer only the bottom 13%.

Tuning knob: tightening the Confident tier to the top ~37% of confidence raises its top-1 to ~95% and drops confident-wrong to ~1.9% - trading auto-answer volume for safety. Exposed via --tier-splits (e.g. --tier-splits 0.63,0.87).

Caveat (important): these thresholds are derived from uncalibrated confidences and were fit on the test set. They are indicative, not deployable as-is. After temperature scaling, the tier cut points must be re-fit on a held-out calibration split (not test) before they drive any production routing.

Self-test (synthetic, harness validation)

Before the real probabilities existed, the harness was validated end-to-end with _synth_probs_for_test.py, which samples real test-split rows and fabricates probabilities correct ~73% of the time. It produced every output (per-class + curve CSVs, confusion CSV, 3 subgroup CSVs, 2 PNGs, summary.json) without error and correctly emitted the Fitzpatrick data-gap warning. To reproduce:


cd src/training
uv run --package ddtrain --extra eval python scripts/eval/_synth_probs_for_test.py \
    --n 2000 --accuracy 0.73 --out /tmp/round2_synth
uv run --package ddtrain --extra eval python scripts/eval/clinical_eval.py \
    --probs /tmp/round2_synth/test_probs.parquet \
    --labels-json /tmp/round2_synth/target_labels.json \
    --out /tmp/round2_synth/clinical_out