Re-fit the tier thresholds (done — results below)

This started as a work brief for the GPU box; the eval has been run and the results + config change are committed on this branch. Headline: the production-shaped confidence distribution is nearly identical to the benchmark one — tier_confident_threshold stays 0.815 (the refit lands on 0.8145), tier_refer_threshold nudges 0.433 → 0.444, and possible_item_threshold 0.25 validates. Labels are still the noisy split (the dermatologist blind review has not come back), so all numbers remain provisional in the same way the originals were.

Why this needs doing now

The Round-2 abstain/referral UX assigns a tier from the model’s max class probability:

tier	rule (config value)	default
confident	maxprob ≥ `tier_confident_threshold`	0.815
possible	`tier_refer_threshold` ≤ maxprob < confident	0.433
refer	maxprob < `tier_refer_threshold`	—

plus, within the possible tier, an item is highlighted as a match if its probability exceeds possible_item_threshold (default 0.25, capped at 3).

These three defaults were fit on the 15,561-image test split with full, correctly-named metadata (Round-2 Stream C accuracy@coverage curve). But two things have since changed what production actually feeds the model:

The metadata serving-skew fix (PR #113). Before it, production sent the model almost no metadata (2 of 19 fields). After it, the app anamnesis is normalized (ddmodels.anamnesis.normalize_anamnesis) and 14 of 19 fields reach the model. The confidence distribution production sees is therefore different from the pre-fix state — and also differs from the benchmark, because the current app does not collect widespread_face, widespread_palm_feet, location_side, location_coverage, location_nail_coverage (they stay “unknown”), while it does now collect cough (newly added question).
The original spec always flagged the thresholds as provisional pending dermatologist-cleaned test labels (the current split has ~7–16% label noise).

So: re-fit the thresholds on inputs shaped like what production now sends.

Where the numbers live

Config: src/ai_service/src/config.py (env-overridable):

tier_confident_threshold (env TIER_CONFIDENT_THRESHOLD, default 0.815)
tier_refer_threshold (env TIER_REFER_THRESHOLD, default 0.433)
possible_item_threshold (env POSSIBLE_ITEM_THRESHOLD, default 0.25)

Note: these config values live in the merged/soon-merged serving PRs (#110 TTA+tier, #112 highlight_count). If they aren’t on main yet when you start, rebase this branch on those, or just edit config.py to add/adjust them.

Eval tooling: the tiered-policy / accuracy@coverage analysis in the clinical eval harness — src/training/scripts/eval/ (the “tiered-policy analysis” added to clinical_eval; grep for coverage / tier there). The deployed model artifact is s3://dermadetect-models/models/gen2a_medsiglip/ (best_model.pt, config.json, feature_schema.json = the trimmed 19-field schema, labels.txt).

What was run (reproducible)

Model: the deployed artifact, byte-for-byte — aws s3 sync s3://dermadetect-models/models/gen2a_medsiglip/ (MedSigLIP-448, trimmed 19-field feature_schema.json, best_model.pt = epoch 3).
Split: the full 15,561-image 25-class test split (patient-level split from ~/.cache/dermadetect/v1 manifest + splits, seed-free deterministic join — same rows as the 73.6% benchmark and the original threshold fit).
TTA on (image + h-flip sigmoid average), matching the TTA=true deploy setting from PR #110.
Production-shaped metadata: each test row’s ETL-normalized metadata with the 5 fields the app never sends nulled to “unknown” — new --null-fields flag on dump_probs.py. This is byte-equivalent to what normalize_anamnesis (PR #113) produces for an app payload: 14 of 19 fields populated, widespread_face, widespread_palm_feet, location_side, location_coverage, location_nail_coverage unknown.
Labels: the noisy split. The dermatologist blind-review packet exists but no responses are in yet (derm_review_BLIND.csv is empty), so dermatologist-cleaned labels were NOT available; re-fit on cleaned labels when they land.


# from src/training, one shard per GPU (TITAN X + 1080 Ti, ~90 min each)
uv run --package ddtrain python scripts/eval/dump_probs.py \
  --model-dir ~/.cache/dermadetect/deployed/gen2a_medsiglip \
  --split test --dataset-dir ~/.cache/dermadetect/v1 \
  --out ~/.cache/dermadetect/round2/prod_shape/test_probs.shardN.parquet \
  --batch-size 48 --num-workers 8 --num-shards 2 --shard N --tta \
  --null-fields "widespread_face,widespread_palm_feet,location_side,location_coverage,location_nail_coverage"
 
uv run --package ddtrain --extra eval python scripts/eval/refit_tier_thresholds.py \
  --probs ~/.cache/dermadetect/round2/prod_shape/test_probs.shard0.parquet \
          ~/.cache/dermadetect/round2/prod_shape/test_probs.shard1.parquet \
  --baseline-probs ~/.cache/dermadetect/round2/tta_test_probs.parquet

refit_tier_thresholds.py (new, this branch) was validated by running it on the original full-metadata TTA probs: it reproduces the shipped fit (confident 0.812≈0.815, refer 0.428≈0.433, same 50/37/13 shares), so the numbers below are apples-to-apples with the Round-2 Stream C fit.

Results

Full-coverage accuracy moves almost nothing under production shaping: top-1 0.7362 / top-3 0.9274 (vs 0.7377 / 0.9303 with full metadata). The five missing fields carry ~0.15 top-1 pts. The confidence distribution shifts slightly toward the middle (less refer mass, fatter possible tier).

Chosen thresholds

config value	old	new	why
`tier_confident_threshold`	0.815	0.815 (refit: 0.8145)	design point unchanged: coverage 49.4%, covered top-1 91.7%, confident-wrong 4.1% of all patients
`tier_refer_threshold`	0.433	0.444	restores the 13.0% refer design share (old value refers only 12.1% on the new distribution)
`possible_item_threshold`	0.25	0.25	validated, see sweep below

Env vars for the terraform deploy: TIER_CONFIDENT_THRESHOLD=0.815, TIER_REFER_THRESHOLD=0.444, POSSIBLE_ITEM_THRESHOLD=0.25 (and TTA=true, which this fit assumes).

Tier report at the chosen thresholds (production-shaped, TTA)

tier	n	share	top-1	top-3
confident	7,696	49.5%	0.917	0.980
possible	5,837	37.5%	0.629	0.909
refer	2,028	13.0%	0.357	0.783

Outcome mix per patient: 45.4% confident-correct, 4.1% confident-wrong (the clinical-risk number), 34.1% possible-with-truth-in-top-3, 3.4% possible-miss, 13.0% refer. Statistically indistinguishable from the original design point — the metadata skew fix (PR #113) restores enough signal that the old calibration holds.

Stricter option (unchanged from the spec, if product wants it): TIER_CONFIDENT_THRESHOLD=0.885 → 95.1% covered top-1, 1.85% confident-wrong, at 37.5% coverage.

Refer-threshold candidates for reference: 0.352 → 6.5% refer, 0.444 → 13.0%, 0.521 → 19.5% (kept-set top-3 0.940 / 0.949 / 0.955).

`possible_item_threshold` sweep (within the possible tier, floor 1 / cap 3)

cut	1 match	2 matches	3 matches	mean	truth in highlighted
0.20	46.6%	49.6%	3.8%	1.57	0.789
0.25	59.1%	39.6%	1.3%	1.42	0.761
0.30	69.6%	30.1%	0.3%	1.31	0.729

0.25 keeps the shortlist tight (mostly 1–2 highlighted) while capturing the truth in 76% of possible-tier cases (the full top-3 holds it in 90.9%; the gap is cases where the truth sits in top-3 below the cut — the client still shows the ranked list, only the highlight is affected). Dropping to 0.20 buys +2.8pts truth capture at the cost of double-highlighting half the tier. Kept at 0.25.

Curve + artifacts

scripts/eval/refit_tier_thresholds.py writes coverage_curve.csv/.png, refer_candidates.csv, possible_item_sweep.csv, tier_reports.json to src/training/eval_output/refit_thresholds/ (gitignored; regenerate with the commands above). Raw probs: ~/.cache/dermadetect/round2/prod_shape/ on the GPU box.

Follow-ups

Re-fit on dermatologist-cleaned labels when the blind review returns — that was half the reason for this task and is still open; the tooling above makes it a 5-minute CPU job (refit_tier_thresholds.py on new true_idx).
possible_item_threshold also lives in PR #112’s config change — trivial merge conflict with this branch, resolve to 0.25 either way.

Guardrails

Do not change model weights — this is threshold calibration on the current checkpoint only.
Do not deploy ad hoc; threshold changes ship via terraform + CI like any other config.
Keep the eval reproducible: record the exact command, split, seed, and TTA setting used.

Skew fix: 2026-07-03-metadata-serving-skew-fix.mdx (PR #113).
Serving impl: 2026-07-03-abstain-ux-serving-impl.mdx (PR #110/#112).
Original spec: 2026-07-03-abstain-ux-serving-spec.mdx.
Questionnaire value / trimmed schema: 2026-06-21-questionnaire-value.md.