Re-fit the tier thresholds (done — results below)
This started as a work brief for the GPU box; the eval has been run and the
results + config change are committed on this branch. Headline: the
production-shaped confidence distribution is nearly identical to the benchmark
one — tier_confident_threshold stays 0.815 (the refit lands on 0.8145),
tier_refer_threshold nudges 0.433 → 0.444, and possible_item_threshold
0.25 validates. Labels are still the noisy split (the dermatologist blind
review has not come back), so all numbers remain provisional in the same way
the originals were.
Why this needs doing now
The Round-2 abstain/referral UX assigns a tier from the model’s max class probability:
| tier | rule (config value) | default |
|---|---|---|
| confident | maxprob ≥ tier_confident_threshold | 0.815 |
| possible | tier_refer_threshold ≤ maxprob < confident | 0.433 |
| refer | maxprob < tier_refer_threshold | — |
plus, within the possible tier, an item is highlighted as a match if its
probability exceeds possible_item_threshold (default 0.25, capped at 3).
These three defaults were fit on the 15,561-image test split with full, correctly-named metadata (Round-2 Stream C accuracy@coverage curve). But two things have since changed what production actually feeds the model:
- The metadata serving-skew fix (PR #113). Before it, production sent the
model almost no metadata (2 of 19 fields). After it, the app anamnesis is
normalized (
ddmodels.anamnesis.normalize_anamnesis) and 14 of 19 fields reach the model. The confidence distribution production sees is therefore different from the pre-fix state — and also differs from the benchmark, because the current app does not collectwidespread_face,widespread_palm_feet,location_side,location_coverage,location_nail_coverage(they stay “unknown”), while it does now collectcough(newly added question). - The original spec always flagged the thresholds as provisional pending dermatologist-cleaned test labels (the current split has ~7–16% label noise).
So: re-fit the thresholds on inputs shaped like what production now sends.
Where the numbers live
Config: src/ai_service/src/config.py (env-overridable):
tier_confident_threshold(envTIER_CONFIDENT_THRESHOLD, default 0.815)tier_refer_threshold(envTIER_REFER_THRESHOLD, default 0.433)possible_item_threshold(envPOSSIBLE_ITEM_THRESHOLD, default 0.25)
Note: these config values live in the merged/soon-merged serving PRs (#110 TTA+tier,
#112 highlight_count). If they aren’t on main yet when you start, rebase this
branch on those, or just edit config.py to add/adjust them.
Eval tooling: the tiered-policy / accuracy@coverage analysis in the clinical eval
harness — src/training/scripts/eval/ (the “tiered-policy analysis” added to
clinical_eval; grep for coverage / tier there). The deployed model artifact
is s3://dermadetect-models/models/gen2a_medsiglip/ (best_model.pt, config.json,
feature_schema.json = the trimmed 19-field schema, labels.txt).
What was run (reproducible)
- Model: the deployed artifact, byte-for-byte —
aws s3 sync s3://dermadetect-models/models/gen2a_medsiglip/(MedSigLIP-448, trimmed 19-fieldfeature_schema.json,best_model.pt= epoch 3). - Split: the full 15,561-image 25-class test split (patient-level split from
~/.cache/dermadetect/v1manifest + splits, seed-free deterministic join — same rows as the 73.6% benchmark and the original threshold fit). - TTA on (image + h-flip sigmoid average), matching the
TTA=truedeploy setting from PR #110. - Production-shaped metadata: each test row’s ETL-normalized metadata with
the 5 fields the app never sends nulled to “unknown” — new
--null-fieldsflag ondump_probs.py. This is byte-equivalent to whatnormalize_anamnesis(PR #113) produces for an app payload: 14 of 19 fields populated,widespread_face,widespread_palm_feet,location_side,location_coverage,location_nail_coverageunknown. - Labels: the noisy split. The dermatologist blind-review packet exists but
no responses are in yet (
derm_review_BLIND.csvis empty), so dermatologist-cleaned labels were NOT available; re-fit on cleaned labels when they land.
# from src/training, one shard per GPU (TITAN X + 1080 Ti, ~90 min each)
uv run --package ddtrain python scripts/eval/dump_probs.py \
--model-dir ~/.cache/dermadetect/deployed/gen2a_medsiglip \
--split test --dataset-dir ~/.cache/dermadetect/v1 \
--out ~/.cache/dermadetect/round2/prod_shape/test_probs.shardN.parquet \
--batch-size 48 --num-workers 8 --num-shards 2 --shard N --tta \
--null-fields "widespread_face,widespread_palm_feet,location_side,location_coverage,location_nail_coverage"
uv run --package ddtrain --extra eval python scripts/eval/refit_tier_thresholds.py \
--probs ~/.cache/dermadetect/round2/prod_shape/test_probs.shard0.parquet \
~/.cache/dermadetect/round2/prod_shape/test_probs.shard1.parquet \
--baseline-probs ~/.cache/dermadetect/round2/tta_test_probs.parquetrefit_tier_thresholds.py (new, this branch) was validated by running it on the
original full-metadata TTA probs: it reproduces the shipped fit (confident
0.812≈0.815, refer 0.428≈0.433, same 50/37/13 shares), so the numbers below are
apples-to-apples with the Round-2 Stream C fit.
Results
Full-coverage accuracy moves almost nothing under production shaping: top-1 0.7362 / top-3 0.9274 (vs 0.7377 / 0.9303 with full metadata). The five missing fields carry ~0.15 top-1 pts. The confidence distribution shifts slightly toward the middle (less refer mass, fatter possible tier).
Chosen thresholds
| config value | old | new | why |
|---|---|---|---|
tier_confident_threshold | 0.815 | 0.815 (refit: 0.8145) | design point unchanged: coverage 49.4%, covered top-1 91.7%, confident-wrong 4.1% of all patients |
tier_refer_threshold | 0.433 | 0.444 | restores the 13.0% refer design share (old value refers only 12.1% on the new distribution) |
possible_item_threshold | 0.25 | 0.25 | validated, see sweep below |
Env vars for the terraform deploy: TIER_CONFIDENT_THRESHOLD=0.815,
TIER_REFER_THRESHOLD=0.444, POSSIBLE_ITEM_THRESHOLD=0.25 (and TTA=true,
which this fit assumes).
Tier report at the chosen thresholds (production-shaped, TTA)
| tier | n | share | top-1 | top-3 |
|---|---|---|---|---|
| confident | 7,696 | 49.5% | 0.917 | 0.980 |
| possible | 5,837 | 37.5% | 0.629 | 0.909 |
| refer | 2,028 | 13.0% | 0.357 | 0.783 |
Outcome mix per patient: 45.4% confident-correct, 4.1% confident-wrong (the clinical-risk number), 34.1% possible-with-truth-in-top-3, 3.4% possible-miss, 13.0% refer. Statistically indistinguishable from the original design point — the metadata skew fix (PR #113) restores enough signal that the old calibration holds.
Stricter option (unchanged from the spec, if product wants it):
TIER_CONFIDENT_THRESHOLD=0.885 → 95.1% covered top-1, 1.85% confident-wrong,
at 37.5% coverage.
Refer-threshold candidates for reference: 0.352 → 6.5% refer, 0.444 → 13.0%, 0.521 → 19.5% (kept-set top-3 0.940 / 0.949 / 0.955).
possible_item_threshold sweep (within the possible tier, floor 1 / cap 3)
| cut | 1 match | 2 matches | 3 matches | mean | truth in highlighted |
|---|---|---|---|---|---|
| 0.20 | 46.6% | 49.6% | 3.8% | 1.57 | 0.789 |
| 0.25 | 59.1% | 39.6% | 1.3% | 1.42 | 0.761 |
| 0.30 | 69.6% | 30.1% | 0.3% | 1.31 | 0.729 |
0.25 keeps the shortlist tight (mostly 1–2 highlighted) while capturing the truth in 76% of possible-tier cases (the full top-3 holds it in 90.9%; the gap is cases where the truth sits in top-3 below the cut — the client still shows the ranked list, only the highlight is affected). Dropping to 0.20 buys +2.8pts truth capture at the cost of double-highlighting half the tier. Kept at 0.25.
Curve + artifacts
scripts/eval/refit_tier_thresholds.py writes coverage_curve.csv/.png,
refer_candidates.csv, possible_item_sweep.csv, tier_reports.json to
src/training/eval_output/refit_thresholds/ (gitignored; regenerate with the
commands above). Raw probs: ~/.cache/dermadetect/round2/prod_shape/ on the
GPU box.
Follow-ups
- Re-fit on dermatologist-cleaned labels when the blind review returns —
that was half the reason for this task and is still open; the tooling above
makes it a 5-minute CPU job (
refit_tier_thresholds.pyon new true_idx). possible_item_thresholdalso lives in PR #112’s config change — trivial merge conflict with this branch, resolve to 0.25 either way.
Guardrails
- Do not change model weights — this is threshold calibration on the current checkpoint only.
- Do not deploy ad hoc; threshold changes ship via terraform + CI like any other config.
- Keep the eval reproducible: record the exact command, split, seed, and TTA setting used.
Context / related
- Skew fix:
2026-07-03-metadata-serving-skew-fix.mdx(PR #113). - Serving impl:
2026-07-03-abstain-ux-serving-impl.mdx(PR #110/#112). - Original spec:
2026-07-03-abstain-ux-serving-spec.mdx. - Questionnaire value / trimmed schema:
2026-06-21-questionnaire-value.md.