Handoff spec: TTA + Confident/Possible/Refer UX
Implementation spec for the two Round-2 wins that ship on the current
gen2a-medsiglip model (no new checkpoint): TTA (free accuracy) and the
3-tier abstain/referral UX (the main product win). Coding to be done on the
emulator machine; this doc is the contract. All numbers from the full 15,561-image
test split (see 2026-07-01-round2-streamC-clinical-eval and
2026-07-03-round2-streamB-results).
1. Serving: TTA in the predictor (ai-service)
In the ai-service inference path (src/ai_service/src/... predictor that builds the
MedSigLIP forward pass), run each image twice — original + horizontal flip — and
average the sigmoid probabilities before ranking. Measured effect: top-1
0.7360→0.7377, top-3 0.9276→0.9303. Cost: ~2× the vision-tower forward (fine on the
T4). Gate behind a config flag (tta: true) so it can be toggled.
2. Serving: tier decision in the diagnosis response
Confidence = max class probability over the 25 target classes (model is already well-calibrated, temperature T≈1.03, so use raw max prob). Assign a tier by two thresholds (from the accuracy@coverage curve):
| tier | rule | coverage | quality on covered set | client shows |
|---|---|---|---|---|
| Confident | maxprob ≥ 0.815 | ~50% of cases | top-1 91.7% | single diagnosis + confidence |
| Possible | 0.433 ≤ maxprob < 0.815 | ~37% | top-3 91.2% | ranked top-3 shortlist |
| Refer | maxprob < 0.433 | ~13% | (uncertain) | “refer to a specialist” |
Outcome over all patients: 45.9% confident-correct, 4.1% confident-wrong, 33.8% possible-with-truth-in-top-3, 3.2% possible-miss, 13% refer. The 4.1% confident-wrong is the key clinical-risk number; the Confident threshold can be raised to ~top-37% (maxprob higher) for 95% top-1 / ~1.9% confident-wrong if a stricter bar is wanted — make the two thresholds config values, not hardcoded.
Proposed API contract (diagnosis endpoint, src/ai_service/src/api/v1/diagnosis.py)
Add to the response:
{
"tier": "confident|possible|refer",
"predictions": [ {"diagnosis": "...", "probability": 0.0} ], // 1 item if confident, 3 if possible, [] if refer
"confidence": 0.0, // max class prob
"refer_to_specialist": false
}Keep the existing full ranked list too (for debugging / physician view). predictions
is the tiered view the patient app renders.
3. Client (rn_dermadetect): three result states
In the assessment result screen, branch on tier:
- confident → show the single diagnosis prominently + a confidence indicator.
- possible → show the top-3 as a “possible matches” shortlist (not one answer).
- refer → show a “we recommend seeing a dermatologist” message, no diagnosis.
This replaces any current “always show one/N results” behavior. Copy/design is a product
decision; the logic is driven entirely by the tier field.
4. Testing on the emulator (no GPU needed)
The model runs in the ai-service (dev T4 / deployed); the emulator just hits the API. Point the app at the dev ai-service, submit an assessment, verify the three tiers render correctly by exercising cases across the confidence range.
5. Caveats / do-after
- Re-fit the two thresholds on the dermatologist-cleaned test set once those labels are in — the current thresholds are on a benchmark with ~7-16% label noise.
- These are serving/client changes on the current checkpoint. A model-weights update (focal recipe) is a separate decision pending the focal confirmation run + the derm benchmark.
- Infra/deploy changes go through terraform + the normal CI, not ad hoc.