Handoff spec: TTA + Confident/Possible/Refer UX

Implementation spec for the two Round-2 wins that ship on the current gen2a-medsiglip model (no new checkpoint): TTA (free accuracy) and the 3-tier abstain/referral UX (the main product win). Coding to be done on the emulator machine; this doc is the contract. All numbers from the full 15,561-image test split (see 2026-07-01-round2-streamC-clinical-eval and 2026-07-03-round2-streamB-results).

1. Serving: TTA in the predictor (ai-service)

In the ai-service inference path (src/ai_service/src/... predictor that builds the MedSigLIP forward pass), run each image twice — original + horizontal flip — and average the sigmoid probabilities before ranking. Measured effect: top-1 0.7360→0.7377, top-3 0.9276→0.9303. Cost: ~2× the vision-tower forward (fine on the T4). Gate behind a config flag (tta: true) so it can be toggled.

2. Serving: tier decision in the diagnosis response

Confidence = max class probability over the 25 target classes (model is already well-calibrated, temperature T≈1.03, so use raw max prob). Assign a tier by two thresholds (from the accuracy@coverage curve):

tier	rule	coverage	quality on covered set	client shows
Confident	maxprob ≥ 0.815	~50% of cases	top-1 91.7%	single diagnosis + confidence
Possible	0.433 ≤ maxprob < 0.815	~37%	top-3 91.2%	ranked top-3 shortlist
Refer	maxprob < 0.433	~13%	(uncertain)	“refer to a specialist”

Outcome over all patients: 45.9% confident-correct, 4.1% confident-wrong, 33.8% possible-with-truth-in-top-3, 3.2% possible-miss, 13% refer. The 4.1% confident-wrong is the key clinical-risk number; the Confident threshold can be raised to ~top-37% (maxprob higher) for 95% top-1 / ~1.9% confident-wrong if a stricter bar is wanted — make the two thresholds config values, not hardcoded.

Proposed API contract (diagnosis endpoint, `src/ai_service/src/api/v1/diagnosis.py`)

Add to the response:


{
  "tier": "confident|possible|refer",
  "predictions": [ {"diagnosis": "...", "probability": 0.0} ],   // 1 item if confident, 3 if possible, [] if refer
  "confidence": 0.0,                                              // max class prob
  "refer_to_specialist": false
}

Keep the existing full ranked list too (for debugging / physician view). predictions is the tiered view the patient app renders.

3. Client (rn_dermadetect): three result states

In the assessment result screen, branch on tier:

confident → show the single diagnosis prominently + a confidence indicator.
possible → show the top-3 as a “possible matches” shortlist (not one answer).
refer → show a “we recommend seeing a dermatologist” message, no diagnosis.

This replaces any current “always show one/N results” behavior. Copy/design is a product decision; the logic is driven entirely by the tier field.

4. Testing on the emulator (no GPU needed)

The model runs in the ai-service (dev T4 / deployed); the emulator just hits the API. Point the app at the dev ai-service, submit an assessment, verify the three tiers render correctly by exercising cases across the confidence range.

5. Caveats / do-after

Re-fit the two thresholds on the dermatologist-cleaned test set once those labels are in — the current thresholds are on a benchmark with ~7-16% label noise.
These are serving/client changes on the current checkpoint. A model-weights update (focal recipe) is a separate decision pending the focal confirmation run + the derm benchmark.
Infra/deploy changes go through terraform + the normal CI, not ad hoc.