Skip to Content
Ai LogHandoff spec — TTA + 3-tier abstain/referral UX (serving + client)

Handoff spec: TTA + Confident/Possible/Refer UX

Implementation spec for the two Round-2 wins that ship on the current gen2a-medsiglip model (no new checkpoint): TTA (free accuracy) and the 3-tier abstain/referral UX (the main product win). Coding to be done on the emulator machine; this doc is the contract. All numbers from the full 15,561-image test split (see 2026-07-01-round2-streamC-clinical-eval and 2026-07-03-round2-streamB-results).

1. Serving: TTA in the predictor (ai-service)

In the ai-service inference path (src/ai_service/src/... predictor that builds the MedSigLIP forward pass), run each image twice — original + horizontal flip — and average the sigmoid probabilities before ranking. Measured effect: top-1 0.7360→0.7377, top-3 0.9276→0.9303. Cost: ~2× the vision-tower forward (fine on the T4). Gate behind a config flag (tta: true) so it can be toggled.

2. Serving: tier decision in the diagnosis response

Confidence = max class probability over the 25 target classes (model is already well-calibrated, temperature T≈1.03, so use raw max prob). Assign a tier by two thresholds (from the accuracy@coverage curve):

tierrulecoveragequality on covered setclient shows
Confidentmaxprob ≥ 0.815~50% of casestop-1 91.7%single diagnosis + confidence
Possible0.433 ≤ maxprob < 0.815~37%top-3 91.2%ranked top-3 shortlist
Refermaxprob < 0.433~13%(uncertain)“refer to a specialist”

Outcome over all patients: 45.9% confident-correct, 4.1% confident-wrong, 33.8% possible-with-truth-in-top-3, 3.2% possible-miss, 13% refer. The 4.1% confident-wrong is the key clinical-risk number; the Confident threshold can be raised to ~top-37% (maxprob higher) for 95% top-1 / ~1.9% confident-wrong if a stricter bar is wanted — make the two thresholds config values, not hardcoded.

Proposed API contract (diagnosis endpoint, src/ai_service/src/api/v1/diagnosis.py)

Add to the response:

{ "tier": "confident|possible|refer", "predictions": [ {"diagnosis": "...", "probability": 0.0} ], // 1 item if confident, 3 if possible, [] if refer "confidence": 0.0, // max class prob "refer_to_specialist": false }

Keep the existing full ranked list too (for debugging / physician view). predictions is the tiered view the patient app renders.

3. Client (rn_dermadetect): three result states

In the assessment result screen, branch on tier:

  • confident → show the single diagnosis prominently + a confidence indicator.
  • possible → show the top-3 as a “possible matches” shortlist (not one answer).
  • refer → show a “we recommend seeing a dermatologist” message, no diagnosis.

This replaces any current “always show one/N results” behavior. Copy/design is a product decision; the logic is driven entirely by the tier field.

4. Testing on the emulator (no GPU needed)

The model runs in the ai-service (dev T4 / deployed); the emulator just hits the API. Point the app at the dev ai-service, submit an assessment, verify the three tiers render correctly by exercising cases across the confidence range.

5. Caveats / do-after

  • Re-fit the two thresholds on the dermatologist-cleaned test set once those labels are in — the current thresholds are on a benchmark with ~7-16% label noise.
  • These are serving/client changes on the current checkpoint. A model-weights update (focal recipe) is a separate decision pending the focal confirmation run + the derm benchmark.
  • Infra/deploy changes go through terraform + the normal CI, not ad hoc.
Last updated on