Round 2 serving implementation: TTA + Confident/Possible/Refer tier

Implements the serving half of the handoff spec (2026-07-03-abstain-ux-serving-spec). Serving/client only — no weights change; runs on the deployed gen2a-medsiglip checkpoint. Branch: round2-abstain-serving.

What shipped

1. TTA in the predictor (`src/ai_service/src/core/models/predictor.py`)

Gen2Predictor.predict() gained a tta: bool = False argument. When on, each image is run through the model twice — original and horizontal flip (torch.flip(image, dims=[-1])) — and the sigmoid probabilities of every view are averaged before ranking. When off, behaviour is unchanged (the existing logit-space averaging across images).

Probability-space averaging is used for TTA (not logit-space) to match the measured gain: top-1 0.7360→0.7377, top-3 0.9276→0.9303.
StubPredictor.predict() accepts the same tta kwarg (ignored) so signatures stay compatible.
get_confidence_level (MC-dropout stars) is left untouched — separate mechanism.

Gated by config flag tta (env TTA, default false).

2. Tier decision in the diagnosis response

src/ai_service/src/api/v1/diagnosis.py now computes a tier from the max class probability (the model is well-calibrated, so raw max prob is the confidence):

tier	rule
confident	maxprob ≥ `tier_confident_threshold` (0.815)
possible	`tier_refer_threshold` (0.433) ≤ maxprob < confident
refer	maxprob < refer threshold

decide_tier() is a pure helper (unit-tested at the boundaries). Thresholds are config values (tier_confident_threshold, tier_refer_threshold; env TIER_CONFIDENT_THRESHOLD / TIER_REFER_THRESHOLD), not hardcoded — so they can be re-fit later on the dermatologist-cleaned test set.

3. Response contract

DiagnosisResponse (src/ai_service/src/schemas/diagnosis.py) adds: tier: "confident"|"possible"|"refer", confidence: float, refer_to_specialist: bool, highlight_count: int.

highlight_count is how many leading predictions the client highlights (the list is sorted, so these are the highest-prob ones). The client labels the first highlight_count as Recommended (confident tier) or Possible (possible tier) and the rest as Alternate:

confident → 1
possible → leading predictions with prob > possible_item_threshold (0.25), capped at 3 (top-1 always qualifies). e.g. 0.70/0.29/0.01 → 2.
refer → 0

A count (rather than a per-prediction label) keeps the change minimal and backwards-compatible, and keeps all the calibratable numbers (tier_confident_threshold, tier_refer_threshold, possible_item_threshold) in the ai-service config so they can be re-fit together on the GPU box.

Deviation from the spec, deliberate: the spec proposed predictions be the truncated tiered list (1 / 3 / []). We instead keep predictions as the full ranked list and let the client derive the tiered view from tier + confidence. Rationale: (a) both existing frontends already consume predictions as the full list, and (b) the confirmed UX is dermatologist-facing and shows all suggestions relabeled (top = “Recommended”/“Possible”, rest = “Alternate”), so the client needs the full list regardless. Truncating server-side would have broken that.

4. Gateway passthrough (`src/api_gateway/src/api/diagnosis.py`)

The /api/diagnosis handler extracted only predictions from the ai-service response. It now also forwards tier, confidence, and refer_to_specialist into the results dict that is persisted to RnRequest.response and returned to the clients.

Tests

decide_tier boundary tests (pure, always run).
Response-shape assertions extended for the new fields + tier/refer consistency.
TTA predictor test (model-gated) + a stub test asserting the tta kwarg is accepted.
uv run pytest: 12 passed, 17 skipped (model-gated without /models/gen2a). Ruff lint + format clean.

Do-after

Re-fit tier_confident_threshold / tier_refer_threshold on the dermatologist-cleaned test labels (current split has ~7–16% label noise).
Enable TTA=true in the deployed ai-service env via terraform (not ad hoc).
Client tier-based UI is a separate branch (rn_dermadetect + physician_portal), pending design review.