Round 2 serving implementation: TTA + Confident/Possible/Refer tier
Implements the serving half of the handoff spec
(2026-07-03-abstain-ux-serving-spec). Serving/client only — no weights change;
runs on the deployed gen2a-medsiglip checkpoint. Branch: round2-abstain-serving.
What shipped
1. TTA in the predictor (src/ai_service/src/core/models/predictor.py)
Gen2Predictor.predict() gained a tta: bool = False argument. When on, each
image is run through the model twice — original and horizontal flip
(torch.flip(image, dims=[-1])) — and the sigmoid probabilities of every
view are averaged before ranking. When off, behaviour is unchanged (the existing
logit-space averaging across images).
- Probability-space averaging is used for TTA (not logit-space) to match the measured gain: top-1 0.7360→0.7377, top-3 0.9276→0.9303.
StubPredictor.predict()accepts the samettakwarg (ignored) so signatures stay compatible.get_confidence_level(MC-dropout stars) is left untouched — separate mechanism.
Gated by config flag tta (env TTA, default false).
2. Tier decision in the diagnosis response
src/ai_service/src/api/v1/diagnosis.py now computes a tier from the max class
probability (the model is well-calibrated, so raw max prob is the confidence):
| tier | rule |
|---|---|
| confident | maxprob ≥ tier_confident_threshold (0.815) |
| possible | tier_refer_threshold (0.433) ≤ maxprob < confident |
| refer | maxprob < refer threshold |
decide_tier() is a pure helper (unit-tested at the boundaries). Thresholds are
config values (tier_confident_threshold, tier_refer_threshold; env
TIER_CONFIDENT_THRESHOLD / TIER_REFER_THRESHOLD), not hardcoded — so they
can be re-fit later on the dermatologist-cleaned test set.
3. Response contract
DiagnosisResponse (src/ai_service/src/schemas/diagnosis.py) adds:
tier: "confident"|"possible"|"refer", confidence: float,
refer_to_specialist: bool, highlight_count: int.
highlight_count is how many leading predictions the client highlights (the list
is sorted, so these are the highest-prob ones). The client labels the first
highlight_count as Recommended (confident tier) or Possible (possible
tier) and the rest as Alternate:
- confident → 1
- possible → leading predictions with prob >
possible_item_threshold(0.25), capped at 3 (top-1 always qualifies). e.g. 0.70/0.29/0.01 → 2. - refer → 0
A count (rather than a per-prediction label) keeps the change minimal and
backwards-compatible, and keeps all the calibratable numbers
(tier_confident_threshold, tier_refer_threshold, possible_item_threshold) in
the ai-service config so they can be re-fit together on the GPU box.
Deviation from the spec, deliberate: the spec proposed predictions be the
truncated tiered list (1 / 3 / []). We instead keep predictions as the full
ranked list and let the client derive the tiered view from tier + confidence.
Rationale: (a) both existing frontends already consume predictions as the full
list, and (b) the confirmed UX is dermatologist-facing and shows all suggestions
relabeled (top = “Recommended”/“Possible”, rest = “Alternate”), so the client needs
the full list regardless. Truncating server-side would have broken that.
4. Gateway passthrough (src/api_gateway/src/api/diagnosis.py)
The /api/diagnosis handler extracted only predictions from the ai-service
response. It now also forwards tier, confidence, and refer_to_specialist
into the results dict that is persisted to RnRequest.response and returned to
the clients.
Tests
decide_tierboundary tests (pure, always run).- Response-shape assertions extended for the new fields + tier/refer consistency.
- TTA predictor test (model-gated) + a stub test asserting the
ttakwarg is accepted. uv run pytest: 12 passed, 17 skipped (model-gated without/models/gen2a). Ruff lint + format clean.
Do-after
- Re-fit
tier_confident_threshold/tier_refer_thresholdon the dermatologist-cleaned test labels (current split has ~7–16% label noise). - Enable
TTA=truein the deployed ai-service env via terraform (not ad hoc). - Client tier-based UI is a separate branch (rn_dermadetect + physician_portal), pending design review.