Train/serve metadata skew (and the fix)

What was wrong

The deployed gen2a-medsiglip model consumes a flat 19-field metadata schema (quantity, location_primary, itch (bool), pain_type, cough, …). The current apps (rn_dermadetect + physician_portal) submit the anamnesis under the raw question keys (distribution, location1.primary, itch.scale_0_to_5, pain.pain_type, …), the gateway forwards that JSON untransformed, and MetadataEncoder.encode() reads each field by exact name with no aliasing. So renamed fields fell through to the encoder’s “unknown” encoding.

The training ETL (run_etl.py) does the name/shape normalization at training time, but it was written for the old app’s request shape (nested location array, pain object, boolean itch) — it never knew the new dotted keys, and nothing in the serving path replicated it.

Measured against the deployed feature_schema.json (pulled from s3://dermadetect-models/models/gen2a_medsiglip/, byte-identical to the checked-in trimmed_feature_schema.json): a real app payload reaches only vesicle + texture (2 of 19 fields). Everything else — including location, the model’s single strongest signal (~11 top-1 pts per the Stream-3 ablation) — was silently “unknown”. In production the model was running close to image-only (~52% top-1) despite the 73.6% offline benchmark (which was measured on properly-named ETL data, so it never surfaced this).

Corollary: the Round-2 abstain/referral tier thresholds (0.815 / 0.433) were fit on the metadata-fed distribution, so they are miscalibrated for the metadata-starved production distribution and should be re-fit after this lands.

The fix

A pure-Python normalize_anamnesis(raw) -> dict in ddmodels (src/common/ddmodels/src/ddmodels/anamnesis.py), applied in the ai-service predictor’s preprocess_metadata just before encoding. It renames keys and maps values/types:

rename-only (values already match): distribution→quantity, elevation→topography, pain.pain_type→pain_type
value vocab: size buckets (pea_or_smaller→pea, …)
scale→bool: itch.scale_0_to_5→itch, pain.scale_0_to_5→pain_is_pain
rename passthrough: swelling→location_swelling, vesicle, cough, texture
locations: locationN.primary/.secondary → location_primary/ location_secondary (scalars) + primary_locations/secondary_locations (lists)

Fields the app doesn’t collect (location_side, location_coverage, widespread_face/widespread_palm_feet) are left absent (unknown) — deriving the widespread flags reliably needs per-region context the app doesn’t capture.

Result against the deployed schema: the same app payload now reaches 14 of 19 fields (from 2). Placed in ddmodels (torch-free module) so the ai-service uses it now and the ETL can reuse it later; the gateway is untouched (it has no ddmodels/torch dependency).

Verification

normalize_anamnesis unit tests (per-field mapping, dropped fields, missing values) + an end-to-end MetadataEncoder round-trip test proving normalized input activates fields the raw payload leaves unknown.
uv run pytest (ai-service): 25 passed, 17 skipped (model-gated). Ruff clean.
Deployed feature_schema.json confirmed identical to trimmed_feature_schema.json.

Do-after

Deploy the ai-service (terraform + CI, not ad hoc) — the fix only takes effect once the updated service is live.
Re-fit the tier thresholds on true production-shaped inputs (GPU box).
ETL / future retrains: have run_etl.py consume normalize_anamnesis (or store normalized metadata) so new-app training data uses correct field names; otherwise new data is skewed the same way. Tracked separately.
Consider retraining once metadata actually flows (accuracy should recover toward the benchmark).