Train/serve metadata skew (and the fix)
What was wrong
The deployed gen2a-medsiglip model consumes a flat 19-field metadata schema
(quantity, location_primary, itch (bool), pain_type, cough, …). The
current apps (rn_dermadetect + physician_portal) submit the anamnesis under the
raw question keys (distribution, location1.primary, itch.scale_0_to_5,
pain.pain_type, …), the gateway forwards that JSON untransformed, and
MetadataEncoder.encode() reads each field by exact name with no aliasing. So
renamed fields fell through to the encoder’s “unknown” encoding.
The training ETL (run_etl.py) does the name/shape normalization at training
time, but it was written for the old app’s request shape (nested location
array, pain object, boolean itch) — it never knew the new dotted keys, and
nothing in the serving path replicated it.
Measured against the deployed feature_schema.json (pulled from
s3://dermadetect-models/models/gen2a_medsiglip/, byte-identical to the
checked-in trimmed_feature_schema.json): a real app payload reaches only
vesicle + texture (2 of 19 fields). Everything else — including
location, the model’s single strongest signal (~11 top-1 pts per the Stream-3
ablation) — was silently “unknown”. In production the model was running close to
image-only (~52% top-1) despite the 73.6% offline benchmark (which was measured on
properly-named ETL data, so it never surfaced this).
Corollary: the Round-2 abstain/referral tier thresholds (0.815 / 0.433) were fit on the metadata-fed distribution, so they are miscalibrated for the metadata-starved production distribution and should be re-fit after this lands.
The fix
A pure-Python normalize_anamnesis(raw) -> dict in ddmodels
(src/common/ddmodels/src/ddmodels/anamnesis.py), applied in the ai-service
predictor’s preprocess_metadata just before encoding. It renames keys and maps
values/types:
- rename-only (values already match):
distribution→quantity,elevation→topography,pain.pain_type→pain_type - value vocab:
sizebuckets (pea_or_smaller→pea, …) - scale→bool:
itch.scale_0_to_5→itch,pain.scale_0_to_5→pain_is_pain - rename passthrough:
swelling→location_swelling,vesicle,cough,texture - locations:
locationN.primary/.secondary→location_primary/location_secondary(scalars) +primary_locations/secondary_locations(lists)
Fields the app doesn’t collect (location_side, location_coverage,
widespread_face/widespread_palm_feet) are left absent (unknown) — deriving the
widespread flags reliably needs per-region context the app doesn’t capture.
Result against the deployed schema: the same app payload now reaches 14 of
19 fields (from 2). Placed in ddmodels (torch-free module) so the ai-service
uses it now and the ETL can reuse it later; the gateway is untouched (it has no
ddmodels/torch dependency).
Verification
normalize_anamnesisunit tests (per-field mapping, dropped fields, missing values) + an end-to-endMetadataEncoderround-trip test proving normalized input activates fields the raw payload leaves unknown.uv run pytest(ai-service): 25 passed, 17 skipped (model-gated). Ruff clean.- Deployed
feature_schema.jsonconfirmed identical totrimmed_feature_schema.json.
Do-after
- Deploy the ai-service (terraform + CI, not ad hoc) — the fix only takes effect once the updated service is live.
- Re-fit the tier thresholds on true production-shaped inputs (GPU box).
- ETL / future retrains: have
run_etl.pyconsumenormalize_anamnesis(or store normalized metadata) so new-app training data uses correct field names; otherwise new data is skewed the same way. Tracked separately. - Consider retraining once metadata actually flows (accuracy should recover toward the benchmark).