Converged deployable CV model — MedSigLIP + reduced questionnaire (Round 1 close-out)

Date: 2026-07-01

Closes out the Round-1 CV-model improvement effort by converging the three parallel streams into one config-driven, deployable model and training it to a final checkpoint.

Result

New deployable model: gen2a-medsiglip — top-1 73.6% / top-3 92.8% on the full 25-class test split (15,561 images), vs the 67.0% / 89.0% ResNet50 baseline → +6.6 top-1.

Model (full test split)	top-1	top-3	notes
Baseline `gen2a_port` (ResNet50 + metadata)	67.0%	89.0%	prior production
Stream-1 best (ConvNeXt-V2-Base)	70.4%	—	generic backbone bake-off
MedSigLIP frozen + head	66.5%	88.8%	Round-1 probe
MedSigLIP standalone fine-tune (3ep, full survey)	72.9%	92.4%	Stream-2 win
`gen2a-medsiglip` (converged, trimmed survey, 6ep)	73.6%	92.8%	deployable

Notably it edges out the standalone with the reduced questionnaire (159 metadata dims vs full) — fewer inputs to collect, same-or-better accuracy (Stream 3 confirmed trimmed ≈ full).

What converged (one trainable config)

Stream 2 (winner): MedSigLIP-448 as a first-class backbone in ddmodels/models/backbones.py (HF SigLIP vision tower → [B,1152] pooled; HAI-DEF licensed; gradient checkpointing).
Stream 3: reduced questionnaire via dataset.feature_schema_path (configs/trimmed_feature_schema.json, drops ~12 removable survey fields, keeps location + morphology).
Stream 1: backbone factory + AdamW/cosine recipe.
Plumbing: per-backbone normalization (image_norm: siglip), amp_dtype: bf16, gradient_checkpointing, all in configs/gen2a_medsiglip.yaml.

Recipe: 25 target classes + sampled other (baseline setup; gives an abstain option), 1 frozen + 6 fine-tune epochs, AdamW, cosine, bf16, batch 64. Val top-1 peaked 72.8% at epoch 3 (= best_model.pt) then plateaued — the recipe saturates ~73%.

Training run (rented GPU)

Ran on a rented RunPod H100 80GB (Terraform ../terraform/env-runpod-training); dataset pulled from S3 (dermadetect-ml-datasets), bf16, ~33 min/epoch. The Pascal dev box can’t train 448 ViTs (fp16 is 1/64 on Pascal). Pod destroyed after; total GPU spend ~$25 of the $50.

Model artifacts (S3)

Uploaded to the models bucket, same layout as gen2a_pytorch/:

s3://dermadetect-models/models/gen2a_medsiglip/ best_model.pt (1.7 GB), config.json, labels.txt, feature_schema.json (trimmed), metrics.json.

Deployment

Stays on the current T4 (g4dn.xlarge) — no GPU upgrade. Inference is forward-only (~2–2.5 GB of 16 GB VRAM), ~150–250 ms p95 at 448px (accepted). The ai-service Gen2Predictor was updated to build the backbone + normalization from config.json, so pointing MODEL_DIRECTORY at the gen2a_medsiglip artifacts serves it unchanged. transformers added to ddmodels deps (needed to load MedSigLIP). Legal: HAI-DEF clinical-use clause review before production.

Next (Round 2)

The model/recipe is ~maxed (~73%, plateaued). Highest-value next levers, in order:

Label cleanup (Round-2 #4) — high AUC + hard plateau signals label noise capping us; the measured 73.6% likely understates true accuracy. Vetting tooling already exists.
Cheap add-ons: test-time augmentation (+0.5–1.5%), weight EMA, layer-wise LR decay.
More labeled training data (Stream-1 finding: the dominant lever).
Class-imbalance / focal loss + Eczema de-bias, and a calibrated abstain option (Round-2 #5). Ensembling (with ConvNeXt-V2) would add ~1–2% but doubles serving cost — not worth it on the T4.