Converged deployable CV model β MedSigLIP + reduced questionnaire (Round 1 close-out)
Date: 2026-07-01
Closes out the Round-1 CV-model improvement effort by converging the three parallel streams into one config-driven, deployable model and training it to a final checkpoint.
Result
New deployable model: gen2a-medsiglip β top-1 73.6% / top-3 92.8% on the full 25-class
test split (15,561 images), vs the 67.0% / 89.0% ResNet50 baseline β +6.6 top-1.
| Model (full test split) | top-1 | top-3 | notes |
|---|---|---|---|
Baseline gen2a_port (ResNet50 + metadata) | 67.0% | 89.0% | prior production |
| Stream-1 best (ConvNeXt-V2-Base) | 70.4% | β | generic backbone bake-off |
| MedSigLIP frozen + head | 66.5% | 88.8% | Round-1 probe |
| MedSigLIP standalone fine-tune (3ep, full survey) | 72.9% | 92.4% | Stream-2 win |
gen2a-medsiglip (converged, trimmed survey, 6ep) | 73.6% | 92.8% | deployable |
Notably it edges out the standalone with the reduced questionnaire (159 metadata dims vs full) β fewer inputs to collect, same-or-better accuracy (Stream 3 confirmed trimmed β full).
What converged (one trainable config)
- Stream 2 (winner): MedSigLIP-448 as a first-class backbone in
ddmodels/models/backbones.py(HF SigLIP vision tower β[B,1152]pooled; HAI-DEF licensed; gradient checkpointing). - Stream 3: reduced questionnaire via
dataset.feature_schema_path(configs/trimmed_feature_schema.json, drops ~12 removable survey fields, keeps location + morphology). - Stream 1: backbone factory + AdamW/cosine recipe.
- Plumbing: per-backbone normalization (
image_norm: siglip),amp_dtype: bf16,gradient_checkpointing, all inconfigs/gen2a_medsiglip.yaml.
Recipe: 25 target classes + sampled other (baseline setup; gives an abstain option),
1 frozen + 6 fine-tune epochs, AdamW, cosine, bf16, batch 64. Val top-1 peaked 72.8% at
epoch 3 (= best_model.pt) then plateaued β the recipe saturates ~73%.
Training run (rented GPU)
Ran on a rented RunPod H100 80GB (Terraform ../terraform/env-runpod-training); dataset
pulled from S3 (dermadetect-ml-datasets), bf16, ~33 min/epoch. The Pascal dev box canβt
train 448 ViTs (fp16 is 1/64 on Pascal). Pod destroyed after; total GPU spend ~$25 of the $50.
Model artifacts (S3)
Uploaded to the models bucket, same layout as gen2a_pytorch/:
s3://dermadetect-models/models/gen2a_medsiglip/
best_model.pt (1.7 GB), config.json, labels.txt, feature_schema.json (trimmed), metrics.json.
Deployment
Stays on the current T4 (g4dn.xlarge) β no GPU upgrade. Inference is forward-only
(~2β2.5 GB of 16 GB VRAM), ~150β250 ms p95 at 448px (accepted). The ai-service Gen2Predictor
was updated to build the backbone + normalization from config.json, so pointing
MODEL_DIRECTORY at the gen2a_medsiglip artifacts serves it unchanged. transformers added
to ddmodels deps (needed to load MedSigLIP). Legal: HAI-DEF clinical-use clause review before
production.
Next (Round 2)
The model/recipe is ~maxed (~73%, plateaued). Highest-value next levers, in order:
- Label cleanup (Round-2 #4) β high AUC + hard plateau signals label noise capping us; the measured 73.6% likely understates true accuracy. Vetting tooling already exists.
- Cheap add-ons: test-time augmentation (+0.5β1.5%), weight EMA, layer-wise LR decay.
- More labeled training data (Stream-1 finding: the dominant lever).
- Class-imbalance / focal loss + Eczema de-bias, and a calibrated abstain option (Round-2 #5). Ensembling (with ConvNeXt-V2) would add ~1β2% but doubles serving cost β not worth it on the T4.