Skip to Content
Ai LogConverged deployable CV model β€” MedSigLIP + reduced questionnaire (Round 1 close-out)

Converged deployable CV model β€” MedSigLIP + reduced questionnaire (Round 1 close-out)

Date: 2026-07-01

Closes out the Round-1 CV-model improvement effort by converging the three parallel streams into one config-driven, deployable model and training it to a final checkpoint.

Result

New deployable model: gen2a-medsiglip β€” top-1 73.6% / top-3 92.8% on the full 25-class test split (15,561 images), vs the 67.0% / 89.0% ResNet50 baseline β†’ +6.6 top-1.

Model (full test split)top-1top-3notes
Baseline gen2a_port (ResNet50 + metadata)67.0%89.0%prior production
Stream-1 best (ConvNeXt-V2-Base)70.4%β€”generic backbone bake-off
MedSigLIP frozen + head66.5%88.8%Round-1 probe
MedSigLIP standalone fine-tune (3ep, full survey)72.9%92.4%Stream-2 win
gen2a-medsiglip (converged, trimmed survey, 6ep)73.6%92.8%deployable

Notably it edges out the standalone with the reduced questionnaire (159 metadata dims vs full) β€” fewer inputs to collect, same-or-better accuracy (Stream 3 confirmed trimmed β‰ˆ full).

What converged (one trainable config)

  • Stream 2 (winner): MedSigLIP-448 as a first-class backbone in ddmodels/models/backbones.py (HF SigLIP vision tower β†’ [B,1152] pooled; HAI-DEF licensed; gradient checkpointing).
  • Stream 3: reduced questionnaire via dataset.feature_schema_path (configs/trimmed_feature_schema.json, drops ~12 removable survey fields, keeps location + morphology).
  • Stream 1: backbone factory + AdamW/cosine recipe.
  • Plumbing: per-backbone normalization (image_norm: siglip), amp_dtype: bf16, gradient_checkpointing, all in configs/gen2a_medsiglip.yaml.

Recipe: 25 target classes + sampled other (baseline setup; gives an abstain option), 1 frozen + 6 fine-tune epochs, AdamW, cosine, bf16, batch 64. Val top-1 peaked 72.8% at epoch 3 (= best_model.pt) then plateaued β€” the recipe saturates ~73%.

Training run (rented GPU)

Ran on a rented RunPod H100 80GB (Terraform ../terraform/env-runpod-training); dataset pulled from S3 (dermadetect-ml-datasets), bf16, ~33 min/epoch. The Pascal dev box can’t train 448 ViTs (fp16 is 1/64 on Pascal). Pod destroyed after; total GPU spend ~$25 of the $50.

Model artifacts (S3)

Uploaded to the models bucket, same layout as gen2a_pytorch/:

s3://dermadetect-models/models/gen2a_medsiglip/ best_model.pt (1.7 GB), config.json, labels.txt, feature_schema.json (trimmed), metrics.json.

Deployment

Stays on the current T4 (g4dn.xlarge) β€” no GPU upgrade. Inference is forward-only (~2–2.5 GB of 16 GB VRAM), ~150–250 ms p95 at 448px (accepted). The ai-service Gen2Predictor was updated to build the backbone + normalization from config.json, so pointing MODEL_DIRECTORY at the gen2a_medsiglip artifacts serves it unchanged. transformers added to ddmodels deps (needed to load MedSigLIP). Legal: HAI-DEF clinical-use clause review before production.

Next (Round 2)

The model/recipe is ~maxed (~73%, plateaued). Highest-value next levers, in order:

  1. Label cleanup (Round-2 #4) β€” high AUC + hard plateau signals label noise capping us; the measured 73.6% likely understates true accuracy. Vetting tooling already exists.
  2. Cheap add-ons: test-time augmentation (+0.5–1.5%), weight EMA, layer-wise LR decay.
  3. More labeled training data (Stream-1 finding: the dominant lever).
  4. Class-imbalance / focal loss + Eczema de-bias, and a calibrated abstain option (Round-2 #5). Ensembling (with ConvNeXt-V2) would add ~1–2% but doubles serving cost β€” not worth it on the T4.
Last updated on