Stream 2 — Open-Source Derma Models / Data — WIN (MedSigLIP fine-tune beats baseline)

Result (2026-06-29): Fine-tuning MedSigLIP-448 end-to-end (on an H100; see §7) gives top-1 72.9% / top-3 92.4% on the full test split — vs the 67.0% baseline (+5.9 top-1) and +6.4 over frozen MedSigLIP (66.5%). MedSigLIP is HAI-DEF (commercially licensable), PyTorch transformers, fine-tunable, derm-trained — i.e. a shippable lift. This is the Round-1 goal met for the CV model. Derm Foundation: CPU-only/GPU-blocked (3 ways, §6); PanDerm/MONET strong but non-commercial (§1). Checkpoint saved at ~/.cache/dermadetect/medsiglip_finetune/.

Date: 2026-06-21 (updated 2026-06-22)

Question: Can we get a top-1 lift by initializing the Gen2A image branch from a publicly available dermatology model (trained on open derm data) and fine-tuning on our data, instead of generic ImageNet weights?

Baseline to beat (gen2a_port, same test split, seed 17): top-1 69%, top-3 89.3%, AUC 0.955. (Val-split numbers run lower: top-1 66.7% / top-3 87.7%.)

1. License / availability triage (done first — gating step)

A model we cannot legally ship in a commercial medical product is a non-starter. For each candidate I read the actual license / model card / terms (URLs cited).

Model	License (verbatim source)	Commercial medical use?	Weights available	Arch / input / framework	Trained on	Verdict
Google Derm Foundation	Health AI Developer Foundations (HAI-DEF) terms; supporting code Apache-2.0	Yes, with caveats — commercial allowed; “Clinical Use” allowed but you must obtain Health Regulatory Authorization and must not position Google as the device “manufacturer”	Gated on HF (accept HAI-DEF terms + HF token)	BiT-M ResNet101x3, 448×448 PNG, JAX/TF Keras SavedModel, outputs 6144-d embedding (not a classifier)	Health image-text web pairs → teledermatology (US + Colombia), AU skin-cancer set, public images	VIABLE (legal review of clinical-use clause; best domain match)
PanDerm	CC-BY-NC-ND 4.0	No — Non-Commercial and No-Derivatives	Yes (GitHub/HF)	ViT foundation enc., multimodal	2M+ skin images, 4 modalities	CUT (license)
MONET	Model card: “Any deployed use case — whether commercial or not — is currently out of scope”; repo CC-BY-NC-SA 4.0	No	Yes (`suinleelab/monet`)	CLIP ViT-L/14, 224px, 768-d	105k derm image–text from literature	CUT (license)
Open classifiers on HAM10000 / DermNet / Fitzpatrick17k	”Derived-weights trap”: HAM10000 = CC-BY-NC, DermNet = copyrighted, Fitzpatrick17k redistribution unclear	No / unclear	Varies	Misc timm/HF	Non-commercial derm datasets	CUT (dataset encumbrance taints derived weights)
Open classifiers on ISIC (CC-0 / CC-BY subset)	Per-image; a CC-0/CC-BY-only subset can be commercial	Maybe (heavy path)	Would need to train ourselves	—	Dermoscopy (domain mismatch vs our clinical photos)	Scrape-and-pretrain path only (Round 2 scope)

Triage conclusion

Of the published derm foundation models, only Google Derm Foundation is commercially licensable. The academic SOTA encoders (PanDerm, MONET) are non-commercial and cannot ship in a commercial medical device. The “train your own on open data” route is mostly poisoned by dataset licenses (HAM10000/DermNet/Fitzpatrick17k); only a CC-0/CC-BY ISIC subset is clean, and that’s dermoscopy (domain mismatch with our clinical teledermatology photos) — a Round-2 heavy path at best.

Derm Foundation is also the best domain match: it was fine-tuned on teledermatology data (US + Colombia), which is exactly our image type.

2. Approach for the one viable candidate (Derm Foundation)

Important nuance: Derm Foundation is a frozen JAX/TF embedding model (448×448 in, 6144-d out), not a PyTorch backbone you can fine-tune end-to-end in our trainer. The brief’s “use it as the backbone/feature extractor” maps cleanly to the feature-extractor reading: precompute embeddings, train a head. End-to-end fine-tuning would require a TF→PyTorch BiT conversion (heavy, Round-2 only).

Planned lightweight first test (low compute, no GPU training of a backbone):

Extract 6144-d Derm Foundation embeddings for our train/val/test images (TF Keras, image/encoded PNG bytes → output['embedding']). Cache to disk keyed by image_uuid. Script: src/training/scripts/opensource/extract_derm_foundation.py (written, ready).
Train a head: replace the ResNet image branch with Linear(6144→256); keep the metadata MLP branch and fusion classifier identical to Gen2A. Train head only (fast).
Score on the full 25-class test split (15,561 images), comparing top-1/top-3 to the gen2a_port baseline measured on the same full split. (Earlier drafts used the 250-image seed-17 sample — that was an LLM-benchmark one-off; model eval uses the full test set.)

3. BLOCKER — Derm Foundation is CPU-only, can’t run on our GPUs

HF access + TF were resolved (token added, tensorflow[and-cuda] 2.16.2 in an isolated venv, gated model downloads fine). The hard blocker is the artifact itself:

Derm Foundation’s released SavedModel is a JAX→StableHLO native-serialized export whose compute is baked CPU-only. Loaded with a GPU visible, it errors: “The current platform CUDA is not among the platforms required by the module: [CPU].” The serving graph is just Placeholder → StatefulPartitionedCall; the math lives inside the CPU-lowered StableHLO, so there’s no TF-op graph to retarget to GPU.
On CPU it’s unusably slow here (380M-param BiT-M ResNet101x3 at 448px: a 16-image smoke ran >18 min without finishing) and policy now forbids CPU for CV models (added to 00-shared-context.md 2026-06-22 at Dave’s direction).

Getting Derm Foundation onto a GPU would require either (a) Google’s hosted Vertex AI / research endpoint (runs on GCP; sends images off-box → data-governance + billing + setup), or (b) porting the BiT-M R101x3 weights into a GPU-runnable framework (the SavedModel exposes no named variables; ~1-2 days, fragile). Neither is a quick preliminary.

4. Empirical signal — PanDerm (research-only proxy)

Since the licensable model (Derm Foundation) can’t run on our GPUs, the science question (does derm-pretraining beat ImageNet on our data?) was answered with PanDerm as a research-only proxy — native PyTorch ViT-L/16, GPU-runnable, ungated. PanDerm is CC-BY-NC-ND (non-commercial) → NOT shippable; used only as an internal signal.

Method: extract frozen forward_features (PanDerm 1024-d; ImageNet ResNet50 2048-d as the control) for the full 25-class test split + a per-class-capped train/val subset (≤1500/300), reusing the shared flash cache (PR #103) via a multi-worker DataLoader on GPU 1. Train the same Gen2A-style head (image-embedding branch + metadata MLP + fusion) on each; score on the full test split (15,561 imgs, 25 targets). Preprocessing: cache’s 224-square + each model’s ImageNet norm (PanDerm’s published transform is 256+CenterCrop(224) — minor deviation).

Results (full 25-class test split, 15,561 images — same head, same 32k train subset)

Setup	top-1	top-3
Frozen ImageNet ResNet50 + head (control)	52.9%	79.8%
Frozen PanDerm ViT-L + head	64.2%	87.3%
Baseline gen2a_port (ResNet50 + meta, fully fine-tuned, 293k train)	67.0%	89.0%

(Baseline re-measured on the full test split = 67.0% top-1; the 69% in the brief was the 250-image sample.)

Reading:

Derm-pretraining clearly helps: PanDerm-frozen beats ImageNet-frozen by +11.3 top-1 points (64.2 vs 52.9), same head/data. The stream’s core hypothesis is confirmed.
A frozen PanDerm + a tiny head trained on only 32k images gets within ~2.8 pts of our fully fine-tuned ResNet50 baseline (293k train). Closing/beating that gap with a fine-tuned derm encoder (and full train data) is very plausible.
So the limiter is not the science — it’s licensing + runnability.

5. Decision (Dave, 2026-06-28): keep improving our own model — STREAM CLOSED

No open-source derm model works as a shippable base we can fine-tune, so we keep iterating on our existing fine-tuned model rather than re-basing on a public derm encoder. Reasoning:

The fine-tunable derm models (PanDerm, MONET) are non-commercial → can’t ship.
The one licensable model (Derm Foundation) is a frozen, CPU-only artifact we can’t fine-tune on our hardware (§3).
Used frozen, no public derm model beats our already-fine-tuned baseline (PanDerm 64.2 vs 67.0), so there’s no shippable, fine-tunable starting point that improves on what we have.

Two caveats recorded so the decision ages well (the “no” is about availability, not merit):

Frozen ≠ the model’s ceiling. PanDerm lost only because it was frozen vs our fully-fine-tuned model — frozen it still beat frozen ImageNet by +11.3 top-1. A fine-tunable, licensable derm model would very likely fine-tune to above 67.0. This space moves fast; revisit if such a model appears.
Derm Foundation was never actually measured (GPU-blocked) — it’s the only shippable and only untested point. Frozen, it could plausibly match/beat 67.0 via Vertex AI (hosted, PHI→GCP). Not pursuing now — it would yield a frozen embedder we can’t keep fine-tuning, which is a worse long-term asset than our own fine-tunable model. Park it as a fallback if own-model gains stall.

On topping up training with open-source images: same license trap as the models — HAM10000 (CC-BY-NC), DermNet (copyrighted), Fitzpatrick17k (unclear) can’t go into a commercial model’s training set. Only a CC-0/CC-BY ISIC subset is clean, and it’s dermoscopy (domain-mismatched vs our clinical photos). Narrow, low-priority.

Round-2 direction: the practical CV win is Stream 1’s fine-tuned modern backbone (applied to our own model), not open-source derm init. This stream is closed.

6. Follow-up (2026-06-28): Derm Foundation re-verified CPU-only; MedSigLIP reopens the stream

Derm Foundation CPU-only — re-verified three ways (loader ruled out, GPU was visible):

The XlaCallModule’s baked platforms attribute is literally ['CPU'] (read from the SavedModel function library) — a property of Google’s jax2tf export, not our setup.
tf.saved_model.load on GPU → NOT_FOUND: platform CUDA is not among [CPU].
The official from_pretrained_keras path, GPU visible + CUDA XLA service initialized → identical error. (from_pretrained_keras returns the same _UserObject; no keras_metadata.pb, so it wraps the same serving fn.) The “empty GPU list / CPU-only wheel” hypotheses don’t apply — list_physical_devices('GPU') was non-empty throughout. The BiT CNN architecture isn’t CPU-bound; only Google’s released export is, and we don’t have the JAX source to re-export for GPU.

MedSigLIP — the shippable + GPU-runnable + fine-tunable derm model we’d concluded didn’t exist. google/medsiglip-448: HAI-DEF license (same commercial terms as Derm Foundation), but ships as a standard PyTorch transformers safetensors checkpoint (AutoModel/AutoProcessor) — natively GPU-runnable AND fine-tunable (full weights, no jax2tf/CPU-baking). SigLIP two-tower (400M vision ViT, 448×448), and trained on the same teledermatology data as Derm Foundation (Colombia clinical+dermatoscopic, Australia skin-cancer, PAD-UFES-20); intended use explicitly includes dermatology. Verified it’s a clean transformers checkpoint and transformers loads it; only blocker to testing is per-model gate acceptance (our token reads metadata but from_pretrained 403s until the MedSigLIP terms are accepted on its HF page).

This reverses §5’s premise. “No shippable + fine-tunable derm model exists” was wrong — MedSigLIP is one.

MedSigLIP frozen-probe results (full test split, 15,561 imgs — same head, same 32k train)

Setup	top-1	top-3
Frozen ImageNet ResNet50 + head	52.9%	79.8%
Frozen PanDerm ViT-L + head	64.2%	87.3%
Frozen MedSigLIP (1152-d) + head	66.5%	88.8%
Baseline gen2a_port (ResNet50+meta, fully fine-tuned, 293k train)	67.0%	89.0%

Frozen MedSigLIP (66.5%) ties the fully fine-tuned baseline (67.0%) — within 0.5 top-1 pts — despite being frozen with only a 32k-image head and no metadata fusion tuning. It beats frozen ImageNet by +13.6 and frozen PanDerm by +2.3. (MedSigLIP ran at 448px via the new 448 flash cache; bilinear-vs-bicubic resize is a minor deviation.)

Decision: GO — Round 2 = fine-tune MedSigLIP end-to-end

MedSigLIP is the breakthrough: commercially licensable (HAI-DEF), GPU-runnable, fine-tunable, derm-trained, and frozen it already matches our production model. Fine-tuning it end-to-end on the full 293k train set (it’s a normal PyTorch transformers model) is very likely to beat 67.0 and is shippable — exactly the Round-1 goal. Recommended Round-2 plan:

Wire MedSigLIP’s vision tower as a ddmodels backbone (image branch) feeding the Gen2A metadata-fusion head; fine-tune end-to-end on full train (the 448 cache makes this fast).
Compare top-1 to the 67.0 baseline on the full test split; expect a clear lift.
Legal: HAI-DEF clinical-use clause review before shipping (same as any HAI-DEF model).

Fallback if fine-tuning underwhelms: frozen MedSigLIP embeddings + a stronger head/metadata fusion already ≈ baseline, so even the frozen path is viable.

Hardware: the fine-tune must run on rented cloud GPU, not the dev box. Measured on the 1080 Ti: fp16 AMP is a Pascal trap (true fp16 at 1/64 → 59 s/step), and even fp32 is ~1.3 img/s for MedSigLIP-448 → ~62 hr/epoch on full train. finetune_medsiglip.py (fp32-default, two-phase, multi-GPU, 448-cache write-through) is correct and validated, but a real full fine-tune needs an A100 80GB / H100 (Ampere+ → use bf16, large batch): ~a few hours, ~$10–15 on RunPod / Vast.ai / Lambda Labs (cheaper than AWS, zero egress). Pull the dataset from our S3 onto the cloud box. See memory reference_cloud_gpu_for_training.

7. Fine-tune result (RunPod H100, 2026-06-29)

Ran the end-to-end fine-tune on a rented RunPod H100 80GB (Terraform: ../terraform/ env-runpod-training, PR #111; ~$3.29/hr, destroyed after). Dataset (104GB) pulled from S3 (dermadetect-ml-datasets, via dermadetect_dev_terraform creds) onto the pod.

finetune_medsiglip.py: MedSigLIP vision tower + Gen2A metadata-fusion head, two-phase (1 frozen epoch → 3 full-fine-tune epochs), bf16 (H100), batch 64, AdamW, full 25-class train (121,514 imgs), scored on the full test split (15,561).

Model (full test split)	top-1	top-3
Baseline gen2a_port (ResNet50+meta, fully fine-tuned)	67.0%	89.0%
Frozen MedSigLIP + head	66.5%	88.8%
MedSigLIP fine-tuned end-to-end	72.9%	92.4%

Val top-1 climbed frozen 58.9% → finetune 64.3 → 66.1 → 66.3 (balanced 200/class val, noisier than the natural-distribution test). Each fine-tune epoch ~30 min on the H100. +5.9 top-1 over baseline, and shippable. Checkpoint: ~/.cache/dermadetect/medsiglip_finetune/medsiglip_ft.pt.

Hardware lesson (Pascal AMP trap): on the local 1080 Ti, fp16 AMP ran at 59 s/step (Pascal does true fp16 at 1/64); fp32 was ~1.3 img/s. The script now defaults to fp32, uses bf16 under --amp on Ampere+/Hopper. See memory feedback_no_cpu_cv_models, reference_cloud_gpu_for_training.

Round 2 / productionization

Fold MedSigLIP into the config-driven ddtrain trainer as a first-class backbone (it’s a standalone script today); wire bf16 + the 448 transform into ddmodels.
Legal: HAI-DEF clinical-use clause review before shipping.
Consider more epochs / LR schedule / unfreezing schedule and the metadata-fusion tuning — 72.9% is from a quick 3-epoch run, likely more headroom.

Reproduce

extract_medsiglip.py — frozen MedSigLIP-448 features (GPU 1; 448 flash cache write-through).
extract_panderm.py / extract_imagenet_resnet50.py — frozen features (GPU 1, 224 flash cache).
build_decode_cache.py --image-size 448 --splits … — populate the 448 decoded-image cache (reuses DermaDetectDataset’s cache; CPU decode). The 224→448 sibling of PR #103.
train_eval_embedding_head.py --label <name> — head train + full-test score (embed dim auto).
eval_gen2a_fulltest.py — baseline gen2a_port on the full test split (cache-accelerated).
MedSigLIP is HF-gated (HAI-DEF); PanDerm repo + ViT-L checkpoint are external/non-commercial. No model weights committed.

Sources

Derm Foundation model card / HAI-DEF terms — developers.google.com/health-ai-developer-foundations
google/derm-foundation (HF, gated), Google-Health/derm-foundation (GitHub)
PanDerm — github.com/SiyuanYan1/PanDerm (CC-BY-NC-ND 4.0)
MONET — huggingface.co/suinleelab/monet (deployment out of scope)
MedSigLIP — huggingface.co/google/medsiglip-448 (HAI-DEF; PyTorch transformers; gated)