Stream 1 — Modern backbone vs ResNet50

Round 1 preliminary investigation: does a 2026-era image backbone beat the ResNet50 image branch of Gen2A on the same data? Primary metric: top-1.

What changed in the code

The model.backbone config field existed but was dead — Gen2AModel hardcoded torchvision.models.resnet50. This stream made the backbone swappable, config-only:

ddmodels/models/backbones.py (new): build_backbone(name, pretrained) -> (module, out_dim). resnet50 is built from torchvision and is byte-identical to the original hardcoded path (baseline reproduces exactly). All other backbones go through timm (num_classes=0 → pooled [B, num_features] features), imported lazily so the production ai_service image is unaffected unless a timm backbone is requested. Friendly aliases map to pinned timm checkpoints; DINOv2 gets dynamic_img_size=True to accept 224px.
Gen2AModel / ImageOnlyModel (ddmodels/models/gen2a.py): accept a backbone: str = "resnet50" arg and use the factory. The metadata MLP branch, fusion, and classifier head are unchanged (held fixed per charter).
ddtrain.config: added training.max_train_samples / max_val_samples (seeded subsampling for fast bake-offs; None = full split).
trainer.py: pass config.model.backbone into the model; apply seeded Subset when the subsample knobs are set.
scripts/eval/model_benchmark.py: read backbone from the trained model’s config.json so non-ResNet checkpoints score correctly.
ddtrain dep: added timm>=1.0.

Method

Backbones: convnextv2_tiny (timm, ImageNet-22k→1k), efficientnetv2_s (timm, in21k→1k), dinov2_vits14 (timm, LVD-142M SSL), and resnet50 (torchvision IMAGENET1K_V2) re-trained at the same short budget as the honest anchor.
Held fixed: seed 17, image_size 224, metadata branch, optimizer rmsprop, base_lr 0.002 / finetune_lr 0.001, warmup 200.
Budget: 1 frozen epoch + 4 finetune epochs on a seeded 24k-train / 4k-val subset (the full 128k-image epochs are ~75 min each on the Pascal GPU — too slow for a 4-backbone bake-off in the time window). All backbones use the same subset, so the relative ranking is apples-to-apples.
Scoring (primary): scripts/eval/test_split_eval.py runs the trainer’s evaluate() over the full test split (16,380 imgs, 26-class). The model_benchmark.py 250-image sample was used during screening but is too noisy for architecture comparisons — see the primary results table above.
Hardware: GPU 0 (TITAN X Pascal, 12GB), CUDA_VISIBLE_DEVICES=0.

⭐ PRIMARY RESULTS — full `test` split (16,380 imgs, 26-class, trainer `evaluate()`)

This is the real test set (the per-patient test partition), scored with the same methodology as the per-epoch val metrics and the production baseline. These supersede the 250-image LLM-sample numbers further down (that sample was only for the LLM comparison; with 250 imgs the ±~3pt sampling error swamps the architecture gap and produced a spurious ResNet50/ConvNeXt “tie”).

Full-budget decisive runs (all 128k imgs, flash decode cache, AdamW + cosine)

Model	recipe	top-1	top-3	AUC
convnextv2_base + strong reg	adamw 1e-4 + cosine, strong aug	0.704	0.900	0.957
convnextv2_tiny	adamw 1e-4 + cosine	0.688	0.893	0.954
resnet50	adamw 1e-4 + cosine	0.680	0.888	0.951
gen2a_port (production baseline)	rmsprop 1e-3	0.657	0.866	0.945

ConvNeXt-V2-Base + strong regularization (rotation/colour-jitter/RandAugment/ random-erasing, dropout 0.3/0.5, wd 0.1) reaches 0.704 top-1 (+1.6 over Tiny, +4.7 over baseline). Unlike Tiny (overfit after ft-epoch 2), Base with strong aug improved monotonically to a clean plateau (val top1 0.670→0.680→0.703→0.707→ 0.708, best at ft-epoch 4) — so the capacity does help once memorisation is curbed. Cost: 88M params (3× Tiny), ~15 h to train on the Pascal card (strong aug is CPU-heavy: ~2.9 h/finetune-epoch). For reference, Stream 2’s fine-tuned MedSigLIP-448 reportedly reaches ~0.729 on the full test split (cloud H100).

Decomposition of the +3.1 pt total top-1 gain over the production baseline:

Recipe (RMSprop@1e-3 → AdamW@1e-4 + cosine): +2.3 pt — ResNet50 alone goes 0.657 → 0.680. This is the dominant lever and is architecture-agnostic.
Architecture (ResNet50 → ConvNeXt-V2-Tiny, recipe-matched): +0.77 pt — 0.680 → 0.688, plus +0.47 top-3 and +0.24 AUC. Real and consistent with the short-budget screen (+1.1), but modest.

Both backbones overfit after ~2-3 finetune epochs on this data (best val_loss at ft-epoch 2 for ConvNeXt, ft-epoch 3 for ResNet50; val_loss rises monotonically after). The efficient recipe is short — early-stop ~epoch 3, not the 18 epochs first scheduled. The decode cache (PR #103) was essential: it moved training from IO-bound (~140 ms/img decode, GPU near-idle) to GPU-bound (~58 min/epoch at 72-99% util on the TITAN X).

Screening runs (equal short 24k/5ep budget — what pointed here)

Model	recipe	top-1	top-3	AUC
convnextv2_tiny	adamw 1e-4	0.621	0.843	0.931
resnet50	adamw 1e-4	0.610	0.835	0.925

Everything below uses the 250-image LLM-comparison sample and is retained only as the screening record that led here. Treat the test-split tables above as the source of truth.

Results — Round A (baseline recipe: RMSprop, finetune_lr 1e-3)

top-1 / top-3 on the 250-img test sample, seed 17:

Backbone	top-1	top-3	Δtop-1 vs short RN50	finetune stability
resnet50 (full-budget ref)	0.644	0.888	—	production baseline
resnet50 (short-budget anchor)	0.560	0.856	—	✅ stable (val top1 0.515→0.619)
efficientnetv2_s	0.564	0.848	+0.4	✅ stable (val top1 0.555→0.622)
convnextv2_tiny	0.568	0.824	+0.8	❌ diverged in finetune
dinov2_vits14	0.588	0.820	+2.8	❌ diverged in finetune

The decisive observation is in the val curves, not the headline numbers. RMSprop @ finetune_lr=1e-3 destabilises the modern backbones:

convnextv2_tiny: frozen-phase val top1 0.566, then finetune val_loss exploded 0.092 → 10.6 → 6.9 → 32.7 → 4.0 and val top1 collapsed to ~0.09.
dinov2_vits14: frozen val top1 0.555, finetune oscillated (0.149 → 0.534 → 0.520 → 0.254), val_loss unstable.
efficientnetv2_s and resnet50: finetuned cleanly, val top1 rising monotonically to ~0.62.

Because checkpointing keeps the best val_loss, the saved best_model.pt for convnextv2/dinov2 is effectively the frozen-backbone model (best val_loss = the frozen epoch). So their test scores (0.568 / 0.588) reflect frozen ImageNet/SSL features + a trained head — and even so they match or beat the fully-finetuned ResNet50 anchor (0.560). That’s a strong hint the architecture is not the limiter; the finetune recipe is.

Results — Round B (corrected recipe: AdamW, finetune_lr 1e-4, wd 0.05)

Re-ran the diverging backbones changing only the optimizer/LR (charter permits per-backbone LR tuning on obvious divergence). Same 24k/4k subset, seed, epochs. ResNet50 re-run under AdamW too, as a control for the optimizer change.

Backbone (AdamW 1e-4)	top-1	top-3	vs RMSprop run	finetune stability
convnextv2_tiny-adamw	0.600	0.888	0.568 → 0.600	✅ stable, val top1 →0.62
dinov2_vits14-adamw	0.512	0.784	0.588 → 0.512	✅ stable but weak (ViT, data-hungry)
resnet50-adamw (control)	0.600	0.844	0.560 → 0.600	✅ stable

Consolidated comparison (test sample, 250 imgs, seed 17)

Model	Recipe	Budget	top-1	top-3
resnet50 (production)	rmsprop 1e-3	full (128k, 18ep)	0.644	0.888
resnet50 (anchor)	rmsprop 1e-3	short (24k, 5ep)	0.560	0.856
efficientnetv2_s	rmsprop 1e-3	short	0.564	0.848
dinov2_vits14	adamw 1e-4	short	0.512	0.784
resnet50 (control)	adamw 1e-4	short	0.600	0.844
convnextv2_tiny	adamw 1e-4	short	0.600	0.888

Findings & recommendation

Confirmed at full budget on the real test split. Two levers, both real, with the recipe being the larger:

1. The finetune recipe is the dominant lever — adopt it now (low risk). Switching RMSprop@1e-3 → AdamW@1e-4 + cosine lifts ResNet50 itself +2.3 pt top-1 (0.657 → 0.680) — no architecture change. It is also a hard prerequisite for modern backbones (ConvNeXt-V2/DINOv2 diverge under RMSprop@1e-3). The flash decode cache (PR #103) is what made full-data runs practical (IO-bound → GPU-bound).

2. ConvNeXt-V2-Tiny is the best model — GO, modest extra gain. Recipe-matched at full budget it beats ResNet50 +0.77 pt top-1 (0.688 vs 0.680), +0.47 top-3, +0.24 AUC, and beats the production baseline +3.1 / +2.6 / +0.9. The architecture edge is real and consistent (matches the short-budget screen) but small next to the recipe gain — so ConvNeXt-V2 is the recommended production model, while ResNet50+AdamW is a near-equal, zero-risk fallback if the backbone swap is undesirable.

3. Train short — both backbones overfit after ~2-3 finetune epochs. Best checkpoints: ConvNeXt ft-epoch 2, ResNet50 ft-epoch 3; val_loss rises monotonically after. 18 epochs was wasteful. Use early stopping (~3 finetune epochs).

4. NO-GO: DINOv2 ViT-S/14 (finetuning hurt it; data-hungry — prefer Swin-V2 if a transformer is ever wanted); EfficientNetV2-S (no edge over ResNet50). Derm-pretrained backbones are Stream 2’s scope.

Recommendation for production / next steps

Ship ConvNeXt-V2-Tiny + AdamW@1e-4 + cosine, ~3 finetune epochs as the new Gen2A backbone: 0.688 top-1 / 0.893 top-3 / 0.954 AUC (+3.1 top-1 over the current production model). Backbone is config-only (backbone: convnextv2_tiny).
Cheap follow-ups worth trying: ConvNeXt-V2-Base (more capacity); layer-wise LR decay + light augmentation/regularisation to push the overfit ceiling past epoch 2-3; combine with Stream 3’s questionnaire findings and Stream 2’s MedSigLIP for an ensemble/foundation comparison.
Infra note: ~58 min/epoch on the TITAN X (Pascal, GPU-bound after the cache). A tensor-core GPU would cut iteration time several-fold.

Reproduce


CUDA_VISIBLE_DEVICES=0 uv run --package ddtrain python -m ddtrain.training.trainer \
    --config configs/arch_convnextv2_tiny_full.yaml   # train (early-stop ~ft-epoch 3)
CUDA_VISIBLE_DEVICES=0 uv run --package ddtrain python scripts/eval/test_split_eval.py \
    --model-dir trained_models/arch-convnextv2_tiny-full   # score on full test split

Blockers / surprises

ddtrain.train (charter’s example module path) does not exist; the entry point is ddtrain.training.trainer.
The trainer auto-wraps in DataParallel whenever it sees >1 GPU, so an unpinned run grabs both cards. CUDA_VISIBLE_DEVICES=0 keeps it to one.
Data loading (CPU JPEG decode) is the throughput bottleneck on the Pascal GPUs; full-dataset epochs are ~75 min. Subsampling was needed for the bake-off.