Skip to Content
Ai LogStream 1 β€” Model Architecture Bake-off (Round 1)

Stream 1 β€” Modern backbone vs ResNet50

Round 1 preliminary investigation: does a 2026-era image backbone beat the ResNet50 image branch of Gen2A on the same data? Primary metric: top-1.

What changed in the code

The model.backbone config field existed but was dead β€” Gen2AModel hardcoded torchvision.models.resnet50. This stream made the backbone swappable, config-only:

  • ddmodels/models/backbones.py (new): build_backbone(name, pretrained) -> (module, out_dim). resnet50 is built from torchvision and is byte-identical to the original hardcoded path (baseline reproduces exactly). All other backbones go through timm (num_classes=0 β†’ pooled [B, num_features] features), imported lazily so the production ai_service image is unaffected unless a timm backbone is requested. Friendly aliases map to pinned timm checkpoints; DINOv2 gets dynamic_img_size=True to accept 224px.
  • Gen2AModel / ImageOnlyModel (ddmodels/models/gen2a.py): accept a backbone: str = "resnet50" arg and use the factory. The metadata MLP branch, fusion, and classifier head are unchanged (held fixed per charter).
  • ddtrain.config: added training.max_train_samples / max_val_samples (seeded subsampling for fast bake-offs; None = full split).
  • trainer.py: pass config.model.backbone into the model; apply seeded Subset when the subsample knobs are set.
  • scripts/eval/model_benchmark.py: read backbone from the trained model’s config.json so non-ResNet checkpoints score correctly.
  • ddtrain dep: added timm>=1.0.

Method

  • Backbones: convnextv2_tiny (timm, ImageNet-22kβ†’1k), efficientnetv2_s (timm, in21kβ†’1k), dinov2_vits14 (timm, LVD-142M SSL), and resnet50 (torchvision IMAGENET1K_V2) re-trained at the same short budget as the honest anchor.
  • Held fixed: seed 17, image_size 224, metadata branch, optimizer rmsprop, base_lr 0.002 / finetune_lr 0.001, warmup 200.
  • Budget: 1 frozen epoch + 4 finetune epochs on a seeded 24k-train / 4k-val subset (the full 128k-image epochs are ~75 min each on the Pascal GPU β€” too slow for a 4-backbone bake-off in the time window). All backbones use the same subset, so the relative ranking is apples-to-apples.
  • Scoring (primary): scripts/eval/test_split_eval.py runs the trainer’s evaluate() over the full test split (16,380 imgs, 26-class). The model_benchmark.py 250-image sample was used during screening but is too noisy for architecture comparisons β€” see the primary results table above.
  • Hardware: GPU 0 (TITAN X Pascal, 12GB), CUDA_VISIBLE_DEVICES=0.

⭐ PRIMARY RESULTS β€” full test split (16,380 imgs, 26-class, trainer evaluate())

This is the real test set (the per-patient test partition), scored with the same methodology as the per-epoch val metrics and the production baseline. These supersede the 250-image LLM-sample numbers further down (that sample was only for the LLM comparison; with 250 imgs the Β±~3pt sampling error swamps the architecture gap and produced a spurious ResNet50/ConvNeXt β€œtie”).

Full-budget decisive runs (all 128k imgs, flash decode cache, AdamW + cosine)

Modelrecipetop-1top-3AUC
convnextv2_base + strong regadamw 1e-4 + cosine, strong aug0.7040.9000.957
convnextv2_tinyadamw 1e-4 + cosine0.6880.8930.954
resnet50adamw 1e-4 + cosine0.6800.8880.951
gen2a_port (production baseline)rmsprop 1e-30.6570.8660.945

ConvNeXt-V2-Base + strong regularization (rotation/colour-jitter/RandAugment/ random-erasing, dropout 0.3/0.5, wd 0.1) reaches 0.704 top-1 (+1.6 over Tiny, +4.7 over baseline). Unlike Tiny (overfit after ft-epoch 2), Base with strong aug improved monotonically to a clean plateau (val top1 0.670β†’0.680β†’0.703β†’0.707β†’ 0.708, best at ft-epoch 4) β€” so the capacity does help once memorisation is curbed. Cost: 88M params (3Γ— Tiny), ~15 h to train on the Pascal card (strong aug is CPU-heavy: ~2.9 h/finetune-epoch). For reference, Stream 2’s fine-tuned MedSigLIP-448 reportedly reaches ~0.729 on the full test split (cloud H100).

Decomposition of the +3.1 pt total top-1 gain over the production baseline:

  • Recipe (RMSprop@1e-3 β†’ AdamW@1e-4 + cosine): +2.3 pt β€” ResNet50 alone goes 0.657 β†’ 0.680. This is the dominant lever and is architecture-agnostic.
  • Architecture (ResNet50 β†’ ConvNeXt-V2-Tiny, recipe-matched): +0.77 pt β€” 0.680 β†’ 0.688, plus +0.47 top-3 and +0.24 AUC. Real and consistent with the short-budget screen (+1.1), but modest.

Both backbones overfit after ~2-3 finetune epochs on this data (best val_loss at ft-epoch 2 for ConvNeXt, ft-epoch 3 for ResNet50; val_loss rises monotonically after). The efficient recipe is short β€” early-stop ~epoch 3, not the 18 epochs first scheduled. The decode cache (PR #103) was essential: it moved training from IO-bound (~140 ms/img decode, GPU near-idle) to GPU-bound (~58 min/epoch at 72-99% util on the TITAN X).

Screening runs (equal short 24k/5ep budget β€” what pointed here)

Modelrecipetop-1top-3AUC
convnextv2_tinyadamw 1e-40.6210.8430.931
resnet50adamw 1e-40.6100.8350.925

Everything below uses the 250-image LLM-comparison sample and is retained only as the screening record that led here. Treat the test-split tables above as the source of truth.

Results β€” Round A (baseline recipe: RMSprop, finetune_lr 1e-3)

top-1 / top-3 on the 250-img test sample, seed 17:

Backbonetop-1top-3Ξ”top-1 vs short RN50finetune stability
resnet50 (full-budget ref)0.6440.888β€”production baseline
resnet50 (short-budget anchor)0.5600.856β€”βœ… stable (val top1 0.515β†’0.619)
efficientnetv2_s0.5640.848+0.4βœ… stable (val top1 0.555β†’0.622)
convnextv2_tiny0.5680.824+0.8❌ diverged in finetune
dinov2_vits140.5880.820+2.8❌ diverged in finetune

The decisive observation is in the val curves, not the headline numbers. RMSprop @ finetune_lr=1e-3 destabilises the modern backbones:

  • convnextv2_tiny: frozen-phase val top1 0.566, then finetune val_loss exploded 0.092 β†’ 10.6 β†’ 6.9 β†’ 32.7 β†’ 4.0 and val top1 collapsed to ~0.09.
  • dinov2_vits14: frozen val top1 0.555, finetune oscillated (0.149 β†’ 0.534 β†’ 0.520 β†’ 0.254), val_loss unstable.
  • efficientnetv2_s and resnet50: finetuned cleanly, val top1 rising monotonically to ~0.62.

Because checkpointing keeps the best val_loss, the saved best_model.pt for convnextv2/dinov2 is effectively the frozen-backbone model (best val_loss = the frozen epoch). So their test scores (0.568 / 0.588) reflect frozen ImageNet/SSL features + a trained head β€” and even so they match or beat the fully-finetuned ResNet50 anchor (0.560). That’s a strong hint the architecture is not the limiter; the finetune recipe is.

Results β€” Round B (corrected recipe: AdamW, finetune_lr 1e-4, wd 0.05)

Re-ran the diverging backbones changing only the optimizer/LR (charter permits per-backbone LR tuning on obvious divergence). Same 24k/4k subset, seed, epochs. ResNet50 re-run under AdamW too, as a control for the optimizer change.

Backbone (AdamW 1e-4)top-1top-3vs RMSprop runfinetune stability
convnextv2_tiny-adamw0.6000.8880.568 β†’ 0.600βœ… stable, val top1 β†’0.62
dinov2_vits14-adamw0.5120.7840.588 β†’ 0.512βœ… stable but weak (ViT, data-hungry)
resnet50-adamw (control)0.6000.8440.560 β†’ 0.600βœ… stable

Consolidated comparison (test sample, 250 imgs, seed 17)

ModelRecipeBudgettop-1top-3
resnet50 (production)rmsprop 1e-3full (128k, 18ep)0.6440.888
resnet50 (anchor)rmsprop 1e-3short (24k, 5ep)0.5600.856
efficientnetv2_srmsprop 1e-3short0.5640.848
dinov2_vits14adamw 1e-4short0.5120.784
resnet50 (control)adamw 1e-4short0.6000.844
convnextv2_tinyadamw 1e-4short0.6000.888

Findings & recommendation

Confirmed at full budget on the real test split. Two levers, both real, with the recipe being the larger:

1. The finetune recipe is the dominant lever β€” adopt it now (low risk). Switching RMSprop@1e-3 β†’ AdamW@1e-4 + cosine lifts ResNet50 itself +2.3 pt top-1 (0.657 β†’ 0.680) β€” no architecture change. It is also a hard prerequisite for modern backbones (ConvNeXt-V2/DINOv2 diverge under RMSprop@1e-3). The flash decode cache (PR #103) is what made full-data runs practical (IO-bound β†’ GPU-bound).

2. ConvNeXt-V2-Tiny is the best model β€” GO, modest extra gain. Recipe-matched at full budget it beats ResNet50 +0.77 pt top-1 (0.688 vs 0.680), +0.47 top-3, +0.24 AUC, and beats the production baseline +3.1 / +2.6 / +0.9. The architecture edge is real and consistent (matches the short-budget screen) but small next to the recipe gain β€” so ConvNeXt-V2 is the recommended production model, while ResNet50+AdamW is a near-equal, zero-risk fallback if the backbone swap is undesirable.

3. Train short β€” both backbones overfit after ~2-3 finetune epochs. Best checkpoints: ConvNeXt ft-epoch 2, ResNet50 ft-epoch 3; val_loss rises monotonically after. 18 epochs was wasteful. Use early stopping (~3 finetune epochs).

4. NO-GO: DINOv2 ViT-S/14 (finetuning hurt it; data-hungry β€” prefer Swin-V2 if a transformer is ever wanted); EfficientNetV2-S (no edge over ResNet50). Derm-pretrained backbones are Stream 2’s scope.

Recommendation for production / next steps

  • Ship ConvNeXt-V2-Tiny + AdamW@1e-4 + cosine, ~3 finetune epochs as the new Gen2A backbone: 0.688 top-1 / 0.893 top-3 / 0.954 AUC (+3.1 top-1 over the current production model). Backbone is config-only (backbone: convnextv2_tiny).
  • Cheap follow-ups worth trying: ConvNeXt-V2-Base (more capacity); layer-wise LR decay + light augmentation/regularisation to push the overfit ceiling past epoch 2-3; combine with Stream 3’s questionnaire findings and Stream 2’s MedSigLIP for an ensemble/foundation comparison.
  • Infra note: ~58 min/epoch on the TITAN X (Pascal, GPU-bound after the cache). A tensor-core GPU would cut iteration time several-fold.

Reproduce

CUDA_VISIBLE_DEVICES=0 uv run --package ddtrain python -m ddtrain.training.trainer \ --config configs/arch_convnextv2_tiny_full.yaml # train (early-stop ~ft-epoch 3) CUDA_VISIBLE_DEVICES=0 uv run --package ddtrain python scripts/eval/test_split_eval.py \ --model-dir trained_models/arch-convnextv2_tiny-full # score on full test split

Blockers / surprises

  • ddtrain.train (charter’s example module path) does not exist; the entry point is ddtrain.training.trainer.
  • The trainer auto-wraps in DataParallel whenever it sees >1 GPU, so an unpinned run grabs both cards. CUDA_VISIBLE_DEVICES=0 keeps it to one.
  • Data loading (CPU JPEG decode) is the throughput bottleneck on the Pascal GPUs; full-dataset epochs are ~75 min. Subsampling was needed for the bake-off.
Last updated on