Stream 1 β Modern backbone vs ResNet50
Round 1 preliminary investigation: does a 2026-era image backbone beat the ResNet50 image branch of Gen2A on the same data? Primary metric: top-1.
What changed in the code
The model.backbone config field existed but was dead β Gen2AModel
hardcoded torchvision.models.resnet50. This stream made the backbone
swappable, config-only:
ddmodels/models/backbones.py(new):build_backbone(name, pretrained) -> (module, out_dim).resnet50is built from torchvision and is byte-identical to the original hardcoded path (baseline reproduces exactly). All other backbones go throughtimm(num_classes=0β pooled[B, num_features]features), imported lazily so the productionai_serviceimage is unaffected unless a timm backbone is requested. Friendly aliases map to pinned timm checkpoints; DINOv2 getsdynamic_img_size=Trueto accept 224px.Gen2AModel/ImageOnlyModel(ddmodels/models/gen2a.py): accept abackbone: str = "resnet50"arg and use the factory. The metadata MLP branch, fusion, and classifier head are unchanged (held fixed per charter).ddtrain.config: addedtraining.max_train_samples/max_val_samples(seeded subsampling for fast bake-offs;None= full split).trainer.py: passconfig.model.backboneinto the model; apply seededSubsetwhen the subsample knobs are set.scripts/eval/model_benchmark.py: readbackbonefrom the trained modelβsconfig.jsonso non-ResNet checkpoints score correctly.ddtraindep: addedtimm>=1.0.
Method
- Backbones:
convnextv2_tiny(timm, ImageNet-22kβ1k),efficientnetv2_s(timm, in21kβ1k),dinov2_vits14(timm, LVD-142M SSL), andresnet50(torchvision IMAGENET1K_V2) re-trained at the same short budget as the honest anchor. - Held fixed: seed 17, image_size 224, metadata branch, optimizer rmsprop, base_lr 0.002 / finetune_lr 0.001, warmup 200.
- Budget: 1 frozen epoch + 4 finetune epochs on a seeded 24k-train / 4k-val subset (the full 128k-image epochs are ~75 min each on the Pascal GPU β too slow for a 4-backbone bake-off in the time window). All backbones use the same subset, so the relative ranking is apples-to-apples.
- Scoring (primary):
scripts/eval/test_split_eval.pyruns the trainerβsevaluate()over the fulltestsplit (16,380 imgs, 26-class). Themodel_benchmark.py250-image sample was used during screening but is too noisy for architecture comparisons β see the primary results table above. - Hardware: GPU 0 (TITAN X Pascal, 12GB),
CUDA_VISIBLE_DEVICES=0.
β PRIMARY RESULTS β full test split (16,380 imgs, 26-class, trainer evaluate())
This is the real test set (the per-patient test partition), scored with the
same methodology as the per-epoch val metrics and the production baseline.
These supersede the 250-image LLM-sample numbers further down (that sample
was only for the LLM comparison; with 250 imgs the Β±~3pt sampling error swamps
the architecture gap and produced a spurious ResNet50/ConvNeXt βtieβ).
Full-budget decisive runs (all 128k imgs, flash decode cache, AdamW + cosine)
| Model | recipe | top-1 | top-3 | AUC |
|---|---|---|---|---|
| convnextv2_base + strong reg | adamw 1e-4 + cosine, strong aug | 0.704 | 0.900 | 0.957 |
| convnextv2_tiny | adamw 1e-4 + cosine | 0.688 | 0.893 | 0.954 |
| resnet50 | adamw 1e-4 + cosine | 0.680 | 0.888 | 0.951 |
| gen2a_port (production baseline) | rmsprop 1e-3 | 0.657 | 0.866 | 0.945 |
ConvNeXt-V2-Base + strong regularization (rotation/colour-jitter/RandAugment/ random-erasing, dropout 0.3/0.5, wd 0.1) reaches 0.704 top-1 (+1.6 over Tiny, +4.7 over baseline). Unlike Tiny (overfit after ft-epoch 2), Base with strong aug improved monotonically to a clean plateau (val top1 0.670β0.680β0.703β0.707β 0.708, best at ft-epoch 4) β so the capacity does help once memorisation is curbed. Cost: 88M params (3Γ Tiny), ~15 h to train on the Pascal card (strong aug is CPU-heavy: ~2.9 h/finetune-epoch). For reference, Stream 2βs fine-tuned MedSigLIP-448 reportedly reaches ~0.729 on the full test split (cloud H100).
Decomposition of the +3.1 pt total top-1 gain over the production baseline:
- Recipe (RMSprop@1e-3 β AdamW@1e-4 + cosine): +2.3 pt β ResNet50 alone goes 0.657 β 0.680. This is the dominant lever and is architecture-agnostic.
- Architecture (ResNet50 β ConvNeXt-V2-Tiny, recipe-matched): +0.77 pt β 0.680 β 0.688, plus +0.47 top-3 and +0.24 AUC. Real and consistent with the short-budget screen (+1.1), but modest.
Both backbones overfit after ~2-3 finetune epochs on this data (best val_loss at ft-epoch 2 for ConvNeXt, ft-epoch 3 for ResNet50; val_loss rises monotonically after). The efficient recipe is short β early-stop ~epoch 3, not the 18 epochs first scheduled. The decode cache (PR #103) was essential: it moved training from IO-bound (~140 ms/img decode, GPU near-idle) to GPU-bound (~58 min/epoch at 72-99% util on the TITAN X).
Screening runs (equal short 24k/5ep budget β what pointed here)
| Model | recipe | top-1 | top-3 | AUC |
|---|---|---|---|---|
| convnextv2_tiny | adamw 1e-4 | 0.621 | 0.843 | 0.931 |
| resnet50 | adamw 1e-4 | 0.610 | 0.835 | 0.925 |
Everything below uses the 250-image LLM-comparison sample and is retained only as the screening record that led here. Treat the test-split tables above as the source of truth.
Results β Round A (baseline recipe: RMSprop, finetune_lr 1e-3)
top-1 / top-3 on the 250-img test sample, seed 17:
| Backbone | top-1 | top-3 | Ξtop-1 vs short RN50 | finetune stability |
|---|---|---|---|---|
| resnet50 (full-budget ref) | 0.644 | 0.888 | β | production baseline |
| resnet50 (short-budget anchor) | 0.560 | 0.856 | β | β stable (val top1 0.515β0.619) |
| efficientnetv2_s | 0.564 | 0.848 | +0.4 | β stable (val top1 0.555β0.622) |
| convnextv2_tiny | 0.568 | 0.824 | +0.8 | β diverged in finetune |
| dinov2_vits14 | 0.588 | 0.820 | +2.8 | β diverged in finetune |
The decisive observation is in the val curves, not the headline numbers. RMSprop @ finetune_lr=1e-3 destabilises the modern backbones:
convnextv2_tiny: frozen-phase val top1 0.566, then finetune val_loss exploded0.092 β 10.6 β 6.9 β 32.7 β 4.0and val top1 collapsed to ~0.09.dinov2_vits14: frozen val top1 0.555, finetune oscillated (0.149 β 0.534 β 0.520 β 0.254), val_loss unstable.efficientnetv2_sandresnet50: finetuned cleanly, val top1 rising monotonically to ~0.62.
Because checkpointing keeps the best val_loss, the saved best_model.pt for
convnextv2/dinov2 is effectively the frozen-backbone model (best val_loss =
the frozen epoch). So their test scores (0.568 / 0.588) reflect frozen
ImageNet/SSL features + a trained head β and even so they match or beat the
fully-finetuned ResNet50 anchor (0.560). Thatβs a strong hint the architecture
is not the limiter; the finetune recipe is.
Results β Round B (corrected recipe: AdamW, finetune_lr 1e-4, wd 0.05)
Re-ran the diverging backbones changing only the optimizer/LR (charter permits per-backbone LR tuning on obvious divergence). Same 24k/4k subset, seed, epochs. ResNet50 re-run under AdamW too, as a control for the optimizer change.
| Backbone (AdamW 1e-4) | top-1 | top-3 | vs RMSprop run | finetune stability |
|---|---|---|---|---|
| convnextv2_tiny-adamw | 0.600 | 0.888 | 0.568 β 0.600 | β stable, val top1 β0.62 |
| dinov2_vits14-adamw | 0.512 | 0.784 | 0.588 β 0.512 | β stable but weak (ViT, data-hungry) |
| resnet50-adamw (control) | 0.600 | 0.844 | 0.560 β 0.600 | β stable |
Consolidated comparison (test sample, 250 imgs, seed 17)
| Model | Recipe | Budget | top-1 | top-3 |
|---|---|---|---|---|
| resnet50 (production) | rmsprop 1e-3 | full (128k, 18ep) | 0.644 | 0.888 |
| resnet50 (anchor) | rmsprop 1e-3 | short (24k, 5ep) | 0.560 | 0.856 |
| efficientnetv2_s | rmsprop 1e-3 | short | 0.564 | 0.848 |
| dinov2_vits14 | adamw 1e-4 | short | 0.512 | 0.784 |
| resnet50 (control) | adamw 1e-4 | short | 0.600 | 0.844 |
| convnextv2_tiny | adamw 1e-4 | short | 0.600 | 0.888 |
Findings & recommendation
Confirmed at full budget on the real test split. Two levers, both real, with the recipe being the larger:
1. The finetune recipe is the dominant lever β adopt it now (low risk). Switching RMSprop@1e-3 β AdamW@1e-4 + cosine lifts ResNet50 itself +2.3 pt top-1 (0.657 β 0.680) β no architecture change. It is also a hard prerequisite for modern backbones (ConvNeXt-V2/DINOv2 diverge under RMSprop@1e-3). The flash decode cache (PR #103) is what made full-data runs practical (IO-bound β GPU-bound).
2. ConvNeXt-V2-Tiny is the best model β GO, modest extra gain. Recipe-matched at full budget it beats ResNet50 +0.77 pt top-1 (0.688 vs 0.680), +0.47 top-3, +0.24 AUC, and beats the production baseline +3.1 / +2.6 / +0.9. The architecture edge is real and consistent (matches the short-budget screen) but small next to the recipe gain β so ConvNeXt-V2 is the recommended production model, while ResNet50+AdamW is a near-equal, zero-risk fallback if the backbone swap is undesirable.
3. Train short β both backbones overfit after ~2-3 finetune epochs. Best checkpoints: ConvNeXt ft-epoch 2, ResNet50 ft-epoch 3; val_loss rises monotonically after. 18 epochs was wasteful. Use early stopping (~3 finetune epochs).
4. NO-GO: DINOv2 ViT-S/14 (finetuning hurt it; data-hungry β prefer Swin-V2 if a transformer is ever wanted); EfficientNetV2-S (no edge over ResNet50). Derm-pretrained backbones are Stream 2βs scope.
Recommendation for production / next steps
- Ship ConvNeXt-V2-Tiny + AdamW@1e-4 + cosine, ~3 finetune epochs as the new
Gen2A backbone: 0.688 top-1 / 0.893 top-3 / 0.954 AUC (+3.1 top-1 over the
current production model). Backbone is config-only (
backbone: convnextv2_tiny). - Cheap follow-ups worth trying: ConvNeXt-V2-Base (more capacity); layer-wise LR decay + light augmentation/regularisation to push the overfit ceiling past epoch 2-3; combine with Stream 3βs questionnaire findings and Stream 2βs MedSigLIP for an ensemble/foundation comparison.
- Infra note: ~58 min/epoch on the TITAN X (Pascal, GPU-bound after the cache). A tensor-core GPU would cut iteration time several-fold.
Reproduce
CUDA_VISIBLE_DEVICES=0 uv run --package ddtrain python -m ddtrain.training.trainer \
--config configs/arch_convnextv2_tiny_full.yaml # train (early-stop ~ft-epoch 3)
CUDA_VISIBLE_DEVICES=0 uv run --package ddtrain python scripts/eval/test_split_eval.py \
--model-dir trained_models/arch-convnextv2_tiny-full # score on full test splitBlockers / surprises
ddtrain.train(charterβs example module path) does not exist; the entry point isddtrain.training.trainer.- The trainer auto-wraps in
DataParallelwhenever it sees >1 GPU, so an unpinned run grabs both cards.CUDA_VISIBLE_DEVICES=0keeps it to one. - Data loading (CPU JPEG decode) is the throughput bottleneck on the Pascal GPUs; full-dataset epochs are ~75 min. Subsampling was needed for the bake-off.