Decoded-image cache for `DermaDetectDataset`

Why

Fine-tuning in ddtrain is CPU/JPEG-decode bound, not GPU bound — during the Round-1 ablations the GPU sat at ~0% utilization while dataloader workers spent ~140 ms per image doing Image.open(...).convert("RGB") + downscale-from-original. The ResNet50 forward/backward finishes almost instantly and then waits on the next batch, so an expensive GPU idles behind the CPU decoder. This hits all three Round-1 streams equally (they share the same dataset/trainer over the same ~370k-image corpus).

What

A lazy, self-populating decoded-image cache keyed by dataset version + image size. The expensive decode + resize happens once per image; the result (a uint8 [image_size, image_size, 3] array) is written to flash and read back on every later epoch (and by every other stream).

Path: cache_dir/<version>/<image_size>/<uuid[:2]>/<uuid>.npy (sharded by uuid prefix so no directory holds 370k files).
Lazy: no build step. First epoch decodes + writes; later epochs read. Populates only images actually used (so it composes with subsampling).
Atomic writes (temp file + os.replace) → safe for concurrent dataloader workers and multiple streams writing the same flash cache.
Augmentation-safe: only the deterministic decode+resize is cached; the random train augments (flip / affine / normalize) still run per-epoch on the cached base image, so augmentation diversity is unchanged.
Fails open: a corrupt entry is rebuilt; a full/unwritable cache silently falls back to live decode — the cache can never break training.

Storing at the target size means the transform pipeline’s Resize is a no-op on a hit, so cached output is bit-identical to the live-decode path (verified: max|Δ| = 0).

How to enable

Add one line to any training config’s dataset: block (works for every model type and both streams’ configs):


dataset:
  cache_dir: /mnt/flash/dermadetect_cache   # flash SSD, ~55 GB for the full v1 corpus

/mnt/flash is an SSD-backed LVM volume (414 GB free) on the dev box. Leave cache_dir unset to keep the old live-decode behaviour.

Measured effect

Per-image image-loading cost on the v1 val split, eval transforms:

path	ms/image
live JPEG decode + resize	~140
cached `.npy` read	~1–5

≈ 25–120× faster image loading (varies with OS page cache). Correctness: max|nocache − cache| = 0.000 over the sampled images (miss and hit). First epoch is unchanged (it populates); every subsequent epoch should be GPU-bound instead of decode-bound.

Changes

ddtrain/datasets/dataset.py — _load_image / _decode_image / _write_cache; new image_size, cache_dir, dataset_version params on DermaDetectDataset.
ddtrain/config.py — DatasetConfig.cache_dir.
ddtrain/training/trainer.py — passes cache_dir / image_size / version to both datasets.

No model, loss, or training-loop changes. Branch: feat/decode-tensor-cache off main, so Streams 1 & 2 can git merge it independently.

Decoded-image cache for DermaDetectDataset

Why

What

How to enable

Measured effect

Changes

Decoded-image cache for `DermaDetectDataset`