Decoded-image cache for DermaDetectDataset
Why
Fine-tuning in ddtrain is CPU/JPEG-decode bound, not GPU bound — during the
Round-1 ablations the GPU sat at ~0% utilization while dataloader workers spent ~140 ms
per image doing Image.open(...).convert("RGB") + downscale-from-original. The
ResNet50 forward/backward finishes almost instantly and then waits on the next batch,
so an expensive GPU idles behind the CPU decoder. This hits all three Round-1 streams
equally (they share the same dataset/trainer over the same ~370k-image corpus).
What
A lazy, self-populating decoded-image cache keyed by dataset version + image size.
The expensive decode + resize happens once per image; the result (a uint8
[image_size, image_size, 3] array) is written to flash and read back on every later
epoch (and by every other stream).
- Path:
cache_dir/<version>/<image_size>/<uuid[:2]>/<uuid>.npy(sharded by uuid prefix so no directory holds 370k files). - Lazy: no build step. First epoch decodes + writes; later epochs read. Populates only images actually used (so it composes with subsampling).
- Atomic writes (temp file +
os.replace) → safe for concurrent dataloader workers and multiple streams writing the same flash cache. - Augmentation-safe: only the deterministic decode+resize is cached; the random train augments (flip / affine / normalize) still run per-epoch on the cached base image, so augmentation diversity is unchanged.
- Fails open: a corrupt entry is rebuilt; a full/unwritable cache silently falls back to live decode — the cache can never break training.
Storing at the target size means the transform pipeline’s Resize is a no-op on a hit,
so cached output is bit-identical to the live-decode path (verified: max|Δ| = 0).
How to enable
Add one line to any training config’s dataset: block (works for every model type and
both streams’ configs):
dataset:
cache_dir: /mnt/flash/dermadetect_cache # flash SSD, ~55 GB for the full v1 corpus/mnt/flash is an SSD-backed LVM volume (414 GB free) on the dev box. Leave cache_dir
unset to keep the old live-decode behaviour.
Measured effect
Per-image image-loading cost on the v1 val split, eval transforms:
| path | ms/image |
|---|---|
| live JPEG decode + resize | ~140 |
cached .npy read | ~1–5 |
≈ 25–120× faster image loading (varies with OS page cache). Correctness:
max|nocache − cache| = 0.000 over the sampled images (miss and hit). First epoch is
unchanged (it populates); every subsequent epoch should be GPU-bound instead of
decode-bound.
Changes
ddtrain/datasets/dataset.py—_load_image/_decode_image/_write_cache; newimage_size,cache_dir,dataset_versionparams onDermaDetectDataset.ddtrain/config.py—DatasetConfig.cache_dir.ddtrain/training/trainer.py— passescache_dir/image_size/versionto both datasets.
No model, loss, or training-loop changes. Branch: feat/decode-tensor-cache off main,
so Streams 1 & 2 can git merge it independently.