Skip to Content
Ai LogRound 2 · Stream A — Train/Test Leakage & Near-Duplicate Audit

Round 2 · Stream A — Leakage & near-duplicate audit

Question this gates: is the gen2a-medsiglip headline of 73.6% top-1 / 92.8% top-3 on the full 25-class test split (15,561 images) real, or inflated by train↔test contamination? Everything else in Round 2 is measured against that number, so we confirm it first.

Verdict: the 73.6% is real. Genuine train/test leakage is one duplicated patient = 6 images = 0.04% of the test set — far too small to move top-1 by even 0.1 pt. No test-split rebuild needed.

What was checked

Tool: src/training/scripts/eval/leakage_audit.py (CPU-only image hashing — no CV model runs, so it respects the no-CPU-CV rule). Per-image hashes are cached to ~/.cache/dermadetect/v1/leakage_hashes.parquet (~370k rows) so re-runs are instant.

  1. Split integrity. Splits are assigned per patient_id (eval_common.load_test_manifest). Confirmed 0 patients span >1 split.
  2. Manifest cross-split “tells” (no image reads): case_uuid (143,449 distinct) spans 0 splits; source_gcs_path (all 369,786 unique) spans 0; no case_uuid crosses a patient_id. Single vendor (maccabi).
  3. Exact-content duplicates: sha256 of image bytes, test vs train+val.
  4. Near-duplicates: perceptual pHash + dHash (64-bit), test vs train+val, matched within a Hamming radius via (K+1)-band LSH.

Results (15,561-image 25-class test set vs 330,493 train+val)

thresholdpHashdHash
exact sha2566 (0.04%)—
perceptual ≤06 (0.04%)51 (0.33%)
≤210 (0.06%)264 (1.70%)
≤517 (0.11%)1,339 (8.61%)
≤8373 (2.40%)4,046 (26.0%)

The two hashes disagree wildly. dHash is a false-positive machine on this data and must be ignored; pHash is the trustworthy signal. Three independent proofs:

  • Label agreement. Real duplicates share a diagnosis; coincidental matches don’t. Random same-diagnosis chance here is 7.8%. pHash≤0 pairs are 100% same-diagnosis (real dups); every dHash bucket is 4–16% — at or below chance, i.e. random image pairs.
  • Mechanism. dHash encodes coarse gradients; dermatology photos are mostly smooth skin (and some solid-color frames), so their dHashes collapse and collide by coincidence. Visual check (eval_output/leakage_pairs_montage.png) shows a solid orange frame (test/psoriasis) matched to a skin photo (val/eczema nummular) at dHash d=0 — a pure collision.
  • Visual. All 6 pHash=0 pairs are the same woman’s near-identical selfies (same seatbelt, same top), labeled seborrheic dermatitis in both splits — unmistakable duplicates.

The actual leak

All 6 pHash-identical pairs are between exactly two records: patient_id 1602866 (test) and 2243477 (train). One real person enrolled twice under different patient IDs, one copy in each split. Patient-level splitting is correct in principle but cannot catch a person re-enrolled under a new record. Footprint: 6 images, 0.04%. Even the most generous credible near-dup line (pHash≤5 = 17 imgs, 0.11%) is negligible.

Recommendations / follow-ups

  • Trust 73.6%. No rebuild. Optionally drop the ~6–17 pHash≤5 train duplicates for hygiene (won’t move the metric).
  • Data-collection note (product/infra): de-duplicate patients at ingest by perceptual hash + demographics, not just patient_id. source_gcs_path and case_uuid are clean per-image, so the collision is upstream identity resolution.
  • Where Stream A’s real value is: not dedup, but the next steps — cleanlab confident-learning to surface mislabels, and taxonomy / “other”-class review via the confusion matrix. Those need GPU (out-of-fold MedSigLIP probabilities) and are the gated next task.

Reproduce

cd src/training uv run --with imagehash python scripts/eval/leakage_audit.py --max-hamming 8 # -> eval_output/leakage_audit.json, leakage_flagged_test_uuids.json
Last updated on