Round 2 · Stream A — Leakage & near-duplicate audit
Question this gates: is the gen2a-medsiglip headline of 73.6% top-1 / 92.8%
top-3 on the full 25-class test split (15,561 images) real, or inflated by
train↔test contamination? Everything else in Round 2 is measured against that number,
so we confirm it first.
Verdict: the 73.6% is real. Genuine train/test leakage is one duplicated patient = 6 images = 0.04% of the test set — far too small to move top-1 by even 0.1 pt. No test-split rebuild needed.
What was checked
Tool: src/training/scripts/eval/leakage_audit.py (CPU-only image hashing — no CV
model runs, so it respects the no-CPU-CV rule). Per-image hashes are cached to
~/.cache/dermadetect/v1/leakage_hashes.parquet (~370k rows) so re-runs are instant.
- Split integrity. Splits are assigned per
patient_id(eval_common.load_test_manifest). Confirmed 0 patients span >1 split. - Manifest cross-split “tells” (no image reads):
case_uuid(143,449 distinct) spans 0 splits;source_gcs_path(all 369,786 unique) spans 0; nocase_uuidcrosses apatient_id. Single vendor (maccabi). - Exact-content duplicates: sha256 of image bytes, test vs train+val.
- Near-duplicates: perceptual pHash + dHash (64-bit), test vs train+val, matched within a Hamming radius via (K+1)-band LSH.
Results (15,561-image 25-class test set vs 330,493 train+val)
| threshold | pHash | dHash |
|---|---|---|
| exact sha256 | 6 (0.04%) | — |
| perceptual ≤0 | 6 (0.04%) | 51 (0.33%) |
| ≤2 | 10 (0.06%) | 264 (1.70%) |
| ≤5 | 17 (0.11%) | 1,339 (8.61%) |
| ≤8 | 373 (2.40%) | 4,046 (26.0%) |
The two hashes disagree wildly. dHash is a false-positive machine on this data and must be ignored; pHash is the trustworthy signal. Three independent proofs:
- Label agreement. Real duplicates share a diagnosis; coincidental matches don’t. Random same-diagnosis chance here is 7.8%. pHash≤0 pairs are 100% same-diagnosis (real dups); every dHash bucket is 4–16% — at or below chance, i.e. random image pairs.
- Mechanism. dHash encodes coarse gradients; dermatology photos are mostly smooth
skin (and some solid-color frames), so their dHashes collapse and collide by
coincidence. Visual check (
eval_output/leakage_pairs_montage.png) shows a solid orange frame (test/psoriasis) matched to a skin photo (val/eczema nummular) at dHash d=0 — a pure collision. - Visual. All 6 pHash=0 pairs are the same woman’s near-identical selfies (same seatbelt, same top), labeled seborrheic dermatitis in both splits — unmistakable duplicates.
The actual leak
All 6 pHash-identical pairs are between exactly two records: patient_id 1602866
(test) and 2243477 (train). One real person enrolled twice under different patient
IDs, one copy in each split. Patient-level splitting is correct in principle but
cannot catch a person re-enrolled under a new record. Footprint: 6 images, 0.04%.
Even the most generous credible near-dup line (pHash≤5 = 17 imgs, 0.11%) is negligible.
Recommendations / follow-ups
- Trust 73.6%. No rebuild. Optionally drop the ~6–17 pHash≤5 train duplicates for hygiene (won’t move the metric).
- Data-collection note (product/infra): de-duplicate patients at ingest by
perceptual hash + demographics, not just
patient_id.source_gcs_pathandcase_uuidare clean per-image, so the collision is upstream identity resolution. - Where Stream A’s real value is: not dedup, but the next steps — cleanlab confident-learning to surface mislabels, and taxonomy / “other”-class review via the confusion matrix. Those need GPU (out-of-fold MedSigLIP probabilities) and are the gated next task.
Reproduce
cd src/training
uv run --with imagehash python scripts/eval/leakage_audit.py --max-hamming 8
# -> eval_output/leakage_audit.json, leakage_flagged_test_uuids.json