2026-06-14: Test-Set Evaluation Tooling (Label Vetting + LLM Benchmark)
Two new evaluation tools under src/training/scripts/eval/, both drawing from the
curated v1 test split (~/.cache/dermadetect/v1) restricted to the model’s 25-class
label set (labels/common25.txt). Test/val/test splits are assigned per patient, so
the join is on patient_id to avoid leakage. All sampling is seeded (default 17) for
reproducibility. Outputs land in src/training/eval_output/ (gitignored).
Shared module: eval_common.py
Loads the label list, joins manifest.parquet with the test rows of
splits.parquet, resolves image UUIDs to images/<uuid>.jpg on disk, and provides
two seeded samplers: sample_per_class (skips rows whose image is missing so the PDF
never shows a broken image) and sample_random.
Deliverable 1: dermatologist label-vetting packet — build_vetting_pdf.py
Produces two files for a dermatologist to vet our ground-truth labels:
label_vetting.pdf— one section per diagnosis (25), up to 5 sample test images each in a 2-up grid. Every image is captioned with a unique identifier (V0001…) plus a short clinical-metadata line (age in years derived from the days-encodedagefield, gender, primary location). 125 images total.label_vetting.xlsx— one pre-filled row per image:Diagnosis,Identifier,Notes. The reviewer adds a note inNotesfor any image whose listed diagnosis they disagree with. Identifiers match the PDF 1:1.
Run: uv run --package ddtrain python src/training/scripts/eval/build_vetting_pdf.py
Deliverable 2: vision-LLM benchmark — llm_benchmark.py
Samples 250 random test images (across the 25 labels, so it follows the natural class
distribution) and sends each to Claude Opus 4.8 (claude-opus-4-8) as a base64
image with a prompt constraining the answer to the 25-label set. Uses adaptive
thinking and structured outputs (output_config.format with a json_schema whose
diagnosis field is an enum of the 25 labels) to get a ranked top-3 with confidence
percentages that sum to ≤100. Requests run concurrently via a thread pool (SDK handles
429/5xx retry). Scores top-1 and top-3 accuracy and prints them next to the
gen2a_port baseline (top-1 66.7%, top-3 87.7% on val).
Constraining to our label set (rather than open-ended + fuzzy matching) makes scoring
directly comparable to the trained model. Per-image results are written to
llm_benchmark_results.jsonl and aggregate metrics to llm_benchmark_metrics.json.
Requires ANTHROPIC_API_KEY (auto-loaded from a repo .env if present, else the
env). Preview without calling the API via --dry-run.
Run: uv run --package ddtrain python src/training/scripts/eval/llm_benchmark.py
Rate limiting: the org limit on claude-opus-4-8 is 50 requests/minute. A first
unthrottled run (8 concurrent workers) got 110/250 rate-limit errors. The script now
has a RateLimiter (default --rpm 45) that spaces request starts under the ceiling,
plus max_retries=8 as a safety net. A clean 250-image run takes ~6 min.
Image robustness: a few files carry a .jpg extension but non-JPEG bytes, which
the API rejects. _classify_one re-encodes through PIL to valid JPEG (also matching
the RGB load the local model does).
Deliverable 3: same-images Gen2A scorer — model_benchmark.py
Scores trained_models/gen2a_port/best_model.pt on the identical 250 images
(same seed + sampler), so the comparison is airtight rather than against the stored
val metrics. Loads the architecture from the saved config.json, the 26-label set
(labels.txt, 25 targets + “other”) and feature_schema.json, runs batched GPU
inference, and reports top-1/top-3 both ranked among the 25 targets (headline,
matches the LLM task) and among all 26 (secondary).
Run: uv run --package ddtrain python src/training/scripts/eval/model_benchmark.py
Results (250 random test images, seed 17, natural class frequency)
| System | top-1 | top-3 |
|---|---|---|
Gen2A (gen2a_port, ranked among 25) | 64.4% | 88.8% |
| Claude Opus 4.8 (effort=high, 25-label constrained) | 49.0% | 72.7% |
(LLM scored 249/250; one image had non-JPEG bytes in the run that produced these numbers — fixed in the script afterward.) Gen2A’s same-images numbers (64.4 / 88.8) track its stored val baseline (66.7 / 87.7), confirming the harness. On this in-distribution test set the trained model clearly outperforms the general-purpose vision LLM; top-1 agreement between the two systems was 52%.
Dependencies
Added an eval optional-dependency group to src/training/pyproject.toml:
reportlab, openpyxl, anthropic. Install with uv sync --all-packages --extra eval.