2026-06-14: Test-Set Evaluation Tooling (Label Vetting + LLM Benchmark)

Two new evaluation tools under src/training/scripts/eval/, both drawing from the curated v1 test split (~/.cache/dermadetect/v1) restricted to the model’s 25-class label set (labels/common25.txt). Test/val/test splits are assigned per patient, so the join is on patient_id to avoid leakage. All sampling is seeded (default 17) for reproducibility. Outputs land in src/training/eval_output/ (gitignored).

Shared module: `eval_common.py`

Loads the label list, joins manifest.parquet with the test rows of splits.parquet, resolves image UUIDs to images/<uuid>.jpg on disk, and provides two seeded samplers: sample_per_class (skips rows whose image is missing so the PDF never shows a broken image) and sample_random.

Deliverable 1: dermatologist label-vetting packet — `build_vetting_pdf.py`

Produces two files for a dermatologist to vet our ground-truth labels:

label_vetting.pdf — one section per diagnosis (25), up to 5 sample test images each in a 2-up grid. Every image is captioned with a unique identifier (V0001…) plus a short clinical-metadata line (age in years derived from the days-encoded age field, gender, primary location). 125 images total.
label_vetting.xlsx — one pre-filled row per image: Diagnosis, Identifier, Notes. The reviewer adds a note in Notes for any image whose listed diagnosis they disagree with. Identifiers match the PDF 1:1.

Run: uv run --package ddtrain python src/training/scripts/eval/build_vetting_pdf.py

Deliverable 2: vision-LLM benchmark — `llm_benchmark.py`

Samples 250 random test images (across the 25 labels, so it follows the natural class distribution) and sends each to Claude Opus 4.8 (claude-opus-4-8) as a base64 image with a prompt constraining the answer to the 25-label set. Uses adaptive thinking and structured outputs (output_config.format with a json_schema whose diagnosis field is an enum of the 25 labels) to get a ranked top-3 with confidence percentages that sum to ≤100. Requests run concurrently via a thread pool (SDK handles 429/5xx retry). Scores top-1 and top-3 accuracy and prints them next to the gen2a_port baseline (top-1 66.7%, top-3 87.7% on val).

Constraining to our label set (rather than open-ended + fuzzy matching) makes scoring directly comparable to the trained model. Per-image results are written to llm_benchmark_results.jsonl and aggregate metrics to llm_benchmark_metrics.json.

Requires ANTHROPIC_API_KEY (auto-loaded from a repo .env if present, else the env). Preview without calling the API via --dry-run. Run: uv run --package ddtrain python src/training/scripts/eval/llm_benchmark.py

Rate limiting: the org limit on claude-opus-4-8 is 50 requests/minute. A first unthrottled run (8 concurrent workers) got 110/250 rate-limit errors. The script now has a RateLimiter (default --rpm 45) that spaces request starts under the ceiling, plus max_retries=8 as a safety net. A clean 250-image run takes ~6 min.

Image robustness: a few files carry a .jpg extension but non-JPEG bytes, which the API rejects. _classify_one re-encodes through PIL to valid JPEG (also matching the RGB load the local model does).

Deliverable 3: same-images Gen2A scorer — `model_benchmark.py`

Scores trained_models/gen2a_port/best_model.pt on the identical 250 images (same seed + sampler), so the comparison is airtight rather than against the stored val metrics. Loads the architecture from the saved config.json, the 26-label set (labels.txt, 25 targets + “other”) and feature_schema.json, runs batched GPU inference, and reports top-1/top-3 both ranked among the 25 targets (headline, matches the LLM task) and among all 26 (secondary).

Run: uv run --package ddtrain python src/training/scripts/eval/model_benchmark.py

Results (250 random test images, seed 17, natural class frequency)

System	top-1	top-3
Gen2A (`gen2a_port`, ranked among 25)	64.4%	88.8%
Claude Opus 4.8 (effort=high, 25-label constrained)	49.0%	72.7%

(LLM scored 249/250; one image had non-JPEG bytes in the run that produced these numbers — fixed in the script afterward.) Gen2A’s same-images numbers (64.4 / 88.8) track its stored val baseline (66.7 / 87.7), confirming the harness. On this in-distribution test set the trained model clearly outperforms the general-purpose vision LLM; top-1 agreement between the two systems was 52%.

Dependencies

Added an eval optional-dependency group to src/training/pyproject.toml: reportlab, openpyxl, anthropic. Install with uv sync --all-packages --extra eval.

2026-06-14: Test-Set Evaluation Tooling (Label Vetting + LLM Benchmark)

Shared module: eval_common.py

Deliverable 1: dermatologist label-vetting packet — build_vetting_pdf.py

Deliverable 2: vision-LLM benchmark — llm_benchmark.py

Deliverable 3: same-images Gen2A scorer — model_benchmark.py