Stream 3 — How much does the questionnaire contribute?

Round 1 preliminary investigation. Question: how much does the questionnaire (metadata) branch of the Gen2A model actually contribute to predictions? Keep it, drop it, or improve it?

All numbers are measured against the trained baseline on the same test split (per-patient test split of ~/.cache/dermadetect/v1, 25 target classes + other), using the same deterministic sampler as model_benchmark.py (seed 17). Backbone is held fixed at ResNet50 so the metadata effect is isolated from any architecture change (that’s Stream 1).

TL;DR / recommendation

KEEP the questionnaire, but PRUNE it to ~19 fields. Four experiments (all final numbers on the full 15,561-image test split) agree the metadata branch is highly load-bearing, with value concentrated almost entirely in body location + lesion morphology (texture).

Value of the questionnaire ≈ +12 top-1 pts — image-only retrains to 52.6%, full image+metadata to 64.7% (3-seed mean). Dropping it would be a major regression.
Reliance / robustness risk: shuffling metadata (wrong answers) drops top-1 65.0% → 31.3% (−33.6 pts) — the model trusts the questionnaire (esp. location) heavily, so wrong/missing answers are costly. This is the Round-2 robustness item.
Prunable for free: a 3-seed paired retrain shows dropping 12 low-value elements (age, gender, color, duration, temperature, pregnancy, hair_loss, shape, bleeding, crater, pus, swelling) costs +0.45 ± 0.70 top-1 pts — i.e. zero. The kept 19 fields retain essentially all the value.
Direct answers: color → no effect (drop). size → negligible (~0.2 pt). location → critical, keep.

Go for Round 2: yes — prune the dead survey elements (free UX win) and harden against missing/wrong location (metadata-dropout in training). Details below.

Experiment 1 — Permutation importance (trained baseline, no retraining)

Script: src/training/scripts/eval/permutation_importance.py. Loads the trained gen2a_port/best_model.pt, caches the ResNet50 image features once (the metadata is the only thing that changes across permutations, so the image branch runs a single pass — the whole study is GPU-light), then re-scores under shuffled metadata. N = 2000 test images, 20 shuffle seeds.


uv run --package ddtrain python scripts/eval/permutation_importance.py \
    --n 2000 --shuffle-seeds 20

Condition	top-1	top-3
Baseline (paired metadata)	65.0%	88.2%
Full metadata shuffle	31.3% ± 0.8	56.7% ± 0.6
Drop	−33.6 pts	−31.5 pts

Caveat (important): permutation importance breaks the pairing by giving each image a random other patient’s questionnaire. That is out-of-distribution and actively misleading (a foot photo paired with “face / widespread” metadata is pushed away from the right answer), so −33.6 pts overstates the value of metadata versus simply not having it. The ablation (Exp 2) gives the cleaner lower-bound number. The truth is bracketed between the two.

Per-field importance (shuffle one field at a time)

No single field is critical — the top per-field top-1 drop is only ~2.9 pts:

Field	top-1 drop
primary_locations	+2.90
texture	+2.90
secondary_locations	+1.20
topography	+1.04
itch	+1.03
size	+0.76
quantity	+0.57
location_secondary	+0.57
(symptom booleans: cough, pain, swelling, vesicle…)	< 0.5

This is the classic redundant/correlated-feature signature: body-location information is spread across ~5 overlapping fields (primary_locations, secondary_locations, location_primary, location_secondary, topography), so shuffling any one barely hurts — the others still carry the signal. Only shuffling all metadata together (the full-shuffle row above) collapses accuracy. The signal lives in location + lesion morphology (texture); free-text symptom booleans contribute almost nothing.

Per-class breakdown (paired vs fully-shuffled top-1)

Metadata helps wildly unevenly. Biggest reliance (n ≥ 20):

Class	n	paired	shuffled	drop
tinea pedis	130	88%	18%	−70
onychomycosis	124	87%	18%	−69
tinea versicolor	70	81%	32%	−49
seborrheic dermatitis	92	64%	16%	−48
verruca vulgaris	50	68%	26%	−42
acne vulgaris	196	89%	48%	−41
intertrigo	73	59%	18%	−41
urticaria	75	60%	22%	−38
insect bite	173	69%	37%	−32
eczema uns	356	69%	44%	−25
molluscum contagiosum	141	84%	65%	−18
folliculitis	189	44%	30%	−15
psoriasis	66	17%	8%	−8

The pattern is dermatologically sensible: location is near-diagnostic for foot/nail fungal disease (tinea pedis, onychomycosis), scalp/face (seborrheic dermatitis), body-fold (intertrigo). Classes the model is already weak at regardless (psoriasis, folliculitis, viral exanthem) lean on metadata least.

Experiment 2 — Ablation: ImageOnly vs Gen2A (equal budget)

ImageOnlyModel (already in ddmodels, identical ResNet50 backbone, no metadata branch) vs Gen2AModel, trained on the same split/seed/budget. Configs: configs/quest_full.yaml and configs/quest_imageonly.yaml — identical except model.type. Equal short budget (2 frozen + 5 finetune epochs, 40k-image subsample, seed 17) so the comparison is fair; the data subsample was needed because full epochs are CPU/decode-bound on this 12-core box. Both scored with scripts/eval/ablation_eval.py on the identical test sample.


uv run --package ddtrain python scripts/eval/ablation_eval.py \
    --full-dir trained_models/quest-full \
    --imageonly-dir trained_models/quest-imageonly --n 1000 --seed 17

Test split, identical 1000-image sample, ranked among the 25 targets (other masked):

Model	top-1	top-3
Full (image + metadata)	62.3%	87.2%
Image-only	51.9%	78.0%
Gap (value of questionnaire)	+10.4 pts	+9.2 pts

Corroborated by the trainers’ own val metrics (4k val subset, 26-class ranking): full final top-1 64.4% vs image-only 49.9% (≈14 pt gap). The ablation models are short-budget (40k-image subsample, 2+5 epochs) so absolute accuracy is below the production gen2a_port baseline (69% top-1); the gap is the deliverable and is measured on an equal footing. It may shift a little at full convergence but the sign and rough magnitude are solid.

Per-class top-1 gap (full − image-only), n ≥ 15

Class	n	full	image-only	gap
keratosis pilaris	15	60%	20%	+40
intertrigo	38	76%	47%	+29
acne vulgaris	89	89%	65%	+24
verruca vulgaris	30	53%	30%	+23
eczema uns	186	61%	44%	+18
tinea versicolor	32	81%	66%	+16
tinea pedis	66	83%	70%	+14
insect bite	91	71%	59%	+12
onychomycosis	56	93%	86%	+7
psoriasis	30	20%	13%	+7
seborrheic dermatitis	35	71%	77%	−6
folliculitis	94	43%	50%	−7
urticaria	34	41%	50%	−9
molluscum contagiosum	74	70%	80%	−9

Reconciling the two experiments: foot/nail conditions had the largest permutation drops (onychomycosis −69, tinea pedis −70) but only modest ablation gaps (+7, +14). The image alone already nails their morphology (image-only onychomycosis = 86%), so metadata’s marginal contribution there is small — yet because the model trusts location heavily, feeding it wrong location is catastrophic. Permutation measures the misleading risk; ablation measures the marginal value. Both matter. A handful of classes (urticaria, molluscum, folliculitis, seborrheic dermatitis) had slightly negative ablation gaps — the metadata branch mildly hurts them at this budget, a small de-biasing opportunity.

Experiment 3 — Per-survey-element ablation (FULL test set)

Script: src/training/scripts/eval/field_ablation.py, run on the entire 25-class test split (15,561 images) — not the 250-image LLM sample — with the production gen2a_port model. For each survey element (location, size, color, …) we re-encode every test row as if that question were left blank (the encoder maps a missing field to its “unknown” encoding, which the model has seen in training because real patients skip fields) and re-score. The top-1 drop is the marginal value of asking the question. Image features are extracted once (multi-worker), so all ablations are fast.

Method check: zeroing all metadata lands at 51.28% top-1, essentially identical to the independently retrained image-only model (51.9%) — so “blank the field” is a faithful stand-in for “remove the question,” and the per-element numbers are trustworthy.

Baseline (full questionnaire): top-1 67.08%, top-3 88.63%.

Drop when each element is removed alone (others kept), safest-to-remove first

Element	top-1 drop (pts)	verdict
age	−0.06	remove
color	+0.00	remove — no effect
duration	+0.00	remove
pregnancy	+0.01	remove
temperature	+0.02	remove
widespread	+0.04	remove
crater	+0.05	remove
hair_loss	+0.05	remove
bleeding	+0.06	remove
gender	+0.09	remove
shape	+0.10	remove
pus	+0.15	remove
quantity	+0.20	borderline
size	+0.24	borderline — barely matters
topography	+0.30	keep-ish
swelling	+0.37	keep-ish
pain	+0.42	keep-ish
vesicle	+0.60	keep
cough	+0.61	keep
itch	+0.68	keep
texture	+1.98	keep
location	+10.68	keep — essential

Greedy backward elimination (drop the cheapest remaining element each step)

Cumulative top-1 stays on a flat plateau through the first ~11 elements (≈0 cost), then climbs:

#	dropped	cum top-1	cum drop (pts)
1–11	age, color, duration, temperature, pregnancy, gender, hair_loss, shape, bleeding, crater, pus	66.99%	+0.09
12	swelling	66.77%	+0.31
13	vesicle	66.54%	+0.55 ← threshold
18	size	65.18%	+1.91
20	topography	62.97%	+4.12
21	texture	60.87%	+6.21
22	location	51.28%	+15.80

Verdict — safe to remove 12 elements for ~0.3 top-1 pts total: age, gender, color, duration, temperature, pregnancy, hair_loss, shape, bleeding, crater, pus, swelling. Keep: location (essential, ~11 pts alone), texture (~2 pts), then topography, itch, cough, vesicle, pain, quantity, size (each small but additive).

Direct answers: color → no effect, safe to drop. size → barely matters (~0.2 pt), borderline. location → critical, keep.

Caveat: “the model doesn’t use it” ≠ “clinically useless” — age/gender may matter for triage, safety, or future classes, so dropping from the model input is safe but dropping from the patient survey is a product call.

Experiment 4 — Retrain confirmation (trimmed survey, FULL test set)

The Exp-3 numbers are eval-time ablations on a model that was trained with all fields. To get the decision-relevant number — what happens if we stop collecting the 12 elements and retrain — I trained quest-trimmed on a trimmed feature schema (19 fields, encoder 159-d vs 193-d) at the identical budget to quest-full (40k subsample, 2+5 epochs, seed 17; configs quest_trimmed.yaml + trimmed_feature_schema.json, new dataset.feature_schema_path knob). Both scored with score_full_test.py on the full 15,561-image test split:

A first single-seed pair (seed 17) showed a +1.39 pt gap, which looked like a real cost. But a 3-seed paired repeat (seeds 17/18/19, full + trimmed each, score_full_test.py on the full test split) shows that was run-to-run noise:

seed	full top-1	trimmed top-1	gap
17	64.55%	63.16%	+1.39
18	64.78%	64.52%	+0.26
19	64.76%	65.05%	−0.29
mean	64.70 ± 0.11	64.24 ± 0.80	+0.45 ± 0.70

Gap = +0.45 ± 0.70 pts top-1 — statistically indistinguishable from zero, and trimmed beat full at seed 19. The seed-17 trimmed run was simply unlucky (low outlier; its val loss plateaued an epoch early), which is why the first pair looked like −1.4. The full runs are tight (±0.11); the trimmed runs are noisier (±0.80) but centered only 0.45 below.

Conclusion: dropping the 12 elements is effectively free — the retrain now confirms the Exp-3 eval-ablation rather than contradicting it. (Image-only reference: 52.55% top-1, so the kept 19 fields retain essentially all ~12 pts of the questionnaire’s value.) Lesson: never trust a single training pair for a sub-1-pt effect — the run-to-run σ here is ~0.6 pt.

Recommendation for Round 2

Keep the questionnaire. Clean ablation value is ~+10 top-1 pts (and ~+9 top-3) against a 69% baseline — a large fraction of total model performance. Dropping it “to simplify the app” would be a major accuracy regression. Not a candidate for removal.
Prune the 12 dead elements — it’s effectively free (confirmed, Exp 4). A 3-seed retrain puts the cost of dropping all 12 at +0.45 ± 0.70 pt (indistinguishable from zero); the kept 19 fields retain essentially all the questionnaire’s value, which is concentrated in body location + lesion morphology (texture). Safe to drop from the survey UX to cut patient burden: age, gender, color, duration, temperature, pregnancy, hair_loss, shape, bleeding, crater, pus, swelling. (Clinical caveat unchanged: removing from the model is free, but age/gender may matter for triage/safety/future classes.)
Robustness — the biggest risk. The model trusts location so much that wrong metadata cost 34 pts. In production, location is often missing/unknown (encoder maps unknown → zeros) or user-mis-entered. Quantify the missing-/wrong-metadata penalty and consider metadata-dropout during training so the image branch stays strong when the questionnaire is absent. ~1 day; no new architecture.
Feeds the Eczema-overprediction work (Round 2 #5). The negative-gap classes (urticaria, molluscum, folliculitis) show the metadata branch can mildly mislead; eczema itself gains +18 from metadata, so metadata is part of the eczema-bias story.
Effort: all of the above is measurement + light training, ~2–3 days total, no backbone change (that’s Stream 1). The permutation_importance.py / ablation_eval.py tooling built here is reusable for the per-field-prune and missing-metadata studies.

Notes / blockers

Infra: training on this box is CPU/JPEG-decode bound (GPU sits at ~0% util), not GPU bound. Full 128k-image epochs are ~50 min each with 4 workers. For faster, fair iteration I subsampled to 40k train images and used 10 dataloader workers. A persistent decoded-tensor cache or an NVDEC/DALI pipeline would massively speed up all three streams — worth flagging to Dave.
Eval tooling base: branched from main, which did not yet contain the PR #102 eval tooling (it lives on derm-label-vetting-and-llm-benchmark); merged that commit into this worktree branch to build on model_benchmark.py/eval_common.py.
Small reusable additions: permutation_importance.py, ablation_eval.py, and a dataset.max_train_samples/max_val_samples config knob for fast fair ablations.