Dataset

Source Data

The training dataset was extracted from the original DermaDetect PostgreSQL database and GCS image storage, both backed up to S3.

Raw Sources

Source	Location	Contents
SQL dump	`/git/consulting/dermadetect/sql_dumps/dd_api-*.sql`	225K patient requests, 178K annotations, disease codes
Image backup	`s3://dermadetect-gcp-backup/dermadetect-20180424.appspot.com/000RN/`	543K patient photos (~150GB)

Curated Dataset

Location	Contents
`s3://dermadetect-ml-datasets/v1/manifest.parquet`	One row per image with metadata + diagnosis
`s3://dermadetect-ml-datasets/v1/splits.parquet`	Patient → train/val/test assignment
`s3://dermadetect-ml-datasets/v1/feature_schema.json`	Field types, valid values, ranges
`s3://dermadetect-ml-datasets/v1/disease_taxonomy.json`	ICD-9 mappings, synonym dictionary
`s3://dermadetect-ml-datasets/v1/stats.json`	Summary statistics
`s3://dermadetect-ml-datasets/v1/images/*.jpg`	~370K deduplicated images

ETL Pipeline

The ETL script (scripts/run_etl.py) reads the PostgreSQL dump file directly — no database restore needed.


uv run --isolated python scripts/run_etl.py \
    --sql-dump "/path/to/dd_api-2024_05_05_14_19_32 (3).sql" \
    --algo-python /path/to/algo_python \
    --output-dir /tmp/dermadetect-etl-output

What it does:

Parses rn_requests and dermatologist_annotation tables from the SQL dump
Joins them, filters out test accounts and non-diagnoses
Flattens the request JSONB into per-image rows with structured metadata
Normalizes diagnosis names using the legacy DiseaseCorrecter synonym dictionary
Validates continuous fields (age capped at 0-120, temperature at 30-45)
Computes patient-level train/val/test splits (80/10/10, stratified by diagnosis)
Writes Parquet files + JSON metadata

--algo-python is optional. If provided, it imports the DiseaseCorrecter class for diagnosis synonym normalization. Without it, diagnosis names are used as-is from the annotations.

Image Copy

After running the ETL, copy the referenced images to the curated bucket:


# Dry run first
uv run --isolated python scripts/copy_images.py \
    --manifest /tmp/dermadetect-etl-output/manifest.parquet \
    --profile dermadetect_superadmin \
    --dry-run
 
# Real copy (~1 hour for 370K images)
uv run --isolated python scripts/copy_images.py \
    --manifest /tmp/dermadetect-etl-output/manifest.parquet \
    --profile dermadetect_superadmin \
    --workers 20

The copy is idempotent — it skips images already in the destination.

Manifest Schema

Each row in manifest.parquet represents one image:

Identity

image_uuid — filename UUID from the original GCS path
case_uuid — patient request UUID (groups images from the same submission)
patient_id — numeric patient identifier (groups cases from the same person)

Diagnosis

diagnosis — primary diagnosis (normalized), e.g. “acne vulgaris”
diagnosis_icd9 — ICD-9 code
diagnoses_all — list of all diagnoses for multi-label

Metadata (raw values, not encoded)

Continuous: age, temperature, duration_days, pregnancy_week
Categorical: gender, size, shape, quantity, topography, location_primary, location_secondary, location_side, location_coverage, pain_type, hair_loss_type
Boolean (3-state): itch, pus, cough, crater, vesicle, bleeding, swelling, pregnancy, widespread_face, duration_from_birth, widespread_palm_feet, pregnancy_birth_control_pills, pain_is_pain, hair_loss_exist, location_swelling
Multi-select: color_condition, texture, color_pus, primary_locations, secondary_locations

Provenance

vendor_id — “maccabi” or “yeledoctor”
modified_at — annotation timestamp
source_gcs_path — original gs:// URL

Splits

Splits are stored separately in splits.parquet with columns patient_id and split.

Key design decision: Splits are by patient, not by case or image. A patient who submitted multiple cases will have all their data in the same split. This prevents data leakage where the model sees the same patient’s skin in both train and test.

Split	Patients	Approximate ratio
train	~71K	80%
val	~9K	10%
test	~10K	10%

Disease Label Pipeline

From Raw Annotation to Training Label

The labeling pipeline transforms dermatologist annotations into model training labels:

Raw annotation: Dermatologists diagnose cases in the app. Each annotation contains a JSON array with disease text and an ICD-9 code (Maccabi coding system).
Normalization (ETL): The ETL script normalizes diagnosis text using a synonym dictionary (166 mappings from the legacy DiseaseCorrecter). Examples: “plaque psoriasis” → “psoriasis”, “cutaneous insect bite reactions” → “insect bite”. All text is lowercased.
Exclusion filtering: 32 non-diagnosis entries are removed during ETL (e.g. “undefined diagnosis”, “unclear image”, “test”, “diagnosis deferred”).
Label file selection: A label file (labels/common25.txt) defines which diseases the model classifies. The production model uses 25 diseases selected from the most common diagnoses in the dataset.
Multi-hot encoding: Each sample gets a label vector of size num_classes. For each diagnosis that matches a disease in the label file, the corresponding bit is set to 1.0.

Handling Unmapped Diagnoses

The dataset contains 1,151 unique diseases, but models typically train on 25-40. Samples with diagnoses outside the label list are handled in two ways:

Without “other” class (dataset.other_class: false): Unmapped samples get an all-zeros label vector. They act as implicit negative examples via BCEWithLogitsLoss.
With “other” class (dataset.other_class: true): An “other” class is appended to the label list. Unmapped samples get label["other"] = 1.0, and are downsampled to dataset.other_ratio (default 10%) of the principal sample count. This teaches the model to affirmatively distinguish “none of the above” from low confidence.

ICD-9 Codes

The diagnosis_icd9 field stores Maccabi-system ICD-9 codes from the original annotations. These codes are not used for training — the model trains on normalized disease name text. The codes are preserved in the manifest for reference and potential future use (e.g. mapping to ICD-10 for regulatory purposes).

Production Disease List (common25)

The 25 diseases used in production (labels/common25.txt):

acne vulgaris, eczema uns, folliculitis, hematoma uns, herpes simplex, herpes zoster, insect bite, intertrigo, keratosis pilaris, melasma, molluscum contagiosum, onychomycosis, paronychia finger, pityriasis rosea, post inflammatory hyperpigmentation, psoriasis, rosacea, seborrheic dermatitis, skin tag, tinea pedis, tinea versicolor, urticaria, verruca vulgaris, viral exanthem uns, vitiligo.

PII/PHI Notes

The manifest deliberately excludes:

HTTP headers (contained JWTs, doctor emails, IP addresses)
modified_by (doctor email)
free_text (could contain patient-identifying notes)
user_crypt (encrypted patient data)

The patient_id is an opaque numeric ID, not a national ID or name.