Dataset
Source Data
The training dataset was extracted from the original DermaDetect PostgreSQL database and GCS image storage, both backed up to S3.
Raw Sources
| Source | Location | Contents |
|---|---|---|
| SQL dump | /git/consulting/dermadetect/sql_dumps/dd_api-*.sql | 225K patient requests, 178K annotations, disease codes |
| Image backup | s3://dermadetect-gcp-backup/dermadetect-20180424.appspot.com/000RN/ | 543K patient photos (~150GB) |
Curated Dataset
| Location | Contents |
|---|---|
s3://dermadetect-ml-datasets/v1/manifest.parquet | One row per image with metadata + diagnosis |
s3://dermadetect-ml-datasets/v1/splits.parquet | Patient → train/val/test assignment |
s3://dermadetect-ml-datasets/v1/feature_schema.json | Field types, valid values, ranges |
s3://dermadetect-ml-datasets/v1/disease_taxonomy.json | ICD-9 mappings, synonym dictionary |
s3://dermadetect-ml-datasets/v1/stats.json | Summary statistics |
s3://dermadetect-ml-datasets/v1/images/*.jpg | ~370K deduplicated images |
ETL Pipeline
The ETL script (scripts/run_etl.py) reads the PostgreSQL dump file directly — no database restore needed.
uv run --isolated python scripts/run_etl.py \
--sql-dump "/path/to/dd_api-2024_05_05_14_19_32 (3).sql" \
--algo-python /path/to/algo_python \
--output-dir /tmp/dermadetect-etl-outputWhat it does:
- Parses
rn_requestsanddermatologist_annotationtables from the SQL dump - Joins them, filters out test accounts and non-diagnoses
- Flattens the request JSONB into per-image rows with structured metadata
- Normalizes diagnosis names using the legacy
DiseaseCorrectersynonym dictionary - Validates continuous fields (age capped at 0-120, temperature at 30-45)
- Computes patient-level train/val/test splits (80/10/10, stratified by diagnosis)
- Writes Parquet files + JSON metadata
--algo-python is optional. If provided, it imports the DiseaseCorrecter class for diagnosis synonym normalization. Without it, diagnosis names are used as-is from the annotations.
Image Copy
After running the ETL, copy the referenced images to the curated bucket:
# Dry run first
uv run --isolated python scripts/copy_images.py \
--manifest /tmp/dermadetect-etl-output/manifest.parquet \
--profile dermadetect_superadmin \
--dry-run
# Real copy (~1 hour for 370K images)
uv run --isolated python scripts/copy_images.py \
--manifest /tmp/dermadetect-etl-output/manifest.parquet \
--profile dermadetect_superadmin \
--workers 20The copy is idempotent — it skips images already in the destination.
Manifest Schema
Each row in manifest.parquet represents one image:
Identity
image_uuid— filename UUID from the original GCS pathcase_uuid— patient request UUID (groups images from the same submission)patient_id— numeric patient identifier (groups cases from the same person)
Diagnosis
diagnosis— primary diagnosis (normalized), e.g. “acne vulgaris”diagnosis_icd9— ICD-9 codediagnoses_all— list of all diagnoses for multi-label
Metadata (raw values, not encoded)
- Continuous:
age,temperature,duration_days,pregnancy_week - Categorical:
gender,size,shape,quantity,topography,location_primary,location_secondary,location_side,location_coverage,pain_type,hair_loss_type - Boolean (3-state):
itch,pus,cough,crater,vesicle,bleeding,swelling,pregnancy,widespread_face,duration_from_birth,widespread_palm_feet,pregnancy_birth_control_pills,pain_is_pain,hair_loss_exist,location_swelling - Multi-select:
color_condition,texture,color_pus,primary_locations,secondary_locations
Provenance
vendor_id— “maccabi” or “yeledoctor”modified_at— annotation timestampsource_gcs_path— originalgs://URL
Splits
Splits are stored separately in splits.parquet with columns patient_id and split.
Key design decision: Splits are by patient, not by case or image. A patient who submitted multiple cases will have all their data in the same split. This prevents data leakage where the model sees the same patient’s skin in both train and test.
| Split | Patients | Approximate ratio |
|---|---|---|
| train | ~71K | 80% |
| val | ~9K | 10% |
| test | ~10K | 10% |
Disease Label Pipeline
From Raw Annotation to Training Label
The labeling pipeline transforms dermatologist annotations into model training labels:
-
Raw annotation: Dermatologists diagnose cases in the app. Each annotation contains a JSON array with disease text and an ICD-9 code (Maccabi coding system).
-
Normalization (ETL): The ETL script normalizes diagnosis text using a synonym dictionary (166 mappings from the legacy
DiseaseCorrecter). Examples: “plaque psoriasis” → “psoriasis”, “cutaneous insect bite reactions” → “insect bite”. All text is lowercased. -
Exclusion filtering: 32 non-diagnosis entries are removed during ETL (e.g. “undefined diagnosis”, “unclear image”, “test”, “diagnosis deferred”).
-
Label file selection: A label file (
labels/common25.txt) defines which diseases the model classifies. The production model uses 25 diseases selected from the most common diagnoses in the dataset. -
Multi-hot encoding: Each sample gets a label vector of size
num_classes. For each diagnosis that matches a disease in the label file, the corresponding bit is set to 1.0.
Handling Unmapped Diagnoses
The dataset contains 1,151 unique diseases, but models typically train on 25-40. Samples with diagnoses outside the label list are handled in two ways:
-
Without “other” class (
dataset.other_class: false): Unmapped samples get an all-zeros label vector. They act as implicit negative examples viaBCEWithLogitsLoss. -
With “other” class (
dataset.other_class: true): An “other” class is appended to the label list. Unmapped samples getlabel["other"] = 1.0, and are downsampled todataset.other_ratio(default 10%) of the principal sample count. This teaches the model to affirmatively distinguish “none of the above” from low confidence.
ICD-9 Codes
The diagnosis_icd9 field stores Maccabi-system ICD-9 codes from the original annotations. These codes are not used for training — the model trains on normalized disease name text. The codes are preserved in the manifest for reference and potential future use (e.g. mapping to ICD-10 for regulatory purposes).
Production Disease List (common25)
The 25 diseases used in production (labels/common25.txt):
acne vulgaris, eczema uns, folliculitis, hematoma uns, herpes simplex, herpes zoster, insect bite, intertrigo, keratosis pilaris, melasma, molluscum contagiosum, onychomycosis, paronychia finger, pityriasis rosea, post inflammatory hyperpigmentation, psoriasis, rosacea, seborrheic dermatitis, skin tag, tinea pedis, tinea versicolor, urticaria, verruca vulgaris, viral exanthem uns, vitiligo.
PII/PHI Notes
The manifest deliberately excludes:
- HTTP headers (contained JWTs, doctor emails, IP addresses)
modified_by(doctor email)free_text(could contain patient-identifying notes)user_crypt(encrypted patient data)
The patient_id is an opaque numeric ID, not a national ID or name.