Skip to Content

Dataset

Source Data

The training dataset was extracted from the original DermaDetect PostgreSQL database and GCS image storage, both backed up to S3.

Raw Sources

SourceLocationContents
SQL dump/git/consulting/dermadetect/sql_dumps/dd_api-*.sql225K patient requests, 178K annotations, disease codes
Image backups3://dermadetect-gcp-backup/dermadetect-20180424.appspot.com/000RN/543K patient photos (~150GB)

Curated Dataset

LocationContents
s3://dermadetect-ml-datasets/v1/manifest.parquetOne row per image with metadata + diagnosis
s3://dermadetect-ml-datasets/v1/splits.parquetPatient → train/val/test assignment
s3://dermadetect-ml-datasets/v1/feature_schema.jsonField types, valid values, ranges
s3://dermadetect-ml-datasets/v1/disease_taxonomy.jsonICD-9 mappings, synonym dictionary
s3://dermadetect-ml-datasets/v1/stats.jsonSummary statistics
s3://dermadetect-ml-datasets/v1/images/*.jpg~370K deduplicated images

ETL Pipeline

The ETL script (scripts/run_etl.py) reads the PostgreSQL dump file directly — no database restore needed.

uv run --isolated python scripts/run_etl.py \ --sql-dump "/path/to/dd_api-2024_05_05_14_19_32 (3).sql" \ --algo-python /path/to/algo_python \ --output-dir /tmp/dermadetect-etl-output

What it does:

  1. Parses rn_requests and dermatologist_annotation tables from the SQL dump
  2. Joins them, filters out test accounts and non-diagnoses
  3. Flattens the request JSONB into per-image rows with structured metadata
  4. Normalizes diagnosis names using the legacy DiseaseCorrecter synonym dictionary
  5. Validates continuous fields (age capped at 0-120, temperature at 30-45)
  6. Computes patient-level train/val/test splits (80/10/10, stratified by diagnosis)
  7. Writes Parquet files + JSON metadata

--algo-python is optional. If provided, it imports the DiseaseCorrecter class for diagnosis synonym normalization. Without it, diagnosis names are used as-is from the annotations.

Image Copy

After running the ETL, copy the referenced images to the curated bucket:

# Dry run first uv run --isolated python scripts/copy_images.py \ --manifest /tmp/dermadetect-etl-output/manifest.parquet \ --profile dermadetect_superadmin \ --dry-run # Real copy (~1 hour for 370K images) uv run --isolated python scripts/copy_images.py \ --manifest /tmp/dermadetect-etl-output/manifest.parquet \ --profile dermadetect_superadmin \ --workers 20

The copy is idempotent — it skips images already in the destination.

Manifest Schema

Each row in manifest.parquet represents one image:

Identity

  • image_uuid — filename UUID from the original GCS path
  • case_uuid — patient request UUID (groups images from the same submission)
  • patient_id — numeric patient identifier (groups cases from the same person)

Diagnosis

  • diagnosis — primary diagnosis (normalized), e.g. “acne vulgaris”
  • diagnosis_icd9 — ICD-9 code
  • diagnoses_all — list of all diagnoses for multi-label

Metadata (raw values, not encoded)

  • Continuous: age, temperature, duration_days, pregnancy_week
  • Categorical: gender, size, shape, quantity, topography, location_primary, location_secondary, location_side, location_coverage, pain_type, hair_loss_type
  • Boolean (3-state): itch, pus, cough, crater, vesicle, bleeding, swelling, pregnancy, widespread_face, duration_from_birth, widespread_palm_feet, pregnancy_birth_control_pills, pain_is_pain, hair_loss_exist, location_swelling
  • Multi-select: color_condition, texture, color_pus, primary_locations, secondary_locations

Provenance

  • vendor_id — “maccabi” or “yeledoctor”
  • modified_at — annotation timestamp
  • source_gcs_path — original gs:// URL

Splits

Splits are stored separately in splits.parquet with columns patient_id and split.

Key design decision: Splits are by patient, not by case or image. A patient who submitted multiple cases will have all their data in the same split. This prevents data leakage where the model sees the same patient’s skin in both train and test.

SplitPatientsApproximate ratio
train~71K80%
val~9K10%
test~10K10%

Disease Label Pipeline

From Raw Annotation to Training Label

The labeling pipeline transforms dermatologist annotations into model training labels:

  1. Raw annotation: Dermatologists diagnose cases in the app. Each annotation contains a JSON array with disease text and an ICD-9 code (Maccabi coding system).

  2. Normalization (ETL): The ETL script normalizes diagnosis text using a synonym dictionary (166 mappings from the legacy DiseaseCorrecter). Examples: “plaque psoriasis” → “psoriasis”, “cutaneous insect bite reactions” → “insect bite”. All text is lowercased.

  3. Exclusion filtering: 32 non-diagnosis entries are removed during ETL (e.g. “undefined diagnosis”, “unclear image”, “test”, “diagnosis deferred”).

  4. Label file selection: A label file (labels/common25.txt) defines which diseases the model classifies. The production model uses 25 diseases selected from the most common diagnoses in the dataset.

  5. Multi-hot encoding: Each sample gets a label vector of size num_classes. For each diagnosis that matches a disease in the label file, the corresponding bit is set to 1.0.

Handling Unmapped Diagnoses

The dataset contains 1,151 unique diseases, but models typically train on 25-40. Samples with diagnoses outside the label list are handled in two ways:

  • Without “other” class (dataset.other_class: false): Unmapped samples get an all-zeros label vector. They act as implicit negative examples via BCEWithLogitsLoss.

  • With “other” class (dataset.other_class: true): An “other” class is appended to the label list. Unmapped samples get label["other"] = 1.0, and are downsampled to dataset.other_ratio (default 10%) of the principal sample count. This teaches the model to affirmatively distinguish “none of the above” from low confidence.

ICD-9 Codes

The diagnosis_icd9 field stores Maccabi-system ICD-9 codes from the original annotations. These codes are not used for training — the model trains on normalized disease name text. The codes are preserved in the manifest for reference and potential future use (e.g. mapping to ICD-10 for regulatory purposes).

Production Disease List (common25)

The 25 diseases used in production (labels/common25.txt):

acne vulgaris, eczema uns, folliculitis, hematoma uns, herpes simplex, herpes zoster, insect bite, intertrigo, keratosis pilaris, melasma, molluscum contagiosum, onychomycosis, paronychia finger, pityriasis rosea, post inflammatory hyperpigmentation, psoriasis, rosacea, seborrheic dermatitis, skin tag, tinea pedis, tinea versicolor, urticaria, verruca vulgaris, viral exanthem uns, vitiligo.

PII/PHI Notes

The manifest deliberately excludes:

  • HTTP headers (contained JWTs, doctor emails, IP addresses)
  • modified_by (doctor email)
  • free_text (could contain patient-identifying notes)
  • user_crypt (encrypted patient data)

The patient_id is an opaque numeric ID, not a national ID or name.

Last updated on