Openly available datasets

SIIM Hackathon Dataset

A small but comprehensive dataset featuring fictional, but realistic, patient personas with their clinical data in FHIR and images in DICOM that corroborate each patient's story.

Note: If you are using the Hackathon Server then it already has this data loaded.


The Cancer Imaging Archive (TCIA)

TCIA is a service which de-identifies and hosts a large archive of medical images of cancer accessible for public download. The data are organized as “Collections”, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage. Supporting data related to the images such as patient outcomes, treatment details, genomics, pathology, and expert analyses are also provided when available.


Imaging Data Commons (IDC)

The National Cancer Institute (NCI) Cancer Research Data Commons (CRDC) aims to establish a national cloud-based data science infrastructure. Imaging Data Commons (IDC), provides a radiology and pathology DICOM image repository in CRDC. CRDC connects researchers with key cancer data types (imaging, genomics, and proteomics) co-located with cloud-based computational resources and big data analysis tools provided in Google and AWS clouds.


SIIM-ISIC Melanoma Classification Challenge

This Dermatology dataset contains 33,126 dermoscopic training images of unique benign and malignant skin lesions from over 2,000 patients. Each image is associated with one of these individuals using a unique patient identifier. All malignant diagnoses have been confirmed via histopathology, and benign diagnoses have been confirmed using either expert agreement, longitudinal follow-up, or histopathology. A thorough publication describing all features of this dataset is available in the form of a pre-print that has not yet undergone peer review.

The dataset was generated by the International Skin Imaging Collaboration (ISIC) and images are from the following sources: Hospital Clínic de Barcelona, Medical University of Vienna, Memorial Sloan Kettering Cancer Center, Melanoma Institute Australia, University of Queensland, and the University of Athens Medical School.

The dataset was curated for the SIIM-ISIC Melanoma Classification Challenge hosted on Kaggle during the Summer of 2020.


The Coherent Dataset

The “Coherent Data Set” is a novel synthetic data set that leverages structured data from Synthea™ to create a longitudinal, “coherent” patient-level electronic health record (EHR). Comprised of synthetic patients, the Coherent Data Set is publicly available, reproducible using Synthea™, and free of the privacy risks that arise from using real patient data. The Coherent Data Set provides complex and representative health records that can be leveraged by health IT professionals without the risks associated with de-identified patient data. It includes familial genomes that were created through a simulation of the genetic reproduction process; magnetic resonance imaging (MRI) DICOM files created with a voxel-based computational model; clinical notes in the style of traditional subjective, objective, assessment, and plan notes; and physiological data that leverage existing System Biology Markup Language (SBML) models to capture non-linear changes in patient health metrics. HL7 Fast Healthcare Interoperability Resources (FHIR®) links the data together. The models can generate clinically logical health data, but ensuring clinical validity remains a challenge without comparable data to substantiate results. We believe this data set is the first of its kind and a novel contribution to practical health interoperability efforts.


MIDRC COVID-19 Imaging Data Collection

RSNA is collecting and publishing data through MIDRC and has established key partnerships and assembled an international task force of scientists and radiologists to support this effort.

We have developed data selection criteria, data sharing agreements and tools to help sites organize, de-identify and transfer data. The RSNA-MIDRC data collection pathway will enable radiology organizations to contribute data to MIDRC safely and conveniently.

Assembling broad and diverse datasets is essential to supporting quality research leading to improvements in diagnosis care.