Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Research Data

A guide for locating, managing, and sharing research data.

Datasets by Field or Type

General and Public Health
  • WHO: Provides datasets based on global health priorities. The organization includes easy search and provides insights for topics along with the datasets.
  • CDC: Use this for US specific public health. The CDC maintains WONDER (Wide-ranging Online Data for Epidemiological Research) and sets are searchable by topic, state, and other factors.
  • data.gov: US focused healthcare data searchable by several different factors. Datasets are intended to improve the lives of people living in the US, but the information could be valuable for other training sets in research or other public health areas.
  • DataONEDataONE (Data Observation Network for Earth) is a searchable repository of environmental and climate data.
  • Inter-university Consortium for Political and Social Research: Browse available data by topic.
 
Scientific Research
  • Re3Data: Contains data from over 2000 research subjects defined across several broad categories. While not all datasets available are free, the structures are clearly marked and easily searchable based on fees, membership requirements, and copyright restrictions.
  • CHDS: Child Health and Development Studies datasets are intended to research how disease and health pass down through generation. It contains datasets for research into not just genomic expression but how social, environmental, and cultural factors play into disease and health.
  • Kent Ridge Biomedical Datasets: High-dimensional datasets in the biomedical field. It focuses on journal-published data (Nature, Science, and others).
  • Merck Molecular Health Activity Challenge: Datasets designed to foster the machine learning pursuit of drug discovery by simulating how molecule combinations could interact with each other.
  • SEER: Datasets arranged by demographic groups and provided by the US government. You can search based on age, race, and gender.
  • 1000 Genomes Project: Sequencing from 2500 individuals and 26 different populations. It’s one of the biggest genome repositories you can access and is an international collaboration. It’s accessed through AWS. (Note, there are grants available for genome projects)
  • DBMI Data Portal: The i2b2 NLP data sets previously released on i2b2.org are now hosted here under their new moniker, n2c2 (National NLP Clinical Challenges): n2c2 NLP Research Data Sets. These data sets are the result of annual NLP challenges dating back to 2006, originally organized as part of the i2b2 project (Informatics for Integrating Biology and the Bedside).
 
Healthcare Services
  • Medicare: Provides datasets based on services provided by Medicare accepting institutions. Datasets are well scrubbed for the most part and offer exciting insights into the service side of hospital care.
  • HCUP: Datasets from US hospitals. It includes emergency room stays, in-patient stays, and ambulance stats. It’s clean and illuminating into the services section of US healthcare.
 
Images
  • OASIS: Open Access Series of Imaging makes neuroimages of the brain freely, hoping to foster research and new advances in both basic health and clinical neuroscience
  • OpenfMRI: Other imaging data sets from MRI machines to foster research, better diagnostics, and training. It includes 95 datasets from 3372 subjects with new material being added as researchers make their own data open to the public.
  • OpenNEURO: A free and open platform for sharing MRI, MEG, EEG, iEEG, and ECoG data.
  • CT Medical Images: This one is a small dataset, but it’s specifically cancer-related. It contains labeled images with age, modality, and contrast tags. Again, high-quality images associated with training data may help speed breakthroughs.
  • Deep Lesion: One of the largest image sets currently available. CT images released from the NIH to help with better accuracy of lesion documentation and diagnosis. It includes over 32,000 lesions from 4000 unique patients.
  • The Cancer Imaging Archive (TCIA): The data are organized as “Collections”, typically patients related by a common disease (e.g. lung cancer), image modality (MRI, CT, etc) or research focus. DICOM is the primary file format used by TCIA for image storage. Supporting data related to the images such as patient outcomes, treatment details, genomics, pathology, and expert analyses are also provided when available.
  • Cell Image Library: The Cell Image Library, accepts image data sets that are too large for publishers to store, and provides access to the biomedical community. There are 10,000 datasets in 20TB of uploaded data as of mid-2018.  The  library inherits data from the Cell Centered Database at UCSD. 
 
Domain Specific
 
  • Metabolomics Workbench (MetWB): MetWB IS a national and international repository for metabolomics data and metadata and provides analysis tools and access to metabolite standards, protocols, tutorials, training, and more.
 
National Cancer Institute (NCI)
  • Cancer Nanotechnology Laboratory (caNanoLab): caNanoLab provides support for the annotation of nanomaterials with characterizations resulting from physico-chemical, in vitro, and in vivo assays and the sharing of these characterizations and associated nanotechnology protocols in a secure fashion.
  • Genomic Data Commons (GDC): The GDC contains clinical, biospecimen, and molecular data from several cancer research programs.
  • The Network Data Exchange (NDEx): NDEx is an online commons where scientists can upload, share, and publicly distribute biological networks and pathway models. The NDEx Project maintains a web-accessible public server, a documentation website, provides seamless connectivity to Cytoscape as well as programmatic access using a variety of languages including Python and Java.
  • The Pediatric Genomic Data Inventory (PGDI): PGDI is an open-access resource for identifying and locating genomic datasets that can be used to further the understanding of childhood cancers and develop better treatment protocols for sick children. This resource lists ongoing and completed molecular characterization projects of pediatric cancer cohorts from the United States and other countries, along with some basic details and reference metadata.
 
Neuroscience Education Institute (NEI)
  • NEI Data Commons: The Commons portal provides a platform for querying and accessing vision research data and tools for data processing and analysis. It is the central location for NEI generated clinical and basic research data available to the public.
 
National Human Genome Research Institute (NHGRI)
  • FlyBase: A Drosophila Genomic and Genetic Database: Drosophila Genomic and Genetic database that includes proteomics data, microarrays and Tiling BAC's.
  • The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL): The NHGRI Genomic Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL) is a scalable and interoperable resource for the genomic scientific community, that leverages a cloud-based infrastructure for democratizing genomic data access, sharing and computing across large genomic, and genomic-related data sets.
  • The Zebrafish Model Organism Database (ZFIN): Aims to: a) be the community database resource for the laboratory use of zebrafish, b) develop and support integrated zebrafish genetic, genomic and developmental information, c) maintain the definitive reference data sets of zebrafish research information, d) to link this information extensively to corresponding data in other model organism and human databases, e) facilitate the use of zebrafish as a model for human biology, and f) serve the needs of the research community.
  • WormBase: An international consortium of biologists and computer scientists dedicated to providing the research community with accurate, current, accessible information concerning the genetics, genomics and biology of C. elegans and related nematodes.
  • Mouse Genome Informatics (MGI): MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.
  • The Universal Protein Resource (UniProt): The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data.
 
National Heart, Blood, and Lung Institute
  • Biologic Specimen and Data Repository Information Coordinating Center (BioLINCC): The goal of BioLINCC is to facilitate and coordinate the existing activities of the NHLBI Biorepository and the Data Repository and to expand their scope and usability to the scientific community through a single web-based user interface.
  • National Sleep Research Resource: The NSRR web platform enables sharing of physiological signals and clinical data elements from well-characterized, de-identified,  research cohorts and clinical trials.
  • Rat Genome Database (RGD): The Rat Genome Database (RGD) was established in 1999 and is the premier site for genetic, genomic, phenotype, and disease data generated from rat research. In addition, it provides easy access to corresponding data for human and mouse, as well as multiple other models such as chinchilla and 13-lined ground squirrel, facilitating cross-species comparisons.
 
National Institute on Aging
  • AMP-AD Knowledge Portal: A NIH-designated repository and the distribution site for multi-omic data from human samples, cell-based and animal models, analysis results, analytical methodology and research tools generated by  multiple National Institute of Aging supported Alezheimer's disease research programs and consortia .
  • National Archive of Computerized Data on Aging (NACDA): NACDA acquires and preserves data relevant to gerontological research, processing as needed to promote effective research use, disseminates them to researchers, and facilitates their use.
  • NIDUS Delirium Research Hub: A database of completed or ongoing studies that include delirium as an outcome or predictor. The Hub includes study meta-data such as study design, sample characteristics, collected biospecimens, neuroimaging tests, neuropsychological testing and pharmacologic intervention.
  • The National Institute on Aging Genetics of Alzheimer’s Disease Data Storage Site (NIAGADS): A national genetics data repository facilitating access to genotypic and phenotypic data for Alzheimer's disease (AD). Data include GWAS, whole genome (WGS) and whole exome (WES), expression, RNA Seq, and CHIP Seq analyses.
 
National Institute of Allergy and Infectious Diseases
  • Eukaryotic Pathogen Database Resources (EuPathDB): EuPathDB Bioinformatics Resource Center for Biodefense and Emerging/Re-emerging Infectious Diseases is a portal for accessing genomic-scale datasets associated with the eukaryotic pathogens.
  • Immune Epitope Database and Analysis Resource (IEDB): This repository contains antibody/B cell and T cell epitope information and epitope prediction and analysis tools for use by the research community worldwide.
  • Influenza Research Database (IRD): The Influenza Research Database (IRD) serves as a public repository and analysis platform for flu sequence, experiment, surveillance and related data.
  • ITN TrialShare: TrialShare provides clinical trial investigators the unprecedented ability to access de-identified study data, review published analysis methods, and perform real-time, interactive graphical analyses in collaboration with other researchers. 
  • Pathosystems Resource Integration Center (PATRIC): PATRIC, the Bacterial Bioinformatics Resource Center, supports research on bacterial infectious diseases by serving as a repository of genomic and other data with associated metadata for over 100,000 bacterial genomes. PATRIC provides an integrated suite of computational services and visualizations for users to analyze and compare their own data in a private workspace with the public data in PATRIC.
  • The Immunology Database and Analysis Portal (ImmPort): Data sources are primarily DAIT-funded clinical trials, associated mechanistic studies, and other basic and applied immunology research programs.
  • VectorBase: VectorBase is a Bioinformatics Resource Center for invertebrate vectors. It is one of four Bioinformatics Resource Centers funded by NIAID to provide web-based resources to scientific community conducting basic and applied research on organisms considered potential agents of biowarfare or bioterrorism or causing emerging or re-emerging diseases.
  • Virus Pathogen Research (ViPR): The Virus Pathogen Resource (ViPR) provides a searchable public repository of genomic, proteomic and other important research data for more than 500,000 strains of pathogenic viruses along with a suite of tools for analyzing these data.
 
National Institute of Biomedical Imaging and Bioengineering
  • LONI Database: The LONI Image Data Archive (IDA) is a user-friendly environment for archiving, searching, sharing, tracking and disseminating neuroimaging and related clinical data. The IDA is utilized for dozens of neuroimaging research projects across North America and Europe and accommodates MRI, PET, MRA, DTI and other imaging modalities.
  • Medical Information Mart for Intensive Care-III (MIMIC-III): MIMIC-III is a large, freely-available database comprising deidentified health-related data associated with over forty thousand patients who stayed in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes information such as demographics, vital sign measurements made at the bedside (~1 data point per hour), laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality (both in and out of hospital). MIMIC-III supports a diverse range of analytic studies spanning epidemiology, clinical decision-rule improvement, and electronic tool development.
  • NeuroImaging Tools and Resources Collaboratory (NITRC): The NeuroImaging Tools and Resources Collaboratory (NITRC) provides free access to data and enables pay-per-use cloud-based access to unlimited computing power, enabling worldwide scientific collaboration with minimal startup and cost.
 
Eunice Kennedy Shriver National Institute of Child Health and Human Development
  • Child Language Data Exchange System (CHILDES): CHILDES is a system for sharing and analyzing conversational interactions.
  • Data and Specimen Hub (DASH): NICHD DASH is a centralized resource for researchers to store and access de-identified data from NICHD funded research studies for the purposes of secondary research use. 
  • Data Sharing for Demographic Research (DSDR): DSDR is a data sharing project providing curation and archiving services for the demographic and population sciences community.
  • National Children’s Study (NCS) Archive: The NCS Archive, a data and sample repository for the National Children’s Study, provides access to data and samples collected from over 5,600 U.S. birth families to study environmental influences on child health and development.
  • PhonBank: PhonBank is an open database for the study of early phonological development using the Phon program.
  • Xenbase: Xenbase is a Xenopus laevis and Xenopus tropicalis biology and genomics resource.
 
National Institute on Drug Abuse
  • Mouse Phenome Database (MPD): The Mouse Phenome Database (MPD) enables the integration of genomic and phenomic data by providing access to primary experimental data, well-documented data collection protocols and analysis tools. Data are contributed by investigators from around the world and represent a broad scope of behavioral, morphological and physiological disease-related characteristics in naive mice and those exposed to drugs, environmental agents or other treatments.
  • National Addiction & HIV Data Archive Program (NAHDAP): NAHDAP acquires, preserves and disseminates data relevant to drug addiction and HIV research. By preserving and making available an easily accessible library of electronic data on drug addiction and HIV infection in the United States, NAHDAP offers scholars the opportunity to conduct secondary analysis on major issues of social and behavioral sciences and public policy.
  • Neuroscience Information Framework (NIF): NIF maintains the largest searchable collection of neuroscience data, the largest catalog of biomedical resources, and the largest ontology for neuroscience on the web.
 
National Institute on Deafness and Other Communication Disorders
  • AphasiaBank: A shared database of multimedia interactions for the study of communication in aphasia. Access to the data in AphasiaBank is password protected and restricted to members of the AphasiaBank consortium group.
  • FluencyBank: FluencyBank is a shared database for the study of the development of fluency in both normal and disordered populations. Participants include normally-developing monolingual and bilingual children, children with disfluencies (CWD), adults with disfluencies (AWSD), and second language learners.
 
National Institute of Dental and Craniofacial Research
  • FaceBase: FaceBase is a NIDCR-funded data hub that hosts variety of data generated through dental, oral,and craniofacial research using model organisms and humans. The data offer spotlights high-throughput genetic, molecular, biological, imaging and computational techniques, as well as the database of 3D Facial Norms, developmental atlases and the Ontology of Craniofacial Development and Malformation (OCDM), Human Genome Analysis Interface (HGAI) and other resources.
 
National Institute of Diabetes and Digestive and Kidney Diseases
  • NIDDK Central Repository: The NIDDK Central Repository stores biosamples, genetic and other data collected in designated NIDDK-funded clinical studies.
  • NIDDK Information Network (DKnet): The NIDDK Information Network serves the needs of basic and clinical investigators by providing seamless access to large pools of data relevant to the mission of NIDDK. 
  • Nuclear Receptor Signaling Atlas (NURSA): The Nuclear Receptor Signaling Atlas (NURSA) is designed to foster the development of a comprehensive understanding of the structure, function, and role in disease of nuclear receptors (NRs) and coregulators. NURSA seeks to elucidate the roles played by NRs and coregulators in metabolism and the development of metabolic disorders (including type 2 diabetes, obesity, osteoporosis, and lipid dysregulation), as well as in cardiovascular disease, oncology, regenerative medicine and the effects of environmental agents on their actions.
 
National Institute of Environmental Health Sciences
  • Chemical Effects in Biological Systems (CEBS): The CEBS database houses data of interest to environmental health scientists. CEBS is a public resource, and has received depositions of data from academic, industrial, and governmental laboratories. 
 
National Institute of General Medical Sciences
  • Database of Interacting Proteins (DIP): The DIP database, a founding member of the International Molecular Exchange Consortium (IMEx: https://www.imexconsortium.org),  catalogs experimentally determined interactions between proteins. It combines information from a variety of sources to create a single, consistent set of protein-protein interactions. 
  • PhysioNet: The PhysioNet Resource is intended to stimulate current research and new investigations in the study of complex biomedical and physiologic signals.