Skip to content

A Curated List of Computational Biology Datasets Suitable for Machine Learning

Notifications You must be signed in to change notification settings

LengerichLab/CompBioDatasetsForMachineLearning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 

Repository files navigation

Computational Biology Datasets Suitable For Machine Learning

This is a curated list of computational biology datasets that have been pre-processed for machine learning. This list is a work in progress, please submit a pull request for any dataset you would like to advertise!

Genotyping

Name Description Comments
The Cancer Genome Atlas Variety of Cancer Data most cancer types have 100-1000 samples
NIH GDC Cancer, many types of genomic data
UK Biobank
European Genome-Phenome Archive
METABRIC The genomic profiles (somatic mutations [targeted sequencing], copy number alterations, and gene expression) of 2509 breast cancers.
HapMap
23andMe 2280 Public Domain Curated Genotypes
Mice SNPs, 2000+ samples 4 generations. It might be possible to learn a family structure out of the data.
Arabidopsis SNPs, 100+ phenotypes

Promoter-Enhancer Pairs

Name Description Comments
TargetFinder ~100,000 DNA-DNA interaction pairs

Gene/Protein Expression

Name Description Comments
GEO Main place for NCBI data
ENCODE Variety of assays to identify functional elements
ArrayExpress DNA sequencing, gene/protein expression, epigenetics
Cytometry Continuous flow cytometry data of 11 proteins+phospholipids, Discretized and cleaned data available offline Classical benchmark dataset for learning graphical models; contains known errors
Transcription factor binding ChIP-Seq data on 12 TFs
GTEx Landmark study for EQTL analysis
PharmacoGenomics DB
ProteomeXChange
BeatAML whole-exome sequencing, RNA sequencing and analyses of ex vivo drug sensitivity 672 tumour specimens collected from 562 patients

Single-cell Data

Name Description Comments
Single-cell expression atlas
scPerturb single-cell perturbation-response datasets harmonized and preprocessed across 44 original datasets

Regulatory Networks

Name Description Comments
TRRUST manually curated database of human transcriptional regulatory network
Yeast Network 23-million yeast 2-hybrid experiments to investigate genetic interactions
Perturb-Seq Integrated model of perturbations, single cell phenotypes, and epistatic interactions
KEGG Metabolic Regulatory Network (Undirected) 65554 instances, 29 attributes each
KEGG Metabolic Regulatory Network (Directed) 53414 instance, 24 attributes each

Images

Name Description Comments
The Cancer Imaging Archive Extracts the images from the TCGA data
Multiple Myeloma DREAM Challenge Challenge to identify Multiple Myeloma Patients
Breast Cancer Wisconsin (Diagnostic) Data Set Predict whether the cancer is benign or malignant
DDSM Mammogram Database
Kaggle Soft Tissue Sarcomas Preprocessed subset of the TCIA study "Soft Tissue Sarcoma" segmentation task
Kaggle Cervical Cancer Screening Classify cervix type from images
CMELYON17 Pathology challenge - automated detection and classification of breast cancer metastases in whole-slide images of histological lymph node sections
Grand Challenges Datasets from biomedical image analysis competitions
Breast Cancer MRI Dataset Demographic, clinical, pathology, treatment, outcomes, and genomic data + MRI images

fMRI

Name Description Comments
ENGIMA Cerebellum Goal: Examine the relationships between regional atrophy and motor and cognitive dysfunction
Seizure Prediction Goal: Classify EEG time series into pre-seizure vs. interictal (i.e., not preceding a seizure).

Electronic Medical Records

Name Description Comments
MIMIC 59,000 EHRs
UCI Diabetes 130 US hospital data for 1999-2008
i2b2 Clinical notes only, designed for NLP tasks
PhysioNet
Metadata Acquired from Clinical Case Reports (MACCRs) 3,100 curated clinical case reports spanning 15 disease groups and more than 750 reports of rare diseases
eICU 200k EHRs
All of Us >250k EHRs, some genomic data
PMC-Patients 167k patient summaries with 3.1 M patient-article relevance annotations and 293k patient-patient similarity annotations

Radiographs

Name Description Comments
CheXPert 200k chest radiographs Competition and leaderboard associated
MIMIC-CXR ~400k chest x-rays, 14 labels Data on PhysioNet
PadChest 160k chest x-rays, 174 different findings

Protein-Protein Interactions

Name Description Comments
HINT (High-quality INTeractomes) curated compilation of high-quality protein-protein interactions from 8 interactome resources

Longitudinal Studies

Name Description Comments
National Population Health Survey Longitudinal Survey that collects health information via surveys every two years.

Protein Structure

Name Description Comments
ProteinNet Standardized dataset for learning protein structure. Includes sequences, structures, alignments, PSSMs, and standardized train/test/valid splits.

Natural Language Data

Name Description Comments
BioASQ Abstracts of medical articles (from PubMed); ontologies of medical concepts. Tasks: MLC, QA.
Cases Articles from medical case studies.
UPMC Pathology UPMC Pathology case studies.

Therapeutics

Name Description Comments
Therapeutic Data Commons Many preprocessed datasets for therapeutic discovery, including target discovery, activity modeling, efficacy and safety, and manufacturing. Available as Python modules.
Cancer Omics Drug Experiment Response Dataset Molecular datasets paired with corresponding drug sensitivity data Seeks to standardize datasets of cancer drug responses into a standard schema

Releases

No releases published

Packages

No packages published