Welcome to the repository of Hypergraph Factorisation for Multi-Tissue Gene Expression Imputation — HYFA.
Overview of HYFA
HYFA processes gene expression from a number of collected tissues (e.g. accessible tissues) and infers the transcriptomes of uncollected tissues.
HYFA Workflow
- The model receives as input a variable number of gene expression samples
$x^{(k)}_i$ corresponding to the collected tissues$k \in \mathcal{T}(i)$ of a given individual$i$ . The samples$x^{(k)}_i$ are fed through an encoder that computes low-dimensional representations$e^{(k)}_{ij}$ for each metagene$j \in 1 .. M$ . A metagene is a latent, low-dimensional representation that captures certain gene expression patterns of the high-dimensional input sample.- These representations are then used as hyperedge features in a message passing neural network that operates on a hypergraph. In the hypergraph representation, each hyperedge labelled with
$e^{(k)}_{ij}$ connects an individual$i$ with metagene$j$ and tissue$k$ if tissue$k$ was collected for individual$i$ , i.e.$k \in \mathcal{T}(i)$ . Through message passing, HYFA learns factorised representations of individual, tissue, and metagene nodes.- To infer the gene expression of an uncollected tissue
$u$ of individual$i$ , the corresponding factorised representations are fed through a multilayer perceptron (MLP) that predicts low-dimensional features$e^{(u)}_{ij}$ for each metagene$j \in 1 .. M$ . HYFA finally processes these latent representations through a decoder that recovers the uncollected gene expression sample$\hat{x}^{(u)}_{ij}$ .
- Clone this repository:
git clone https://github.com/rvinas/HYFA.git
- Install the dependencies via the following command:
pip install -r requirements.txt
The installation typically takes a few minutes.
To download the processed GTEx data, please follow these steps:
wget -O data/GTEx_data.csv.zip https://figshare.com/ndownloader/files/40208074
wget -O data/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt https://storage.googleapis.com/gtex_analysis_v8/annotations/GTEx_Analysis_v8_Annotations_SubjectPhenotypesDS.txt
unzip data/GTEx_data.csv.zip -d data
To download the pre-trained model, please run this command:
wget -O data/normalised_model_default.pth https://figshare.com/ndownloader/files/40208551
-
Prepare your dataset:
- By default, the script
train_gtex.py
loads a dataset from a CSV file (GTEX_FILE
) with the following format:- Columns are genes and rows are samples.
- Entries correspond to normalised gene expression values.
- The first row contains gene identifiers.
- The first column contains donor identifiers. The file might contain multiple rows per donor.
- An extra column
tissue
denotes the tissue from which the sample was collected. The combination of donor and tissue identifier is unique.
- The metadata is loaded from a separate CSV file (
METADATA_FILE
; see functionGTEx_metadata
intrain_gtex.py
). Rows correspond to donors and columns to covariates. By default, the script expects at least two columns:AGE
(integer) andSEX
(integer).
Example of gene expression CSV file:
, GENE1, GENE2, GENE3, tissue INDIVIDUAL1, 0.0, 0.1, 0.2, heart INDIVIDUAL1, 0.0, 0.1, 0.2, lung INDIVIDUAL1, 0.0, 0.1, 0.2, breast INDIVIDUAL2. 0.0, 0.1, 0.2, kidney INDIVIDUAL3, 0.0, 0.1, 0.2, kidney
Example of metadata CSV file:
, AGE, SEX INDIVIDUAL1, 34, 0 INDIVIDUAL2. 55, 1 INDIVIDUAL3, 49, 1
See the notebook
hyfa_tutorial.ipynb
for an overview of the data format and main features of HYFA. - By default, the script
-
Run the script
train_gtex.py
to train HYFA. This uses the default hyperparameters fromconfig/default.yaml
. After training, the model will be stored in your current working directory. We recommend training the model on a GPU machine (training takes between 15 and 30 minutes on a NVIDIA TITAN Xp). -
Once the model is trained, evaluate your results via the notebook
evaluate_GTEx_v8_normalised.ipynb
.
hyfa_tutorial.ipynb
: Tutorial of the main features of HYFA.train_gtex.py
: Main script to train the multi-tissue imputation model on normalised GTEx dataevaluate_GTEx_v8_normalised.ipynb
: Analysis of multi-tissue imputation quality on normalised data (i.e. model trained viatrain_gtex.py
)evaluate_GTEx_v9_signatures_normalised.ipynb
: Analysis of cell-type signature imputation (i.e. fine-tunes model on GTEx-v9)
src/data.py
: Data object encapsulating multi-tissue gene expressionsrc/dataset.py
: Dataset that takes care of processing the datasrc/data_utils.py
: Data utilities
src/hnn.py
: Hypergraph neural networksrc/hypergraph_layer.py
: Message passing on hypergraphsrc/hnn_utils.py
: Hypergraph model utilitiessrc/metagene_encoders.py
: Model transforming gene expression to metagene valuessrc/metagene_decoders.py
: Model transforming metagene values to gene expression
src/train_utils.py
: Train/eval loopssrc/distribions.py
: Count data distributionssrc/losses.py
: Loss functions for different data likelihoods
src/pathway_utils.py
: Utilities to retrieve KEGG pathwayssrc/ct_signature_utils.py
: Utilities for inferring cell-type signatures
If you use this code for your research, please cite our paper:
@article{vinas2023hypergraph,
title={Hypergraph factorization for multi-tissue gene expression imputation},
author={Vi{\~n}as, Ramon and Joshi, Chaitanya K and Georgiev, Dobrik and Lin, Phillip and Dumitrascu, Bianca and Gamazon, Eric R and Li{\`o}, Pietro},
journal={Nature Machine Intelligence},
pages={1--15},
year={2023},
publisher={Nature Publishing Group UK London}
}