Pretrained models using the Human Protein Atlas:

All required packages are listed in package-list.txt and can be installed via conda. Model training requires GPU.

Pretrained models using the Human Protein Atlas:

./checkpoints has the model parameters for all the 22 epochs.
All results in the paper are based on the model parameters in “splice_isoform_dataset_cell_line_and_gene_split_full-epoch=01-val_combined_loss=0.18.ckpt”

Compiling & accessing the dataset:

The repository is configured to use the free MongoDB Service to store and serve all data for training and validation. Not only does MongoDB scale well for the hundreds of thousands of immunofluorescent images & ESM-2 feature binaries, it also natively supports hierarchical data structures which is crucial for clean data handling.

Whenever the dataset is updated or accessed the mongo server must be up and running. To do so open run the command sudo mongod --dbpath <MONGO_PATH>
It is recommended to also download Mongo Compass to visualize the datasets directly from a GUI (screenshots below of the hierarchical structure of the dataset as seen from Compass)
From the compass UI it is possible to directly create a database https://www.mongodb.com/docs/compass/current/databases/. Before any data can be downloaded or accessed you must first create a database named ‘hpa’.

Dataset preparation:

Down the public tabular subcellular_localization dataset from HPA (https://www.proteinatlas.org/download/subcellular_location.tsv.zip). Note: the tabular dataset is used only for its index of genes for the rest of the pipeline to reference. The rest of the information in the tabular dataset including the localization labels are not used as they only describe the coarse gene level information.
To populate the proteoform level information (localization labels, cell images, amino acid sequence, ESM-2 representation) refer to dataset/download_data.py. The python code is all set up to run but may take a few days to finish pulling data from the web.
To create dataset splits for training & evaluation refer to dataset/create_datasets.py. A number of different training / evaluation splits are offered.

Visualization of train/test data

datasplits_matrix_visualization.ipynb (Figure 2a)

Model training:

The model is defined in src/model:

Image inpainting: src/model/nn_unet.py
Localization prediction using the sequence representation: src/model/nn_multilabel_mlp.py
Learning protein sequence representation: src/model/nn_light_attention.py
Full model: src/model/full_model.py For model training, run ‘python train.py’. Model training progress can monitored with TensorBoard

Visualize model performance and protein localization variability

The following plots are generated using plotLoss.ipynb and plotLoss_holdout2.ipynb (same procedure separately applied to training/Holdout 1 and Holdout 2)
- Computing protein image prediction losses for all held-out proteins in the test set of Holdout 1 and in Holdout 2 (Figure 2b)
- Plotting examples of protein image predictions (Figure 2c)
- Ploting predicted intra-nuclear proportions (Figure 3a)
- Computing the variability of intra-nuclear proportions across cell lines (Figure 3)
- Computing the variability of intra-nuclear proportions across single cells of the same cell lines (Figure 4)
- Gene ontology of the most variable proteins: go_variableProteins.ipynb (Training and Holdout 1); go_variableProteins_holdout2 (Holdout 2)
spectral_bleed_through.ipynb; guided backprop attribution of model attention and a visualization of model predictions across different proteoforms and cell lines (Supplementary Figure 4b)

Evaluations of the experimental validation

plotLoss_experiment.ipynb (Figure 5)

Visualization of cell and protein representations (Figure 6)

latent_proteoform_representation_visualization.ipynb - PCA of protein sequence representations (Figure 6a left panel, Supplemental Figure 12)
latent_proteoform_nucCytosol.ipynb - PCA of proteins in nucleoplasm, cytosol, or both (Figure 6a right panel)
shap.ipynb - Shapley analysis (Figure 6b)
jointEmbedding.ipynb - Image representation (Figure 6c) and joint representation (Figure 6d)
latent_cell_representation_visualization.ipynb - PCA of image representations for 36 cell lines (Supplemental Figure 13)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pretrained models using the Human Protein Atlas:

Compiling & accessing the dataset:

Dataset preparation:

Visualization of train/test data

Model training:

Visualize model performance and protein localization variability

Evaluations of the experimental validation

Visualization of cell and protein representations (Figure 6)

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
checkpoints		checkpoints
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
datasplits_matrix_visualization.ipynb		datasplits_matrix_visualization.ipynb
go_variableProteins.ipynb		go_variableProteins.ipynb
go_variableProteins_holdout2.ipynb		go_variableProteins_holdout2.ipynb
jointEmbedding.ipynb		jointEmbedding.ipynb
latent_cell_representation_visualization.ipynb		latent_cell_representation_visualization.ipynb
latent_proteoform_nucCytosol.ipynb		latent_proteoform_nucCytosol.ipynb
latent_proteoform_representation_visualization.ipynb		latent_proteoform_representation_visualization.ipynb
mongo.001.jpeg		mongo.001.jpeg
mongo.002.jpeg		mongo.002.jpeg
mongo.003.jpeg		mongo.003.jpeg
package-list.txt		package-list.txt
plotLoss.ipynb		plotLoss.ipynb
plotLoss_experiment.ipynb		plotLoss_experiment.ipynb
plotLoss_holdout2.ipynb		plotLoss_holdout2.ipynb
shap.ipynb		shap.ipynb
spectral_bleed_through.ipynb		spectral_bleed_through.ipynb
train.py		train.py

License

uhlerlab/PUPS

Folders and files

Latest commit

History

Repository files navigation

Pretrained models using the Human Protein Atlas:

Compiling & accessing the dataset:

Dataset preparation:

Visualization of train/test data

Model training:

Visualize model performance and protein localization variability

Evaluations of the experimental validation

Visualization of cell and protein representations (Figure 6)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages