Skip to content

ligand-discovery/protein-set-enrichment-analysis

Repository files navigation

Protein Set Enrichment Analysis for Ligand Discovery

Perform and explore enrichment analyses based on Ligand Discovery primary screening data

Usage

The repository contains a large amount of orthogonal data collected from the public domain. The data corresponds to protein annotations of multiple types, including (but not limited to):

Sequence

  • Domain, family, philogeny groups
  • Active site, binding site

Functions

  • Protein class
  • Molecular function

Processes and pathways

  • Biological processes
  • Pathways
  • Complexes

Localization

  • Subcellular localization
  • Cellular component

Drugs and diseases

  • Drug target classes
  • Target druggability
  • Disease category

All annotations are available here.

Primitive plots

For each fragment, we ranked proteins by their Log2FC (z-normalized) and performed a ranksum (GSEA-like) enrichment test across all annotations. The figure below shows some annotations found to be enriched for fragment C001.

primitive-plots

Conventional Ranksum enrichment analysis.

In addition, we performed hypergeometric tests, based on binarized data, as well as top-25, 50, 100, 250 and 500. The protein universe used was the basal proteome of HEK293T. We designed a primitive version of the Streamlit App to navigate the enormous amount of enrichment results. Two limitations became apparent:

  • A panel displaying a large number of top enrichment results was necessary in order to extract biological insights.
  • Promiscuity of proteins propagates to promiscuity of enrichment results, resulting in frequently occurring annotation terms.

Advanced plots

To address the above limitations, we provide the following two plot types.

Leaderboard

The leaderboard below corresponds to fragment C170. Vacuolar proteins (a GO Cellular componet) are enriched for this fragment. In the leading edge of this enrichment result, we find TMEM59, TPP1, etc. The normalized enrichment score (NES) is 5.95, and the P-value is 2.9e-09. In red, we see high Log2FC values, for the vacuolar proteins, and in blue lower Log2FCs. The dot at the right is colored by category (in this case, localization).

leaderboard

Leaderboard plot. The leaderboard can have an arbitrary length (10, 50, 100...).

In-depth plots

Here we focus on one particular annotation (Vacuolar Lumen) and fragment (C175).

indepth

In depth-plot. (Left) The promiscuity plot highlights proteins in the annotaiton (coloured). Filled circles correspond to the leading edge. Color denotes promiscuity (blue) or specificity (red). (Center) On top, ranksum plot, including circles denoting the result of a hypergeometric test at top-25, top-50, top-250 and top-500. Empty dots denote non-significant result (P-value > 0.05). In the bottom, top-10 proteins in the leading edge, colored and located by promiscuity. (Right) In the upper-left panel, the expected normalized enrichment score (NES) of this annotation across other fragments is shown (mean and standard deviation), along with fragments of the same pull down (in black). In the upper-right panel, a griddified projection of annotations is shown (coloured by promiscuity), in order to geolocate the annotation with respect to the rest of annotations. In the lower-left panel, the number of proteins in the leading edge at different degrees of promiscuity is shown. In the lower-right panel, proteins are projected (and griddified) by sequence similarity, and the leading edge proteins are highlighted (coloured by promiscuity).

The current Streamlit App capitalizes on these two display items to provide informative navigation of the enrichment results. The following is a mockup of the Streamlit App:

mockup

Mockup of the Streamlit Protein Set Enrichment Analysis App. The two main pages are highlighted. On the left, we sketch the leaderboard page, focused on a given fragment. On the right, we sketch the focus page, specific to a fragment-category pair.

The case of C310

Below we use the case of fragment C310 to illustrate the pages of the protein enrichment app.

Overview page

Table view, filtering for SQSTM1: screenshot-1 Leaderboard page, table view, where SQSTM1 is used as a filtering gene in the leading edge.

Plot view: screenshot-2 Plot view of the leaderboard page

Detailed page

Table view, focused on localization terms: screenshot-3 Enriched terms, in a table view. At the bottom, there is the possibility to explore proteins.

Basic plots: screenshot-4 Basic enrichment plots. Fill color of the curve indicates strength of enrichment signal.

Advanced plots: screenshot-5 Advanced enrichment plots. Please see above for interpretation.

Installation and running the app

Download data

First of all, you have to a few big files download data files. These files need to be unzipped in the protein-set-enrichment-analysis/ folder.

App-only installation and run

This app has very few dependencies. You can install them as follows:

pip install -r requirements.txt

Then, you can simply run the app as follows:

streamlit run app/app.py

Non-cached installation and run

In case you don't want to use the cache data, we have a much more complete version of the app that dynamically creates plots, etc. We do not recommend using this version of the app unless you are a developer of the Ligand Discovery project.

To install the dynamic version of the app, we recommend using Conda. Make sure a C++ compiler is installed:

conda install -c conda-forge cxx-compiler

Install the necessary dependencies:

pip install -r requirements_dynamic.txt

Finally, you can run the app as follows:

streamlit run app/app_dynamic.py

Other

You can also run an app for quick exploration, such as identifying good enrichment signals for further inspection.

streamlit run app/explore.py

About

This project was performed at Georg Winter Lab, based at CeMM, Vienna.