October 2021
This repo contains the code and data used to generate the figures for the 2022 OpenCell paper (preprint here). It was developed by Keith Cheveralls and Kibeom Kim in the Leonetti group at the Chan Zuckerberg Biohub.
This directory contains various external and processed datasets required to make the figures. Note that some of these datasets are from external sources; these are found in the data/external/
subdirectory. The remaining datasets are all original datasets generated by or derived from the OpenCell project. Note that some datasets, including the full IP-MS interaction dataset and the matrix of target localization encodings, are too large to host on GitHub. These datasets are available on FigShare.
These are Jupyter notebooks that document how the figures were generated using the Python modules in scripts/
. The notebooks used for each figure panel are specified below. These notebooks are the primary documentation
for how the scripts in scripts/
were used for analysis and figure generation.
These are Python modules that contain the bulk of the code used for data analysis and figure generation. They are used directly by the Jupyter notebooks discussed above. Please note that these scripts are explicitly written for, and specific to, the OpenCell project. They are not intended to form a stand-alone or general-purpose Python package.
-
scripts/annotation_comparisons/
These modules are used to compare manual localization annotations from OpenCell to those from the HPA and from a yeast dataset. -
scripts/biophysical_properties/
This module calculates or retrieves various protein biophysical properties for all OpenCell targets and interactors, including hydrophobicity and disorder scores. -
scripts/cytoself_analysis/
These modules analyze the encodings of protein localization patterns generated by the cytoself model from the OpenCell microscopy dataset. -
scripts/interactome_markov_clustering/
This documents how the Markov clustering is used to delineate the mass-spec 'communities.' -
scripts/interactome_paris_clustering/
This documents how the interaction communities are themselves clustered using a hierarchical clustering algorithm to yield a hierarchical representation of the interactome. -
scripts/interactome_precision_recall/
This module documents how estimates of precision and recall are obtained for our mass-spec interactions. -
scripts/external/
These are external dependencies that were either modified by us for a specific purpose or are not available as pip-installable packages. -
scripts/pyseus/
This module is technically an external dependency; it is a package of analysis and visualization methods that we developed for analyzing our mass-spec interaction data. It is not yet pip-installable, so it is included here.
Here we provide links to the notebook sections or Python scripts that were used to generate the data and/or the graphics underlying each figure panel. Note that, for figure panels that are direct visualizations (e.g., bar or scatterplots) of data found in a supplementary table, we refer directly to the relevant supplementary table itself. Also, in some cases, the same data or graphic is used in slightly different forms in multiple figures; when this occurs, we try to indicate this transparently without replicating the same links.
2C-D: Example of an interactome community and its core clusters, and an overview of all interactome communities. These graphics were generated using Cytoscape directly from the cluster memberships in Supp. Table 5. Core cluster membership is generated by second-step Markov clustering documented here.
2E: PubMed citation count vs protein expression level. Refer to Supp. Table 2 for expression levels.
2F: Interaction network for SCAR/WAVE. The network visualization was generated using Cytoscape directly from the protein-protein interactions in Supp. Table 4.
2G: Heatmaps and volcano plots for RAVE complex. Generated using Plotly directly from the protein-protein interactions in Supp. Table 4.
3B: Sankey diagram comparing OpenCell and HPA localization annotations. This is here.
3D: UMAP of target localization encodings. This is generated here. Please note that this same UMAP is used in many subsequent figures, with different colormaps to indicate localization annotations, cluster memberships, etc.
4A: ARI curves for localization-based Leiden clustering. These are calculated here.
4B: Sankey diagram of low-resolution Leiden clusters vs manual annotations. This is generated here.
4C: UMAP of localization encodings colored by high-resolution localization clusters. For the UMAP, see Figure 3D above. The Leiden clustering is performed here.
4D-E: 2D histograms of interaction stoichiometry vs localization similarity. The interaction stoichiometries are found in Supp. Table 4 and the matrix of localization similarities is generated here.
4G: Proportion of interacting target pairs by localization similarity. This uses the same matrices of localization similarities and protein-protein interactions as in Figure 4D-E above.
4H: FAM241A de-orphaning. The ranked similarities are plotted directly from the matrix of localization similarities (see Figure 4D-E above) and the heatmap of interactions was generated using Plotly from Supp. Table 4.
5A: Interactome hierarchy. The interaction communities are clustered using the Paris hierarchical clustering algorithm here.
5B: Composition of interactome hierarchy branches. Refer to Supp. Table 2 for protein annotations and Supp. Table 5 for protein membership in branches.
5C: Box-whisker plots of biophysical properties by branch. The biophysical properties are calculated here.
5D-E: Within-spatial-cluster mean disorder and percent RNA-BPs. The disorder scores are retrieved from the IUPRED API and the within-cluster means are calculated here.
S1B: Choice of tag terminus. Generated directly from Supp. Table 3.
S1C: Number of detected interactors vs input material.
S1D: Distribution of GO annotations.
S2A: Target success rate. Generated directly from Supp. Table 3.
S2B: RNA vs protein abundance. Generated directly from Supp. Table 2.
S2C-E: Properties of successful tags. Generated directly from Supp. Table 3.
S4B: Precision-recall curve for interactome clustering. This is calculated here.
S4C-D: CORUM based recall and co-localization based precision for various datasets. This is calculated here.
S4E: Precision-recall for interactions in both Bioplex 3.0 and OpenCell. See Figure S4B.
S4F: Interaction network compression rates. The compression rates are calculated according to Royer et al.
S4G: Number of interactions unique to OpenCell. Refer to Supp. Table 4.
S4H: Overlapping GO annotations between interactors in high-stoichiometry vs low-stoichiometry interactions. Calculated using Supp. Table 4.
S4I: Clustering F1 score vs MCL inflation. This is calculated here.
S7B: Heatmap of multi-localizing targets. This is generated here.
S7C: Sankey diagram of OC-HPA discrepancies. This is generated here.
S8A: Cluster size vs clustering resolution. The localization cluster sizes are calculated along with the ARI curves here.
S8B-C: Additional examples of high-resolution localization clusters. See Figure 4C.
S9A: Localization-based hierarchical clustering. The hierarchy is obtained by clustering the high-resolution localization clusters using the Paris algorithm here.
S9B: Interactome hierarchy. See Figure 5A.
S10A-E: GO enrichment in hierarchy branches. The enrichement analysis is performed by using the Panther API. An example of how this API is used (for the localization clusters) is here.
S10F-G: Additional biophysical properties by interactome hierarchy branch. See Figure 5C.
S11A-C: Protein abundance, disorder scores, and number of interactors for RNA-binding proteins. Refer to Supp. Table 2 and Supp. Table 4.
S11D: Protein abundance vs number of interactors. Refer to Supp. Table 2 and Supp. Table 4.
S11E: Within-spatial-cluster mean hydrophobicity. See Figure 5D-E.
Chan Zuckerberg Biohub Software License
This software license is the 2-clause BSD license plus a third clause that prohibits redistribution and use for commercial purposes without further permission.
Copyright © 2021. Chan Zuckerberg Biohub. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
-
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
-
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
-
Redistributions and use for commercial purposes are not permitted without the Chan Zuckerberg Biohub's written permission. For purposes of this license, commercial purposes are the incorporation of the Chan Zuckerberg Biohub's software into anything for which you will charge fees or other compensation or use of the software to perform a commercial service for a third party. Contact [email protected] for commercial licensing opportunities.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.