Repository for thresholding code, based on dissertation research. We use a number of methods to analyse a weighted graph in order to suggest an optimal threshold for the graph.
The components of this repository are:
- C++ threshold analysis code
- C++ threshold code
- Bash threshold script
- script to output a histogram of the edge weights of a graph
- Python3 Jupyter notebook to analyse the results of the thresholding analysis code
Compiling also installs igraph-0.8.0, so you will need to have igraph dependency (libxml2) installed beforehand (see below).
git clone [email protected]:carissableker/thresholding.git
cd thresholding
To compile:
make
Executables will be in bin
folder.
The bash script absolute_global_threshold
uses awk, and should be executable.
The bash script edge_weight_histogram
also uses awk to calculate the
bin counts, and uses Python3 (including seaborn, matplotlib, and pandas) to produce the visualization.
The notebook uses the module combine_analysis_results.py
(supplied) as well as seaborn, matplotlib, and pandas.
Graphs need to be in .ncol
format to be read correctly
igraph_read_graph_ncol.
The format is defined by
Large Graph Layout.
In this application, it is a simple weighted, whitespace separated edge list. Vertex names cannot contain whitespace.
vertex1⇥vertex2⇥weight1,2
vertex1⇥vertex3⇥weight1,3
·
·
·
It is recommended to first view a histogram of the edge weights:
$ ./bin/edge_weight_histogram <GRAPH FILE PATH> <OUTPUT FILE PATH> <BIN WIDTH>
This will create a tab separated file with bin counts in
<OUTPUT FILE PATH>.tsv.
and a SVG figure of the histogram at <OUTPUT FILE PATH>.svg
.
The histogram can be used to decide on lower and upper bounds on the thresholds.
For analysis of the graph, run the analysis
program.
$ ./bin/analysis --help
Usage:
./bin/analysis [-OPTIONS]... <GRAPH FILE PATH> <OUTPUT FILE PATH>
Graph has to be in .ncol format.
Output file path is the prefix to the results files, which will be of the form:
<OUTPUT FILE PATH>.pid.<method_name>.txt
Options:
-l --lower <value> lower bound on thresholds to test (default 0.5)
-u --upper <value> upper bound on thresholds to test (default 0.99)
-i --increment <value> threshold increment (default 0.01)
-w --windowsize <value> sliding window size for spectral method (default 5)
-p --minimumpartitionsize <value> minimum size of graph or subgraph after threshold (default 10)
-n --num_samples <value> number of samples for significance and power calculations (default NULL)
-b --bonferroni_correction switch to perform bonferroni corrections in significance and power calculations (default FALSE)
-c --minimum_cliquesize <value> minimum size of maximal cliques in maximal clique count (default 5)
-m --methods <value> comma separated list of methods (defaults to none)
0 - all
1 - significance and power calculations (only valid for Pearson CC)
2 - local-global
3 - scale free
4 - maximal cliques
5 - spectral methods
6 - random matrix theory
7 - clustering coefficient
8 - percolation
-h --help print this help and exit
The methods (-m
) correspond to calculations required for the following approaches:
Approach | methods (-m ) |
Description |
---|---|---|
apeltsin | 8 | Implementation of the algorithm described in \cite{apeltsin2011} based on the number of edges and the number of vertices. |
cc-inflection | 8 | The point at which the size of largest connected component sharply inflects downwards. |
density-min | 8 | The threshold that minimizes the density from \cite{aoki2007}. |
elo-clustering | 7 | Smallest threshold at which the difference between the clustering coefficient of the graph and the clustering coefficient of a random counterpart graph is larger than the same difference at the next threshold \citep{elo2007} |
gupta-clustering | 7 | Threshold at which the clustering coefficient sharply increases \cite{gupta2006} |
mcr-2 | 4 | Maximal clique 2 described in \cite{borate2009}. |
mcr-3 | 4 | Maximal clique 3 described in \cite{borate2009}. |
mcr-max | 4 | Generalization of maximal clique 2 and maximal clique 3. |
scale-free | 3 | Based on WCGNA but this method uses maximum likelihood to determine |
single-component | 8 | The highest threshold that guarantees a connected graph. |
spectral-methods | 5 | Implementation of the algorithm described in \cite{perkins2009}. |
rmt | 6 | Implementation of the algorithm described in \cite{gibson2013,luo2007} |
whole-graph | 8 | The highest threshold that guarantees all vertices remain in the graph. |
local-global | 2 | Method of \cite{guzzi2014} but adapted for general weighted graphs instead of semantic similarity graphs. |
TypeI |
1 | Given any value of |
Power |
1 | Given any value of |
The output of graph analysis is a number of files containing statistics and metrics of the graph. A Python3 module is supplied to interactively analyse these files. Usage can be seen in the notebook found at ./example/HumanCellCycleSubset-ThresholdNotebook.ipynb.
For simple thresholding, the fastest is probably to use
the ./bin/absolute_global_threshold
script.
$ ./bin/absolute_global_threshold <input.ncol> <t> <output.ncol>
This thresholds the graph at |t|, i.e. removes all edges with absolute value smaller or equal to t.
Alternatively, threshold
can also be used to threshold the graph
$ ./bin/threshold --help
Usage:
./bin/threshold [-OPTIONS]... <GRAPH FILE PATH> <OUTPUT FILE PATH>
Graph has to be in .ncol format.
One of the following options have to be given:
Options:
-a --absolute <value> Threshold graph at absolute of <value>
-l --local-global <value> Use local-global method to threshold with alpha = <value>
-r --rank <value> Use top <value> ranked edges per vertex to threshold graph
-h --help print this help and exit
This example uses the human cell cycle data from Stanford, available at Transcriptional regulation and function in the human cell cycle. The graph is the probe-wise Pearson correlation of the normalized expression data, across the 13 time points (n=13). Since the original graph has over 23 million edges, the version used here is a random subset of 250 000 edges.
The complete graph is avalible on Zenodo: .
First we can take a look at the edge weight distribution:
$ ../bin/edge_weight_histogram HumanCellCycleSubset.ncol HumanCellCycleSubset-edgehist 0.01
Edge distribution at HumanCellCycleSubset-edgehist.tsv and HumanCellCycleSubset-edgehist.svg
$ head HumanCellCycleSubset-edgehist.tsv
bin_start bin_end bin_count
-0.96000 -0.95000 1
-0.94000 -0.93000 1
-0.93000 -0.92000 2
-0.92000 -0.91000 8
-0.91000 -0.90000 6
Now lets do the analysis of our graph. To make it time efficient, lets do percolation, significance and power, and scale-free. By default it will also calculate density at each threshold. We'll start at a lower threshold limit of 0.6, and leave the increment at the default of 0.01. Remember to give the number of samples using -n
.
../bin/analysis HumanCellCycleSubset.ncol HumanCellCycleSubset-result -l0.6 -m1,3,8 -n13
------------------------------------------------
input graph file: HumanCellCycleSubset.ncol
output file prefix: HumanCellCycleSubset-result.3932.
lower threshold: 0.6
upper threshold: 0.99
threshold increment: 0.01
------------------------------------------------
Loading graph ... done.
Number vertices: 7077
Number edges: 250000 (maximum possible number edges 25038426)
------------------------------------------------
Iterative thresholding
Number steps: 40
Step: 1, Threshold: 0.6 Vertices: 6639 Edges: 15905
Step: 2, Threshold: 0.61 Vertices: 6514 Edges: 14532
Step: 3, Threshold: 0.62 Vertices: 6365 Edges: 13228
Step: 4, Threshold: 0.63 Vertices: 6212 Edges: 12037
Step: 5, Threshold: 0.64 Vertices: 6046 Edges: 10852
.
.
.
Step: 31, Threshold: 0.9 Vertices: 81 Edges: 41
Step: 32, Threshold: 0.91 Vertices: 47 Edges: 24
Step: 33, Threshold: 0.92 Vertices: 20 Edges: 10
Step: 34, Threshold: 0.93 Graph too small, finished.
Done.
Lets do spectral methds as well:
../bin/analysis HumanCellCycleSubset.ncol HumanCellCycleSubset-result -l0.6 -m5
Now lets see what files were created:
$ ls
HumanCellCycleSubset-result.3932.iterative.txt
HumanCellCycleSubset-result.3932.statistical_errors.txt
HumanCellCycleSubset-result.4152.iterative.txt
Each of the *.iteractive.txt
files correspond to an anlysis run (we ran analysis
twice here). Additionally, a file *.statistical_errors.txt
file was created with the results of the stattical analysis (-m 1
, Power and TypeI thresholds), since the format of the output is different to the methods that are calculated using interactive thresholding. Note, the statistical methods will only work if the number of samples is also provided.
You can peruse these files yourself, or use the notebook to take a look. Open the notebook at ./example//HumanCellCycleSubset-ThresholdNotebook.ipynb and follow the instructions there.
Once you have decided on a threshold, you can use either absolute_global_threshold
or threshold
. Suppose we decide on a global threshold of 0.79 for our graph, then use:
$ ../bin/absolute_global_threshold HumanCellCycleSubset.ncol 0.79 HumanCellCycleSubset-0.79.ncol
- Change Chi2 test of GOE to KS test for continuous distributions
- Add Spearman significance
- Consider positive and negative values separately
- Test cases
- Improve memory allocation and deallocation
- Make increment loop optional
- Consider support for Windows/Mac
- Generally improve C++ code (with help from a C++ developer)
- Fix igraph dependency in Makefile
- ?Other input/output formats
If libxml2 is not already installed, the following is one way of installing it.
Another (preferable and easier) option is apt-get
.
wget ftp://xmlsoft.org/libxml2/libxml2-<version>.tar.gz
tar -xzvf libxml2-<version>.tar.gz
./configure --without-python --prefix=/my_optional path/
make
make install
And if you used a prefix:
echo 'export PATH="/my_optional_path/bin:$PATH"
export LD_LIBRARY_PATH="/my_optional_path/lib:$LD_LIBRARY_PATH"
export LD_RUN_PATH="/my_optional_path/lib:$LD_RUN_PATH"' >> ~/.bashrc
source ~/.bashrc