Magda Markowska, Tomasz Cąkała, Błażej Miasojedow, Bogac Aybey, Dilafruz Juraeva, Johanna Mazur, Edith Ross, Eike Staub and Ewa Szczurek
Tomasz Cąkała, [email protected]
Magda Markowska, [email protected]
Ewa Szczurek, [email protected]
- C++ compiler that supports C++14 standard
- Python 3.6 or higher
- GNU Make
- R 4.0 or higher (for output plotting)
We advise new users to examine notebooks found in python/notebooks directory.
This project consists of 3 main components:
- CONET cpp sources which define CONET executable
- CONET executable should not be used directly but with aid of conet-py python package (python/conet-py)
- R script for advanced plots of inference results
Use image tc360950/conet_py_sing:latest. At minimum, the entrypoint expects 3 arguments:
- data_dir - on container directory where CONET data will be stored (must end with /)
- output_dir - on container directory where CONET output will be saved (must end with /)
- corrected_counts_file - path to corrected counts file
Exemplary call may look as follows:
docker run -v /home/user/Desktop/CONET/:/data tc360950/conet_py_sing:latest --data_dir /data/ --output_dir /data/out/ --corrected_counts_file /data/SA501X3F_filtered_corrected_counts_chr_17_18_20_23.csv
Where SA501X3F_filtered_corrected_counts_chr_17_18_20_23.csv has been saved in /home/user/Desktop/CONET/ on host and we mount the directory to the container.
Other available parameters are described in Usage Details section
Use image defined in CONET.Dockerfile. It installs conet-py and compiles cpp CONET into executable ~/conet-py/CONET. If you want to install CONET locally it's easy to mimic steps executed in the image.
Basic input data should be provided in the form of corrected counts matrix. With subsequent bins in rows and cells in columns. The matrix should contain 5 additional columns (placed at positions 1,2,3,4,5 in the matrix):
Bin's chromosome number - should always be an integer (please change X to 23 and Y to 24).
Bin's start locus
Bin's end locus
Bin's width
binary breakpoint indicator: 1 -- if the start locus of the bin is a candidate breakpoint 0 -- otherwise
Example of input matrix for SA501X3F xenograft breast cancer data is contained in CONET/python/scicone_on_conet/biological_data/data/SA501X3F_filtered_corrected_counts.csv
and for TN2 breast cancer data -- in CONET/R/TN2_corrected_counts_with_indices_50cells.csv
CONET should be used with the aid of provided Python scripts - there's no way to call CPP code directly
Examples and details are provided in three notebooks:
Applies CONET to SA501X3F xenograft breast cancer data (DLP sequencing)
- python/notebooks/biological_data/biological_data.ipynb
Contains notebook for synthetic data generation, inference and result scoring.
- python/notebooks/per_bin_generative_model/generative_model.ipynb
- python/notebooks/per_breakpoint_generative_model/generative_model.ipynb
CONET depends on a number of user-defined parameters which are represented by objects of class CONETParameters.
Parameter name | Description | Default value |
---|---|---|
data_dir | Path to directory containing input file. | "./" |
output_dir | Path to output directory. Inference results will be saved there. | "./output" |
param_inf_iters | Number of MCMC iterations for joint tree and model parameters inference. | 100000 |
pt_inf_iters | Number of MCMC iterations for tree inference. | 100000 |
counts_penalty_s1 | Constant controlling impact of penalty for large discrepancies between inferred and real count matrices. | 0.0 |
counts_penalty_s2 | Constant controlling impact of penalty for inferring clusters with changed copy number equal to basal ploidy. | 0.0 |
event_length_penalty_k0 | Constant controlling impact of penalty for long inferred events. | 1.0 |
tree_structure_prior_k1 | Constant controlling impact of data size part of tree structure prior. | 1.0 |
use_event_lengths_in_attachment | If True cell attachment probability will depend on average event length in the history, otherwise it will be uniform. | True |
seed | Seed for C++ RNG | 12312 |
mixture_size | Initial number of components in difference distribution for breakpoint loci. This value may be decreased in the course of inference but will never be increased. | 4 |
num_replicas | Number of tempered chain replicas in MAP event tree search. | 5 |
threads_likelihood | Number of threads which will be used for the most demanding likelihood calculations. | 4 |
neutral_cn | Neutral copy number. | 10000 |
verbose | True if CONET should print messages during inference. | True |
For more details please refer to Additional File 1: S7 A recommended procedure for setting CONET regularization parameters.
Parameter name | Recommendation | Initial value |
---|---|---|
param_inf_iters | Start with initial value, save and plot likelihood to check convergence. Depends on input size - number of cells and candidate breakpoint loci. | 250000 |
pt_inf_iters | Start with initial value, save and plot likelihood to check convergence. Depends on input size - number of cells and candidate breakpoint loci. | 500000 |
event_length_penalty_k0 | Start with initial value and increase if you want to penalize trees inferring long events. | 1.0 |
tree_structure_prior_k1 | Start with initial value and try increasing/decreasing if quality measures are not satisfactory. | 0.0 |
counts_penalty_s1 | Start with initial value and try increasing/decreasing if quality measures are not satisfactory. | 100000.0 |
counts_penalty_s2 | Start with initial value and try increasing/decreasing if quality measures are not satisfactory. | 100000.0 |
seed | Try using different seed to make sure you do not stuck in local optima. | 12312 |
We recommend that all other parameters values are left at default.
Structure of inferred CONET in a format readable by readTree.R script. Can be saved as Newick using this script.
Cells attachment to CONET nodes. Readable by readTree.R script.
Binary matrix with inferred breakpoints per genomic locus and cell (coressponding to input file).
Model parameters inffered by CONET. On first line mean and variance of no_breakpoint distribution (R+ truncated normal). On the following lines: weigth; mean; variance of the components of breakpoint distribution (R+ truncated mixed normal). One line per each component.
Final CN matrix can be inferred with provided R script. It also allows the user to visualize results of CONET model.
CONET input, output plus all dependencies provided at CONET/R directory.
All details in CONET/R/readTree.R
CONET/R/TN2example Illustration of the results from CONET applied to 100 cells from TN2 breast cancer data (ACT sequencing).
Plot of CONET with chr, start and end breakpoint loci, cancer genes and number of attached cells.
CONET in the Newick format
Final inferred CN matrix in the same format as input corrected counts matrix
Quality measures calculated for the inferred CONET and CN matrix (described in manuscript, Additonal File 1: Section S6.1 )
Heatmap of inferred CNs (genomic loci on X axis, cells on Y axis)
Heatmap of corrected counts (genomic loci on X axis, cells on Y axis), plotted in the same cells order as CN heatmap.