dcHiC is a tool for differential compartment analysis of Hi-C datasets. It features many capabilities, including:
- Optimized PCA calculations (faster + capable of analysis up to 5kb resolution)
- Comprehensive identification of significant compartment changes between any number of cell lines (with replicates), including with pseudo-bulk single cell data
- Beautiful standalone HTML files for visualization of results
- Identification of differential loops anchored in significant differential compartments (using Fit-Hi-C)
- And much more!
If you want to see examples of dcHiC in action or cite our tool, please see our paper in Nature Communications! Web-hosted visualization examples of case scenarios in the paper here.
To see how to run dcHiC, read our docs and try our demo (below)! Information about data pre-processing and running single-cell data is available in the wiki.
This README contains the key information you will need to use this application. However, some users may find a demo helpful; ours includes a script to run package installation as well as detailed guides for different options of dcHiC. All of these resources are available in the demo
directory, with relevant instructions inside!
The latest version of dcHiC runs pre-dominantly from R (3+) and Python (3+). The necessary packages may be installed via conda or manually (those transitioning environments should have most, if not all, of the packages already installed). For the core application, the following packages are necessary:
We recommend using Conda to install all dependencies in a virtual environment. The suggested path is using the appropriate Miniconda distribution.
If you face any issues, be sure your "conda" command specifically calls the executable under the miniconda distribution (e.g., ~/miniconda3/condabin/conda). If "conda activate" command gives an error when you run it the first time then you will have to run "conda init bash" once.
To install, go to the directory of your choice and run:
git clone https://github.com/ay-lab/dcHiC
conda env create -f ./packages/dchic.yml
conda activate dchic
Afterward, activate the environment and install some purpose-built processing functions with R CMD INSTALL functionsdchic_1.0.tar.gz
(functions file under 'packages'). M1 Mac users may face some issues, as some bioconductor packages have not yet been updated for native ARM64 support; we recommend using an x86-64 based OS for the cleanest experience.
To install the dependencies manually, ensure that you have the following packages installed:
- Rcpp
- optparse
- bench
- bigstatsr
- bigreadr
- robust
- data.table
- networkD3
- depmixS4
- rjson
- limma (bioconductor)
- IHW (bioconductor)
- lpsymphony (bioconductor, incase if you face error while installing the IHW package)
- ggplot2
- R.utils
- hashmap (.tar.gz file under 'packages')
- igv-reports
- dcHiC requires bedtools. Please install the program as directed—it should be accessible via $PATH.
Those who wish to perform differential loop analysis should also download the latest Python version of FitHiC, which requires a set of Python libraries: numpy, scipy, sk-learn, sortedcontainers, and matplotlib. You may also need to install 'cooler' if you wish to use .cool files. See documentation on how to do so.
Afterward, activate the environment and install some purpose-built processing functions with R CMD INSTALL functionsdchic_1.0.tar.gz
(functions file under 'packages').
Rscript -e 'plist <- c("functionsdchic","hashmap","R.utils","Rcpp","RcppEigen","BH","optparse","bench","bigstatsr","bigreadr","robust","data.table","networkD3","depmixS4","rjson","limma","ggplot2","lpsymphony","IHW"); setdiff(plist,basename(find.package(plist)))'
If you get character(0) then you're all set, otherwise install the packages shown in the output.
Create an input file for dcHiC with the format below. The matrix and bed columns are for input data (see next section), whereas the replicate_prefix and experiment_prefix columns describe the hierarchy of data.
Note: Do not use dashes ("-") or dots (".") in the replicate or experiment prefix names.
<mat> <bed> <replicate_prefix> <experiment_prefix>
For instance, consider this sample file which describes two replicates for two Hi-C profiles:
matr1_e1.txt matr1_e1.bed exp1_R1_100kb exp1
matr2_e1.txt matr2_e2.bed exp1_R2_100kb exp1
matr1_e2.txt matr1_e2.bed exp2_R1_100kb exp2
matr2_e2.txt matr2_e2.bed exp2_R2_100kb exp2
dcHiC accepts sparse matrices as its input (Hi-C Pro style). If you have .cool or .hic files, see how to convert their format here.
To see the full list of options, run Rscript dchicf.r --help
or view dchicdoc.txt
here.
The matrix file should look like this:
<indexA> <indexB> <count>
1 1 300
1 2 30
1 3 10
2 2 200
2 3 20
3 3 200
....
... And the corresponding bed file like this:
<chr> <start> <end> <index>
chr1 0 40000 1
chr1 40000 80000 2
chr1 80000 120000 3
....
Many high-throughput genomics studies "blacklist" problematic mapping regions (see the study here). If you wish to blacklist regions from your data, you may do so by adding a fifth column to your input file containing 1's in rows that should be blacklisted:
<chr> <start> <end> <index> <blacklisted>
chr1 0 40000 1 0
chr1 40000 80000 2 1
....
To see the full list of run options with examples of run code for each one, run Rscript dchicf.r --help
. The most high-level option is --pcatype
, which allows users to perform different types of step-wise analysis. Each of these run options will require other input information.
--pcatype option | Meaning |
---|---|
cis | Find compartments on a cis interaction matrix |
trans | Find compartments on a trans interaction matrix |
select | Selection of best PC for downstream analysis [Must be after cis or trans step] |
analyze | Perform differential analysis on selected PC's [Must be after select step] |
subcomp | Optional: Assigning sub-compartments based on PC magnitude values using HMM segmentation |
fithic | Run Fit-Hi-C to identify loops before running dloop (Optional) |
dloop | Find differential loops anchored in at least one of the differential compartments across the samples (Optional) |
viz | Generate IGV vizualization HTML file. Must have performed other steps in order (optional ones not strictly necessary) before this one. |
enrich | Perform gene enrichment analysis (GSEA) of genes in differential compartments/loops |
Here is a sample full run using the traditional cis matrix for compartment analysis:
Must -
Rscript dchicf.r --file input.ES_NPC.txt --pcatype cis --dirovwt T --cthread 2 --pthread 4
Rscript dchicf.r --file input.ES_NPC.txt --pcatype select --dirovwt T --genome mm10
Rscript dchicf.r --file input.ES_NPC.txt --pcatype analyze --dirovwt T --diffdir ES_vs_NPC_100Kb
Rscript dchicf.r --file input.ES_NPC.txt --pcatype viz --diffdir ES_vs_NPC_100Kb --genome mm10
Optional -
Rscript dchicf.r --file input.ES_NPC.txt --pcatype subcomp --dirovwt T --diffdir ES_vs_NPC_100Kb
Rscript dchicf.r --file input.ES_NPC.txt --pcatype fithic --dirovwt T --diffdir ES_vs_NPC_100Kb --fithicpath "/path/to/fithic.py" --pythonpath "/path/to/python"
Rscript dchicf.r --file input.ES_NPC.txt --pcatype dloop --dirovwt T --diffdir ES_vs_NPC_100Kb
Rscript dchicf.r --file input.ES_NPC.txt --pcatype viz --diffdir ES_vs_NPC_100Kb --genome mm10
Rscript dchicf.r --file input.txt --pcatype enrich --genome mm10 --diffdir conditionA_vs_conditionB --exclA F --region both --pcgroup pcQnm --interaction intra --pcscore F --compare F
As output, dcHiC creates two types of directories. The first are raw PCA results, in directories named after the third column of the input file. One of these is created for each input Hi-C profile; inside, there will be directories "intra_pca" or "inter_pca" depending on whether the user specified compartment calculations based on intra- or inter-chromosomal interactions and raw PC values for each chromosome inside each one.
The second overarching directory is called DifferentialResult
, which contains directories for differential results (on any number of parameter settings). These directory names are specified under the -analyze
pcatype option (which performs differential calling) dcHiC where users denote a --diffdir
where they want the analysis to be done. Multiple directories, with different analysis parameters, can be stored under the global DifferentialResult directory.
Inside each diffdir, there are raw compartment results ("expXX_data") and two PC output directories PcOri
and PcQnm
with combined and quantile-normalized compartment results. Finally, there will be a directory fdr_result
containing differential compartment, loop, and subcompartment results. Inside fdr_result
, the sample_combined
files contain complete bedGraphs with average PC values across replicates for all XX cell lines, as well as a final adjusted p-value denoting the significance of changes between Hi-C experiments for that compartment bin. The sample_combined.Filtered
files contain the same information, filtered by a p-value cutoff.
Other subcompartments
and compartmentLoops
may be there depending on whether the user opted to run those options. The differential loop files list significant loop interactions and their associated differential compartment anchors, whereas the subcompartment
files illustrate HMM-segmented subcompartments based on the magnitude of the PC values.
Below is a diagram of the overarching results structure, containing two different runs (
dcHiC_dir
exp1_rep1_100kb_pca
intra_pca
[files]
inter_pca
[files]
exp1_rep2_100kb_pca
exp2_rep1_100kb_pca
exp2_rep2_100kb_pca
DifferentialResult
inter_100kb_diff
[files]
intra_100kb_diff
exp1_data
exp2_data
fdr_result
fithic_run
geneEnrichment
pcOri
pcQnm
viz
There are a few technical implementation items to note:
Chromosomes: If you are running into issues during running dcHiC, removing chrM, chrY and other non-standard chromosomes will help. There have been many issues raised about this; we highly recommend you search for the label "user questions" or "not a bug" under Issues if you encounter an error related to this. Also make sure that the chromosome labels in the matrices match the goldenPath files; see this issue.
Chromosome Name: The chromosome names should have a 'chr' prefix with them. Please do not use a numeric vector (e.g. 1, 2, 3 ...) to represent chromosome names.
fithic
/dloop
: If running dloop
, dcHiC will first run Fit-Hi-C on the data. You will need to follow the Fit-Hi-C running procedure to do this, which will require generating a bias file. See "FitHiC2 bias file format" here.
Support for other genomes: While it has only been extensively tested for human and mouse genomes, dcHiC supports most other commonly-used genomes that are under the UCSC genome page. To utilize this, create a folder *{genome}_{resolution}_goldenpathData*
(e.g hg38_100000_goldenpathData).
Within that folder put three files:
{genome}.fa
(e.g. hg38.fa){genome}.tss.bed
(e.g. hg38.tss.bed, the TSS file. Please make sure the TSS position is selected based on the strad direction!) Note that this may be named.refGene.gtf.gz
.{genome}.chrom.sizes
(e.g. hg38.chrom.sizes).
These files can be found under the UCSC bigZips page for the specified genome. When running dcHiC use the --gfolder
option in the select
step to provide the folder path, and dcHiC will create the necessary files.
Compartment clustering: Due to statistical noise, edge cases, and other factors, lone differential compartments occassionally crop up (ex: one bin is "significant" but all of its neighbors are not). These may be significant if analyzing at coarse resolution, but can also be misleading, especially if analyzing at very fine resolution. By default, dcHiC does not filter any of these lone compartments; however, there are two parameters to do so: distclust
is the distance threshold for close differential regions to be a "cluster." If it's 0, only adjacent differential compartments form a cluster. If it's 1, differential compartments separated by up to 1 bin are a cluster. The other parameter is numberclust
, which is a filter for the minimum number of significant bins within a cluster.
Chromosome-arm wise PCA calculation: In order to perform p and q-arm wise PCA calculations, please check the run_dcHiC_chrArms_pca_step1.pl and run_dcHiC_chrArms_combine_step2.pl scripts provided under the utility/Chromosome_ArmWise_PCA/
folder.
We previously released a different version of dcHiC (under the branch "dcHiC-v1") based on Python & R. While we hope that all users try the latest version of dcHiC, all code and documentation for the first version remains and we will continue offering support for it into the future.
- If you receive an error while hashmap package installation (R CMD INSTALL hashmap_0.2.2.tar.gz), please try to install the following BH version https://cran.r-project.org/src/contrib/Archive/BH/BH_1.72.0-3.tar.gz and try it again.
For help with installation, technical issues, interpretation, or other details, feel free to raise an issue or contact us:
Abhijit Chakraborty ([email protected]), Jeffrey Wang ([email protected]), Ferhat Ay ([email protected])