Merge branch 'v3.0' of https://github.com/Genome3D/codes3d into v3.0

Genome3d · May 19, 2022 · 806d6b9 · 806d6b9
2 parents b784d25 + e8fe185
commit 806d6b9
Show file tree

Hide file tree

Showing 2 changed files with 260 additions and 26 deletions.
diff --git a/README.md b/README.md
@@ -44,29 +44,71 @@ You will also need to install Hi-C libraries and data necessary to calculate eQT
 The CoDeS3D interface is heavily inspired by the Qiime interface (J Gregory Caporaso *et al*., Nature Methods, 2010; doi:10.1038/nmeth.f.303). Running the CoDeS3D script in the codes3d directory will drop the user into the CoDeS3D shell, in which all CoDeS3D scripts are accessible from anywhere in the system, e.g.
 
 ```
-/home/alcamerone/Documents/codes3d$ conda activate envs/codes3d_env
-(codes3d_env) /home/alcamerone/Documents/codes3d$ ./CoDeS3D
- Setting up CoDeS3D environment.
-
- Type 'exit' or press Ctrl+D at any time to leave.
-/home/alcamerone/Documents/codes3d$ CoDeS3D> cd ../project
-/home/alcamerone/Documents/project$ CoDeS3D> codes3d.py -h
-usage: codes3d.py [-h] [-s SNP_INPUT [SNP_INPUT ...]]                    
-                 [-g GENE_INPUT [GENE_INPUT ...]] [-o OUTPUT_DIR]    
-                 [--multi-test MULTI_TEST] [--pval-threshold PVAL_THRESHOLD]
-                 [-f FDR_THRESHOLD] [--maf-threshold MAF_THRESHOLD]       
-                 [-p NUM_PROCESSES] [--no-afc]                       
-                 [--afc-bootstrap AFC_BOOTSTRAP]                    
-                 [-n INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...]]        
-                 [-x EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...]]
-                 [--list-hic-libraries] [--match-tissues ...]                                                      
-                 [--list-tissue-tags] [-t TISSUES [TISSUES ...]]          
-                 [--eqtl-project EQTL_PROJECT] [--list-eqtl-db]            
-                 [-r RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...]]    
-                 [--list-enzymes] [--list-eqtl-tissues] [-c CONFIG]      
-                 [--output-format OUTPUT_FORMAT] [--do-not-produce-summary]
-                 [--suppress-intermediate-files] [--non-spatial]        
- [...]
+(/mnt/projects/codes3d/codes3d_env) /m/p/codes3d$ python codes3d/codes3d.py -h
+usage: codes3d.py [-h] [-s SNP_INPUT [SNP_INPUT ...]] [-g GENE_INPUT [GENE_INPUT ...]] [--snps-within-gene SNPS_WITHIN_GENE [SNPS_WITHIN_GENE ...]] [-o OUTPUT_DIR] [--multi-test MULTI_TEST] [--pval-threshold PVAL_THRESHOLD]
+                  [-f FDR_THRESHOLD] [--maf-threshold MAF_THRESHOLD] [-p NUM_PROCESSES] [--no-afc] [--afc-bootstrap AFC_BOOTSTRAP] [-n INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...]] [-x EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...]]
+                  [--list-hic-libraries] [--match-tissues ...] [--list-tissue-tags] [-t TISSUES [TISSUES ...]] [--eqtl-project EQTL_PROJECT] [--list-eqtl-db] [-r RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...]] [--list-enzymes]
+                  [--list-eqtl-tissues] [-c CONFIG] [--output-format OUTPUT_FORMAT] [--do-not-produce-summary] [--suppress-intermediate-files] [--non-spatial] [--gene-list GENE_LIST [GENE_LIST ...]] [--gtex-cis]
+
+CoDeS3D maps gene regulatory landscape using chromatin
+        interaction and eQTL data.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  -s SNP_INPUT [SNP_INPUT ...], --snp-input SNP_INPUT [SNP_INPUT ...]
+                        The dbSNP IDs or loci of SNPs of interest in the format 'chr<x>:<locus>'. Use this flag to identify eGenes associated with the SNPs of interest.
+  -g GENE_INPUT [GENE_INPUT ...], --gene-input GENE_INPUT [GENE_INPUT ...]
+                        The symbols, Ensembl IDs or loci of genes interest in the format 'chr<x>:<start>-<end>'. Use this flag to identify eQTLs associated with the gene of interest.
+  --snps-within-gene SNPS_WITHIN_GENE [SNPS_WITHIN_GENE ...]
+                        A gene symbol, Ensembl ID or location in the format 'chr<x>:<start>-<end>'. Use this flag to identify eGenes associated with the SNPs located within the gene of interest.
+  -o OUTPUT_DIR, --output-dir OUTPUT_DIR
+                        The directory in which to output results.
+  --multi-test MULTI_TEST
+                        Options for BH multiple-testing: ['snp', 'tissue', 'multi']. 'snp': corrects for genes associated with a given SNP in a given tissue. 'tissue': corrects for all associations in a given tissue. 'multi': corrects
+                        for all associations across all tissues tested.
+  --pval-threshold PVAL_THRESHOLD
+                        Maximum p value for mapping eQTLs. Default: 1.
+  -f FDR_THRESHOLD, --fdr-threshold FDR_THRESHOLD
+                        The FDR threshold to consider an eQTL statistically significant (default: 0.05).
+  --maf-threshold MAF_THRESHOLD
+                        Minimum MAF for variants to include. Default: 0.1.
+  -p NUM_PROCESSES, --num-processes NUM_PROCESSES
+                        Number of CPUs to use (default: half the number of CPUs).
+  --no-afc              Do not calculate allelic fold change (aFC). If true, eQTL beta (normalised effect size) is calculated instead. (default: False).
+  --afc-bootstrap AFC_BOOTSTRAP
+                        Number of bootstrap for aFC calculation (default: 1000).
+  -n INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...], --include-cell-lines INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...]
+                        Space-separated list of cell lines to include (others will be ignored). NOTE: Mutually exclusive with EXCLUDE_CELL_LINES. --match-tissues takes precedence.
+  -x EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...], --exclude-cell-lines EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...]
+                        Space-separated list of cell lines to exclude (others will be included). NOTE: Mutually exclusive with INCLUDE_CELL_LINES. --match-tissues and -n take precedence.
+  --list-hic-libraries  List available Hi-C libraries.
+  --match-tissues ...   Try to match eQTL and Hi-C tissue types using space-separated tags. When using this, make sure that it is the last tag. Note that tags are combined with the AND logic. Prepend "-" to a tag if you want it to be
+                        excluded. Use `--list-tissues-tags' for possible tags.
+  --list-tissue-tags    List tags to be used with `--match-tissues'.
+  -t TISSUES [TISSUES ...], --tissues TISSUES [TISSUES ...]
+                        Space-separated list of eQTL tissues to query. Note that tissues are case-sensitive and must be from the same eQTL projects. Default is all tissues from the GTEx project. Use 'codes3d.py --list-eqtl-tissues'
+                        for a list of installed tissues.
+  --eqtl-project EQTL_PROJECT
+                        The eQTL project to query. Default: GTEx. 'use 'codes3d.py --list-eqtl-db' to list available databases.
+  --list-eqtl-db        List available eQTL projects to query.
+  -r RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...], --restriction-enzymes RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...]
+                        Space-separated list of restriction enzymes used in Hi-C data.' Use 'list-enzymes' to see available enzymes.
+  --list-enzymes        List restriction enzymes used to prepare installed Hi-C libraries.
+  --list-eqtl-tissues   List available eQTL tissues to query.
+  -c CONFIG, --config CONFIG
+                        The configuration file to use (default: docs/codes3d.conf).
+  --output-format OUTPUT_FORMAT
+                        Determines columns of output. 'short': snp|gencode_id|gene|tissue|adj_pval| [log2_aFC|log2_aFC_lower|log2_aFC_upper] or [beta|beta_se]| maf|interaction_type|hic_score. 'medium' adds:
+                        eqtl_pval|snp_chr|snp_locus|ref|alt|gene_chr| gene_start|gene_end|distance. 'long' adds:cell_lines|cell_line_hic_scores| expression|max_expressed_tissue|max_expression| min_expressed_tissue|min_expression.
+                        (default: long).
+  --do-not-produce-summary
+                        Do not produce final summary file, stop process after mapping eQTLs. Helpful if running batches (default: False).
+  --suppress-intermediate-files
+                        Do not produce intermediate files. These can be used to run the pipeline from an intermediate stage in the event of interruption (default: False).
+  --non-spatial         Map non-spatial eQTLs.
+  --gene-list GENE_LIST [GENE_LIST ...]
+                        List of genes for non-spatial eQTL mapping.
+  --gtex-cis            Retrieve spatially unconstrained cis-eQTLs as calculated in GTEx. To be used in combination with 'non-spatial'.
 ```
 
 If you want to make CoDeS3D accessible from anywhere in your system, you can add the codes3d directory to your PATH by appending this line to the bottom of your `.bash_profile`/`.bashrc` file in your home directory:

diff --git a/data_preparation/README.md b/data_preparation/README.md
@@ -1,12 +1,204 @@
 # CoDeS3D: Contextualising Developmental SNPs in Three Dimensions
 
-## Data preparation and installation
+## Database setup
+CoDeS3D requires a PostgreSql database to run. This is because it integrates different types and sources of data, and using a database improves efficiency.
+- Install [PostgreSql](https://www.postgresql.org) if you haven't already.
+- If you plan to use more than a few Hi-C or eQTL datasets, you may want to create a tablespace for your database. To do that,
+    1. Create a directory in location with adequate disk space
+    ```
+    mkdir /path/to/tablespace
+    chown -R postgres /path/to/tablespace
+    chrgrp -R postgres /path/to/tablespace
+    ```
 
+    2. Start psql as postgres and create tablespace
+    ```
+    psql -U postgres -h localhost
+    CREATE TABLESPACE tblspace_codes3d LOCATION '</path/to/tablespace>';
+    ```
 
-### Hi-C data
+    3. Confirm that a tablespace is created
+    ```
+    \db+
+    ```
+- Create the `codes3d` user,
+```
+CREATE USER codes3d WITH PASSWORD 'your password' CREATEDB;
+CREATE DATABASE codes3d OWNER codes3d;
+```
+
+- Then create the `codes3d_commons` database,
+```
+CREATE DATABASE codes3d_commons OWNER codes3d tablespace tblspace_codes3d;
+\q
+```
+
+
+## Hi-C data pre-processing and installation
+- First process your raw Hi-C data with the [Juicer pipeline](https://github.com/aidenlab/juicer). Ensure that you use the GRCh38 reference human genome. The final outputs of the Hi-C processing should have the following columns: name of the paired-ends reads, strand (`str`), chromosome (`chr`), position (`pos`), restriction site fragment (`frag`), and mapping quality score (`score`) for both read pairs. (Read pairs with mapping quality score < 30 will be removed during installation.)
+
+- Create a sub-directory for the processed Hi-C libraries in `codes3d/lib/hic/<restriction enzyme>`. After executing the subsequent steps, your Hi-C library directory should have a structure like the following:
+```
+codes3d/lib/hic/DpnII/
+├── GCA_000001405.15_GRCh38_no_alt_analysis_set.DpnII.fragments.bed
+├── GCA_000001405.15_GRCh38_no_alt_analysis_set.DpnII.fragments.db
+├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
+├── dna.fragments.db
+└── hic_data
+    ├── A549-ACE2_0h_Ho2020
+    │   ├── A549_ACE2_0h_Ho2020_GSM4955378_merged_nodups.db
+    │   └── A549_ACE2_0h_Ho2020_GSM4955379_merged_nodups.db
+    |
+    └── TeloHAEC_Lalonde2019
+        ├── TeloHAEC_GSM3593256_merged_nodups.db
+        └── TeloHAEC_GSM3593257_merged_nodups.db
+```
+- Digest the human genome with the restriction enzyme that was used to create your Hi-C library. See `python data_preparation/digest_genome.py -h` for help.
+
+- Load the Hi-C libraries into sqlite3 or PostgreSql databases with the appropriate helper scripts: `data_preparation/hic_db_sqlite.py` for sqlite3 or `data_preparation/hic_db_postgres.py` for PostgreSql.
+
+- Update the `lib/meta_info/meta_hic.txt` with information about the Hi-C cells. This file should have the following columns: `library`, `name`, `cell_type`, `tissue`, `disease` `sex`, `ethnicity`, `karyotype`, `age`, `enzyme`, and `tags`.
+
+- Finally, load the Hi-C meta-info to the `codes3d_commons` PostgreSql database with `data_preparation/init_hic_meta.py`.
 
 
 ### eQTL data
+- CoDeS3D requires genotype and expression datasets that are prepared using GTEx's [genotype](https://github.com/broadinstitute/gtex-pipeline/tree/master/genotype) and [RNA-seq](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) pipelines. In addition, follow steps 1, 2, and 3 of the GTEx's [QTL pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl) to produce your normalised expression and combined covariates files.
+
+- Place all the required file in a sub-directory in `codes3d/lib/eqtls/`. Your file structure should look like this:
+```
+codes3d/lib/eqtls/GEUVADIS/
+├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.lookup_table.txt.gz
+├── GEUVADIS.445_samples.rnaseqc_tpm.gct.gz
+├── GEUVADIS.445_samples.rnaseqc_tpm_median.gct.gz
+├── GEUVADIS_covariates
+│   └── GEUVADIS.445_samples.covariates.txt
+├── GEUVADIS_expression_matrices
+│   ├── GEUVADIS.445_samples.normalized_expression.bed.gz
+│   └── GEUVADIS.445_samples.normalized_expression.bed.gz.tbi
+└── GEUVADIS_genotypes
+    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bed
+    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bim
+    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.fam
+    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.log
+    ├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz
+    └── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz.tbi
+```
+
+- Update the `codes3d/lib/meta_info/meta_eqtls.txt` file to include the `name`,	`tissue`,	`project`, and `tags` of your eQTL project.
+
+- Load the updated meta-info to the `codes3d_commons` database using the `data_preparation/init_eqtl_meta.py` helper script.
+
+- Create a database for your eQTL project. The name should have the syntax `eqtls_<eQTL project name>`
+```
+CREATE DATABASE eqtls_<eQTL project name> owner codes3d tablespace <tablespace>;
+```
+
+- Create a `variants` table in your new eQTL database using
+```
+data_preparation/init_gene_variant_db.py
+```
+
+- Create `variant_lookup_<restriction fragment>` tables for the restriction enzymes that used to generate your Hi-C libraries
+```
+data_preparation/init_fragment_db.py
+```
+
+- Your eQTL database should have the following tables:
+
+```
+eqtls_geuvadis=> \d
+                 List of relations
+ Schema |          Name          | Type  |  Owner  
+--------+------------------------+-------+---------
+ public | variant_lookup_hindiii | table | codes3d
+ public | variant_lookup_mboi    | table | codes3d
+ public | variants               | table | codes3d
+(3 rows)
+```
+
+- Finally, update `docs/codes3d.conf` to point CoDeS3D scripts to your eQTL datasets e.g.
+```
+.
+.
+.
+[GEUVADIS]
+eqtl_dir=../lib/eqtls/GEUVADIS/
+GENE_FP: %(eqtl_dir)sGEUVADIS.445_samples.rnaseqc_tpm_median.gct.gz
+GENOTYPES_DIR: %(eqtl_dir)sGEUVADIS_genotypes
+GENOTYPES_FP: %(GENOTYPES_DIR)s/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz
+COVARIATES_DIR: %(eqtl_dir)sGEUVADIS_covariates
+EXPRESSION_DIR: %(eqtl_dir)sGEUVADIS_expression_matrices
+.
+.
+```
+
+
+## Reference tables
+CoDeS3D needs genes and SNP reference files to run.
+The gene reference file should be in `.BED` format: `chrom`, `start`, `end`,
+`gene name`, `gene id (Ensembl)`.
+Similarly, you will SNP reference files (e.g. human_9606_b151_GRCh38p7). Lastly,
+CoDeS3D uses the [RsMergeArch.bcp.gz](https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/database/organism_data/RsMergeArch.bcp.gz) reference to resolve SNPs with multiple rsIDs. Your reference files should have the following structure.
+```
+codes3d/lib/reference_files/
+├── genes
+│   ├── gene_reference.bed
+│   ├── gene_reference.db
+│   └── gene_reference_sorted.txt
+└── snps
+    ├── human_9606_b151_GRCh38p7
+    │   ├── bed_chr_1.bed
+    │   ├── bed_chr_10.bed
+    │   ├── bed_chr_11.bed
+    │   ├── bed_chr_12.bed
+    │   ├── bed_chr_13.bed
+    │   ├── bed_chr_14.bed
+    │   ├── bed_chr_15.bed
+    │   ├── bed_chr_16.bed
+    │   ├── bed_chr_17.bed
+    │   ├── bed_chr_18.bed
+    │   ├── bed_chr_19.bed
+    │   ├── bed_chr_2.bed
+    │   ├── bed_chr_20.bed
+    │   ├── bed_chr_21.bed
+    │   ├── bed_chr_22.bed
+    │   ├── bed_chr_3.bed
+    │   ├── bed_chr_4.bed
+    │   ├── bed_chr_5.bed
+    │   ├── bed_chr_6.bed
+    │   ├── bed_chr_7.bed
+    │   ├── bed_chr_8.bed
+    │   ├── bed_chr_9.bed
+    │   ├── bed_chr_MT.bed
+    │   ├── bed_chr_X.bed
+    │   └── bed_chr_Y.bed
+    ├── human_9606_b151_GRCh38p7_RsMergeArch.bcp.gz
+    └── human_9606_b151_GRCh38p7_RsMergeArch.pairs.bcp.gz
+```
+
+- Create a genes table in the `codes3d_commons` database. Make sure the `.BED`
+file has gene coordinates in GrCh38.
+```
+data_preparation/init_gene_variant_db.py
+```
 
+- Create `gene_lookup_<restriction enzyme>` lookup tables in the `codes3d_commons`
+database.
+```
+data_preparation/init_fragment_db.py
+```
 
-### Reference files
+Your `codes3d_commons` database should look like the following:
+```
+codes3d_commons=> \d
+               List of relations
+ Schema |        Name         | Type  |  Owner  
+--------+---------------------+-------+---------
+ public | gene_lookup_hindiii | table | codes3d
+ public | gene_lookup_mboi    | table | codes3d
+ public | genes               | table | codes3d
+ public | meta_eqtls          | table | codes3d
+ public | meta_hic            | table | codes3d
+(5 rows)
+```