Skip to content

Commit

Permalink
Merge branch 'v3.0' of https://github.com/Genome3D/codes3d into v3.0
Browse files Browse the repository at this point in the history
  • Loading branch information
tayaza committed May 19, 2022
2 parents b784d25 + e8fe185 commit 806d6b9
Show file tree
Hide file tree
Showing 2 changed files with 260 additions and 26 deletions.
88 changes: 65 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,29 +44,71 @@ You will also need to install Hi-C libraries and data necessary to calculate eQT
The CoDeS3D interface is heavily inspired by the Qiime interface (J Gregory Caporaso *et al*., Nature Methods, 2010; doi:10.1038/nmeth.f.303). Running the CoDeS3D script in the codes3d directory will drop the user into the CoDeS3D shell, in which all CoDeS3D scripts are accessible from anywhere in the system, e.g.

```
/home/alcamerone/Documents/codes3d$ conda activate envs/codes3d_env
(codes3d_env) /home/alcamerone/Documents/codes3d$ ./CoDeS3D
Setting up CoDeS3D environment.
Type 'exit' or press Ctrl+D at any time to leave.
/home/alcamerone/Documents/codes3d$ CoDeS3D> cd ../project
/home/alcamerone/Documents/project$ CoDeS3D> codes3d.py -h
usage: codes3d.py [-h] [-s SNP_INPUT [SNP_INPUT ...]]
[-g GENE_INPUT [GENE_INPUT ...]] [-o OUTPUT_DIR]
[--multi-test MULTI_TEST] [--pval-threshold PVAL_THRESHOLD]
[-f FDR_THRESHOLD] [--maf-threshold MAF_THRESHOLD]
[-p NUM_PROCESSES] [--no-afc]
[--afc-bootstrap AFC_BOOTSTRAP]
[-n INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...]]
[-x EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...]]
[--list-hic-libraries] [--match-tissues ...]
[--list-tissue-tags] [-t TISSUES [TISSUES ...]]
[--eqtl-project EQTL_PROJECT] [--list-eqtl-db]
[-r RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...]]
[--list-enzymes] [--list-eqtl-tissues] [-c CONFIG]
[--output-format OUTPUT_FORMAT] [--do-not-produce-summary]
[--suppress-intermediate-files] [--non-spatial]
[...]
(/mnt/projects/codes3d/codes3d_env) /m/p/codes3d$ python codes3d/codes3d.py -h
usage: codes3d.py [-h] [-s SNP_INPUT [SNP_INPUT ...]] [-g GENE_INPUT [GENE_INPUT ...]] [--snps-within-gene SNPS_WITHIN_GENE [SNPS_WITHIN_GENE ...]] [-o OUTPUT_DIR] [--multi-test MULTI_TEST] [--pval-threshold PVAL_THRESHOLD]
[-f FDR_THRESHOLD] [--maf-threshold MAF_THRESHOLD] [-p NUM_PROCESSES] [--no-afc] [--afc-bootstrap AFC_BOOTSTRAP] [-n INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...]] [-x EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...]]
[--list-hic-libraries] [--match-tissues ...] [--list-tissue-tags] [-t TISSUES [TISSUES ...]] [--eqtl-project EQTL_PROJECT] [--list-eqtl-db] [-r RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...]] [--list-enzymes]
[--list-eqtl-tissues] [-c CONFIG] [--output-format OUTPUT_FORMAT] [--do-not-produce-summary] [--suppress-intermediate-files] [--non-spatial] [--gene-list GENE_LIST [GENE_LIST ...]] [--gtex-cis]
CoDeS3D maps gene regulatory landscape using chromatin
interaction and eQTL data.
optional arguments:
-h, --help show this help message and exit
-s SNP_INPUT [SNP_INPUT ...], --snp-input SNP_INPUT [SNP_INPUT ...]
The dbSNP IDs or loci of SNPs of interest in the format 'chr<x>:<locus>'. Use this flag to identify eGenes associated with the SNPs of interest.
-g GENE_INPUT [GENE_INPUT ...], --gene-input GENE_INPUT [GENE_INPUT ...]
The symbols, Ensembl IDs or loci of genes interest in the format 'chr<x>:<start>-<end>'. Use this flag to identify eQTLs associated with the gene of interest.
--snps-within-gene SNPS_WITHIN_GENE [SNPS_WITHIN_GENE ...]
A gene symbol, Ensembl ID or location in the format 'chr<x>:<start>-<end>'. Use this flag to identify eGenes associated with the SNPs located within the gene of interest.
-o OUTPUT_DIR, --output-dir OUTPUT_DIR
The directory in which to output results.
--multi-test MULTI_TEST
Options for BH multiple-testing: ['snp', 'tissue', 'multi']. 'snp': corrects for genes associated with a given SNP in a given tissue. 'tissue': corrects for all associations in a given tissue. 'multi': corrects
for all associations across all tissues tested.
--pval-threshold PVAL_THRESHOLD
Maximum p value for mapping eQTLs. Default: 1.
-f FDR_THRESHOLD, --fdr-threshold FDR_THRESHOLD
The FDR threshold to consider an eQTL statistically significant (default: 0.05).
--maf-threshold MAF_THRESHOLD
Minimum MAF for variants to include. Default: 0.1.
-p NUM_PROCESSES, --num-processes NUM_PROCESSES
Number of CPUs to use (default: half the number of CPUs).
--no-afc Do not calculate allelic fold change (aFC). If true, eQTL beta (normalised effect size) is calculated instead. (default: False).
--afc-bootstrap AFC_BOOTSTRAP
Number of bootstrap for aFC calculation (default: 1000).
-n INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...], --include-cell-lines INCLUDE_CELL_LINES [INCLUDE_CELL_LINES ...]
Space-separated list of cell lines to include (others will be ignored). NOTE: Mutually exclusive with EXCLUDE_CELL_LINES. --match-tissues takes precedence.
-x EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...], --exclude-cell-lines EXCLUDE_CELL_LINES [EXCLUDE_CELL_LINES ...]
Space-separated list of cell lines to exclude (others will be included). NOTE: Mutually exclusive with INCLUDE_CELL_LINES. --match-tissues and -n take precedence.
--list-hic-libraries List available Hi-C libraries.
--match-tissues ... Try to match eQTL and Hi-C tissue types using space-separated tags. When using this, make sure that it is the last tag. Note that tags are combined with the AND logic. Prepend "-" to a tag if you want it to be
excluded. Use `--list-tissues-tags' for possible tags.
--list-tissue-tags List tags to be used with `--match-tissues'.
-t TISSUES [TISSUES ...], --tissues TISSUES [TISSUES ...]
Space-separated list of eQTL tissues to query. Note that tissues are case-sensitive and must be from the same eQTL projects. Default is all tissues from the GTEx project. Use 'codes3d.py --list-eqtl-tissues'
for a list of installed tissues.
--eqtl-project EQTL_PROJECT
The eQTL project to query. Default: GTEx. 'use 'codes3d.py --list-eqtl-db' to list available databases.
--list-eqtl-db List available eQTL projects to query.
-r RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...], --restriction-enzymes RESTRICTION_ENZYMES [RESTRICTION_ENZYMES ...]
Space-separated list of restriction enzymes used in Hi-C data.' Use 'list-enzymes' to see available enzymes.
--list-enzymes List restriction enzymes used to prepare installed Hi-C libraries.
--list-eqtl-tissues List available eQTL tissues to query.
-c CONFIG, --config CONFIG
The configuration file to use (default: docs/codes3d.conf).
--output-format OUTPUT_FORMAT
Determines columns of output. 'short': snp|gencode_id|gene|tissue|adj_pval| [log2_aFC|log2_aFC_lower|log2_aFC_upper] or [beta|beta_se]| maf|interaction_type|hic_score. 'medium' adds:
eqtl_pval|snp_chr|snp_locus|ref|alt|gene_chr| gene_start|gene_end|distance. 'long' adds:cell_lines|cell_line_hic_scores| expression|max_expressed_tissue|max_expression| min_expressed_tissue|min_expression.
(default: long).
--do-not-produce-summary
Do not produce final summary file, stop process after mapping eQTLs. Helpful if running batches (default: False).
--suppress-intermediate-files
Do not produce intermediate files. These can be used to run the pipeline from an intermediate stage in the event of interruption (default: False).
--non-spatial Map non-spatial eQTLs.
--gene-list GENE_LIST [GENE_LIST ...]
List of genes for non-spatial eQTL mapping.
--gtex-cis Retrieve spatially unconstrained cis-eQTLs as calculated in GTEx. To be used in combination with 'non-spatial'.
```

If you want to make CoDeS3D accessible from anywhere in your system, you can add the codes3d directory to your PATH by appending this line to the bottom of your `.bash_profile`/`.bashrc` file in your home directory:
Expand Down
198 changes: 195 additions & 3 deletions data_preparation/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,204 @@
# CoDeS3D: Contextualising Developmental SNPs in Three Dimensions

## Data preparation and installation
## Database setup
CoDeS3D requires a PostgreSql database to run. This is because it integrates different types and sources of data, and using a database improves efficiency.
- Install [PostgreSql](https://www.postgresql.org) if you haven't already.
- If you plan to use more than a few Hi-C or eQTL datasets, you may want to create a tablespace for your database. To do that,
1. Create a directory in location with adequate disk space
```
mkdir /path/to/tablespace
chown -R postgres /path/to/tablespace
chrgrp -R postgres /path/to/tablespace
```

2. Start psql as postgres and create tablespace
```
psql -U postgres -h localhost
CREATE TABLESPACE tblspace_codes3d LOCATION '</path/to/tablespace>';
```

### Hi-C data
3. Confirm that a tablespace is created
```
\db+
```
- Create the `codes3d` user,
```
CREATE USER codes3d WITH PASSWORD 'your password' CREATEDB;
CREATE DATABASE codes3d OWNER codes3d;
```

- Then create the `codes3d_commons` database,
```
CREATE DATABASE codes3d_commons OWNER codes3d tablespace tblspace_codes3d;
\q
```


## Hi-C data pre-processing and installation
- First process your raw Hi-C data with the [Juicer pipeline](https://github.com/aidenlab/juicer). Ensure that you use the GRCh38 reference human genome. The final outputs of the Hi-C processing should have the following columns: name of the paired-ends reads, strand (`str`), chromosome (`chr`), position (`pos`), restriction site fragment (`frag`), and mapping quality score (`score`) for both read pairs. (Read pairs with mapping quality score < 30 will be removed during installation.)

- Create a sub-directory for the processed Hi-C libraries in `codes3d/lib/hic/<restriction enzyme>`. After executing the subsequent steps, your Hi-C library directory should have a structure like the following:
```
codes3d/lib/hic/DpnII/
├── GCA_000001405.15_GRCh38_no_alt_analysis_set.DpnII.fragments.bed
├── GCA_000001405.15_GRCh38_no_alt_analysis_set.DpnII.fragments.db
├── GCA_000001405.15_GRCh38_no_alt_analysis_set.fa
├── dna.fragments.db
└── hic_data
├── A549-ACE2_0h_Ho2020
│   ├── A549_ACE2_0h_Ho2020_GSM4955378_merged_nodups.db
│   └── A549_ACE2_0h_Ho2020_GSM4955379_merged_nodups.db
|
└── TeloHAEC_Lalonde2019
├── TeloHAEC_GSM3593256_merged_nodups.db
└── TeloHAEC_GSM3593257_merged_nodups.db
```
- Digest the human genome with the restriction enzyme that was used to create your Hi-C library. See `python data_preparation/digest_genome.py -h` for help.

- Load the Hi-C libraries into sqlite3 or PostgreSql databases with the appropriate helper scripts: `data_preparation/hic_db_sqlite.py` for sqlite3 or `data_preparation/hic_db_postgres.py` for PostgreSql.

- Update the `lib/meta_info/meta_hic.txt` with information about the Hi-C cells. This file should have the following columns: `library`, `name`, `cell_type`, `tissue`, `disease` `sex`, `ethnicity`, `karyotype`, `age`, `enzyme`, and `tags`.

- Finally, load the Hi-C meta-info to the `codes3d_commons` PostgreSql database with `data_preparation/init_hic_meta.py`.


### eQTL data
- CoDeS3D requires genotype and expression datasets that are prepared using GTEx's [genotype](https://github.com/broadinstitute/gtex-pipeline/tree/master/genotype) and [RNA-seq](https://github.com/broadinstitute/gtex-pipeline/tree/master/rnaseq) pipelines. In addition, follow steps 1, 2, and 3 of the GTEx's [QTL pipeline](https://github.com/broadinstitute/gtex-pipeline/tree/master/qtl) to produce your normalised expression and combined covariates files.

- Place all the required file in a sub-directory in `codes3d/lib/eqtls/`. Your file structure should look like this:
```
codes3d/lib/eqtls/GEUVADIS/
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.lookup_table.txt.gz
├── GEUVADIS.445_samples.rnaseqc_tpm.gct.gz
├── GEUVADIS.445_samples.rnaseqc_tpm_median.gct.gz
├── GEUVADIS_covariates
│   └── GEUVADIS.445_samples.covariates.txt
├── GEUVADIS_expression_matrices
│   ├── GEUVADIS.445_samples.normalized_expression.bed.gz
│   └── GEUVADIS.445_samples.normalized_expression.bed.gz.tbi
└── GEUVADIS_genotypes
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bed
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.bim
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.fam
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.log
├── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz
└── GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz.tbi
```

- Update the `codes3d/lib/meta_info/meta_eqtls.txt` file to include the `name`, `tissue`, `project`, and `tags` of your eQTL project.

- Load the updated meta-info to the `codes3d_commons` database using the `data_preparation/init_eqtl_meta.py` helper script.

- Create a database for your eQTL project. The name should have the syntax `eqtls_<eQTL project name>`
```
CREATE DATABASE eqtls_<eQTL project name> owner codes3d tablespace <tablespace>;
```

- Create a `variants` table in your new eQTL database using
```
data_preparation/init_gene_variant_db.py
```

- Create `variant_lookup_<restriction fragment>` tables for the restriction enzymes that used to generate your Hi-C libraries
```
data_preparation/init_fragment_db.py
```

- Your eQTL database should have the following tables:

```
eqtls_geuvadis=> \d
List of relations
Schema | Name | Type | Owner
--------+------------------------+-------+---------
public | variant_lookup_hindiii | table | codes3d
public | variant_lookup_mboi | table | codes3d
public | variants | table | codes3d
(3 rows)
```

- Finally, update `docs/codes3d.conf` to point CoDeS3D scripts to your eQTL datasets e.g.
```
.
.
.
[GEUVADIS]
eqtl_dir=../lib/eqtls/GEUVADIS/
GENE_FP: %(eqtl_dir)sGEUVADIS.445_samples.rnaseqc_tpm_median.gct.gz
GENOTYPES_DIR: %(eqtl_dir)sGEUVADIS_genotypes
GENOTYPES_FP: %(GENOTYPES_DIR)s/GEUVADIS.445_samples.GRCh38.20170504.maf01.filtered.nodup.vcf.gz
COVARIATES_DIR: %(eqtl_dir)sGEUVADIS_covariates
EXPRESSION_DIR: %(eqtl_dir)sGEUVADIS_expression_matrices
.
.
```


## Reference tables
CoDeS3D needs genes and SNP reference files to run.
The gene reference file should be in `.BED` format: `chrom`, `start`, `end`,
`gene name`, `gene id (Ensembl)`.
Similarly, you will SNP reference files (e.g. human_9606_b151_GRCh38p7). Lastly,
CoDeS3D uses the [RsMergeArch.bcp.gz](https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/database/organism_data/RsMergeArch.bcp.gz) reference to resolve SNPs with multiple rsIDs. Your reference files should have the following structure.
```
codes3d/lib/reference_files/
├── genes
│   ├── gene_reference.bed
│   ├── gene_reference.db
│   └── gene_reference_sorted.txt
└── snps
├── human_9606_b151_GRCh38p7
│   ├── bed_chr_1.bed
│   ├── bed_chr_10.bed
│   ├── bed_chr_11.bed
│   ├── bed_chr_12.bed
│   ├── bed_chr_13.bed
│   ├── bed_chr_14.bed
│   ├── bed_chr_15.bed
│   ├── bed_chr_16.bed
│   ├── bed_chr_17.bed
│   ├── bed_chr_18.bed
│   ├── bed_chr_19.bed
│   ├── bed_chr_2.bed
│   ├── bed_chr_20.bed
│   ├── bed_chr_21.bed
│   ├── bed_chr_22.bed
│   ├── bed_chr_3.bed
│   ├── bed_chr_4.bed
│   ├── bed_chr_5.bed
│   ├── bed_chr_6.bed
│   ├── bed_chr_7.bed
│   ├── bed_chr_8.bed
│   ├── bed_chr_9.bed
│   ├── bed_chr_MT.bed
│   ├── bed_chr_X.bed
│   └── bed_chr_Y.bed
├── human_9606_b151_GRCh38p7_RsMergeArch.bcp.gz
└── human_9606_b151_GRCh38p7_RsMergeArch.pairs.bcp.gz
```

- Create a genes table in the `codes3d_commons` database. Make sure the `.BED`
file has gene coordinates in GrCh38.
```
data_preparation/init_gene_variant_db.py
```

- Create `gene_lookup_<restriction enzyme>` lookup tables in the `codes3d_commons`
database.
```
data_preparation/init_fragment_db.py
```

### Reference files
Your `codes3d_commons` database should look like the following:
```
codes3d_commons=> \d
List of relations
Schema | Name | Type | Owner
--------+---------------------+-------+---------
public | gene_lookup_hindiii | table | codes3d
public | gene_lookup_mboi | table | codes3d
public | genes | table | codes3d
public | meta_eqtls | table | codes3d
public | meta_hic | table | codes3d
(5 rows)
```

0 comments on commit 806d6b9

Please sign in to comment.