frag_align_rDNA workflow

This is the repository for the snakemake workflow frag_align_rDNA. In a nutshell, the workflow creates artificial reads from the canonical human and mouse rDNA sequences U13369 and BK000964,respectively, and aligns reads to the corresponding reference genome to help identify rDNA-like regions across the genome.

Summary

snakemake rule DAG

The workflow consists of the following steps:

Download human and mouse reference genome files from Ensembl
Concatenate individual genome reference files and create a single reference file for bowtie2
Index the single reference files with bowtie2-build
Download the rDNA reference sequences from GenBank: U13369 for human, BK000964 for mouse.
Fragment rDNA sequence using a non-overlapping sliding-window approach using a read length of L = 100, 500 and 1000 bp. This is done using a custom Python script workflow/scripts/fragment_seq.py. The script also allows for overlapping sliding-windows as well as sampling uniformly from the reference sequence. Details on how to call fragment_seq.py can be found in the snakemake rule.
Align all rDNA subsequences to the corresponding reference genome with bowtie2, using default parameters except for switch --all to report all alignments.
Index resulting BAM file with samtools index.
Create BigWig coverage track from BAM file using deeptools bamCoverage.
Create BED output frin BAM file using bedtools bamtobed.
Create Circos plots to visualise the mapping of rDNA fragments to the corresponding reference genome.

Workflow deployment and requirements

The snakemake workflow should be entirely self-contained and reproducible.

Provided snakemake is installed, the workflow can be deployed to a new system by following these steps

# Clone repository
git clone https://github.com/mevers/frag_align_rDNA
cd frag_align_rDNA

# Execute workflow
./run_workflow.sh
# Or explicitly:
# snakemake --use-conda --cores

External dependencies include bowtie2, samtools, deeptools and R (along with the additional R/Bioconductor libraries tidyverse, Biostrings, Rsamtools, circlize), and will be met automatically through rule-specific conda environments. There is no need to manually install bowtie2 etc., although it doesn't affect the workflow if these tools are already installed.

How to make changes to the workflow

To change (parts of) the workflow, a minimal familiarity with snakemake is recommended. For example, to fragment the rDNA with overlapping (instead of non-overlapping) windows the input targets of rule all need to be changed, because the window and step size are inferred from the output file names. Details about specific rules are given in the corresponding snakemake rules files in folder workflow/rules.

More extensive modifications may require new rules and/or changes to the existing rules.

Details & comments

Performance

Running bowtie2 with -all to report all alignments takes a very long time. Be aware, when changing parameters of the rDNA fragment process and increasing the number of rDNA fragments.

Location of rDNA repeats across the genome

Human rDNA tandem repeats are located on the acrocentric chromosomes 13, 14, 15, 21, 22 (, Y).

References (by no means complete):

Mouse rDNA tandem repeats are located on chromosomes (11,) 12, 15, 16, 18 and 19.

References (by no means complete):

Results

The Circos plots visualise where in the genome specific (non-overlapping) fragments of the rDNA map to. The width of every arc corresponds to the length of the fragment. The darker the shading the more regions a specific rDNA fragment maps to.

Human

Fragment length = 100 bp, non-overlapping	Fragment length = 500 bp, non-overlapping	Fragment length = 1000 bp, non-overlapping

Mouse

Fragment length = 100 bp, non-overlapping	Fragment length = 500 bp, non-overlapping	Fragment length = 1000 bp, non-overlapping

Figures are included as PNGs in folders 05_circos_plots/mouse/GRCm38 and 05_circos_plots/mouse/GRCm38. Scalable PDF versions of the figures are generated when executing the full workflow.

In summary:

Most 100 bp long chunks from the human and mouse rDNA have matches across the genome; these matches are often one-to-many mappings, meaning that one 100 bp long rDNA region maps to multiple loci in the genome.
Genomic matches are not necessarily on accrocentric chromosomes but are located throughout the entire genome.
In mouse, there are regions in the canonical rDNA sequence that have no matches elsewhere in the genome. This may be indicative of (1) a poor rDNA reference sequence, and/or (2) an incomplete genome reference assembly.
Genomic regions that match rDNA regions overlap with repetitive elements, mainly AluY elements in human, and B1/B2 SINE elements in mouse.

Further selected references:

Bugs

None. Yet.

Contact

Maurits Evers ([email protected])

Please raise any questions, concerns, bugs as an Issue on the GitHub project site.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
04_output_data		04_output_data
05_circos_plots		05_circos_plots
06_overlap_analysis		06_overlap_analysis
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
report.html		report.html
run_workflow.sh		run_workflow.sh
workflow_dag.png		workflow_dag.png
workflow_rulegraph.png		workflow_rulegraph.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

frag_align_rDNA workflow

Summary