Skip to content

Draft genome assembly track

Håkon Kaspersen edited this page Sep 26, 2024 · 8 revisions

Information

This track utilize regular paired-end Illumina reads to generate draft assemblies for prokaryotic organisms. First, the reads are quality-controlled with FastQC and MultiQC, followed by trimming with Trim-Galore. The reads are also contamination-checked with Kraken2. Then, the reads are downsampled to a user-defined coverage level using Rasusa, followed by a new round of FastQC and MultiQC. Then, the subsampled and trimmed reads are assembled using Unicycler. Coverage is calculated by mapping the reads to their respective assemblies, and Quast is used for general quality metrics of each assembly. Lastly, an html report is generated which report the various statistics in a neat way.

How to run

Generate input samplesheet

The input samplesheet can be generated by using the script generate_input.R script located in the bin directory. The script assumes that all the reads to be analysed are located in the same directory.

Rscript Assemblage/bin/generate_input.R illumina /path/to/read/data pattern r1_suffix r2_suffix
  • Illumina: Specifies that this is for illumina reads only
  • Pattern: Specifies what should be removed from the file name to attain the sample name common for both R1 and R2
  • R1/R2-suffix: The part of the filename that separates the R1 and R2.

Pattern and suffix example: If the filenames are sample1_L001_R1_001.fastq.gz and sample1_L001_R2_001.fastq.gz, the pattern would be everything after sample1, e.g. _L001_R._001.fastq.gz, with a period instead of the 1 or 2 in R1|R2. Here, the suffixes would be R1_001.fastq.gz and R2_001.fastq.gz.

Usage example: Given the reads:

sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz
...

The following command will be used:

Rscript Assemblage/bin/generate_input.R illumina /path/to/read/data _R..fastq.gz R1.fastq.gz R2.fastq.gz

This will generate the following samplesheet.csv in your current directory:

sample,R1,R2
sample1,/path/to/read/data/sample1_R1.fastq.gz,/path/to/read/data/sample1_R2.fastq.gz
sample2,/path/to/read/data/sample2_R1.fastq.gz,/path/to/read/data/sample2_R2.fastq.gz

Pipeline execution

The pipeline is executed by referring to the main.nf file in the Assemblage directory.

nextflow run Assemblage/main.nf --track draft --input samplesheet.csv --genome_size 5500000 --out_dir assemblage --kraken_db kraken_db_path -profile apptainer -work-dir $USERWORK/assemblage -c path_to_config

Parameters

Input and output
--input:                      Input csv file
--out_dir:                    Output directory name

Tool-specific parameters
--phred_score:                Minimum phred score value for trimming, default: 15
--error_rate:                 Maximum allowed error rate, default: 0.1
--minlength:                  Minimum read length after trimming, default: 20
--kraken_db:                  Path to a kraken2 database directory, mandatory option
--genome_size:                Estimated genome size of the organism, used to calculate relative coverage
--coverage:                   Target coverage for the subsampling
--unicycler_mode:             Specify the mode used in Unicycler, default: normal
--min_contig_len:             Minimum contig length in Unicycler, default: 500
--depth_filter:               Depth filter used in Unicycler, default: 0.25
--output_bam:                 Use to output BAM files from mapping

Optional output parameters
--output_trimmed_reads:       Use to output the trimmed reads in the output directory
--output_kraken_reports:      Use to output the kraken2 reports
--output_subsampled_reads:    Use to output the subsampled reads
--output_coverage_reports:    Use to output coverage reports
Clone this wiki locally