Full list of commands for the 2020 Assembly of 221 S. aureus isolates from the DFU100 cohort (after removal of non-S. aureus isolates), as sequenced on the Illumina HiSeq in late 2019, and the subsequent coverage & gene content analyses on these assembled genomes / selection of a subset of these isolates to perform Oxford nanopore long read sequencing on. Some 'assembly' portions of this README are shared with that of EAGenomeAssembly. Steps 1-4 were performed on the 223 (presumed*) S. aureus isolates at the same time as they were performed on some other submitted isolates.
* One of the 223 DFU genomes turned out to be S. xylosis, and another turned out to be S. simulans
Conda environment configuration files for this project are located here
- EAGenomeEnv
- BlastEnv
- BowtieEnv
- QuastEnv
- R_envir
- TreeEnv
- TrimmingEnvironment
- updated_annotation_env
Here, we run FastQC on the raw, demultiplexed reads to assess the quality of the reads prior to trimming adapters or trimming for length. Outputs quality report
Shell script runs FastQC in paired end mode with default settings from EAGenomeEnv.
Here, we run TrimGalore to remove adapter sequences from the raw illumina reads and trim noisy ends observed in the fastqc step. Make_TrimGalore_shells.py builds shell scripts which contain commands to run TrimGalore TrimmingEnvironment in paired-end mode, set to stringency of 10, with length cutoff of 70bp, and set to clip 10bp off of the 5' end of each read. These shells also make use of the --nextera flag to specifically remove the nextera transposase sequence. Outputs trimmed reads
From TrimmingEnvironment , make shell scripts.
OutPutShells="/home/acampbe//EAGenomeAssembly/TrimGaloreShells"
OutPutTrimGalore="/project/grice/storage/HiSeq/WGS/HiSeq_19/TrimmedFastqs_TrimGalore"
CondaPath="/home/acampbe/software/miniconda3/bin/activate"python3 Make_TrimGalore_shells.py --inputdirname $InputDirectory --outputdirshells $OutPutShells --outputdir_trimgalore $OutPutTrimGalore --conda_activatepath $CondaPath --nshells 10
The following shell scripts run TrimGalore on each isolate:
Run_TrimGalore_0.sh Run_TrimGalore_1.sh
Run_TrimGalore_2.sh
Run_TrimGalore_3.sh
Run_TrimGalore_4.sh
Run_TrimGalore_5.sh
Run_TrimGalore_6.sh
Run_TrimGalore_7.sh
Run_TrimGalore_8.sh
Run_TrimGalore_9.sh
After running the trimming shells, show that TrimGalore actually improved the quality of our reads for input into the assembly step by running FastQC again. Outputs updated quality report
Run SPAdes using Unicycler's wrapper for it on the illumina short reads only. Outputs assembled contigs
A python script generates 20 shellscripts which, from EAGenomeEnv, calls unicycler on the paired-end short reads with default settings (running SPAdes on a range of k-mers)
python3 Make_UnicyclerIlluminaShells.py --inputdirname /project/grice/storage/HiSeq/WGS/HiSeq_19/TrimmedFastqs_TrimGalore --outputdirshells UnicyclerIlluminaShells --outputdir_unicycler /project/grice/storage/HiSeq/WGS/HiSeq_19/UnicyclerAssemble --conda_activatepath /home/acampbe/software/miniconda3/bin/activate --nshells 20
For 1...20, run
Run_UnicyclerSpades_<_>.sh
5. Align trimmed reads directly to the SPAdes assembly for each isolate to evaluate coverage and depth.
Using BowtieEnv, the following shell scripts perform a Bowtie2 alignment of the trimmed reads to the SPAdes-generated assemblies from step 4, then use samtools to extract depth-by-contig information from these alignments.
Outputs bowtie alignments of trimmed reads against each isolate, depth, length for each contig in each isolate (DFU_Covg_stats.tsv)
Run Bowtie2, Samtools, and Galaxy tools
Make output .csv file of depths and lengths by contig of each isolate
After the initial contig assembly, in order to make sure we're making conclusions about gene content and SNP-level relationships between isolates based on a real representation of each genome and not based on artifacts of either contamination or misassembly. While the latter is more difficult to detect in the absence of long reads against which we can scaffold, we should at least filter out contigs which are too short, of too low coverage, or map to another organism than S. aureus by BLAST. Outputs record of retained versus removed contigs, cleaned contigs for core genome analysis and tree building
We first sort the contigs into three categories:
- Remove: Contigs which are < 500bp in length or <10X depth
- Follow up: Contigs which have a depth less outside of the (Median Contig Depth for the Isolate) ± 1.5 x (Depth IQR) range
- Keep: Contigs which fit none of the above criteria
Then, we search the contigs labeled for "follow up" against the BLAST nt database. We note any follow-up contig (high- or low-coverage outlier) for which the best hit (by bitscore) is not of the Staphylococcus aureus species (this turned out to only be one, low-coverage outlier contig, which was circular and mapped closely to known S. aureus plasmids, and we won't remove it).
Run R script to classify contigs' keep vs. followup vs. drop status based on depth and length
Run a shell calling the python script to write .fastas based on the depth-based contig classification
Blastn query for every contig classified for 'followup,'
Add back the "follow-up" contigs that don't need to be removed (none of them did, in fact) This shell script uses Add_SA_contigs_Back.R to parse blast output, make fasta of 'added back' contigs
From the cleaned assemblies, we build a SNP-based, maximum likelihood phylogeny of the 221 S. aureus isolates with the following goals:
- Visualize and interpret the evolutionary relationships between the different isolates observed in the DFU100 cohort.
- Identify subject- and healing outcome-specific clades of S. aureus
- Interpret the isolate genomes in the context of known clinical strains of S. aureus by estimating the positions of each reference genome a posteriori
Shellscripts contained here call Prokka on default settings to predict gene coordinates and protein sequence. outputs .gff annotation files, gene coordinate maps, and .faa protein translations for each genome
For 1...9 run:
Run_Prokka<>.sh
Run Prokka on the selected S. aureus reference genomes and outgroup S. epidermidis
Run_Prokka_references.sh
2. Run Roary
Define core & accessory genomes for the set of isolates and build multiple alignment between core genomes. The pplacer algorithm, which will place the reference genomes on the maximum likelihood phylogeny of isolates, requires an alignment between all isolates and the references as input. Therefore, we will also run roary (and its wrapper for PRANK) on the isolates in aggregate with the reference genomes. outputs core and accessory genomes for each isolate, probabilistic multiple alignment of core genomes for the set with and without the references/outgroup included, SNP distance matrix for each alignment
Run RaxML to build a maximum likelihood, SNP-based tree from the core genome alignment of the 221 isolates with a random seed set, using the GTR-gamma nucleotide subsitution model. outputs a maximum likelihood tree in .newick format, as well as a text file containing the parameters of the final ML model.
Pplacer estimates the maximum posterior probable location of each reference genome on the maximum likelihood tree output by RaxML. Note: A silly versioning issue(pplacer was developed on a different version of RaxML than I used) means that I had to manually remove the "Partition: 0 with name: No Name Provided” string from the RAxML parameter file(input with the '-s' flag into Pplacer). outputs a .jplace file containing the mappings of each reference genome to the tree, and a .newick representation of this mapping thanks to tog
This R script visualizes the ML tree of isolates with references placed on it, collapsing clades of closely related genomes, recording the descendants contained in each collapsed clade. outputs plots for tree visualization as well as a mapping of the closest reference genome by SNP distance to each of the isolates' core genomes
- Get sequencing depth and breadth of coverage estimates for each isolate
- Summarize confirmed circular contigs (e.g. plasmids) present in each isolate's genome
- Identify regions of each genome which differ from reference genomes most closely related by core SNP distance
- Select a subset of the 221 S. aureus isolates to perform Oxford Nanopore long read sequencing on.
Here, we calculate depth as the average # of bases mapped to each position on an assembly, which we extract from Bowtie2 alignments of each isolate's trimmed reads to its assembly. We calculate breadth of coverage as the proportion of positions along a reference genome that each isolate's reads map to with at least 10X depth. We extract this information from Bowtie2 alignments of each isolate's trimmed reads to the closest (by core SNP distance). The "closest" reference on the tree to each isolate was output in step 2 of the previous section.
This script runs the bowtie alignments for both cases (isolate's reads against its assembly, isolate's reads against its closest reference genome) outputs one text file with the average depth per isolate, another text file with the base coverage count for each isolate to its reference, as well as reference length
Calculate statistics such as total # contigs, N50, L50 on the cleaned assemblies using QUAST.
Run MiniMap2 pairwise alignments between each isolate's assembly and the reference genome on the tree it's closest to (by SNP distance). This enables us to identify areas of an isolate that aren't represented in its closest reference genome. Outputs tabular summary of pairwise alignment
4. Aggregate statistics from steps 1-3 to select representative isolates from each cluster in step 1 for ONP sequencing
Rscript AggregateGenomeStats.R
5. Identify "clusters" of isolate genomes which share >99.99% core genome identity. Select representative isolates for each cluster based on the the output of Step 5.
The SubsettingCleanedAssemblies_forONP.R script constructs groups of isolates from the same subject that all differ by less than 118 SNPs in their core genomes (.01% the length of the core genome alignment between all isolates). It selects 87 representative isolates(to fill a 96-well plate) from the set based on the following criteria:
- All subjects represented in the set of 221 isolates should have at least one isolate sequenced with ONP
- Any isolate which differed by more than >118 SNPs from all other isolates should be sequenced with ONP (44 isolates total)
- At least one isolate should be sequenced from each core genome-based cluster. The representative should be chosen with the following "prioritized" criteria:
- High enough depth so we have at least one really good hybrid assembly from that ONP scaffold
- Imperfect coverage to the closest reference genome (step 1)
- Prioritize isolates with excess unmapped sequence to their closest reference by pairwise alignment (step 3)
- Fewer plasmids which have already been circularized using short read sequencing/ plasmidSPAdes
- Fragmented assemblies with many shorter contigs
- While representatives for the 11 largest clusters were chosen manually (>7 isolates each) based on the above criteria and plots of breadth vs. coverage, representatives for the 12th:42nd clusters were chosen in an automated fashion with the following nested criteria:
- Maximum total unmapped sequence to closest reference genome.
- Most fragmented assembly, based on minimum N50 score, if unmapped lengths equal
- Minimum number of circularized plasmids if N50 equal
- Another QC step I added in early 2021 (will subsequently update previous steps listed here to reflect repeated analyses with now filtered set of genomes)
- Test the sequences we have for evidence of within-species contamination, as based mostly on this paper
- Basically, map trimmed reads back to the cleaned contigs using Bowtie again
- This time, perform SNP calling on those alignments
- Filter to SNPs in 10X or higher depth positions of the contigs
- Based on the Raven et al. paper's findings, identify and remove genomes in which there are >30 SNPs that don't fall within 'hot spots' of SNPs <50bp of one another
- This identified 14 genomes in our set
Run Run_Bowtie_DORN_PostCleaning.sh to align trimmed raw reads against cleaned contigs for each genome.
Run TestAdmixture.sh. This script uses:
- samtools faidx, view, sort, and index to make a sorted bam representation of the bowtie alignment and samtools indexed representation of the contigs file
- samtools mpileup and bcftools call to call SNPs on the bowtie alignments It outputs a .bcf file for each of the 221 genomes
getDP4Counts.py parses the .bcf output and sticks every called variant from every genome into BCF_output_stats.csv, a .csv with the following fields:
- # Genome (DORN ID for the genome)
- Position (Position on the genome)
- Depth (Depth of read alignments to the position)
- VariantFrequency (From DP4 column of BCF, the (# forward reads mapped to variant allele + # reverse reads mapped to variant allele)/(Total reads mapped at the position)
4. Perform contamination test based on Raven et al.
AdmixtureOutput.R filters the table generated in step 3 to 10X or higher positions, then counts # of >50 bp apart SNPs per genome. Identifies genomes that have >30 of these SNPs, and outputs to ContaminatedIsolates2-12-21.csv
Note: the DBGWAS-related steps were performed initially on the full set of 221 isolates in Dec 2020-Jan 2021 (as reflected in commit history), but I added this documentation when I'd done them instead on the set of 207 isolates (having performed the purity tests above and subsequently removed 14 isolates).