- A survey of different genome randomisation methods.
- Fetch sequences for testing different tools. Selected sequences cover a range of lengths and C+G contents. These include single gene sequences (both ncRNA and protein-coding), viroid, viral, bacterial, archaeal and eukaryotic chromosomes and genomes.
-- All sequences are sourced from the EMBL nucleotide archive (ENA), selected to uniformally cover a range of lengths and C+G contents (docs/figures/seqStats.pdf)
-- See data/sequenceList.tsv for a list of sequence accessions and descriptions
-- To fetch the sequences from ENA, run:
#Attempted to use for simuG to randomise sequence, accounting for coding regions -- fails to read GFF:
#embl2gff.pl ./data/sequences/AB000109.1.embl > ./data/sequences/AB000109.1.gff
Check and strip off non-ACGT characters -- causes phastSim to fail...
- Run & time different randomisation methods. Selected methods include:
shuffle methods
EASEL esl-shuffle
EMBOSS shuffleseq
Markov methods & simulated evolutionary methods
EASEL esl-shuffle -0|1 (-w)
Shuffler (-m)
Add random mutations or sequencing errors (SNPs, INDELs, structural variation etc.)
Monte-Carlo methods
Quick run:
./bin/generateNullSeqs.pl -n 10 -d ./data/sequences -o ./data/sequences-null -v && cat ./data/sequences-null/times.txt
- Evaluate shuffled sequences
similarity with input sequence
C+G content and shared k-mers
Combine statistics: https://www.nature.com/articles/s41598-021-86465-y Fisher's method: -2*\sum log(p) ~ \chi^2
Generate figure(s)
Write manuscript