genome-randomisation-survey

A survey of different genome randomisation methods.

Files:

README.md: this readme file
LICENSE: software licence
bin: directory of software scripts
data: data directory
docs: documents, including draft manuscript

Work flow:

Fetch sequences for testing different tools. Selected sequences cover a range of lengths and C+G contents. These include single gene sequences (both ncRNA and protein-coding), viroid, viral, bacterial, archaeal and eukaryotic chromosomes and genomes.

-- All sequences are sourced from the EMBL nucleotide archive (ENA), selected to uniformally cover a range of lengths and C+G contents (docs/figures/seqStats.pdf)

-- See data/sequenceList.tsv for a list of sequence accessions and descriptions

-- To fetch the sequences from ENA, run:

./bin/fetchSeqs.pl

#Attempted to use for simuG to randomise sequence, accounting for coding regions -- fails to read GFF:

#embl2gff.pl ./data/sequences/AB000109.1.embl > ./data/sequences/AB000109.1.gff

#Check and strip off non-ACGT characters -- causes phastSim to fail... ls ./data/sequences/*fasta | awk '{print "echo "$1" && esl-seqstat -c -a "$1}' | sh > ./data/sequences/esl-seqstat.out egrep 'fasta|^residue|^Total' ./data/sequences/esl-seqstat.out | egrep -v 'residue: A|residue: C|residue: G|residue: T' | grep -B 2 ^resi egrep 'fasta|^residue|^Total' ./data/sequences/esl-seqstat.out | egrep -v 'residue: A|residue: C|residue: G|residue: T' | grep -B 2 ^resi | grep fasta$ | awk '{print "./bin/iupac24char.pl -i "$1"> blah && mv blah "$1}' | sh

Run & time different randomisation methods. Selected methods include:

shuffle methods
EASEL esl-shuffle
EMBOSS shuffleseq
Shuffler
USHUFFLE
Markov methods & simulated evolutionary methods
EASEL esl-shuffle -0|1 (-w)
Shuffler (-m)
ROSE
PhastSim
Add random mutations or sequencing errors (SNPs, INDELs, structural variation etc.)
simuG
Monte-Carlo methods
?????????
Quick run:

./bin/generateNullSeqs.pl -n 10 -d ./data/sequences  -o ./data/sequences-null -v  && cat ./data/sequences-null/times.txt

Evaluate shuffled sequences

similarity with input sequence
C+G content and shared k-mers
ORFs, ncRNAs (Rfam), domains (Pfam), ...

Combine statistics: https://www.nature.com/articles/s41598-021-86465-y Fisher's method: -2*\sum log(p) ~ \chi^2

Generate figure(s)
Write manuscript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

genome-randomisation-survey

Files:

Work flow:

ORFs, ncRNAs (Rfam), domains (Pfam), ...

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
bin		bin
data		data
docs		docs
LICENSE		LICENSE
README.md		README.md

License

Gardner-BinfLab/genome-randomisation-survey

Folders and files

Latest commit

History

Repository files navigation

genome-randomisation-survey

Files:

Work flow:

ORFs, ncRNAs (Rfam), domains (Pfam), ...

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages