Skip to content

Major updates for LTR_retriever and LAI

Compare
Choose a tag to compare
@oushujun oushujun released this 19 Jun 16:12
· 147 commits to master since this release

LTR_retriever

New features

  1. Add LTR_digest support
    • The *pass.list.gff3 becomes readable to LTR_digest.
    • You can also use /LTR_retriever/database/TEfam.hmm to feed LTR_digest.
  2. Improved gff3 output for intact LTR-RTs
    • Add strand info for each elements.
    • The ones with '?' (unknown direction) in the *pass.list will remain '?'s in the *pass.list.gff3 file.
  3. Improve multi-threading efficiency
    • Use the Thread::Queue module to replace the Thread::Semaphore module
    • At least 100% more efficient
  4. Add Mac OsX support (High Sierra v10.13.3 tested)
  5. Add a script to summarize the genome % of each TE families using RepeatMasker .out files
    • Usage: perl ./LTR_retriever/bin/fam_coverage.pl TE_lib RM_output genome_size_bp > TE_fam.size.list
    • Not only works with LTRs but also other TEs in the RM.out file.
  6. Add a script to summarize the genome % of each TE superfamilies (TE summary table for genome publications)
    • Usage: perl ./LTR_retriever/bin/fam_summary.pl TE_fam.size.list genome_size_bp > TE_fam.sum.txt
    • Summary tables for LTR families and superfamilies are added to the output of LTR_retriever
  7. Add a script to calculate LTR distribution (Copia, Gyspy, and unknow) on chromosomes.
    • Usage: perl ./LTR_retriever/bin/LTR_sum.pl -genome genome.fa -all genome.fa.RM.out [options]
    • Options:
      -window [int] bp size of the sliding window, default 3,000,000
      -step [int] bp size of the moving step, defalut 300,000
      -intact indicate the -all file is an LTR_retriever .pass.list instead of a RepeatMasker .out file
    • The .out.LTR.distribution.txt file is generated by default.
  8. Add a script for whole-genome forward simulation (randomly add mutations on the genome)
    • Usage: perl ./LTR_retriever/bin/simulate_mutation.pl -g genome.fasta -u [0-1] > genome.mutated.fasta
    • -u specifis the mutation rate. i.e., -u 0.01 will randomly mutate 1% of the entire genome.
  9. Replace annotate_gff.pl with make_gff3_with_RMout.pl for better whole-genome LTR-RT annotation
    • Usage: perl ./LTR_retriever/bin/make_gff3.pl genome.fa.RepeatMasker.out > genome.fa.RepeatMasker.gff
    • Applied basic hit filtering: SW_score>=300, alignment length >= 80 bp
  10. Add more usage information to -h
  11. Update README

Bug fixed

  1. Program halt when nothing is masked in truncated candidates.
  2. Program halt when multiple LTR_retriever tasks simutainously check RepeatMasker in the same directory
  3. substr sequences out of range when self-corrected reads are used as input

LAI Version b2

  1. Rewrite LTR_calc.pl with more accurate and efficient algorithms.
    • Add the -step parameter for overlapping-sliding window scheme to estimate LAI
    • Output the size of the genome for genome LAI
    • Memory consumption of this scrip is approx. 2X the size of the input genome
  2. To control the boom and bust dynamic of LTR-RTs, adjust the raw LAI based on LTR identity.
    • Estimate mean identity of LTR sequences in a genome using all-versus-all blastn search
    • Add a quick estimation (-q) of genomic LTR identity based on a log-linear model with the slope estimated from three small subsets of LTRs
    • To avoid abnormal adjustment, if estimated LTR identity <= 92% or >= 96%, then corrected it to 92% or 96%, respectively
    • Use the -unlock parameter to release the restriction of LTR identity ([92, 96]) for good genomes with extreme LTR activities
    • Set LAI_adj=0 if raw LAI==0
    • The alignment identity cutoff (-iden) can excludes hits higher than this value for LTR identity calculation. Default: 100 (%)
  3. Change the output naming of LAI to raw_LAI and LAI_adj to LAI for easier description.
  4. Add polyploid support.
    • If the input genome is a polypoid (diploidized ancient polypoid does not count), then only a set of chromosomes (1x, a monoploid) should be used to estimate LAI, otherwise the LTR identity will be erroneously estimated to a very high value and substantially decrease the LAI.
    • Use the -mono parameter to provide a list of chromosome names of a monoploid, LAI will be calculated only on these sequences.
    • Users can run LAI multiple times with different monoploids specified to obtain the whole genome LAI estimation.
  5. Set prerequisites of LAI estimation
    • set intact LTR-RT limit >= 0.01%;
    • set total LTR limit >= 5%
  6. Add the -totLTR parameter for customized total LTR content;
  7. Add the -window parameter to control window size
  8. Add the rush mode (-qq) to quickly estimate raw LAI for version comparison. Raw LAI should not be used to compare between different species because the LTR dynamic is not controlled.
  9. Add status output of the LAI program. LAI is a default output of LTR_retriever. You should rerun LAI with the -mono parameter if the target genome is a polyploid.
  10. Add Mac OsX support (High Sierra v10.13.3 tested).