From 3785127d3c9728202b875e6e1acff7bb383ec8b4 Mon Sep 17 00:00:00 2001 From: Per Unneberg Date: Thu, 21 Dec 2023 14:02:27 +0100 Subject: [PATCH] Documentation updates --- DESCRIPTION | 2 +- NEWS.md | 31 ++++++++++++------- README.Rmd | 4 +++ vignettes/empirical.Rmd | 66 +++++++++++++++++++++++++++++++++++++++++ vignettes/genecovr.Rmd | 22 +++++++------- 5 files changed, 104 insertions(+), 21 deletions(-) create mode 100644 vignettes/empirical.Rmd diff --git a/DESCRIPTION b/DESCRIPTION index d608929..078c2dd 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: genecovr Title: Gene body coverage analysis to evaluate genome assemblies -Version: 0.1.0 +Version: 0.1.1 Authors@R: person(given = "Per", family = "Unneberg", diff --git a/NEWS.md b/NEWS.md index 3c5811d..e4f53c4 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,22 +1,33 @@ -# Release 0.0.0.9013 + + +# genecovr 0.1.1 + +- update README +- add Empirical studies section + +# genecovr 0.1.0 + +- add pkgdown site + +# genecovr 0.0.0.9013 - fix factor level ordering for geneBodyCoverage plot - save geneBodyCoverage as tsv -# Release 0.0.0.9012 +# genecovr 0.0.0.9012 - adjust factor levels for number of inserts (#4) - summarize number of inserts by transcript (#5) -# Release 0.0.0.9011 +# genecovr 0.0.0.9011 - fix order of factors -# Release 0.0.0.9010 +# genecovr 0.0.0.9010 - remove duplicate entries in psl input -# Release 0.0.0.9009 +# genecovr 0.0.0.9009 - add plot of transcript length distributions conditioned on number of mapped contigs @@ -26,21 +37,21 @@ DataFrame inputs, obviating the need to rerun geneBodyCoverage multiple times in genecovr script - -# Release 0.0.0.9008 +# genecovr 0.0.0.9008 - Remove characters trailing first space in fasta headers -# Release 0.0.0.9007 +# genecovr 0.0.0.9007 - Fix conversion of DNAStringSet to Seqinfo - Make sure geneBodyCoverage table has nmax levels - -# Release 0.0.0.9006 +# genecovr 0.0.0.9006 - add depthOfCoverage function and analysis to vignette and script - reduceHitCoverage is deprecated - improve some docs - add wrapper for saving plots - add tests mainly for alignmentpairs and test setup + + diff --git a/README.Rmd b/README.Rmd index dad725e..6292e73 100644 --- a/README.Rmd +++ b/README.Rmd @@ -40,6 +40,10 @@ GitHub](https://github.com/nbis) with: devtools::install_github("NBISweden/genecovr") ``` +The tool has been developed and tested on GNU/Linux systems but should +work on any system that runs `R`. Installation is expected to take at +most a couple of minutes. + ## Usage ### genecovr script quick start diff --git a/vignettes/empirical.Rmd b/vignettes/empirical.Rmd new file mode 100644 index 0000000..b21c26e --- /dev/null +++ b/vignettes/empirical.Rmd @@ -0,0 +1,66 @@ +--- +title: "Empirical studies" +author: "Per Unneberg" +date: "`r Sys.Date()`" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Empirical studies} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +biblio-style: plain +bibliography: bibliography.bib +--- + +# Northern krill + +`genecovr` was used to assess the quality metrics of the Northern +krill genome. + +To test genecovr with the 19 Gb Northern krill genome and gene data +(16,509 transcripts of protein coding genes), access the collection in +the SciLifeLab Data Repository named "Ecological genomics of the +Northern krill" using the following permanent link: + +< URL to be provided > + +1. Genome file + +Access item: 1. Ecological genomics of the Northern krill: Genome +assembly DNA sequences + +Download: northern_krill.genome_assembly.tar.gz + +Extract genome assembly for evaluation: +1.m_norvegica.main_w_mito.fasta + +2. Gene models + +Access item: 3. Ecological genomics of the Northern krill: Genome +assembly annotations (genes and repeats) + +Download: trinity_transcript.16509_single_isoforms.cds.fasta.tar.gz + +Extract and use transcripts for evaluation: +trinity_transcript.16509_single_isoforms.cds.fasta + +3. gmap alignment + +Map transcripts to assembly with gmap: + + # Build index + gmap_build --genomedb mnorvegica 1.m_norvegica.main_w_mito.fasta + # Map with gmap; format=1 -> psl output + gmap -t 12 --dir . --db mnorvegica --format 1 trinity_transcript.16509_single_isoforms.cds.fasta > mnorvegica.psl + +4. genecovr input file + +Generate a comma-separated file, assemblies.csv, with the following contents: + + main,mnorvegica.psl,1.m_norvegica.main_w_mito.fasta,trinity_transcript.16509_single_isoforms.cds.fasta + +and run + + genecovr assemblies.csv + +This will generate a number of summary data files along with png and +pdf plots based on the summary data. diff --git a/vignettes/genecovr.Rmd b/vignettes/genecovr.Rmd index b60a1b7..ab8d348 100644 --- a/vignettes/genecovr.Rmd +++ b/vignettes/genecovr.Rmd @@ -6,7 +6,7 @@ output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Gene body coverage analysis in R} %\VignetteEngine{knitr::rmarkdown} - \usepackage[utf8]{inputenc} + %\VignetteEncoding{UTF-8} biblio-style: plain bibliography: bibliography.bib --- @@ -71,15 +71,17 @@ are `GenomicRanges::GRanges` objects or objects derived from the # Analysing gene body coverage In this section we analyse the mapping of a transcriptome to a -non-polished and polished assembly. The mapping results consist of two -gmap files in psl format, `transcripts2nonpolished.psl` and -`transcripts2polished.psl`. In addition there are fasta index files -for both assemblies (`nonpolished.fai` and `polished.fai`) and for the -transcriptome (`transcripts.fai`). The fasta indices are used to -generate `GenomeInfoDb::Seqinfo` objects that can be used to set -sequence information on the parsed output. We load the fasta indices -and parse the psl files with `genecovr::readPsl`, storing the results -in an `genecovr::AlignmentPairsList` for convenience. +non-polished and polished assembly, using example data. The entire +analysis takes less than 5 minutes to execute using these datasets. +The mapping results consist of two gmap files in psl format, +`transcripts2nonpolished.psl` and `transcripts2polished.psl`. In +addition there are fasta index files for both assemblies +(`nonpolished.fai` and `polished.fai`) and for the transcriptome +(`transcripts.fai`). The fasta indices are used to generate +`GenomeInfoDb::Seqinfo` objects that can be used to set sequence +information on the parsed output. We load the fasta indices and parse +the psl files with `genecovr::readPsl`, storing the results in an +`genecovr::AlignmentPairsList` for convenience. ``` {r gbc-load-data} assembly_fai_fn <- list(