🌲 Annotation of the plastid genome of white spruce (Picea glauca), genotype WS77111 https://www.ncbi.nlm.nih.gov/nuccore/MK174379
The white spruce WS77111 chloroplast assembly was annotated using GeSeq. The GenBank file generated by GeSeq was then converted into a Gene Feature File using EMBOSS Seqret, where duplicate annotations were removed and manual annotations were added. Reference chloroplast genomes used include interior spruce PG29 and Sitka spruce Q903, and occasionally the Norway spruce. In addition to GeSeq, two third party tRNA annotators were used: tRNAscan-SE v2.0 and ARAGORN v1.2.38. Although these third party tRNA annotators did in fact detect some 'novel' tRNAs, these tRNAs were not found in all reference chloroplast genomes used. Further analysis of these tRNAs was conducted using RNAweasel, and ARAGORN to produce 2D structures and folding results of the tRNAs. Due to these inconclusive results and the fact that the spruce chloroplast genome is known to be highly conserved, it was concluded that these tRNAs to be excluded in the final annotation. Inverted repeats were also found but excluded from the final annotation.
The assembled FASTA file was inputted into GeSeq.
- see Documentation
- GeSeq Settings
- GeSeq generates an OGDRAW .png file and a GenBank .gb file.
The .gb file is converted to a .gff file using EMBOSS Seqret:
Duplicates were removed. Most conflicts were due to the Picea morrisonicola and Picea asperata reference annotations. One of them annotated tRNAs with anti-codons, and the other did not, so they were treated as different annotations and placed in the GeSeq generated .gb file twice. Those without anti-codons were removed from the final file.
ARAGORN and tRNAscan detected some tRNAs that GeSeq did not detect. However, those were not detected with high confidence and were removed from the final annotation as they were not present in PG29, Q903, or Norway spruce annotations, making them highly unlikely as most of them are highly conserved sequences (see diagram.
- GeSeq files were regenerated with and without third party tRNA annotators to cross reference which annotations were valid:
- with only ARAGORN, no tRNAscan
- with only tRNAscan, no ARAGORN
- with only GeSeq (without both ARAGORN and tRNAscan)
Four genes specifically needed to be manually annotated: rps12, petB, petD, rpl16. Rps12 is trans-spliced, while the other genes had initial short exons.
GeSeq did not annotate some mRNAs as well as some exons, which were later manually annotated as well (see final annotation). In the final annotation, all 114 genes were conserved, including the 74 coding regions (CDS), 4 rRNAs, 36 tRNAs, and 15 introns (9 of them in coding regions, 6 in tRNAs).
RNAweasel used to confirm that the tRNAs were not valid, but tRNA-Ser was worth looking into further.
ARAGORN was run independently of GeSeq to generate the ARAGORN text report with 2D tRNA structures. The tRNA in question tRNA-Ser is tRNA #17.
tRNAscan, in conjunction with ARAGORN, was used to determine tRNA products.
GFF annotation validated using table2asn_GFF:
- Generated Files: Sequin ASN.1 file, discrepancy report, error list, Genbank file
- See Makefile
The .gbf file generated by table2asn_GFF was fed through OGDRAW:
- OGDraw settings
- OGDraw files: see ogdraw
Lin D, Coombe L, Jackman SD, Gagalova KK, Warren RL, Hammond SA, Kirk H, Pandoh P, Zhao Y, Moore RA, Mungall AJ, Ritland C, Jaquish B, Isabel N, Bousquet J, Jones SJM, Bohlmann J, Birol I. 2019. Complete chloroplast genome sequence of a white spruce (Picea glauca, genotype WS77111) from eastern Canada. Microbiol Resour Announc 8:e00381-19. doi: 10.1128/MRA.00381-19.