Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Species-level 16S kraken2 database? #862

Open
DntBScrdDv opened this issue Jul 31, 2024 · 5 comments
Open

Species-level 16S kraken2 database? #862

DntBScrdDv opened this issue Jul 31, 2024 · 5 comments

Comments

@DntBScrdDv
Copy link

DntBScrdDv commented Jul 31, 2024

Hi all,

I'm in need of a species-level 16S database. I had relied on the rdp database on other analysis platforms (e.g. FROGS) but the pre-built kraken2 rdp database only goes to genus-level.

I built a database from the RefSeq database but it is missing many key taxa (e.g. candidatus Omnitrophus).

Does anyone know of a species-level 16S database for kraken2 that is broader than RefSeg? e.g. a species-level Silva or RDP?

Many thanks

@ChillarAnand
Copy link

Kraken 2 / Bracken 16s RNA indexes are available for Greenegenes, RDP, Silva.

https://benlangmead.github.io/aws-indexes/k2

Does this help?

@DntBScrdDv
Copy link
Author

Hi @ChillarAnand ,

Many thanks for your reply. Unfortunately, no - this doesn't help. The Kraken2 RDP and Silva databases are limited to genus level, while GreenGenes has not been updated since - I think - 2016.

@Username-felix-is-not-available

Hi @DntBScrdDv ,
I have built an unofficial strain level version of the RDP database for my research: https://www.bioinformatics.uni-muenster.de/tools/metag/download Maybe you can get it to work with Kraken2, but I guess it will require quite some tinkering. If you use the database in your research, I would appreciate, if you cite my preprint which is linked on the download website.
Have a nice day,
Felix

@DntBScrdDv
Copy link
Author

Many thanks for this @Username-felix-is-not-available ,

I'm sorry, but could you explain a little what all the different files are? Is the .fa the sequences? What's the giant .suf file?

Thanks!

@Username-felix-is-not-available

You are very welcome, @DntBScrdDv . For your purposes, you can ignore all files except the "RDP16s28s.fa" (sequences) and the "tax.RDP16s28s.txt" (taxonomy) files. The other files either provide metadata or are specific to the LAST alignment program which I used for my project.

I hope my message will not send you down the rabbit hole, because Kraken2 uses a vastly different approach to taxonomy files than I did. In my files, you can use the sequence ID in the FASTA file to find the matching taxonomic string in the taxonomy file. The string contains the full lineage. Kraken2 uses an approach based on taxonomy IDs and splits the lineage in single taxa (see names.dmp and nodes.dmp files in Kraken2 database). For the special databases, it is best to assume that they are not identical to the NCBI taxonomy IDs (i.e. they are artificial). I think translating my files to Kraken2 format could be very difficult. It may be easier to use the logic in my script and add it to Kraken2's build_rdp_taxonomy.pl. The logic is described here (Supplementary Methods 4.2) in more general terms. Nevertheless, I don't know what downstream effects this would have. My automated approach to fix the taxonomy is also not fool proof and I am not a taxonomist by training. So there will be some room for improvement. If you come up with a better approach, please let me know.

Best,
Felix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants