Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run with third-party taxonomy db #18

Open
alexaibio opened this issue Nov 16, 2020 · 6 comments
Open

Run with third-party taxonomy db #18

alexaibio opened this issue Nov 16, 2020 · 6 comments

Comments

@alexaibio
Copy link

Hi Joao,

Is that possible to run MapSeq with green genes, rdp or silva taxonomy databases?
It seems like they have a different format.

If so, could you please update the readme file as well?

Best
Alex

@colinbrislawn
Copy link

I'm interested in using MAPseq using Silva 138 pre-clustered at 99% identity (SILVA_138_SSURef_NR99_tax_silva.fasta.gz from here)

Here's what the silva files look like

gzip -dc SILVA_138_SSURef_NR99_tax_silva.fasta.gz | head -n 2
>AY846380.1.2583 Eukaryota;Archaeplastida;Chloroplastida;Chlorophyta;Chlorophyceae;Monoraphidium minutum
AACCUG...

gzip -dc tax_ncbi_ssu_ref_nr99_138.txt.gz | head -n 4
root;   1       no rank
root;Viruses;   10239   superkingdom
root;Viruses;Caudovirales;      28883   order
root;Viruses;Caudovirales;Ackermannviridae;     2169529 family

gzip -dc taxmap_ncbi_ssu_ref_nr99_138.txt.gz | head -n 3
primaryAccession        start   stop    Unclassified;   submitted_name
BD359736        3       2150    root;cellular organisms;Eukaryota;Alveolata;Apicomplexa;Aconoidasida;Haemosporida;Plasmodiidae;Plasmodium <genus>;Plasmodium (Plasmodium);Plasmodium malariae;                                                                                                                                        Plasmodium malariae
AB000278        1       1410    root;cellular organisms;Bacteria <prokaryotes>;Proteobacteria;Gammaproteobacteria;Vibrionales;Vibrionaceae;Photobacterium;Photobacterium iliopiscarium;                                                                                                                                               Photobacterium iliopiscarium

There's a plugin with Qiime 2 to normalize taxonomy levels, which could be helpful here.

@evilvenom
Copy link

@colinbrislawn @alexaibio
Hello. Did anyone of you figure out how to use custom databases? If yes, it'll be really helpful. Thanks in advance.

@colinbrislawn
Copy link

I have not figured out how to use custom databases, but also I have not worked on this sense posting. I would be interested in updates, though

@jfmrod
Copy link
Owner

jfmrod commented Jan 11, 2022

Hi! To use a custom database, you would need to have a file with the fasta sequences (which is already provided with SILVA), and a taxonomy file which has two (tab separated) columns one with the IDs of the fasta sequences and one with the taxonomic labels for each of the sequences. The taxonomic annotations should be normalized (equal number of ranks).

That will get you a result, the problem is there are still a lot of misannotations in SILVA sequences that will throw off mapseq, so to get optimal results one would need to clean the sequences and annotations from SILVA a bit.

Some collaborators have recently made such a set for SILVA which we were planning on including in the next release, I can ask them for the dataset if you are interested in it and try to push it out faster.

@jfmrod
Copy link
Owner

jfmrod commented Jan 11, 2022

You can find an example of the taxonomy (NCBI and our OTUs) files included with mapseq, the NCBI taxonomy is mapref-2.2b.fna.ncbitax and the OTU "taxonomy" is mapref-2.2b.fna.otutax. You will want to copy the parameters in the NCBI taxonomy file in the line:
#cutoff: 0.00:0.08 0.70:0.35 0.70:0.35 0.70:0.35 0.80:0.25 0.92:0.08 0.95:0.05

these are needed to exclude hits based on identity cutoffs, and should work also for the SILVA set if you use 7 taxonomic levels.

@evilvenom
Copy link

evilvenom commented Jan 13, 2022

@jfmrod Thanks a lot for your response.
I should be able to use greengenes database also in that case right? It also has a fasta file taxonomy defined in a separate taxonomy file.

Also, my question was that if we use this, as I saw in some previous issue threads, how do I use the output with krona, was the krona output flag added? I don't see it in the help message. Yes we have -otucounts and -otutables option but when I import the generated -otutable in krona, it says "|Unclassified| has no OTU code".

I will be really grateful if you can help me with the issue. Is it going wrong from mapseq or krona is the question!

Thanks again!
PS: MapSeq version: v2.0.1alpha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants