Skip to content
This repository has been archived by the owner on Jun 17, 2024. It is now read-only.

Non-unique header names in BLAST and effects on PctSIM #4

Open
ctseto opened this issue Sep 28, 2016 · 0 comments
Open

Non-unique header names in BLAST and effects on PctSIM #4

ctseto opened this issue Sep 28, 2016 · 0 comments

Comments

@ctseto
Copy link

ctseto commented Sep 28, 2016

Summary: Not having unique names in the headers may cause mismatches between PctSim and e-value/bitscore, blastn output issue more than a MMinte issue.

Background:
Building my own blast databases, I am encountering issues with match assignments.

For example, with the current 16Sdb a given OTU is matching with 100 percent identity, 0 mismatch, 2e-120 bitscore ; but when the database is augmented with new sequences the best match switches to a genomeid with 27 percent identity, 170 mismatch. The effect is observed when a database is constructed of the new sequences alone. Re-testing with the new sequences alone, I find the 27% identity anomaly is definitely associated with the new sequences

Below: BLAST results before pre-processing names to make them unique:
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 21.84 206 147 12 1 199 540 738 3e-46 174

(inspection of the alignment from blast output shows a very bad alignment closer to 27% identity/170 mismatch than 5e-124 evalue and 433 bitscore)

Analysis of the sequence headers in the new set suggest that 38 of them share a taxa_id, thus they also share a header, which appears to mess with parsing:
(

59620
...
59620
...
)

After correcting names into uniqueness:

denovo830 59620.37 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.14 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.8 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.7 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.23 83.01 206 21 12 1 199 540 738 3e-46 174

At this point, the >59620 sequence that was a mere 27% identity is no longer highly ranked.

If this particular issue is driven by how e-value/bitscore et al are assigned in the case of duplicate headers, then there is a possibility that when a given OTU representative has a strong match to one of a set, results might not always be reported for the representative. In the above case, 59620 does have a high-evalue and high bitscore match, but reported a 27 percent identity, which is passed into MMinte as PctSim. This may have an effect on components of MMinte that rely on the PctSim value.

Edit: This would probably have the most effect in cases where a given taxa has multiple associated genomeIDs, and said genomeID's have 16S sufficiently distinct such that a query sequence would produce different percent identity for each of the similar sequences.

@ctseto ctseto changed the title Blast effects on Non-unique header names in BLAST and effects on PctSIM Sep 29, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant