You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 17, 2024. It is now read-only.
Summary: Not having unique names in the headers may cause mismatches between PctSim and e-value/bitscore, blastn output issue more than a MMinte issue.
Background:
Building my own blast databases, I am encountering issues with match assignments.
For example, with the current 16Sdb a given OTU is matching with 100 percent identity, 0 mismatch, 2e-120 bitscore ; but when the database is augmented with new sequences the best match switches to a genomeid with 27 percent identity, 170 mismatch. The effect is observed when a database is constructed of the new sequences alone. Re-testing with the new sequences alone, I find the 27% identity anomaly is definitely associated with the new sequences
(inspection of the alignment from blast output shows a very bad alignment closer to 27% identity/170 mismatch than 5e-124 evalue and 433 bitscore)
Analysis of the sequence headers in the new set suggest that 38 of them share a taxa_id, thus they also share a header, which appears to mess with parsing:
(
At this point, the >59620 sequence that was a mere 27% identity is no longer highly ranked.
If this particular issue is driven by how e-value/bitscore et al are assigned in the case of duplicate headers, then there is a possibility that when a given OTU representative has a strong match to one of a set, results might not always be reported for the representative. In the above case, 59620 does have a high-evalue and high bitscore match, but reported a 27 percent identity, which is passed into MMinte as PctSim. This may have an effect on components of MMinte that rely on the PctSim value.
Edit: This would probably have the most effect in cases where a given taxa has multiple associated genomeIDs, and said genomeID's have 16S sufficiently distinct such that a query sequence would produce different percent identity for each of the similar sequences.
The text was updated successfully, but these errors were encountered:
ctseto
changed the title
Blast effects on
Non-unique header names in BLAST and effects on PctSIM
Sep 29, 2016
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Summary: Not having unique names in the headers may cause mismatches between PctSim and e-value/bitscore, blastn output issue more than a MMinte issue.
Background:
Building my own blast databases, I am encountering issues with match assignments.
For example, with the current 16Sdb a given OTU is matching with 100 percent identity, 0 mismatch, 2e-120 bitscore ; but when the database is augmented with new sequences the best match switches to a genomeid with 27 percent identity, 170 mismatch. The effect is observed when a database is constructed of the new sequences alone. Re-testing with the new sequences alone, I find the 27% identity anomaly is definitely associated with the new sequences
Below: BLAST results before pre-processing names to make them unique:
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 27.35 234 170 0 1 234 523 756 5e-124 433
denovo830 59620 21.84 206 147 12 1 199 540 738 3e-46 174
(inspection of the alignment from blast output shows a very bad alignment closer to 27% identity/170 mismatch than 5e-124 evalue and 433 bitscore)
Analysis of the sequence headers in the new set suggest that 38 of them share a taxa_id, thus they also share a header, which appears to mess with parsing:
(
After correcting names into uniqueness:
denovo830 59620.37 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.14 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.8 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.7 100.00 234 0 0 1 234 523 756 5e-124 433
denovo830 59620.23 83.01 206 21 12 1 199 540 738 3e-46 174
At this point, the >59620 sequence that was a mere 27% identity is no longer highly ranked.
If this particular issue is driven by how e-value/bitscore et al are assigned in the case of duplicate headers, then there is a possibility that when a given OTU representative has a strong match to one of a set, results might not always be reported for the representative. In the above case, 59620 does have a high-evalue and high bitscore match, but reported a 27 percent identity, which is passed into MMinte as PctSim. This may have an effect on components of MMinte that rely on the PctSim value.
Edit: This would probably have the most effect in cases where a given taxa has multiple associated genomeIDs, and said genomeID's have 16S sufficiently distinct such that a query sequence would produce different percent identity for each of the similar sequences.
The text was updated successfully, but these errors were encountered: