-
Notifications
You must be signed in to change notification settings - Fork 22
Definition of identity
Badread defines identity the same way as BLAST does: the number of matching bases over the length of the alignment. Take this example of a 24 bp read which originated from a 24 bp fragment of DNA. The read has 3 errors: one deletion, one substitution and one insertion. This read's identity is 22 / 25 = 88%. Note that the denominator is the not the length of the read but rather the length of the alignment.
Read: ACGAC-CAGCAGTCGCGACTAGCTT
||||| |||||| || |||||||||
Original sequence: ACGACTCAGCAGACG-GACTAGCTT
You can read more on Heng Li's excellent blog post: On the definition of sequence identity.
Since DNA has only a 4 letter alphabet, two completely random sequences can typically align with >50% identity. As an example, here are two random sequences aligned to each other which match in 32 places over 59 alignment positions, giving an identity of 54%:
AAT-CGGCGCGTCCCGCGTTTCGGAAATTGA-C-ACTCTGACG-GTT---AGCACAG--
| | ||| | | | || || || | || | | | ||| | ||| || | ||
ATTACGG-GAG-C--GC-TTA-GGC--T-GAACTATTATGATGCGTTGCGAGAAAAGGA
This means that any read with less than about 60% identity is difficult to distinguish from random sequence.