Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the most random randstrobes in compiled languages? #8

Open
ksahlin opened this issue Dec 1, 2021 · 7 comments
Open

How to get the most random randstrobes in compiled languages? #8

ksahlin opened this issue Dec 1, 2021 · 7 comments

Comments

@ksahlin
Copy link
Owner

ksahlin commented Dec 1, 2021

Background

Randstrobes is a method for linking two or more k-mers together into a seed that can be used for sequence mapping. I intend the below post to be a self-contained rough description of randstrobes (but just in case, more formal details on strobemers are found here).

Creating a randstrobe

We obtain a randstrobe r consisting of two linked k-mers k_1 and k_2 by fixing the position of k_1 and selecting k_2 in a window [w_1, w_2] bases downstream from k_1 through some function. The aim was to select k_2 in a way that would yield a random but deterministic sampling of k_2 in the window. Strobemers was introduced as a concept, i.e., I didn't care to study the optimal method for selecting k_2. Originally, selecting k_2 was described as

k_2 = argmin_{k' \in W} ( h(k_1) + h(k') ) % p

where W denotes the window to pick from. I will denote this as method 1. It was realized by @shenwei356 in #1 (this post specifically) that MOD operation is relatively expensive, and that we could instead use bit arithmetic to do

k_2 = argmin_{k' \in W} ( h(k_1) + h(k') ) & q

where q is a bitmask and & is the bitwise AND operator. I will denote this as method 2

Problem

While we cannot achieve full randomness (nothing really is), it seems to be room for significant improvement on the randomness of sampling k_2 while keeping speed. Below I pasted the example output from method 2 where we are picking k_1 and sampling k_2 over a thinned space of syncmers as described in this preprint.

In the below output p1 and p2 denotes the positions of k_1 and k_2 on a metagenomic contig sequence, respectively. Repetitions occurring two times are marked with a * and repetitions occurring more than two times with **. Also, this is not a cherry-picked example, it actually seems to get worse later on in the sequence.

p1 p2 
-----
0 78
4 60
8 37
13 92
17 85 *
24 85 *
28 98
32 113
37 92
41 88
48 95 *
60 95 *
68 133
74 106 *
78 106 *
85 143
88 121 *
92 121 *
95 149 **
98 149 **
106 149 **
113 149 **

We can alleviate the repetitions by instead using a method (method3) I just came up with, which is:

k_2 = argmin_{k' \in W} bitcount( h(k_1) ^ h(k') )

which picks the k' with the smallest number of bits set after an XOR operation. See below the reduced repetitiveness and lower correlation. For further motivation, method3 increases the accuracy of strobealign by about 0.1% (preliminary small experiments), which is a big deal as strobealign was previously about 0.1-0.2% less accurate than BWA and minimap2. So this matters for read mapping!

p1 p2 
0 28
4 41
8 85 *
13 48 *
17 48 *
24 92
28 88 *
32 85 *
37 78
41 121
48 88 *
60 95
68 126 *
74 113
78 116
85 126 *
88 121
92 158
95 130
98 133
106 149 *
113 149 *

While some repetitions are expected even under randomness, it is good to reduce degenerate cases such as the ** case seen above.

Open questions

  1. How to measure "randomness" in picking k_2?
  2. Which operation gives the best randomness? (is there a better one than method3?)
  3. Which operation is the fastest?

The three methods are all implemented here in a strobealign commit

@ksahlin ksahlin changed the title How to get the most random randstrobes fast in compiled languages? How to get the most random randstrobes in compiled languages? Dec 1, 2021
@marcelm
Copy link

marcelm commented Dec 1, 2021

with the smallest number of bits set after a Bitwise OR operation

That should be XOR, right?

If you use method3, note that there’s a POPCNT instruction that recent CPUs support. It’s quite possible that bitset.count already uses it, but you should make sure. If your CPU has support and you use GCC, you can enable POPCNT with -mpopcnt. If bitset.count doesn’t use POPCNT, you can also use the GCC builtin __builtin_popcount (with -mpopcnt, this will compile to the POPCNT instruction. Without it, it uses a function.)

@ksahlin
Copy link
Owner Author

ksahlin commented Dec 1, 2021

Yes, I meant XOR, will fix it!

Thanks for this info on C++. Still so much to learn..

@ksahlin
Copy link
Owner Author

ksahlin commented Dec 1, 2021

As pointed out by @ekimb, the hash function matters. In my above experiments, I used simple bit encoding A=00, C=01, G=10, T=11 without any jumbling of the bits (through a hash function). Below I show the "randomness" of

  1. method2 together with applying robin_hash C++ hash function to the k-mers.
  2. method2 together with applying hash64 hash function from minimap2 code
  3. method3 together with applying hash64 hash function from minimap2 code

The randomness looks better under Heng Li's hash64 function than under robin_hash

4 36
10 60 **
16 82 *
19 60 **
23 60 **
27 60 **
32 82 *
36 102
41 96 **
46 86
50 82 *
53 124 *
60 96 **
66 96 **
72 141 **
78 141 **
82 124 *
86 141 **
93 167
96 141 **
2 74 *
14 63 *
22 63 *
26 84 **
29 74 *
34 84 **
37 113
46 117
54 84 **
60 105
63 127 *
69 121 **
74 109
80 121 **
84 121 **
92 165 *
97 127 *
105 156
109 142
113 148
117 165 *
121 174
127 192 *
131 186
139 200
142 192 *
2 63 *
14 46
22 84 *
26 63 *
29 97
34 80
37 113 *
46 84 *
54 127 *
60 131 **
63 113 *
69 131 **
74 131 **
80 131 **
84 156 *
92 121
97 127 *
105 156 *
109 152
113 142
117 174 *
121 186
127 174 *
131 192
139 177 *
142 177 *

@lh3
Copy link

lh3 commented Dec 2, 2021

Method 1 and 2 would not work well. Method 3 is better. I would do

k_2 = argmin_{k' \in W} h(k_1∘k')

k_1∘k_2 denotes the concatenation of k-mer k_1 and k_2 and h() can be minimap2's invertible hash function if the total k-mer length is below 32.

PS: just noticed that Daniel Liu suggested the nearly same thing on twitter except that he added a mask at the end. If you use an invertible hash function, you don't need to apply a mask again.

Also on notation, { k' | (h(k_1) + h(k')) % p } is a set of k-mers. Applying argmin to this set doesn't have a clear meaning. Daniel's and my notation should be the correct one.

@ksahlin
Copy link
Owner Author

ksahlin commented Dec 2, 2021

Thanks for those points @lh3.

For documentation, I will implement a small bake-off as soon as I have time, which will include the following

Hashing

  1. No hash - simple 2bit encoding of nucleotides
  2. Thomas Wang hash
  3. xxhash3

Let the hash function be denoted by h.

Linking

  1. Sahlin ( h(k_1) + h(k') ) % p (method1 above)
  2. Shen ( (h(k_1) + h(k') ) & p (method2 above)
  3. Sahlin bitcount( h(k_1) ^ h(k') ) (method3 above)
  4. Pibri h(k_1) ^ h(k') (global XOR), (also a variant: h(k_1 ^ k'))
  5. Liu-Patro-Li h( k_1 || k' ) (concatenation)

Viable combinations of (hashing, linking) seem to be (1,1)-(1,4), (2,1)-(2,4), (3,1)-(3,5) as the total length can be larger than 32.

@ekimb's idea is potentially wild - it must be investigated - but it is conceptually different and will not be considered in this bake-off.

Metrics

  • Construction time
  • Uniqueness:
    • Distribution of the number of times the position p2 is selected for k_2 (repetition distribution)
    • Fraction of unique positions p2.
  • Downstream metrics such as how it affects strobealign (accuracy, final runtime, mapping sites tried, calls to ksw2)

@jermp
Copy link

jermp commented Dec 2, 2021

A little update on what I proposed:
it is h(k_1) ^ h(k') , i.e., the XOR between the two hash codes, not just the hash of the XOR between the 2-bit encoding of the kmers (which is h(k_1 ^ k')) The reason is that the XOR of two hash codes is another random quantity, the XOR between two bit-packed kmers will not be that random even if k' changes.
h should be a hash function (like the ones you mentioned or murmur2).

@ksahlin
Copy link
Owner Author

ksahlin commented Dec 2, 2021

Ah, I see, sorry. Will try both of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants