Optimize BAM split generation for cloud stores #169

tomwhite · 2017-11-23T10:47:48Z

Finding BAM split boundaries is currently slow for cloud stores like S3 and GCS. The goal of this issue is to characterize the problem, and implement fixes (e.g. finding splits in parallel on the client).

ryan-williams · 2017-12-16T00:18:31Z

Two important bits of spark-bam that deal with this, fwiw:

computing splits on workers, in parallel (cf. diagrams)
using a block-LRU-caching inputstream/channel abstraction

ryan-williams · 2017-12-18T03:37:43Z

I guess another thing worth adding here is that I had to guard against unreasonably-large memory-allocations in BAMRecordCodec (at non-record-start positions where the first 4 bytes of the candidate BAM-record are arbitrary data but are interpreted as a 4-byte int, and an array of that many bytes is allocated).

Without optimizing around that, evaluating hadoop-bam's guessing-logic on all positions in a file often slowed to a crawl, seemingly in parts of files where the average 4-byte windows corresponded to large integers, which caused large bogus-sized allocations at each checked virtual-position, and resulted in memory-pressure and slowdowns.

Here's some relevant code in a BAMRecordCodec shim that I wrote for this reason.

tomwhite · 2017-12-18T18:00:22Z

Thanks for the info @ryan-williams! That's very helpful.

tomwhite added this to the 8.0.0 milestone Nov 23, 2017

tomwhite mentioned this issue Dec 19, 2017

Avoid unbounded memory allocation in BAM split guessing #178

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize BAM split generation for cloud stores #169

Optimize BAM split generation for cloud stores #169

tomwhite commented Nov 23, 2017

ryan-williams commented Dec 16, 2017

ryan-williams commented Dec 18, 2017

tomwhite commented Dec 18, 2017

Optimize BAM split generation for cloud stores #169

Optimize BAM split generation for cloud stores #169

Comments

tomwhite commented Nov 23, 2017

ryan-williams commented Dec 16, 2017

ryan-williams commented Dec 18, 2017

tomwhite commented Dec 18, 2017