You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Finding BAM split boundaries is currently slow for cloud stores like S3 and GCS. The goal of this issue is to characterize the problem, and implement fixes (e.g. finding splits in parallel on the client).
The text was updated successfully, but these errors were encountered:
I guess another thing worth adding here is that I had to guard against unreasonably-large memory-allocations in BAMRecordCodec (at non-record-start positions where the first 4 bytes of the candidate BAM-record are arbitrary data but are interpreted as a 4-byte int, and an array of that many bytes is allocated).
Without optimizing around that, evaluating hadoop-bam's guessing-logic on all positions in a file often slowed to a crawl, seemingly in parts of files where the average 4-byte windows corresponded to large integers, which caused large bogus-sized allocations at each checked virtual-position, and resulted in memory-pressure and slowdowns.
Finding BAM split boundaries is currently slow for cloud stores like S3 and GCS. The goal of this issue is to characterize the problem, and implement fixes (e.g. finding splits in parallel on the client).
The text was updated successfully, but these errors were encountered: