Multithreaded column read from GCS bug #236

willcrichton · 2018-12-11T05:30:36Z

Working on @Haotianz94 transcript aligner. Empirically, when column read is multithreaded within a file, getting nondeterministic issues where incorrect bits are being read. Sometimes the buffer isn't long enough, sometimes the bytes are corrupted (pickle has an error).

willcrichton · 2018-12-15T05:03:50Z

From the email thread:

Been debugging this for a few hours. My experiments suggest that there is a race condition in the S3 API. What's happening is that when reading multiple metadata files from S3 in parallel in separate threads/clients, nondeterministically attempting to read a particular file will return the bytes for a different file being read at the same time. Like literally, executing read(LENGTH); seek(0); read(LENGTH) will return different bytes (also nondeterministically).
I'm at a complete loss as to how this is possible. This seems to only happen with metadata files (small, 48 bytes), not with data files. My guess is that the bug is related to reading sufficiently small files. We've never seen this bug before because we've never run jobs with as few outputs per file as Haotian's (I/O packet size = 4).

Just pushed a workaround that uses multiprocessing instead of multithreading (c275d03). Still need to figure out what the core issue is.

willcrichton added the bug label Dec 11, 2018

willcrichton added the high-priority label Feb 21, 2019

willcrichton added this to the Strata milestone Feb 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multithreaded column read from GCS bug #236

Multithreaded column read from GCS bug #236

willcrichton commented Dec 11, 2018

willcrichton commented Dec 15, 2018

Multithreaded column read from GCS bug #236

Multithreaded column read from GCS bug #236

Comments

willcrichton commented Dec 11, 2018

willcrichton commented Dec 15, 2018