You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Working on @Haotianz94 transcript aligner. Empirically, when column read is multithreaded within a file, getting nondeterministic issues where incorrect bits are being read. Sometimes the buffer isn't long enough, sometimes the bytes are corrupted (pickle has an error).
The text was updated successfully, but these errors were encountered:
Been debugging this for a few hours. My experiments suggest that there is a race condition in the S3 API. What's happening is that when reading multiple metadata files from S3 in parallel in separate threads/clients, nondeterministically attempting to read a particular file will return the bytes for a different file being read at the same time. Like literally, executing read(LENGTH); seek(0); read(LENGTH) will return different bytes (also nondeterministically).
I'm at a complete loss as to how this is possible. This seems to only happen with metadata files (small, 48 bytes), not with data files. My guess is that the bug is related to reading sufficiently small files. We've never seen this bug before because we've never run jobs with as few outputs per file as Haotian's (I/O packet size = 4).
Just pushed a workaround that uses multiprocessing instead of multithreading (c275d03). Still need to figure out what the core issue is.
Working on @Haotianz94 transcript aligner. Empirically, when column read is multithreaded within a file, getting nondeterministic issues where incorrect bits are being read. Sometimes the buffer isn't long enough, sometimes the bytes are corrupted (pickle has an error).
The text was updated successfully, but these errors were encountered: