You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems like reading sharded sam / cram files correctly depends on what file system you're using. Particularly, the order the shards are read seems to be different depending on which underlying filesystem you're using. I suspect that there is a call to get an iterator over the part files that gets an iterator directly from the file system and then doesn't sort it.
BAMInputFormat seems to work because it overrides getSplits and then sorts them based on the returned path. ExtendingAnySAMInputFormat to also override getSplits seems to fix the problem, but it's a nasty hack that relies on casting things in various ways. We should fix it at the source instead.
where it says "Results are sorted by their names". Have you got some more details about the input file path you used, the files actually in the directory, and the order that the splits were read in?
The Hadoop-BAM fix would be to change AnySAMInputFormat to do the sort.
lbergelson
added a commit
to broadinstitute/gatk
that referenced
this issue
Jun 11, 2018
* adding --sort-order option to SortSamSpark
adding a --sort-order option to SortSamSpark to let users specify the what order to sort in
enabling disabled tests
fixing the tests which weren't actually asserting anything
* closes#1260
* adding hack to get around HadoopGenomics/Hadoop-BAM#199
created SplitSortingSamInputFormat which empirically fixes the issue although we don't necessarily completely understand the problem
It seems like reading sharded sam / cram files correctly depends on what file system you're using. Particularly, the order the shards are read seems to be different depending on which underlying filesystem you're using. I suspect that there is a call to get an iterator over the part files that gets an iterator directly from the file system and then doesn't sort it.
BAMInputFormat
seems to work because it overridesgetSplits
and then sorts them based on the returned path. ExtendingAnySAMInputFormat
to also overridegetSplits
seems to fix the problem, but it's a nasty hack that relies on casting things in various ways. We should fix it at the source instead.We noticed this as part of adding tests on sharded files in https://github.com/broadinstitute/gatk/pull/4545/files. The tests passed on OSX but failed with out of order files on Travis (running ubuntu).
The text was updated successfully, but these errors were encountered: