Reading sharded SAM / CRAM fails on some filesystems #199

lbergelson · 2018-06-06T19:44:23Z

It seems like reading sharded sam / cram files correctly depends on what file system you're using. Particularly, the order the shards are read seems to be different depending on which underlying filesystem you're using. I suspect that there is a call to get an iterator over the part files that gets an iterator directly from the file system and then doesn't sort it.

BAMInputFormat seems to work because it overrides getSplits and then sorts them based on the returned path. ExtendingAnySAMInputFormat to also override getSplits seems to fix the problem, but it's a nasty hack that relies on casting things in various ways. We should fix it at the source instead.

  public static class SplitSortingSamInputFormat extends AnySAMInputFormat{
        @SuppressWarnings("unchecked")
        @Override
        public List<InputSplit> getSplits(JobContext job) throws IOException {
            final List<InputSplit> splits = super.getSplits(job);


            if( splits.stream().allMatch(split -> split instanceof FileVirtualSplit || split instanceof FileSplit)) {
                splits.sort(Comparator.comparing(split -> {
                    if (split instanceof FileVirtualSplit) {
                        return ((FileVirtualSplit) split).getPath();
                    } else {
                        return ((FileSplit) split).getPath();
                    }
                }));
            }

            return splits;
        }
    }

We noticed this as part of adding tests on sharded files in https://github.com/broadinstitute/gatk/pull/4545/files. The tests passed on OSX but failed with out of order files on Travis (running ubuntu).

The text was updated successfully, but these errors were encountered:

lbergelson · 2018-06-06T19:44:45Z

@tomwhite Could you give us your thoughts on this when you get a chance?

tomwhite · 2018-06-07T11:05:18Z

I thought that Hadoop's FileInputFormat returned splits in lexicographic order, see

https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

where it says "Results are sorted by their names". Have you got some more details about the input file path you used, the files actually in the directory, and the order that the splits were read in?

The Hadoop-BAM fix would be to change AnySAMInputFormat to do the sort.

* adding --sort-order option to SortSamSpark adding a --sort-order option to SortSamSpark to let users specify the what order to sort in enabling disabled tests fixing the tests which weren't actually asserting anything * closes #1260 * adding hack to get around HadoopGenomics/Hadoop-BAM#199 created SplitSortingSamInputFormat which empirically fixes the issue although we don't necessarily completely understand the problem

tomwhite mentioned this issue Jun 14, 2018

Ensure splits are sorted by path name #200

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading sharded SAM / CRAM fails on some filesystems #199

Reading sharded SAM / CRAM fails on some filesystems #199

lbergelson commented Jun 6, 2018

lbergelson commented Jun 6, 2018

tomwhite commented Jun 7, 2018

Reading sharded SAM / CRAM fails on some filesystems #199

Reading sharded SAM / CRAM fails on some filesystems #199

Comments

lbergelson commented Jun 6, 2018

lbergelson commented Jun 6, 2018

tomwhite commented Jun 7, 2018