Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading sharded SAM / CRAM fails on some filesystems #199

Open
lbergelson opened this issue Jun 6, 2018 · 2 comments
Open

Reading sharded SAM / CRAM fails on some filesystems #199

lbergelson opened this issue Jun 6, 2018 · 2 comments

Comments

@lbergelson
Copy link
Contributor

It seems like reading sharded sam / cram files correctly depends on what file system you're using. Particularly, the order the shards are read seems to be different depending on which underlying filesystem you're using. I suspect that there is a call to get an iterator over the part files that gets an iterator directly from the file system and then doesn't sort it.

BAMInputFormat seems to work because it overrides getSplits and then sorts them based on the returned path. ExtendingAnySAMInputFormat to also override getSplits seems to fix the problem, but it's a nasty hack that relies on casting things in various ways. We should fix it at the source instead.

  public static class SplitSortingSamInputFormat extends AnySAMInputFormat{
        @SuppressWarnings("unchecked")
        @Override
        public List<InputSplit> getSplits(JobContext job) throws IOException {
            final List<InputSplit> splits = super.getSplits(job);


            if( splits.stream().allMatch(split -> split instanceof FileVirtualSplit || split instanceof FileSplit)) {
                splits.sort(Comparator.comparing(split -> {
                    if (split instanceof FileVirtualSplit) {
                        return ((FileVirtualSplit) split).getPath();
                    } else {
                        return ((FileSplit) split).getPath();
                    }
                }));
            }

            return splits;
        }
    }

We noticed this as part of adding tests on sharded files in https://github.com/broadinstitute/gatk/pull/4545/files. The tests passed on OSX but failed with out of order files on Travis (running ubuntu).

@lbergelson
Copy link
Contributor Author

@tomwhite Could you give us your thoughts on this when you get a chance?

@tomwhite
Copy link
Member

tomwhite commented Jun 7, 2018

I thought that Hadoop's FileInputFormat returned splits in lexicographic order, see

https://hadoop.apache.org/docs/r2.8.2/api/org/apache/hadoop/fs/FileSystem.html#globStatus(org.apache.hadoop.fs.Path)

where it says "Results are sorted by their names". Have you got some more details about the input file path you used, the files actually in the directory, and the order that the splits were read in?

The Hadoop-BAM fix would be to change AnySAMInputFormat to do the sort.

lbergelson added a commit to broadinstitute/gatk that referenced this issue Jun 11, 2018
* adding --sort-order option to SortSamSpark

adding a --sort-order option to SortSamSpark to let users specify the what order to sort in
enabling disabled tests
fixing the tests which weren't actually asserting anything

* closes #1260

* adding hack to get around HadoopGenomics/Hadoop-BAM#199
  created SplitSortingSamInputFormat which empirically fixes the issue although we don't necessarily completely understand the problem
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants