Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make AbstractVCFCodec thread-safe, allowing for multi-threaded VariantContext genotype decoding #1636

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

rcoleb
Copy link

@rcoleb rcoleb commented Nov 16, 2022

Description

Problem: Processing multiple VariantContext objects concurrently results in shared data due to the stateful nature of AbstractVCFCodec. Internal data structures (parts, genotypeParts, alleleMap, lineNo) are re-used to parse the LazyGenotypesContext contained within VariantContext objects; this re-use makes AbstractVCFCodec not thread-safe - calling LazyGenotypesContext.decode() in concurrent threads results in the genotypes of one VariantContext leaking over to other VariantContext objects and overwriting the genotypes there.

Solution: I have wrapped the internal state objects of AbstractVCFCodec in ThreadLocal objects. This will give each parsing thread its own copies of the internal state, removing cross-thread information leakage. This solution should not appreciably affect existing workflows, as single threaded applications will still maintain the resource re-use optimization. This solution also removes the need for expensive and rate-limiting synchronization.

I believe this fixes #1026, as well as the issue reported here.

This also resolves the // todo: make this thread safe? comment on line 69 of AbstractVCFCodec.

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
    • Some tests did not run / pass due to working in Windows / due to path issues. However, on reviewing these tests they do not appear related to this functionality (as best I can tell).
  • Add new tests or update existing ones: [the test suite is explicitly not multi-threaded, and the comments in the grade build file suggest that making it multi-threaded would break existing tests and make them nondeterministic. However I have confirmed locally that this PR solves the problem reported above.]
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@rcoleb
Copy link
Author

rcoleb commented Nov 16, 2022

I have pushed a commit that fixes the issue that caused the automatic tests to fail. No error was shown in the IDE due to String.format accepting Object arguments rather than more specifically-typed values. I have checked for other usages of the variable in question and this appears to be the only remaining usage.

@lbergelson
Copy link
Member

@rcoleb Thank you for doing this. This seems like a good idea, it's very natural to want to decode these things in parallel and confusing to users that it doesn't work right.

I'll have to take a close look to make sure there aren't any wierd knock on effects, it's vaguely possible there are wierd expectations about String identity equality or something along those lines that might be disrupted by using multiple caches.

Do you know if there is any noticeable performance impact of using threadlocal access vs direct access? I assume it's non 0 but likely minimal compared to the extremely costly and wasteful vcf parsing.

We have a two very large outstanding pr's (#1581 and #1596) which are likely to conflict with this. I will probably hold off on trying to merge this until we get in #1581 (and possibly 1596) in order to reduce any rebase pain. I'm currently in the process of reviewing the first one so that should hopefully move along soon.

@rcoleb
Copy link
Author

rcoleb commented Nov 18, 2022

There will certainly be some performance cost, but it is likely minimal by my estimation: usage of ThreadLocals will inject a couple of additional method calls into the variable usage but because internally they're implemented as a map, not much more than that. And if the genotypes are even being decoded, it seems likely that the addition of method calls is minimal compared to whatever operations will be performed on the genotypes themselves.

As an aside, there is a possible solution for adding a "threadsafe" flag to VCFFileReader constructors, which could direct the AbstractVCFCodec to use thread-safe variables if requested, and direct-access otherwise, but that might be more pollution of the constructor that you'd prefer. Overall, I believe (and our internal testing supports) the performance impacts from ThreadLocal usage are much less than the performance gains from concurrent processing.

Also worth noting, I also don't know how HTSJDK performs with files that don't use AbstractVCFCodec (e.g. BCF files) and whether those would also benefit from thread-safe operations. We don't use those much, so I don't have the internal pressure to optimize a workflow around them.

Sounds good! Please of course take this in when it is appropriate (if at all) - I am just happy for the potential opportunity to contribute to a library that we use heavily in our lab. In the meantime we are synchronizing the genotype decoding, as even the performance impact of synchronization doesn't compare to the gains we are getting from parallel processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants