Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support faster GBWT #670

Open
glennhickey opened this issue Dec 17, 2018 · 3 comments
Open

Support faster GBWT #670

glennhickey opened this issue Dec 17, 2018 · 3 comments

Comments

@glennhickey
Copy link
Collaborator

Make sure toil-vg is using the latest and greatest logic from @jltsiren's wiki

@glennhickey
Copy link
Collaborator Author

toil-vg is up to date (as far as I can see) with this wiki dating from July:

https://github.com/jltsiren/gbwt/wiki/Construction-Benchmarks

but couldn't find anything more up to date. @JTSiren Can you please point me to the newer one we should be using?

@jltsiren
Copy link

The relevant wiki page is https://github.com/vgteam/vg/wiki/Indexing-Huge-Datasets .

There are four steps:

  1. Parse the VCF file with vg index -e.
  2. Divide the samples into a number of batches.
  3. Build GBWT for each batch separately using deps/gbwt/build_gbwt.
  4. Merge the GBWTs using deps/gbwt/merge_gbwt -p.

This makes building single-chromosome GBWT indexes several times faster than the direct construction.

Some issues remain:

  • When you build GBWT this way, the XG index will not contain thread names or haplotype count.
  • GBWT metadata will also be missing. You can write it manually using deps/gbwt/metadata, but I'm going to fix this soon.
  • The GBWT in the current VG master is a bit old, and merge_gbwt is significantly slower than the latest version.
  • The default parameters of merge_gbwt are appropriate for TOPMed, but the memory usage may be too high for smaller datasets like 1000GP. Some relevant parameters can't be changed yet with merge_gbwt options.

@glennhickey
Copy link
Collaborator Author

OK Thanks! This should all be do-able in toil-vg, though we'll have to either put the deps/gbwt/ executables into the vg docker image or vg gbwt.

I understand that we'll need this for topmed-sized VCFs, but I don't think it's relevant for @cmarkello's 9-days-to-index-the-1kg-graph issue that brought this discussion about. For that, @cmarkello, I think you'll have to post your command line and any changes to the config. I'd suspect lack-of-parallelism due something there rather any problems with the existing GBWT code which does the 1kg graph just fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants