-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
int overflow for big variant lines in vcf #1539
Comments
See also samtools/bcftools#1614 and #1448. |
Thanks. We will start using FORMAT/LPL and FORMAT/LAA for variants with many alleles. |
It does look like Note that this may not buy you that much due to other limitations in the library, for example |
It would be nice to have htslib print a warning and skip such entries. We cannot support such large records fully as that breaks too many things and is not worth it. The feasible solution was proposed above, generate LPL and LAA instead. |
We have a variant with more than 50 alleles for almost 500000 samples. The line in question is over 3 gb in size.
bcftools and our own program that uses htslib can not read it.
After debugging I noticed, at least for .vcf instead of .vcf.gz, that hts_getline seemed to read it but it returned the length read which was over 3gb and since the returned value was captured in an int (4 bytes), (function _reader_fill_buffer in synced_bcf_reader.c)
it resulted in an overflow giving a negative number which was interpreted as failure.
Bigger datasets becoming more common in the future this problem will occur more often.
I´m sure this overflow problem can be found at more places.
I would consider it important to solve this and allow for bigger lines to handle the big data that we have today and even bigger in the near future. Changing to 8 byte integers would solve this. Note, size_t is used at many places, which handles this.
The text was updated successfully, but these errors were encountered: