int overflow for big variant lines in vcf #1539

arnaldurgylfason · 2022-12-16T18:40:22Z

We have a variant with more than 50 alleles for almost 500000 samples. The line in question is over 3 gb in size.
bcftools and our own program that uses htslib can not read it.
After debugging I noticed, at least for .vcf instead of .vcf.gz, that hts_getline seemed to read it but it returned the length read which was over 3gb and since the returned value was captured in an int (4 bytes), (function _reader_fill_buffer in synced_bcf_reader.c)
it resulted in an overflow giving a negative number which was interpreted as failure.
Bigger datasets becoming more common in the future this problem will occur more often.
I´m sure this overflow problem can be found at more places.
I would consider it important to solve this and allow for bigger lines to handle the big data that we have today and even bigger in the near future. Changing to 8 byte integers would solve this. Note, size_t is used at many places, which handles this.

jmarshall · 2022-12-16T23:23:39Z

See also samtools/bcftools#1614 and #1448.

arnaldurgylfason · 2022-12-19T09:48:09Z

Thanks. We will start using FORMAT/LPL and FORMAT/LAA for variants with many alleles.
I still think htslib should be able to handle bigger lines though

daviesrob · 2023-01-03T10:30:34Z

It does look like bgzf_getline() has the same overflow problem. It should be easy to fix in the same way as #1448 did for uncompressed reads.

Note that this may not buy you that much due to other limitations in the library, for example bcf_fmt_t::p_len and bcf_fmt_t::p_off. As these are part of the public interface we can't easily change them. I'm also not entirely sure we'd want to as by the time individual records start getting to multi-gigabyte sizes you have to consider if you've reached the limit of what can sensibly be done with flat files, and consider solutions that are more database oriented (for example, hail).

pd3 · 2023-01-03T14:40:59Z

It would be nice to have htslib print a warning and skip such entries. We cannot support such large records fully as that breaks too many things and is not worth it. The feasible solution was proposed above, generate LPL and LAA instead.

daviesrob self-assigned this Jan 3, 2023

daviesrob mentioned this issue Jan 4, 2023

Improve support for very long input lines (> 2Gbyte) #1542

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

int overflow for big variant lines in vcf #1539

int overflow for big variant lines in vcf #1539

arnaldurgylfason commented Dec 16, 2022

jmarshall commented Dec 16, 2022

arnaldurgylfason commented Dec 19, 2022

daviesrob commented Jan 3, 2023

pd3 commented Jan 3, 2023

int overflow for big variant lines in vcf #1539

int overflow for big variant lines in vcf #1539

Comments

arnaldurgylfason commented Dec 16, 2022

jmarshall commented Dec 16, 2022

arnaldurgylfason commented Dec 19, 2022

daviesrob commented Jan 3, 2023

pd3 commented Jan 3, 2023