-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warcio.bufferedreaders.BufferedReader.readline can get stuck in an infinite loop #122
Comments
I see an endless loop possibility involving a truncated file, are you reading the last record when the problem happens? |
I don't know, actually. I will have to investigate. |
Do you know if the WARC records are gzipped or not? If (correctly) gzipped per record, I think the content-length should matter less, as it would hit the gzip boundary and end. Can you run the warcs through |
I'm happiest to debug this if you send us the warc with the problem. I could explore the possible bug I found, but really, solving your actual bug is probably the best path forward. Maybe it's the one I guessed, maybe it isn't. |
I had to put this project on the back burner over Christmas and New Year. Now I can hopefully get back to working on it again. |
I have got access to my data again now and I can see that my script using an |
@wumpus so far I can at least say that this does not seem to happen at the last record in the file. The file is approximately 53GB and |
I have identified the problematic record in my WARC now. I can open an
So far, so good. I can also read another item from the iterator:
This is where it starts acting up now. If I attempt to call |
Can you prepare and attach a cut-down warc that starts at 32624067591 and is long enough to have 2 complete records plus a fair bit of the 3rd? Compressed the same as the original file. So I'd guess 242,221 + 241,594 bytes (the sum of the uncompressed record lengths) would be enough. You can then review the webpages inside to make sure they aren't sensitive. |
I am trying to do that now. |
I have read the problematic record now, directly from the file object. It is clearly a truncated file. I don't know what has happened here, but from somewhere in the record, the remaining contents are simply zero. I have browsed manually through the file and this seems to go on, gigabyte by gigabyte. I have stored the contents in this gist; it appears to be a page about a bill being discussed in the Danish parliament. |
This is a not unusual situation in the face of a hardware or software bug. Either the file ends in a hole, which means the filesystem metadata is corrupt, or your bits were written elsewhere on the disk, or not written anywhere. We ought to be able to trigger it by simply overwriting the end of a test warc with nulls. We have plenty of such files in the test data already. I think there are 6 interesting cases, one is when the payload is where the problem starts, and a second is when the http headers is where the problem starts, and the 3rd is when the warc headers is when the problem starts. And then with and without warc-record-level compression. And I suppose all nulls vs random bytes, that's 12 cases. |
I am currently processing a large-ish (on the order of 600GB) batch of WARC files containing a number of dumped homepages.
I am sifting through all of these files for image content which I then extract and do some further processing of. I use a
warcio.ArchiveIterator
to iterate through all of the records in all of the files. Once in a while, I come across WARC files that seem to cause thewarcio.ArchiveIterator
to get stuck in an infinite loop.I suspect it happens in
warcio/warcio/bufferedreaders.py
Line 193 in aa702cb
I found this problem in warcio 1.7.4, but I cannot identify this version in your repo as the releases do not seem to be tagged.
I suspect this could be related to #121 and the way I speculate it might happen is that #121 may occur if the expected content length is somehow too short compared to the actual content. On the other hand, if the expected content length is too long compared to the actual length, reading the record might miss the newline character that
warcio.bufferedreaders.BufferedReader.readline()
is looking for, and cause the loop to never stop?I notice when this occurs, that
warcio/warcio/archiveiterator.py
Line 174 in aa702cb
readline()
without alength
argument. Adding a sensiblelength
argument here would probably be good "emergency exit" mechanism?The text was updated successfully, but these errors were encountered: