Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

voltagex · 2022-04-26T04:49:30Z

I'm using Python 3.10.4 and warcio 1.7.4

Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting

    for record in ArchiveIterator(writer.get_stream()):
AttributeError: 'WARCWriter' object has no attribute 'get_stream'. Did you mean: '_iter_stream'?

import os.path
import hashlib

from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests #https://github.com/webrecorder/warcio#writing-warc-records
from bs4 import BeautifulSoup

#https://gist.github.com/edsu/62bc39890806ffd19b597186a3619419

OUTPUT_PATH = 'output/'

def cache_and_return_bs(url):
    if url_already_retrieved(url):
        raise Exception(url + ' already there')
    with capture_http(get_output_filename(url),warc_version='1.1') as writer:
        #TODO: do we want to try to append to a single file?
        requests.get(url)
        for record in ArchiveIterator(writer.get_stream()):
            if record.rec_type == 'response':
                return BeautifulSoup(record.raw_stream)

def get_output_filename(url):
    return OUTPUT_PATH + hashlib.sha256(url.encode()).hexdigest()

def url_already_retrieved(url):
    return os.path.isfile(get_output_filename(url))


if __name__ == '__main__':
    print(cache_and_return_bs('https://example.org'))

I narrowed this down to specifying a filename in the writer object - if I don't do this, the get_stream method exists

The text was updated successfully, but these errors were encountered:

wumpus · 2022-04-26T05:39:39Z

This is by design. I agree that this isn't obvious and that we can improve the documentation and runtime error messages for this case.

What you should do instead is do the capture to a file, and once that's done, read that file.

voltagex · 2022-04-26T08:00:39Z

Thanks @wumpus, should I leave this open as a documentation bug in that case?

wumpus · 2022-04-26T08:14:48Z

Yes, please leave it open, this is not the only place where we have a lack of clarity about streaming vs files.

wumpus self-assigned this Apr 26, 2022

voltagex changed the title ~~capture_http writer with filename has no get_stream methood~~ Documentation: Clarify that capture_http writer with filename has no get_stream methood Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

voltagex commented Apr 26, 2022 •

edited

Loading

wumpus commented Apr 26, 2022

voltagex commented Apr 26, 2022

wumpus commented Apr 26, 2022

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

Comments

voltagex commented Apr 26, 2022 • edited Loading

wumpus commented Apr 26, 2022

voltagex commented Apr 26, 2022

wumpus commented Apr 26, 2022

voltagex commented Apr 26, 2022 •

edited

Loading