Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation: Clarify that capture_http writer with filename has no get_stream methood #143

Open
voltagex opened this issue Apr 26, 2022 · 3 comments
Assignees

Comments

@voltagex
Copy link

voltagex commented Apr 26, 2022

I'm using Python 3.10.4 and warcio 1.7.4

Using a piece of code based on https://github.com/webrecorder/warcio#writing-warc-records, I'm getting

    for record in ArchiveIterator(writer.get_stream()):
AttributeError: 'WARCWriter' object has no attribute 'get_stream'. Did you mean: '_iter_stream'?
import os.path
import hashlib

from warcio.capture_http import capture_http
from warcio.archiveiterator import ArchiveIterator
import requests #https://github.com/webrecorder/warcio#writing-warc-records
from bs4 import BeautifulSoup

#https://gist.github.com/edsu/62bc39890806ffd19b597186a3619419

OUTPUT_PATH = 'output/'

def cache_and_return_bs(url):
    if url_already_retrieved(url):
        raise Exception(url + ' already there')
    with capture_http(get_output_filename(url),warc_version='1.1') as writer:
        #TODO: do we want to try to append to a single file?
        requests.get(url)
        for record in ArchiveIterator(writer.get_stream()):
            if record.rec_type == 'response':
                return BeautifulSoup(record.raw_stream)

def get_output_filename(url):
    return OUTPUT_PATH + hashlib.sha256(url.encode()).hexdigest()

def url_already_retrieved(url):
    return os.path.isfile(get_output_filename(url))


if __name__ == '__main__':
    print(cache_and_return_bs('https://example.org'))

I narrowed this down to specifying a filename in the writer object - if I don't do this, the get_stream method exists

@wumpus
Copy link
Collaborator

wumpus commented Apr 26, 2022

This is by design. I agree that this isn't obvious and that we can improve the documentation and runtime error messages for this case.

What you should do instead is do the capture to a file, and once that's done, read that file.

@voltagex
Copy link
Author

Thanks @wumpus, should I leave this open as a documentation bug in that case?

@wumpus wumpus self-assigned this Apr 26, 2022
@wumpus
Copy link
Collaborator

wumpus commented Apr 26, 2022

Yes, please leave it open, this is not the only place where we have a lack of clarity about streaming vs files.

@voltagex voltagex changed the title capture_http writer with filename has no get_stream methood Documentation: Clarify that capture_http writer with filename has no get_stream methood Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants