Question: can it continue a suspended job? #55

User670 · 2020-07-18T14:08:19Z

Trying to clone a webpage, but it froze after a while, probably due to some network hiccups. I had to kill the process and start over (only to get stuck again, to be honest). Is it possible for this module to continue a suspended job, skipping files that have already been saved?

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)

Windows 10, Python 3.8.1. Module installed via pip install pywebcopy, module called by command line python -m pywebcopy save_webpage http://y.tuwan.com/chatroom/3701 ./ --bypass_robots.

The text was updated successfully, but these errors were encountered:

rajatomar788 · 2020-07-19T16:00:30Z

@User670

Is it possible for this module to continue a suspended job, skipping files that have already been saved?

Yes. Pywebcopy skips files that already exists, so you could consider it being resumed.

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.

(Also, can I make it print some logs if a request failed or timed out and is doing a retry?)

Yes. Set debug=True or --debug flag, then it will print logs which you could manually inspect.

dibarpyth · 2021-07-17T17:46:23Z

(Also, what are the time out thresholds and retry limits for the requests? Can I specify these values?)

No. You have to rerun the scripts/cmd manually i.e. overwrite=False in scripts or without the --overwrite flag in cmd.

I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.

Is it possible to set such a delay between requests? Like "--wait" in WGET.

It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).

User670 · 2021-07-17T18:41:12Z

I think he was talking about the crawl delays between the requests (i.e. timeouts / pauses / wait) to prevent the high load and avoid being banned by the source.

Is it possible to set such a delay between requests? Like "--wait" in WGET.

It would be great for both sides (a source website won't be ddosed and the crawler won't be banned in the middle of the process).

I don't think I got banned, and I wasn't talking about delay between requests.

What I was experiencing was, um, like, the crawling just freezes, with no messages being printed to the console for minutes, after a while, and I had to kill the process and start over (otherwise it won't move).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: can it continue a suspended job? #55

Question: can it continue a suspended job? #55

User670 commented Jul 18, 2020 •

edited

Loading

rajatomar788 commented Jul 19, 2020 •

edited

Loading

dibarpyth commented Jul 17, 2021

User670 commented Jul 17, 2021

Question: can it continue a suspended job? #55

Question: can it continue a suspended job? #55

Comments

User670 commented Jul 18, 2020 • edited Loading

rajatomar788 commented Jul 19, 2020 • edited Loading

dibarpyth commented Jul 17, 2021

User670 commented Jul 17, 2021

User670 commented Jul 18, 2020 •

edited

Loading

rajatomar788 commented Jul 19, 2020 •

edited

Loading