Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

None of the Proxies are checked - leading to perpetual process where scraping never starts #45

Open
caffeinatedMike opened this issue Jul 31, 2020 · 0 comments

Comments

@caffeinatedMike
Copy link

caffeinatedMike commented Jul 31, 2020

Also, another side note. It seems this middleware is not respecting the normal shutdown signal Scrapy sends, forcing a user to force an unclean shutdown
2020-07-31 10:03:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
5 minutes between when signal is received before I forced unclean shutdown
2020-07-31 10:08:42 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown

Note: This is a site that my current IP is blocked, so I suspect that is the root cause. However, I think it'd be a good idea to have some sort of detection in this middleware to notice that the site is blocking all requests and output this in the logs.

Logs

2020-07-31 10:00:57 [scrapy.core.engine] INFO: Spider opened
2020-07-31 10:00:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:00:58 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-31 10:00:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:01:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:01:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:01:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:02:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:02:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:02:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:03:05 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force 
2020-07-31 10:03:05 [scrapy.core.engine] INFO: Closing spider (shutdown)
2020-07-31 10:03:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:03:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:03:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 0, unchecked: 25, reanimated: 0, mean backoff time: 0s)
2020-07-31 10:03:58 [rotating_proxies.expire] DEBUG: Proxy <http://XXXXXXXXXXX:8800> is DEAD
2020-07-31 10:03:58 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.kroger.com/robots.txt> with another proxy (failed 1 times, max retries: 5)
2020-07-31 10:04:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:04:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:04:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:05:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:05:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:05:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:06:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:06:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:06:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 24, reanimated: 0, mean backoff time: 253s)
2020-07-31 10:06:58 [rotating_proxies.expire] DEBUG: Proxy <http://XXXXXXXXXXX:8800> is DEAD
2020-07-31 10:06:58 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.kroger.com/robots.txt> with another proxy (failed 2 times, max retries: 5)
2020-07-31 10:07:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 2, unchecked: 23, reanimated: 0, mean backoff time: 214s)
2020-07-31 10:07:58 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 10:07:58 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 2, unchecked: 23, reanimated: 0, mean backoff time: 214s)
2020-07-31 10:08:13 [rotating_proxies.middlewares] DEBUG: 1 proxies moved from 'dead' to 'reanimated'
2020-07-31 10:08:28 [rotating_proxies.middlewares] INFO: Proxies(good: 0, dead: 1, unchecked: 23, reanimated: 1, mean backoff time: 175s)
2020-07-31 10:08:42 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
2020-07-31 10:08:42 [rotating_proxies.expire] DEBUG: Proxy <http://XXXXXXXXXXX:8800> is DEAD
2020-07-31 10:08:42 [rotating_proxies.middlewares] DEBUG: Retrying <GET https://www.kroger.com/robots.txt> with another proxy (failed 3 times, max retries: 5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant