Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shouldn't an IgnoreRequest being raised if max_proxies_to_try is reached #51

Open
codekoriko opened this issue Oct 2, 2020 · 0 comments

Comments

@codekoriko
Copy link

I implemented a ban_policy to mark redirect 302 as a "ban".

But once the request reached the maximum retries it is let through and therefor picked-up by scrapy.downloadermiddlewares.redirect

Which in turn restart a max_proxies_to_try cycle the redirected request (a useless captacha page.)

2020-10-02 05:31:07 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET http://www.url.com> (failed 6 times with different proxies)
2020-10-02 05:31:07 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://www.url.com/redirected/to/captacha> from <GET http://www.url.com>
2020-10-02 05:31:10 [rotating_proxies.middlewares] DEBUG: Gave up retrying <GET http://www.url.com/redirected/to/captacha> (failed 6 times with different proxies)

Shouldn't we add a raise IgnoreRequest() like so:

 def _retry(self, request, spider):
        retries = request.meta.get('proxy_retry_times', 0) + 1
        max_proxies_to_try = request.meta.get('max_proxies_to_try',
                                              self.max_proxies_to_try)

        if retries <= max_proxies_to_try:
            logger.debug("Retrying %(request)s with another proxy "
                         "(failed %(retries)d times, "
                         "max retries: %(max_proxies_to_try)d)",
                         {'request': request, 'retries': retries,
                          'max_proxies_to_try': max_proxies_to_try},
                         extra={'spider': spider})
            retryreq = request.copy()
            retryreq.meta['proxy_retry_times'] = retries
            retryreq.dont_filter = True
            return retryreq
        else:
            logger.debug("Gave up retrying %(request)s (failed %(retries)d "
                         "times with different proxies)",
                         {'request': request, 'retries': retries},
                         extra={'spider': spider})
            raise IgnoreRequest("Max retries reached")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant