Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refresh the list of proxies during scraping #40

Open
dibodin opened this issue Jun 1, 2020 · 11 comments
Open

Refresh the list of proxies during scraping #40

dibodin opened this issue Jun 1, 2020 · 11 comments

Comments

@dibodin
Copy link

dibodin commented Jun 1, 2020

Hello

i find the load of the list of proxies in from_crawler (middleware.py) : the load is in a constructor of object.

i read this in a good site of scraping : " ...write some code that would automatically pick up and refresh the proxy list you use for scraping with working IP addresses. This will save you a lot of time and frustration."

i wish change dynamically the list of proxie or complete it during the scraping. i think : il is a goog feature.

Best Regards.

(sorry for my english ...)

@StasDeep
Copy link
Contributor

StasDeep commented Jun 7, 2020

Hello! If I understand it correctly, you want to dynamically load some proxy lists from the internet to always have the latest proxies.

Anyway, what you can do is defining a custom middleware and custom proxies class:

from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies
from twisted.internet import task

class CustomRotatingProxiesMiddleware(RotatingProxyMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        mw = super(CustomRotatingProxiesMiddleware, cls).from_crawler(crawler)
        # Substitute standart `proxies` object with a custom one
        mw.proxies = CustomProxies(mw.cleanup_proxy_list(proxy_list), backoff=mw.proxies.backoff)

        # Connect `proxies` to engine signals in order to start and stop looping task
        crawler.signals.connect(mw.proxies.engine_started, signal=signals.engine_started)
        crawler.signals.connect(mw.proxies.engine_stopped, signal=signals.engine_stopped)
        return mw

class CustomProxies(Proxies):
    
    def engine_started(self):
        """ Create a task for updating proxies every hour """
        self.task = task.LoopingCall(self.update_proxies)
        self.task.start(3600, now=True)

    def engine_stopped(self):
        if self.task.running:
            self.task.stop()

    def update_proxies(self):
        new_proxies = ...  # fetch proxies from wherever you want
        for proxy in new_proxy_list:
            self.add(proxy)
        
    def add(self, proxy):
        """ Add a proxy to the proxy list """
        if proxy in self.proxies:
            logger.warn("Proxy <%s> is already in proxies list" % proxy)
            return

        hostport = extract_proxy_hostport(proxy)
        self.proxies[proxy] = ProxyState()
        self.proxies_by_hostport[hostport] = proxy
        self.unchecked.add(proxy)

@victor-wyk
Copy link

In settings.py do you simply replace 'rotating_proxies.middlewares.RotatingProxyMiddleware' with 'YourProject.middlewares.CustomRotatingProxiesMiddleware' and how about the settings.py options?

@StasDeep
Copy link
Contributor

@victor-wyk I think replacing the original middleware with the custom one should do the work.

@victor-wyk
Copy link

victor-wyk commented Jul 12, 2020

Thanks for the quick response. I did some fiddling and found out that after replacing with the custom one, you do have to supply the ROTATING_PROXY_LIST option with a list of proxies, or else the custom one would not run. The custom middleware will ignore the list and continue to run like usual. How to solve the issue?

`DOWNLOADER_MIDDLEWARES = {

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  

'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,  

'MyProject.middlewares.MyProjectDownloaderMiddleware': 543,  

'MyProject.middlewares.CustomRotatingProxiesMiddleware': 610,  

# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,  

'rotating_proxies.middlewares.BanDetectionMiddleware': 620,  

}

ROTATING_PROXY_LIST = ['69.69.69.69:69']`

@StasDeep
Copy link
Contributor

@victor-wyk when you run the spider, do you see the CustomRotatingProxiesMiddleware in the list logged after [scrapy.middleware] INFO: Enabled downloader middlewares:?

@victor-wyk
Copy link

@StasDeep I do, only if i include the ROTATING_PROXY_LIST as shown above. If i get rid of the option then it does not appear.

@StasDeep
Copy link
Contributor

@victor-wyk but what goes wrong then? As in, what's expected and what's actual?

@Kamranbarlas
Copy link

Kamranbarlas commented Nov 15, 2021

I am facing the same problem when it comes to using dynamic changing the proxy list while scraping.
can you tell me what is the proxy list in the --> cleanup_proxy_list(proxy_list).
@StasDeep

@milancelap
Copy link

I get an NameError: name 'proxy_list' is not defined when implementing that custom middleware. Also new_proxy_list, logger, extract_proxy_hostport, and ProxyState are all not defined... @StasDeep

@Kamranbarlas
Copy link

I am facing the same problem when it comes to using dynamic changing the proxy list while scraping.
can you tell me what is the proxy list in the --> cleanup_proxy_list(proxy_list).
@StasDeep

@reedjones
Copy link

@Kamranbarlas
basically you have to override the from_crawler method in your custom class and then just set proxy list to whatever you want.

from the source proxy_list :

 @classmethod
    def from_crawler(cls, crawler):
        s = crawler.settings
        proxy_path = s.get('ROTATING_PROXY_LIST_PATH', None)
        if proxy_path is not None:
            with codecs.open(proxy_path, 'r', encoding='utf8') as f:
                proxy_list = [line.strip() for line in f if line.strip()]
        else:
            proxy_list = s.getlist('ROTATING_PROXY_LIST')
        if not proxy_list:
            raise NotConfigured()
        mw = cls(
            proxy_list=proxy_list,
            logstats_interval=s.getfloat('ROTATING_PROXY_LOGSTATS_INTERVAL', 30),
            stop_if_no_proxies=s.getbool('ROTATING_PROXY_CLOSE_SPIDER', False),
            max_proxies_to_try=s.getint('ROTATING_PROXY_PAGE_RETRY_TIMES', 5),
            backoff_base=s.getfloat('ROTATING_PROXY_BACKOFF_BASE', 300),
            backoff_cap=s.getfloat('ROTATING_PROXY_BACKOFF_CAP', 3600),
            crawler=crawler,
        )
        crawler.signals.connect(mw.engine_started,
                                signal=signals.engine_started)
        crawler.signals.connect(mw.engine_stopped,
                                signal=signals.engine_stopped)
        return mw
        ```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants