Refresh the list of proxies during scraping #40

dibodin · 2020-06-01T17:31:18Z

Hello

i find the load of the list of proxies in from_crawler (middleware.py) : the load is in a constructor of object.

i read this in a good site of scraping : " ...write some code that would automatically pick up and refresh the proxy list you use for scraping with working IP addresses. This will save you a lot of time and frustration."

i wish change dynamically the list of proxie or complete it during the scraping. i think : il is a goog feature.

Best Regards.

(sorry for my english ...)

StasDeep · 2020-06-07T09:03:12Z

Hello! If I understand it correctly, you want to dynamically load some proxy lists from the internet to always have the latest proxies.

Anyway, what you can do is defining a custom middleware and custom proxies class:

from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies
from twisted.internet import task

class CustomRotatingProxiesMiddleware(RotatingProxyMiddleware):

    @classmethod
    def from_crawler(cls, crawler):
        mw = super(CustomRotatingProxiesMiddleware, cls).from_crawler(crawler)
        # Substitute standart `proxies` object with a custom one
        mw.proxies = CustomProxies(mw.cleanup_proxy_list(proxy_list), backoff=mw.proxies.backoff)

        # Connect `proxies` to engine signals in order to start and stop looping task
        crawler.signals.connect(mw.proxies.engine_started, signal=signals.engine_started)
        crawler.signals.connect(mw.proxies.engine_stopped, signal=signals.engine_stopped)
        return mw

class CustomProxies(Proxies):
    
    def engine_started(self):
        """ Create a task for updating proxies every hour """
        self.task = task.LoopingCall(self.update_proxies)
        self.task.start(3600, now=True)

    def engine_stopped(self):
        if self.task.running:
            self.task.stop()

    def update_proxies(self):
        new_proxies = ...  # fetch proxies from wherever you want
        for proxy in new_proxy_list:
            self.add(proxy)
        
    def add(self, proxy):
        """ Add a proxy to the proxy list """
        if proxy in self.proxies:
            logger.warn("Proxy <%s> is already in proxies list" % proxy)
            return

        hostport = extract_proxy_hostport(proxy)
        self.proxies[proxy] = ProxyState()
        self.proxies_by_hostport[hostport] = proxy
        self.unchecked.add(proxy)

victor-wyk · 2020-07-12T07:39:54Z

In settings.py do you simply replace 'rotating_proxies.middlewares.RotatingProxyMiddleware' with 'YourProject.middlewares.CustomRotatingProxiesMiddleware' and how about the settings.py options?

StasDeep · 2020-07-12T07:43:00Z

@victor-wyk I think replacing the original middleware with the custom one should do the work.

victor-wyk · 2020-07-12T08:01:53Z

Thanks for the quick response. I did some fiddling and found out that after replacing with the custom one, you do have to supply the ROTATING_PROXY_LIST option with a list of proxies, or else the custom one would not run. The custom middleware will ignore the list and continue to run like usual. How to solve the issue?

`DOWNLOADER_MIDDLEWARES = {

'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,  

'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,  

'MyProject.middlewares.MyProjectDownloaderMiddleware': 543,  

'MyProject.middlewares.CustomRotatingProxiesMiddleware': 610,  

# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,  

'rotating_proxies.middlewares.BanDetectionMiddleware': 620,

}

ROTATING_PROXY_LIST = ['69.69.69.69:69']`

StasDeep · 2020-07-12T09:33:54Z

@victor-wyk when you run the spider, do you see the CustomRotatingProxiesMiddleware in the list logged after [scrapy.middleware] INFO: Enabled downloader middlewares:?

victor-wyk · 2020-07-12T14:14:33Z

@StasDeep I do, only if i include the ROTATING_PROXY_LIST as shown above. If i get rid of the option then it does not appear.

StasDeep · 2020-07-12T14:29:59Z

@victor-wyk but what goes wrong then? As in, what's expected and what's actual?

Kamranbarlas · 2021-11-15T10:59:24Z

I am facing the same problem when it comes to using dynamic changing the proxy list while scraping.
can you tell me what is the proxy list in the --> cleanup_proxy_list(proxy_list).
@StasDeep

milancelap · 2021-12-22T16:54:06Z

I get an NameError: name 'proxy_list' is not defined when implementing that custom middleware. Also new_proxy_list, logger, extract_proxy_hostport, and ProxyState are all not defined... @StasDeep

Kamranbarlas · 2022-01-11T16:07:29Z

I am facing the same problem when it comes to using dynamic changing the proxy list while scraping.
can you tell me what is the proxy list in the --> cleanup_proxy_list(proxy_list).
@StasDeep

reedjones · 2024-05-14T20:10:57Z

@Kamranbarlas
basically you have to override the from_crawler method in your custom class and then just set proxy list to whatever you want.

from the source proxy_list :

 @classmethod
    def from_crawler(cls, crawler):
        s = crawler.settings
        proxy_path = s.get('ROTATING_PROXY_LIST_PATH', None)
        if proxy_path is not None:
            with codecs.open(proxy_path, 'r', encoding='utf8') as f:
                proxy_list = [line.strip() for line in f if line.strip()]
        else:
            proxy_list = s.getlist('ROTATING_PROXY_LIST')
        if not proxy_list:
            raise NotConfigured()
        mw = cls(
            proxy_list=proxy_list,
            logstats_interval=s.getfloat('ROTATING_PROXY_LOGSTATS_INTERVAL', 30),
            stop_if_no_proxies=s.getbool('ROTATING_PROXY_CLOSE_SPIDER', False),
            max_proxies_to_try=s.getint('ROTATING_PROXY_PAGE_RETRY_TIMES', 5),
            backoff_base=s.getfloat('ROTATING_PROXY_BACKOFF_BASE', 300),
            backoff_cap=s.getfloat('ROTATING_PROXY_BACKOFF_CAP', 3600),
            crawler=crawler,
        )
        crawler.signals.connect(mw.engine_started,
                                signal=signals.engine_started)
        crawler.signals.connect(mw.engine_stopped,
                                signal=signals.engine_stopped)
        return mw
        ```

rajatshenoy56 mentioned this issue Nov 4, 2020

Update proxy list once a day #53

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refresh the list of proxies during scraping #40

Refresh the list of proxies during scraping #40

dibodin commented Jun 1, 2020

StasDeep commented Jun 7, 2020

victor-wyk commented Jul 12, 2020

StasDeep commented Jul 12, 2020

victor-wyk commented Jul 12, 2020 •

edited

Loading

StasDeep commented Jul 12, 2020

victor-wyk commented Jul 12, 2020

StasDeep commented Jul 12, 2020

Kamranbarlas commented Nov 15, 2021 •

edited

Loading

milancelap commented Dec 22, 2021

Kamranbarlas commented Jan 11, 2022

reedjones commented May 14, 2024

Refresh the list of proxies during scraping #40

Refresh the list of proxies during scraping #40

Comments

dibodin commented Jun 1, 2020

StasDeep commented Jun 7, 2020

victor-wyk commented Jul 12, 2020

StasDeep commented Jul 12, 2020

victor-wyk commented Jul 12, 2020 • edited Loading

StasDeep commented Jul 12, 2020

victor-wyk commented Jul 12, 2020

StasDeep commented Jul 12, 2020

Kamranbarlas commented Nov 15, 2021 • edited Loading

milancelap commented Dec 22, 2021

Kamranbarlas commented Jan 11, 2022

reedjones commented May 14, 2024

victor-wyk commented Jul 12, 2020 •

edited

Loading

Kamranbarlas commented Nov 15, 2021 •

edited

Loading