-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refresh the list of proxies during scraping #40
Comments
Hello! If I understand it correctly, you want to dynamically load some proxy lists from the internet to always have the latest proxies. Anyway, what you can do is defining a custom middleware and custom proxies class: from rotating_proxies.middlewares import RotatingProxyMiddleware
from rotating_proxies.expire import Proxies
from twisted.internet import task
class CustomRotatingProxiesMiddleware(RotatingProxyMiddleware):
@classmethod
def from_crawler(cls, crawler):
mw = super(CustomRotatingProxiesMiddleware, cls).from_crawler(crawler)
# Substitute standart `proxies` object with a custom one
mw.proxies = CustomProxies(mw.cleanup_proxy_list(proxy_list), backoff=mw.proxies.backoff)
# Connect `proxies` to engine signals in order to start and stop looping task
crawler.signals.connect(mw.proxies.engine_started, signal=signals.engine_started)
crawler.signals.connect(mw.proxies.engine_stopped, signal=signals.engine_stopped)
return mw
class CustomProxies(Proxies):
def engine_started(self):
""" Create a task for updating proxies every hour """
self.task = task.LoopingCall(self.update_proxies)
self.task.start(3600, now=True)
def engine_stopped(self):
if self.task.running:
self.task.stop()
def update_proxies(self):
new_proxies = ... # fetch proxies from wherever you want
for proxy in new_proxy_list:
self.add(proxy)
def add(self, proxy):
""" Add a proxy to the proxy list """
if proxy in self.proxies:
logger.warn("Proxy <%s> is already in proxies list" % proxy)
return
hostport = extract_proxy_hostport(proxy)
self.proxies[proxy] = ProxyState()
self.proxies_by_hostport[hostport] = proxy
self.unchecked.add(proxy) |
In settings.py do you simply replace 'rotating_proxies.middlewares.RotatingProxyMiddleware' with 'YourProject.middlewares.CustomRotatingProxiesMiddleware' and how about the settings.py options? |
@victor-wyk I think replacing the original middleware with the custom one should do the work. |
Thanks for the quick response. I did some fiddling and found out that after replacing with the custom one, you do have to supply the ROTATING_PROXY_LIST option with a list of proxies, or else the custom one would not run. The custom middleware will ignore the list and continue to run like usual. How to solve the issue? `DOWNLOADER_MIDDLEWARES = {
} ROTATING_PROXY_LIST = ['69.69.69.69:69']` |
@victor-wyk when you run the spider, do you see the |
@StasDeep I do, only if i include the ROTATING_PROXY_LIST as shown above. If i get rid of the option then it does not appear. |
@victor-wyk but what goes wrong then? As in, what's expected and what's actual? |
I am facing the same problem when it comes to using dynamic changing the proxy list while scraping. |
I get an NameError: name 'proxy_list' is not defined when implementing that custom middleware. Also new_proxy_list, logger, extract_proxy_hostport, and ProxyState are all not defined... @StasDeep |
I am facing the same problem when it comes to using dynamic changing the proxy list while scraping. |
@Kamranbarlas from the source
|
Hello
i find the load of the list of proxies in from_crawler (middleware.py) : the load is in a constructor of object.
i read this in a good site of scraping : " ...write some code that would automatically pick up and refresh the proxy list you use for scraping with working IP addresses. This will save you a lot of time and frustration."
i wish change dynamically the list of proxie or complete it during the scraping. i think : il is a goog feature.
Best Regards.
(sorry for my english ...)
The text was updated successfully, but these errors were encountered: