Using get_random() to select a proxy is not optimal #49

fredd-427 · 2020-09-25T15:31:31Z

Hello,
I discovered that using get_random() to choose a proxy from the list is not optimal, indeed in my example:

I crawl a site that uses datadom to protect itself from crawling, so not to be banned, I have a DOWNLOAD_DELAY at 180 seconds
I have 2 proxies in ROTATING_PROXY_LIST
DOWNLOAD_DELAY=180
CONCURRENT_REQUESTS_PER_DOMAIN=1
CONCURRENT_REQUESTS=2 (like the number of proxies)

Sometimes get_random() returns the same proxy as the spider already in use and therefore waits for the end of the DOWNLOAD_DELAY.

Would it be possible to replace get_random() with a get_unused() function? a function that returns the first "free" proxy that is not inside the DOWNLOAD_DELAY?

thank you
fred

1st file : log I observed with the problem (see the comments to the right)
2nd file : log without problem (see the comments to the right)
1st log.txt
2nd log.txt

fredd-427 mentioned this issue Mar 20, 2021

Please add ability to not assign proxy that is in-use #60

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using get_random() to select a proxy is not optimal #49

Using get_random() to select a proxy is not optimal #49

fredd-427 commented Sep 25, 2020 •

edited

Loading

Using get_random() to select a proxy is not optimal #49

Using get_random() to select a proxy is not optimal #49

Comments

fredd-427 commented Sep 25, 2020 • edited Loading

fredd-427 commented Sep 25, 2020 •

edited

Loading