Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What's the current rate limit for CDX search? #153

Closed
itsrun opened this issue Dec 15, 2023 · 2 comments
Closed

What's the current rate limit for CDX search? #153

itsrun opened this issue Dec 15, 2023 · 2 comments
Labels
question Further information is requested

Comments

@itsrun
Copy link

itsrun commented Dec 15, 2023

Hi there, I'm currently sending search request every 1.25 seconds continuously but soon received 429 errors. May I ask what's the current recommended rate limit for the CDX search API? Thanks!

@Mr0grog
Copy link
Member

Mr0grog commented Dec 16, 2023

Just to be clear, this isn't an official package from the Internet Archive, so for most questions not specifically about this Python package, you should contact them directly.

BUT I do try and keep in close contact with the staff there, and the current limit for requests to web.archive.org/cdx/*, the limit is 60 requests/minute averaged over a 5-minute window. Those limits are generally based on IP address, so if you are sharing an IP with someone else (e.g. if you are behind any kind of proxy or router, or working from a shared server), your requests will be grouped together for the purposes of rate limiting. Those limits are also different for particular IPs that have been allowed more or less because of past abuse or other issues.

If you are using this package, it does its best to stick to the limits for you automatically, but there are some significant issues we fixed around rate limits in the latest release (v0.4.4) and a complete overhaul of rate limits in the next release (v0.5.0, hopefully later this month 🤞) — so make sure you're on the latest version!

Also keep in mind that rate limits in this library are expressed in calls per second, so to make a request every 1.25s, you should configure:

client = WaybackClient(WaybackSession(search_calls_per_second=0.8))

And make sure to back off that value even more if you are using multiple clients on multiple threads. Also be careful not to create too many HTTP connections if you are multithreading! That'll be easier in v0.5.0, but in the current release, doing so is messy — see #106 (comment).

Finally, once you receive a 429 response, make sure to stop all new requests immediately and do not start again for at least 60s. If you make new requests during that 60s window, your IP will get blocked for progressively longer time periods, from a few hours up to a few days.

@Mr0grog Mr0grog added the question Further information is requested label Dec 16, 2023
@itsrun
Copy link
Author

itsrun commented Dec 16, 2023

Thanks for the clear explanation! I'm running the script (single-threaded) from a GCP VM so I guess that's why it got rate limited so quickly

@itsrun itsrun closed this as completed Dec 16, 2023
@github-project-automation github-project-automation bot moved this from Backlog to Unreleased in Wayback Roadmap Dec 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Archived in project
Development

No branches or pull requests

2 participants