-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add search_v2()
method
#103
base: main
Are you sure you want to change the base?
Conversation
wayback/_client.py
Outdated
# Since pages are a number of *blocks searched* and not results, a page | ||
# in the middle of the result set may have nothing in it. The only way | ||
# to know when to stop iterating is to check how many pages there are. | ||
page_count = int(self.session.request('GET', CDX_SEARCH_2_URL, params={ | ||
**query, | ||
'showNumPages': 'true' | ||
}).text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be necessary! I just discovered that sending too high a page
value gets a 400 error with the header x-archive-wayback-runtime-error: page must be smaller than numpages
, so we can in theory check for that and stop.
That said, that’s a very human-readable message and feels unstable. We should check with folks at the Internet Archive about what approach they’d prefer people use.
Update: the way you control output format in the new search is not with
|
This adds support for the Internet Archive's new, beta CDX search endpoint at `/web/timemap/cdx`. It deals with pagination much better and is eventually slated to replace the search currently at `/cdx/search/cdx`, but is a little slower and still being tested. This commit is a start, but we still need to do more detailed testing and talk more with the Wayback Machine team about things that are unclear here. I'm also not sure if `filter`, `collapse`, `resolveRevisits`, etc. are actually supported. Fixes #8.
ada8423
to
5093982
Compare
5c41ba6
to
42d5f7d
Compare
Some updates:
At this point, there’s a little more I can do (rate limiting, cleanup), but we the main blocker by lack of clarity on bugs/intended behavior from Wayback, which we’ll have to wait to hear back on. |
🚧 Work in Progress! 🚧
This adds support for the Internet Archive's new, beta CDX search endpoint at
/web/timemap/cdx
. It deals with pagination much better and is eventually slated to replace the search currently at/cdx/search/cdx
, but is a little slower and still being tested. Fixes #8.There are still a bunch of things to be done before merging:
output=json
working), etc.filter
,collapse
,resolveRevisits
, etc) and whether there are new ones we can/should use. (Update:resolveRevisits
is badly broken, but the rest are the same as original search and work fine. Checking w/ Wayback folks for more detail.)search_v2
is the right name, or if it should be something else (search_beta()
?search_next()
?)