-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WoS REST API causes problems with some author searches #1642
Comments
|
I did a test with the author query below by requesting only the WOS UIDs in the first query (by asking for 0 results), taking the queryID and then requesting the UIDs separately via a separate call (using the Swigger API view), as described at the bottom of the page here: https://github.com/sul-dlss/sul_pub/wiki/Clarivate-APIs#web-of-sciences-expanded-api-notes This is fast, and then iterating over resulting WOS UID and requesting one record at a time, and this parsed a few times successfully. So I wonder if our current approach of requesting all of the records all at once has issues with very large publications resulting in giant amounts of XML. Query that I ran, and I used a loadTimeSpan of 3W
Wos APIs returned:
|
The last WOS UID has 2900 authors and is a giant XML record. It works fine processed singly but is plausible causes issue when wrapped up with other publications:
|
Looks like most of those publications have a lot of authors ... they are physics publications:
|
So it seems entirely plausible a result set like this will blow up if we ask for it all in one go (which we currently do) instead of requesting just UIDs, and then iterating over them one at a time to request the records. |
Of note: the same problem happens in the current SOAP based API. In that case, while we may initially just fetch WOS UIDs when running the name query, we then pass all of the IDs in and try and batch fetch many records at once. This also causes a failure:
|
I'm confused why the XML that is being logged is a fragment, and not a seemingly complete document. For example am I reading this HB alert correctly? {
"message" => "Error processing XML record from WoS",
"xml" => "izations>Med CtrChicagoILUSA",
"encoded_xml" => nil
} |
Good question - I am not 100% sure. I think the WoS response is not being fully returned correctly since it is so large. |
Here is another particularly problematic example. This author has 2210 publications (which previously would blow up just asking for the UIDs, now that we fetch UIDs more efficiently, it works fine just requesting the UIDs). However, so many of the publications have so many authors, it will blow up if you pull the publication data in groups of 100 (the default). Below, we pull one at a time and count the authors (and even this takes a long long time), showing that many many of these publications have hundreds or thousands of authors.
|
The new WoS REST API seems to have issues with some authors, throwing errors.
https://app.honeybadger.io/projects/50046/faults/99684154/01HB323CCE7F43DYFQ5SBH0QJT?page=0
For example, this author fails to get results back with an XML error as you can see from the HB error above.
You can reproduce the problem without attempting to harvest by just querying for the UIDs
This is likely due to some problematic responses from Clarivate (like some invalid XML documents) and needs to be investigated with them.
If you use their Swagger interface (https://developer.clarivate.com/apis/wos) to generate the actual rest call and execute on the console, you will get a response:
curl -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'
For just the 200 header:
curl -I -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'
The text was updated successfully, but these errors were encountered: