Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WoS REST API causes problems with some author searches #1642

Open
peetucket opened this issue Sep 27, 2023 · 9 comments
Open

WoS REST API causes problems with some author searches #1642

peetucket opened this issue Sep 27, 2023 · 9 comments

Comments

@peetucket
Copy link
Member

peetucket commented Sep 27, 2023

The new WoS REST API seems to have issues with some authors, throwing errors.

https://app.honeybadger.io/projects/50046/faults/99684154/01HB323CCE7F43DYFQ5SBH0QJT?page=0

For example, this author fails to get results back with an XML error as you can see from the HB error above.

You can reproduce the problem without attempting to harvest by just querying for the UIDs

author=Author.find(157210)
options = {load_time_span: '3W', relDate: '21'}
author_query = WebOfScience::QueryAuthor.new(author, options)
puts "WOS (by name): #{author_query.name_query.send(:name_query)}"
uids = author_query.uids

This is likely due to some problematic responses from Clarivate (like some invalid XML documents) and needs to be investigated with them.

If you use their Swagger interface (https://developer.clarivate.com/apis/wos) to generate the actual rest call and execute on the console, you will get a response:

curl -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'

For just the 200 header:

curl -I -X 'GET' 'https://wos-api.clarivate.com/api/wos/?databaseId=WOK&usrQuery=AU%3D%28%22Miller%2CD%22%20OR%20%22Miller%2CD%2CCraig%22%20OR%20%22Miller%2CD%2CC%22%29%20AND%20AD%3D%28%22stanford%22%29&loadTimeSpan=3W&count=100&firstRecord=1' -H 'accept: application/xml' -H 'X-ApiKey: API_KEY_REDACTED'

@peetucket
Copy link
Member Author

peetucket commented Sep 27, 2023

  1. Perhaps add some additional logging around the XML parser in https://github.com/sul-dlss/sul_pub/blob/main/lib/web_of_science/xml_parser.rb#L25-L31 or elsewhere (in the records class) so we can identify the exact WoS ID record that is not parsing correctly, and provide this to info to Clarivate.
  2. Ignore records that do not parse instead of blowing up the whole harvest for that author (this would allow other publications to be added instead of stopping the whole process for that author).

@peetucket
Copy link
Member Author

peetucket commented Oct 2, 2023

I did a test with the author query below by requesting only the WOS UIDs in the first query (by asking for 0 results), taking the queryID and then requesting the UIDs separately via a separate call (using the Swigger API view), as described at the bottom of the page here: https://github.com/sul-dlss/sul_pub/wiki/Clarivate-APIs#web-of-sciences-expanded-api-notes

This is fast, and then iterating over resulting WOS UID and requesting one record at a time, and this parsed a few times successfully. So I wonder if our current approach of requesting all of the records all at once has issues with very large publications resulting in giant amounts of XML.

Query that I ran, and I used a loadTimeSpan of 3W

AU=("Miller,D" OR "Miller,D,Craig" OR "Miller,D,C") AND AD=("stanford")

Wos APIs returned:

["WOS:001021700000001",
 "WOS:001028170500007",
 "WOS:001037066800001",
 "WOS:001021392200001",
 "WOS:001022697000001",
 "WOS:001021461500001",
 "WOS:001022682600001",
 "WOS:000329880700016",
 "WOS:001035480600001",
 "WOS:001035434900001",
 "WOS:001030510600001",
 "WOS:001035476900001",
 "WOS:001022781200001",
 "WOS:001035251600001",
 "WOS:001023908100001",
 "WOS:001023760300001",
 "WOS:001035462900001",
 "WOS:001035458200001",
 "WOS:001035262600001",
 "WOS:001035431100001",
 "WOS:001035282400001",
 "WOS:001035569700001",
 "WOS:001035240700001",
 "WOS:001035251000001",
 "WOS:001035243000001"]

@peetucket
Copy link
Member Author

peetucket commented Oct 2, 2023

The last WOS UID has 2900 authors and is a giant XML record. It works fine processed singly but is plausible causes issue when wrapped up with other publications:

wos_uid='WOS:001035243000001'
results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
pub_hash = results[0].pub_hash;nil
puts pub_hash[:author].size
=> 2900
puts results[0].to_xml

@peetucket
Copy link
Member Author

peetucket commented Oct 2, 2023

Looks like most of those publications have a lot of authors ... they are physics publications:

wos_uids = ["WOS:001021700000001",
 "WOS:001028170500007",
 "WOS:001037066800001",
 "WOS:001021392200001",
 "WOS:001022697000001",
 "WOS:001021461500001",
 "WOS:001022682600001",
 "WOS:000329880700016",
 "WOS:001035480600001",
 "WOS:001035434900001",
 "WOS:001030510600001",
 "WOS:001035476900001",
 "WOS:001022781200001",
 "WOS:001035251600001",
 "WOS:001023908100001",
 "WOS:001023760300001",
 "WOS:001035462900001",
 "WOS:001035458200001",
 "WOS:001035262600001",
 "WOS:001035431100001",
 "WOS:001035282400001",
 "WOS:001035569700001",
 "WOS:001035240700001",
 "WOS:001035251000001",
 "WOS:001035243000001"]

resp = Hash.new
wos_uids.each do |wos_uid|
   results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
   pub_hash = results[0].pub_hash
   resp[wos_uid] = pub_hash[:author].size
end;nil
resp
 =>
{"WOS:001021700000001"=>2900,
 "WOS:001028170500007"=>43,
 "WOS:001037066800001"=>2900,
 "WOS:001021392200001"=>2900,
 "WOS:001022697000001"=>2900,
 "WOS:001021461500001"=>2900,
 "WOS:001022682600001"=>2864,
 "WOS:000329880700016"=>17,
 "WOS:001035480600001"=>2898,
 "WOS:001035434900001"=>2856,
 "WOS:001030510600001"=>2864,
 "WOS:001035476900001"=>2933,
 "WOS:001022781200001"=>2900,
 "WOS:001035251600001"=>2864,
 "WOS:001023908100001"=>2898,
 "WOS:001023760300001"=>2898,
 "WOS:001035462900001"=>2900,
 "WOS:001035458200001"=>2900,
 "WOS:001035262600001"=>2900,
 "WOS:001035431100001"=>2900,
 "WOS:001035282400001"=>2900,
 "WOS:001035569700001"=>2900,
 "WOS:001035240700001"=>2900,
 "WOS:001035251000001"=>2900,
 "WOS:001035243000001"=>2900}

@peetucket
Copy link
Member Author

So it seems entirely plausible a result set like this will blow up if we ask for it all in one go (which we currently do) instead of requesting just UIDs, and then iterating over them one at a time to request the records.

@peetucket
Copy link
Member Author

Of note: the same problem happens in the current SOAP based API. In that case, while we may initially just fetch WOS UIDs when running the name query, we then pass all of the IDs in and try and batch fetch many records at once. This also causes a failure:

wos_uids = ["WOS:001021700000001",
 "WOS:001028170500007",
 "WOS:001037066800001",
 "WOS:001021392200001",
 "WOS:001022697000001",
 "WOS:001021461500001",
 "WOS:001022682600001",
 "WOS:000329880700016",
 "WOS:001035480600001",
 "WOS:001035434900001",
 "WOS:001030510600001",
 "WOS:001035476900001",
 "WOS:001022781200001",
 "WOS:001035251600001",
 "WOS:001023908100001",
 "WOS:001023760300001",
 "WOS:001035462900001",
 "WOS:001035458200001",
 "WOS:001035262600001",
 "WOS:001035431100001",
 "WOS:001035282400001",
 "WOS:001035569700001",
 "WOS:001035240700001",
 "WOS:001035251000001",
 "WOS:001035243000001"]
results = WebOfScience.queries.retrieve_by_id(wos_uids).next_batch.to_a;
/opt/app/pub/sul-pub/shared/bundle/ruby/3.2.0/gems/savon-2.14.0/lib/savon/response.rb:132:in `raise_soap_and_http_errors!': (soap:Server) (WSE0002) Error processing your request. Reason: The (server-side) Web service could not create the call to a supporting server. Error processing results of query. Cause: [{0}]. Remedy: Call customer support. This is not a problem within your SOAP client.  : Java heap space (Savon::SOAPFault)

@edsu
Copy link
Contributor

edsu commented Nov 17, 2023

I'm confused why the XML that is being logged is a fragment, and not a seemingly complete document. For example am I reading this HB alert correctly?

    {
      "message" => "Error processing XML record from WoS",
      "xml" => "izations>Med CtrChicagoILUSA",
      "encoded_xml" => nil
    }

@peetucket
Copy link
Member Author

Good question - I am not 100% sure. I think the WoS response is not being fully returned correctly since it is so large.

@peetucket
Copy link
Member Author

peetucket commented Nov 17, 2023

Here is another particularly problematic example. This author has 2210 publications (which previously would blow up just asking for the UIDs, now that we fetch UIDs more efficiently, it works fine just requesting the UIDs). However, so many of the publications have so many authors, it will blow up if you pull the publication data in groups of 100 (the default). Below, we pull one at a time and count the authors (and even this takes a long long time), showing that many many of these publications have hundreds or thousands of authors.

author=Author.find_by(cap_profile_id: 34047)
author_query = WebOfScience::QueryAuthor.new(author)
puts author_query.name_query.send(:name_query);

 => AU=("Wu,Sean" OR "Wu,Sean,M." OR "Wu,Sean,M" OR "Wu,Ming" OR "Wu,Ming,Ming-yuan" OR "Wu,Ming,M" OR "Wu,S" OR "Wu,S,M") AND AD=("stanford" OR "massachusetts general hospital")

wos_uids = author_query.uids;
wos_uids.size
 => 2210

resp = Hash.new
wos_uids.each do |wos_uid|
   results = WebOfScience.queries.retrieve_by_id([wos_uid]).next_batch.to_a;
   pub_hash = results[0].pub_hash
   resp[wos_uid] = pub_hash[:author].size
end;nil

 # wait a long long time

puts resp
 =>
{"WOS:001062395700001"=>2856,
 "WOS:001035476900001"=>2933,
 "WOS:001062376700001"=>2900,
 "WOS:001062550200002"=>2856,
 "WOS:001062421400002"=>2900,
 "WOS:001062420000001"=>2898,
 "WOS:001062395800001"=>2898,
 "WOS:001062395800002"=>2898,
 "WOS:001062554100001"=>2898,
 "MEDLINE:37955510"=>2871,
 "WOS:001022682600001"=>2864,
 "WOS:001063965200001"=>2876,
 "WOS:001062420100001"=>2928,
 "MEDLINE:37925689"=>2935,
 "WOS:001063985300002"=>7,
 "WOS:001066442000003"=>17,
 "WOS:001002149400001"=>14,
 "WOS:000952205900001"=>21,
 "MEDLINE:37931634"=>733,
 "WOS:001063420300001"=>2882,
 "WOS:001063751200019"=>11,
 "WOS:001062451200001"=>2898,
 "WOS:001035434900001"=>2856,
 "WOS:001035431100001"=>2900,
 "WOS:001055270000001"=>2913,
 "MEDLINE:37897746"=>2920,
 "WOS:001035462900001"=>2900,
 "WOS:001069745300005"=>2900,
 "WOS:001035251600001"=>2864,
 "WOS:001035240700001"=>2900,
 "WOS:001035251000001"=>2900,
 "WOS:001035243000001"=>2900,
 "WOS:001062397100001"=>2856,
 "WOS:001062421500007"=>2898,
 "WOS:001062396000001"=>2898,
 "WOS:001071193900001"=>2900,
 "WOS:001062398000001"=>2900,
 "WOS:001062454100002"=>2900,
 "WOS:001062454100001"=>2900,
 "WOS:001062376700002"=>2900,
 "WOS:001061847500001"=>2911,
 "WOS:001062358800001"=>2896,
 "MEDLINE:37897770"=>2933,
 "WOS:001058590400001"=>2898,
 "WOS:001079098400001"=>2900,
 "WOS:001061829700001"=>2856,
 "WOS:001069542700001"=>2898,
 "WOS:001060591500001"=>2898,
 "WOS:001063971600001"=>2900,
 "WOS:001063973100001"=>2900,
 "WOS:001063486300001"=>2900,
 "WOS:001061803400002"=>2900,
 "WOS:001035262600001"=>2900,
 "PPRN:42553004"=>106,
 "WOS:001074938400001"=>13,
 "WOS:001035480600001"=>2898,
 "WOS:001035458200001"=>2900,
 "WOS:001035569700001"=>2900,
 "WOS:000989629700723"=>2,
 "WOS:000989629702211"=>7,
 "WOS:000989629700722"=>8,
 "WOS:001061852200001"=>2900,
 "WOS:001068857600001"=>107,
 "WOS:001059027100001"=>29,
 "WOS:001061751900002"=>2898,
 "WOS:001061751900001"=>2898,
 "WOS:001061876200001=>2933,
etc etc etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants