-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Augment TRAPI results using PFOCR data #109
Conversation
1d7fc01
to
9b3af00
Compare
It may be useful to have a example response? |
Here's an example pulled from the query in this notebook: Expand to see JSON{
"node_bindings": {
"n1": [
{
"id": "NCBIGene:5599"
}
],
"n0": [
{
"id": "NCBIGene:211"
}
],
"n2": [
{
"id": "PUBCHEM.COMPOUND:3121"
}
]
},
"edge_bindings": {
"e01": [
{
"id": "5fc6cc476f4bcb068460c5d299db52dd"
}
],
"e02": [
{
"id": "fc671d4b62d2d983372271323c3e8be3"
},
{
"id": "7b6da63564f96e9b4d0aef2581eb3ba3"
},
{
"id": "3e93479f91fd45c376d4851eea747174"
},
{
"id": "8353796517ac7652afabf4042e944971"
},
{
"id": "e495f8080658dd1cd0acbec8c4cfeaf7"
},
{
"id": "551bd928ba58c60f465375306b81bbac"
},
{
"id": "6f6dd5467485ea133e64cb76b405c7d2"
}
]
},
"score": 2.1062617593370945,
"pfocr": [
{
"figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5354998/bin/nihms843846f10.jpg",
"pmc": "PMC5354998",
"nodes": [
"n1",
"n0"
]
},
{
"figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1906540/bin/cei0130-0363-f2.jpg",
"pmc": "PMC1906540",
"nodes": [
"n1",
"n0"
]
},
{
"figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7402116/bin/antioxidants-09-00636-g004.jpg",
"pmc": "PMC7402116",
"nodes": [
"n1",
"n0"
]
}
]
} |
Feedback for @ariutta to discuss with @andrewsu: Basically I'm not sure what conditions the pfocr augmenting happens in:
QueryRelated to July 2022 Translator Standup
An example of query with Gene QNode `is_set: true`This query is used in the creative-mode run for this disease.
|
As noted during the meeting today, BTE already does "ID resolution" based on biolink-model's id_prefix priority order, so the KG Node key itself should be NCBIGene if a mapping to one was found. However, for Disease / Chemical, the MESH IDs could be in the KG Node key or in the synonyms section of the node's attributes... It may be easier to pull the IDs out in internal data things, similar to the UMLS ID (which is done for semmeddb ngd scoring) Other notes from the meeting:
|
My current understanding is we don't want to include PFOCR in this case, so the code currently requires matching CURIEs for 2+ different QNodes for a TRAPI result.
The current code just checks whether there's an NCBIGene associated with a QNode. It doesn't actually check whether the category is "biolink:Gene". |
Some queries can result in trying to get essentially all of the PFOCR figures, e.g.: {
"message": {
"query_graph": {
"edges": {
"e01": {
"object": "n0",
"predicates": [
"biolink:related_to"
],
"subject": "n1"
},
"e02": {
"object": "n1",
"predicates": [
"biolink:related_to"
],
"subject": "n2"
}
},
"nodes": {
"n0": {
"categories": [
"biolink:Gene"
],
"ids": [
"NCBIGene:3855",
"NCBIGene:211",
"NCBIGene:26995"
]
},
"n1": {
"categories": [
"biolink:SmallMolecule"
],
"ids": [
"PUBCHEM.COMPOUND:3121"
]
},
"n2": {
"categories": [
"biolink:Gene"
]
}
}
}
}
} When I ran that once, it timed out. When I ran it again, it ended up getting 67,280 figures. Should we put a limit on how many genes we submit in queries to the PFOCR API? |
Hmmm I don't know what's going on in #109 (comment): is it a massive number of genes (so splitting into multiple queries would help)? is it that some genes are too common (maybe we can remove them - so we just don't use them for the pfocr related tasks)? |
In the example I gave, the problem is a massive number of genes. |
As far as I can tell, the gene count issue isn't really specific to PFOCR. If we get too many genes, the scoring can also fail, and for that matter, the entire BTE query can fail. So any gene count limits should probably be set at a system-level. If the PFOCR API is too slow, we could speed it up by updating it to support POST queries. |
Potential issues:
Notes in general:
|
Notes on performance (related to these comments): A. Took ~ 2 min to run the PFOCR section of this query that had 2 Gene QNodes and 522 results. BTE took 5 min 14 s total. query from Feb-March 2022 QotM work
some console logs
B. Took ~ 2 min to run the PFOCR section of some console logs
C. Took ~ 20 sec to run the PFOCR section of this query that had 29 results. BTE took 1 min 56 s total. Quinazolines -> Gene <- Bunch of Gene IDs
some console logs
D. Took ~ 32 sec to run the PFOCR section of this query that had 529 results. BTE took 4 min 20 s total. Type 2 Diabetes -> BiologicalEntity <- TXNIP
some console logs
E. Took ~ 51 sec to run the PFOCR section of this query that had 78 results. BTE took 1 min 20 s total. Gene MLXIP -> Gene <- Gene CTGF
some console logs
(It's really hard to run a Gene ID -> Gene -> Gene (Predict-style) without having it blow up with tons of results >.<) |
Will be updated
Will be added
Your interpretation was correct. Just note that the "split" can include many-to-many relationships (e.g., one figure could match multiple TRAPI results and one TRAPI result could match multiple figures), many-to-one, one-to-many or one-to-one. Any suggestions for better wording? Maybe one of the following?
We don't want the user to think each of the 19 TRAPI results matches 72 figures. But we probably do want the user to know that we required at least 2 CURIEs to be shared for each figure / TRAPI result pair in order to be a match.
Currently no sorting and no scoring, but in issue #451, we're exploring this topic. Once that's figured out, we'll want to add some type of sorting and scoring.
I think these bullet points correctly describe the current behavior, which I also understand to be the desired behavior. If you or @andrewsu think otherwise, just let me know. |
As soon as I get feedback on the wording for logging the counts of PFOCR figure and TRAPI result matches, I'll make another push to this PR to address each item you mentioned, @colleenXu. Thanks for the comments. |
Regarding the performance, this might be a question for @andrewsu and @erikyao. With the existing PFOCR API, I'm not sure of a good way to improve performance. If there's a large number of genes, the only way to get PFOCR data from the API is to make a large number of GET requests. I could just get results for the first 1k genes, but that seems wrong. Or maybe I could just download the entire PFOCR dataset from Dropbox if the number of genes exceeds a certain threshold? |
Feedback on this post: Out of the options for logs, I like On sorting / scoring: perhaps a quick / simple method is to prefer figures that have more matching QNodes. I noticed some answers where there'd be only 1 QNode listed, probably because there were multiple IDs corresponding to that QNode (an And Andrew is off this week. I haven't been involved in setting the requirements for this issue....so I'd like to defer to you and Andrew on whether the "notes in general"/current behavior is the desired behavior. |
On this comment, I think it's worth revisiting the issue with Chunlei and Yao. If you're regularly trying to query with LOTs of gene IDs...to retrieve basically the entire dataset in the API....this does seem inefficient / odd / slow somehow.... some ideas/food for thought: I wonder if it's possible to keep the pfocr data in a github file in the repo (compressed tabular data?), and basically run "filtering" operations on the data during the "pfocr augmenting" process... And also wondering: is there anyway to decrease the number of figures we're working with?
|
I agree, I think this one makes the most sense. |
The logging and PFOCR figure truncation is updated as of my latest commit. |
The only blocker remaining is how to handle cases where we send a very large number of genes to the PFOCR API and/or where we receive a very large number of figures back. Proposed solutions:
Regardless of the items above, we will still get a very large number of figures for certain genes, e.g., ~5k for MAPK1. |
On 1) you could try 100 (we do that right now with some other pending BioThings APIs)? and maybe more depending on how the retrieval of documents in a POST query goes with that PR On 2) sounds like a good idea to me On 3) I dunno. Sounds kinda like a good idea, but I'm not sure. |
Regarding bullet point 3), I pushed an experimental branch Some stats on the current PFOCR dataset:
If you want to use a Bloom or XOR filter, you could pick one and remove the code for the others. Then run |
After verifying functional parity, I've pushed my changes. Still to do is figure scoring. I've also confirmed that BTE still augments explain queries, and did additional performance testing based on these tests: With PR before my changes -> after my changes, in minutes: So, significant performance increases thanks to the intensive stuff being offloaded to the API! Tagging @andrewsu @colleenXu to review in whatever capacity they see fit. |
For a detailed example, given query D:
We get the attached file as a response (1.4Mb, compressed as zip): |
That example output file looks great to me. In terms of functionality, I think this is ready to deploy to dev. (If it were easy to cap the number of PFOCR results returned per TRAPI result at 20, I think that would be good. But that's a minor point that shouldn't hold up deployment...) |
Ah, it appears I had disabled figure limits during testing and forgot to re-enable. I’ll enable that before deploying to dev. |
Codecov Report
@@ Coverage Diff @@
## main #109 +/- ##
==========================================
- Coverage 53.29% 51.99% -1.30%
==========================================
Files 26 27 +1
Lines 2289 2577 +288
==========================================
+ Hits 1220 1340 +120
- Misses 1069 1237 +168
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Note: After checking out this branch and compiling, I encountered this message when running BTE: module missing
So I ran "npm i" which installed 1 package and changed 2 other packages. |
EDITED 2022-10-24 after discussion with Andrew: Feedback for devs
the TRAPI log
Are requirements met?Requirements from this post of the issue Not sure if the "scoring" requirements are met: that scoring is good and higher scores are better. However, I reviewed the top figures for some results, and I thought the stuff looked relevant. It'll be easier to see if the figures were correctly annotated with the genes once Feedback point 3 (adding Gene IDs/names) is done. I think requirements are met for:
Noticed some results that fulfill the criteria but don't have a pfocr section. Turns out that (correctly) no matching figures were found...See the example below from Case B. example: has 2 QNodes with NCBIGene IDs, but no pfocr section
Recording queries used for testingPFOCR-augmentingCases A, B are also examples of PFOCR-augmenting being useful. Case A1-hop Predict: Gene Some results have an entity with an NCBIGene ID mapped to the NamedThing QNode - so they were used for PFOCR-augmenting. Response Case A query and console logsquery
console logs
Case B2-hop Explain: quinazoline -> Gene <- TXNIP-related gene set. From the July Translator QotM (last point in this post). Also tested as Query C in this earlier post. Response Case B query and console logsquery
Console logs
Some queries that aren't use cases (for testing only)From this earlier post, looking at performance. Previously used for testing by Jackson as well. A is from Feb-March QotM. B-E are inspired by July QotM. A. Valproic acid -> Gene -> heme -> ALAS1Took 2 min 6 sec to run. 525 results total -> 121 are augmented with PFOCR section.
B. Gene TXNIP -> GeneTook 14 sec to run. 502 results total -> 147 are augmented with PFOCR section.
D. Type 2 Diabetes -> BiologicalEntity <- TXNIPTook 6 min 1 sec to run. 577 results total -> 107 are augmented with PFOCR section.
E. Gene MLXIP -> Gene <- Gene CTGFTook 20 sec to run. 86 results total -> 65 are augmented with PFOCR section.
PFOCR-augmenting didn't happenChecking that PFOCR-augmenting is only happening in certain situations. Requirements from this post of the issue Case 11 Gene QNode with query
The console-logs correctly show that only 1 QNode maps to NCBIGene IDs in the results. It looks like PFOCR-augmenting didn't happen (no more console logs regarding PFOCR).
Case 21 Gene QNode with From July Translator QotM (last query of this post, which I "didn't run through ARS") response Case 2 query and console logsquery
The console-logs correctly show that only 1 QNode maps to NCBIGene IDs in the results. It looks like PFOCR-augmenting didn't happen (no more console logs regarding PFOCR).
However, there are results that have multiple NCBIGene IDs mapped to 1 QNode (n2) - so maybe theoretically pfocr-augmenting could be done on them. This is 1 example:
|
EDITED 2022-10-24 after discussion with Andrew: Tasks for devs:
|
@tokebe I've edited both of my posts above...The next things to do are now listed in the post right before this one. EDIT: And for myself, here's something to check after the first starred ⭐ task above is done ("adding Gene IDs/names")... Even when there are 2 Gene QNodes, a result's PFOCR section says it only maps to 1 QNode's IDs. Is that correct? See the example below from Case B. It's related to this PFOCR hit. example: result with pfocr section mapped to 1 QNode
Hmmm...it's interesting that Case 2 has similar results but PFOCR-augmenting doesn't happen, maybe because the query doesn't meet the criteria (>=2 QNodes have NCBIGene IDs). |
This pull request adds the first PFOCR data to the TRAPI results:
biothings/biothings_explorer#420
The fields for each PFOCR entry:
figureUrl
,pmc
andnodes
(query node IDs). We can add other fields likescore
in a future round.@tokebe, the only change since your comment "looks good" was minor cleanup.
@colleenXu, did you want to check this new code too? It's in the branch add_pfocr.