-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
augment TRAPI results using PFOCR data #420
Comments
It sounds like we want to incorporate this into BTE first after results assembly (either as default behavior or with true/false parameter to control it).... and maybe later make it an "endpoint" that can be used for the Translator workflow idea (operations)... Does that sound correct? |
yes, correct! |
Adding some notes from my discussion with @AlexanderPico
|
Moved a comment to a different repo: wikipathways/pathway-figure-ocr#16 (comment) |
@tokebe, any feedback on where this code should go? The basic idea: once the The code organization for getting scores looks like a good pattern to follow here. I can create an async |
I agree that the current organization regarding scores seems like a good pattern to follow. I think a new file for this purpose makes the most sense. It might also be good to make a 'results-assembly' folder for all such supporting files, just for ease-of-navigation? |
@tokebe, most of the files appear to use snake_case for names, so how about a folder name of |
Agreed, following the snake_case convention for filenames is probably best. |
As far as I can tell, the records are normalized for gene IDs to use NCBIGene, so that makes this easy for the first step -- adding pfocr data for genes -- because that's the datasource PFOCR uses as well. When and if we add pfocr data for other types like diseases, we'll have to double check the datasource normalization. I think MESH is available in the normalized data, but the normalized |
@andrewsu, in the example you gave, each |
Another item to note: for some queries, we can get a large enough number of genes that the PFOCR API returns an error: "414 Request-URI Too Large". In other projects, I've gotten around an issue like this by using |
Related issue: |
Regarding the |
Re: Either a chi-sq or Fisher's would generate a p-value that could serve a |
@erikyao, do you know how to get all hits for this API query: There should be 317 hits. The API response correctly gives a |
@ariutta I've been doing &size=1000, with the understanding that 1000 might be the max that can be returned. Yao would know more than me though. https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272&size=1000 EDIT: yeah I get an error from trying to set size > 1000: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272&size=2000 |
By default only the top 10 hits are returned. The max is currently set to 1,000 (up to 10,000), and can be implemented with a parameter |
We have a potential problem with the PFOCR API. As mentioned, the maximum
That means it's not possible to get all the figures for genes like AKT1. The first query only gets 1k out of >11k, and the second query fails: Most or all of these are probably gene families, e.g., "WNT11" could be because the figure had "WNT", so we included all the WNTs. Any suggestions? We could just ignore genes that show up in more than 1k figures. |
@ariutta @colleenXu sorry I forgot that there is a |
@ariutta @colleenXu its behavior should be identical as documented in https://docs.mygene.info/en/latest/doc/query_service.html#fetch-all Note that when |
For this first iteration, I just worked with NCBIGene identifiers as the first step in getting PFOCR results into BTE. Next, we'll want to handle other identifiers. PFOCR uses |
For much of the latest discussion on this issue, please refer to the PR: |
I've done some additional testing and can confirm that I've achieved parity of results between Anders' code and my updates using the new POST method. There were some differing figures between the two that I've now confirmed to be exclusively due to minute differences in the order figures are received/processed, causing different results to be truncated when trimming down to 20 figures. I believe the implementation of figure sorting by score has been discussed above, but not yet implemented, so that'll be my next task after cleaning up my implementation and updating the PR for review. |
@andrewsu RE: chi-square p-value for scoring, do we want a higher or lower p-value to be "better"? I've implemented a working prototype of the behavior using a package I found, however documentation is a little...sparse, so I'm not 100% sure this will be acceptable. Unfortunately, not many relevant packages seem readily-available/working. |
in general the lower p-value will be considered "better"... |
Ok, I asked because I'm not entirely sure I understand what the chi-square test would be testing in this case...the null hypothesis here is a little unclear to me. That said, I'll push my changes to the PR...a PFOCR figure score in results is currently defined as |
Yeah, to be clear, this is a really sketchy use of the chi square test because the counts in our 2x2 contingency table are so small. (If the path has four nodes, then the largest the minimum cell count could be is 2, and that's much smaller than what you'd want.) So this sort of ranking of PFOCR figures for a given result is a very crude ranking metric, and it may end up being so crude that we'd take it out after we actually look at how it behaves... |
Organizing previous discussions:Requirements
Future directions
|
I agree that chi square and Fisher's Exact Test are not ideal for these small n comparisons. Here's an alternative that was implemented by NDEx iQuery to address this same issue: Cosine similarity: This score characterizes the similarity between the query set and the genes in the pathway while considering that some genes are much more universal than others and will appear in many more pathways. So, it takes into consideration overall frequencies without applying (assuming) a rigorous statistical test. Implemented here in Java as for a REST service. @tokebe |
Deployed to prod 🚀 @andrewsu I assume anything in the future related to this would be discussed in a new issue? |
I posted the query below to our prod instance (https://bte.transltr.io/v1/query) and got the answer snippet below which includes a Query: NGLY1 - [Gene] - [Gene]
output snippet
|
Showing one more result from the The second ranked result corresponds to
The noted figure shows great context on how these three genes are related (plus a pointer to a highly relevant manuscript) |
We have an API for PFOCR that can be queried for multiple entities like this: http://pending.biothings.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:10879%20AND%20associatedWith.mentions.genes.ncbigene:7098. Let's experiment with augmenting TRAPI results with links to PFOCR pathway figures. Since PFOCR is mostly gene-based at the moment, let's focus on TRAPI results objects with two or more genes in them. For each such results object, let's query the PFOCR API and populate
results.pfocr
like this:The text was updated successfully, but these errors were encountered: