augment TRAPI results using PFOCR data #420

andrewsu · 2022-03-08T19:52:42Z

We have an API for PFOCR that can be queried for multiple entities like this: http://pending.biothings.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:10879%20AND%20associatedWith.mentions.genes.ncbigene:7098. Let's experiment with augmenting TRAPI results with links to PFOCR pathway figures. Since PFOCR is mostly gene-based at the moment, let's focus on TRAPI results objects with two or more genes in them. For each such results object, let's query the PFOCR API and populate results.pfocr like this:

        "results": [
            {
                "node_bindings": {
                    "n0": [{"id": "NCBIGene:64963"}],
                    "n1": [{"id": "NCBIGene:1017"}],
                    "n2": [{"id": "NCBIGene:695"}]
                },
                "edge_bindings": { ... },
                "score": 1,
                "pfocr": [
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5845388/bin/fimmu-09-00427-g004.jpg",
                        "pubmed": "12345",
                        "pmc": "PMC5845388",
                        "nodes": ["n0", "n2"],
                        "score": 0.95
                    },
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5845388/bin/fimmu-09-00427-g005.jpg",
                        "pubmed": "98765",
                        "pmc": "PMC5845388",
                        "nodes": ["n1", "n2"],
                        "score": 0.93
                    }
                ]
            }
        ]

The text was updated successfully, but these errors were encountered:

colleenXu · 2022-03-09T06:12:29Z

It sounds like we want to incorporate this into BTE first after results assembly (either as default behavior or with true/false parameter to control it)....

and maybe later make it an "endpoint" that can be used for the Translator workflow idea (operations)...

Does that sound correct?

andrewsu · 2022-03-09T06:31:43Z

yes, correct!

andrewsu · 2022-03-22T17:54:43Z

Adding some notes from my discussion with @AlexanderPico

Good project for Yihang to work on with @ariutta
Rather than doing an API call for each result, BTE could take the union of all entities in a result set and send them to the API in a single request (using a call like this), and then BTE can sort out associating individual results with PFOCR records after
Linking individual TRAPI results with PFOCR entries is the first low-lying fruit. Later, we could pitch adding a feature at the ARAX / Translator UI level to show how different results relate to each other
Separately, we should do a data update on the data underlying the https://biothings.ncats.io/pfocr API

ariutta · 2022-03-25T00:10:04Z

Moved a comment to a different repo: wikipathways/pathway-figure-ocr#16 (comment)

ariutta · 2022-05-10T19:28:17Z

@tokebe, any feedback on where this code should go? The basic idea: once the TrapiResultsAssembler finishes, we add a pfocr property to each TRAPI result. This requires calling an API to get PFOCR data, but the number of results will remain unchanged.

The code organization for getting scores looks like a good pattern to follow here. I can create an async annotate function and call it within TrapiResultsAssembler.update. With your recent cross-repo work on naming, I figured you might have an opinion.

tokebe · 2022-05-10T19:47:06Z

I agree that the current organization regarding scores seems like a good pattern to follow. I think a new file for this purpose makes the most sense. It might also be good to make a 'results-assembly' folder for all such supporting files, just for ease-of-navigation?

ariutta · 2022-05-13T18:04:05Z

@tokebe, most of the files appear to use snake_case for names, so how about a folder name of results_assembly? Or maybe TrapiResultsAssembler?

tokebe · 2022-05-13T18:07:29Z

Agreed, following the snake_case convention for filenames is probably best.

ariutta · 2022-05-13T18:07:37Z

@andrewsu, @yihangx is teaming on this with me, so I tried adding him to the assignees list, but it didn't let me. Maybe we need to change a permission somewhere or add him to BTE?

ariutta · 2022-05-13T18:13:36Z

As far as I can tell, the records are normalized for gene IDs to use NCBIGene, so that makes this easy for the first step -- adding pfocr data for genes -- because that's the datasource PFOCR uses as well.

When and if we add pfocr data for other types like diseases, we'll have to double check the datasource normalization. I think MESH is available in the normalized data, but the normalized primaryID appears to use MONDO.

ariutta · 2022-05-19T23:14:49Z

@andrewsu, in the example you gave, each result has a sub-sub-property pfocr.score like "score": 0.95, but with how we're querying the PFOCR API, I suspect this score is incorrect. That score would be correct if we made one PFOCR API query per result, but since we're trying to make just one query (see second bullet point), the score is based on all genes from all results. Should we drop pfocr.score?

ariutta · 2022-05-19T23:21:25Z

Another item to note: for some queries, we can get a large enough number of genes that the PFOCR API returns an error: "414 Request-URI Too Large". In other projects, I've gotten around an issue like this by using POST with a request body instead of GET with URI params.

ariutta · 2022-05-20T00:13:09Z

Related issue:
wikipathways/pathway-figure-ocr#24

andrewsu · 2022-05-20T04:39:37Z

Regarding the pfocr.score, I wasn't thinking of directly using the ES score that is returned by the PFOCR API. Rather, I was thinking that would be some simple score that we computed in the new code that you are writing. For example, that score might be the percentage of entities in the result that are also found in the figure. Or if we want to be a little more complex, it could be a chi-square statistic from a 2x2 contingency table (nicely worked out example in https://online.stat.psu.edu/statprogram/reviews/statistical-concepts/chi-square-tests).

AlexanderPico · 2022-05-24T17:42:53Z

Re: pfocr.score this online tool makes it easy to plug in numbers to see how it would work:
https://www.graphpad.com/quickcalcs/contingency1.cfm (see screenshot)

Either a chi-sq or Fisher's would generate a p-value that could serve a pfocr.score for a given figure and a given set of result genes.

ariutta · 2022-05-25T01:17:12Z

@erikyao, do you know how to get all hits for this API query:
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272

There should be 317 hits. The API response correctly gives a total of 317, but in the hits field, there are just 10 items. Do I need to add a parameter to tell it to return all hits?

colleenXu · 2022-05-25T01:42:20Z

@ariutta I've been doing &size=1000, with the understanding that 1000 might be the max that can be returned. Yao would know more than me though.

https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272&size=1000

EDIT: yeah I get an error from trying to set size > 1000: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272&size=2000

erikyao · 2022-05-25T16:49:57Z

@erikyao, do you know how to get all hits for this API query: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272

There should be 317 hits. The API response correctly gives a total of 317, but in the hits field, there are just 10 items. Do I need to add a parameter to tell it to return all hits?

By default only the top 10 hits are returned. The max is currently set to 1,000 (up to 10,000), and can be implemented with a parameter &size=1000.

ariutta · 2022-05-26T00:39:20Z

We have a potential problem with the PFOCR API. As mentioned, the maximum size param is 1000, but some genes are in more than 1000 figures, e.g.:

name	NCBIGene	figure count
AKT1	207	11,343
ATF2	1386	2,086
PDK1	5163	1,983
WNT11	7481	3,304

That means it's not possible to get all the figures for genes like AKT1. The first query only gets 1k out of >11k, and the second query fails:
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:207&size=1000
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:207&size=12000

Most or all of these are probably gene families, e.g., "WNT11" could be because the figure had "WNT", so we included all the WNTs.

Any suggestions? We could just ignore genes that show up in more than 1k figures.

erikyao · 2022-05-26T17:33:38Z

@ariutta @colleenXu sorry I forgot that there is a &fetch_all=true parameter that indicates fetching all documents. Give it a try!

ariutta · 2022-05-26T17:48:11Z

Thanks, @erikyao! This worked:
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:207&fetch_all=true

erikyao · 2022-05-26T18:16:51Z

@ariutta @colleenXu its behavior should be identical as documented in https://docs.mygene.info/en/latest/doc/query_service.html#fetch-all

Note that when fetch_all=true, the response will contain a _scroll_id field, whose value is leveraged to fetch all the documents in batches of 1000. For more information on scroll ids, please refer to https://docs.mygene.info/en/latest/doc/query_service.html#scroll-id

ariutta · 2022-07-20T15:54:10Z

For this first iteration, I just worked with NCBIGene identifiers as the first step in getting PFOCR results into BTE. Next, we'll want to handle other identifiers. PFOCR uses NCBIGene for genes and MESH for chemicals and diseases, but we want to match TRAPI results that use other identifiers. I know there's been work related to this, and I'd like to have a meeting sometime to discuss the best way of handling this.

ariutta · 2022-08-05T18:44:33Z

For much of the latest discussion on this issue, please refer to the PR:
biothings/bte_trapi_query_graph_handler#109

colleenXu · 2022-08-10T17:19:49Z

Some discussion from 8/10 lab meeting on Translator stuff:

Plan: have Chunlei do some more investigation

look into how to do batch queries where each entry has set-logic for multiple fields (and these entries can have varying lengths of stuff to match)
if we can't do this on the PFOCR API, then we could do this as post-processing within the custom PFOCR handler in BTE

Not clear which dev works on this next (Jackson? Due to templating knowledge?)

Logic:

Iterate through each TRAPI result, if it has NCBIGene IDs in >=2 QNodes, proceed. Otherwise, don't use that TRAPI result in the next steps.
Generate batch-queries to PFOCR API. This is the complicated step since we want each entry to represent the set-based logic for 1 TRAPI result (We want figures with genes X and Y, and in the future we may also want the figures to have diseases A and B in that same entry too)
Will use templating and set-logic stuff (haven't been used yet in queries…)
Send them out. If scrolling is necessary to retrieve all matching hits for all entries, do that (relates to previous row's notes)
Proceed with whatever logic Anders is doing to pick figures (just the top 20?) and write the pfocr sections for each trapi result

Complication: entries are sometimes different sizes (how many and would involve multiple fields in the future

tokebe · 2022-09-07T19:38:56Z

I've done some additional testing and can confirm that I've achieved parity of results between Anders' code and my updates using the new POST method.

There were some differing figures between the two that I've now confirmed to be exclusively due to minute differences in the order figures are received/processed, causing different results to be truncated when trimming down to 20 figures.

I believe the implementation of figure sorting by score has been discussed above, but not yet implemented, so that'll be my next task after cleaning up my implementation and updating the PR for review.

tokebe · 2022-09-13T18:40:38Z

@andrewsu RE: chi-square p-value for scoring, do we want a higher or lower p-value to be "better"?

I've implemented a working prototype of the behavior using a package I found, however documentation is a little...sparse, so I'm not 100% sure this will be acceptable. Unfortunately, not many relevant packages seem readily-available/working.

andrewsu · 2022-09-13T19:05:23Z

in general the lower p-value will be considered "better"...

tokebe · 2022-09-13T19:22:02Z

Ok, I asked because I'm not entirely sure I understand what the chi-square test would be testing in this case...the null hypothesis here is a little unclear to me.

That said, I'll push my changes to the PR...a PFOCR figure score in results is currently defined as 1-p where p is the p-value. Let me know if you'd prefer me to just leave it as the raw p-value.

andrewsu · 2022-09-13T19:51:00Z

Yeah, to be clear, this is a really sketchy use of the chi square test because the counts in our 2x2 contingency table are so small. (If the path has four nodes, then the largest the minimum cell count could be is 2, and that's much smaller than what you'd want.) So this sort of ranking of PFOCR figures for a given result is a very crude ranking metric, and it may end up being so crude that we'd take it out after we actually look at how it behaves...

colleenXu · 2022-10-19T05:23:25Z

Organizing previous discussions:

Requirements

first post of this issue shows what Andrew originally envisioned. The scope was narrowed to genes only. The pfocr section would include figureURL, pmc, QNodeIDs, and a score
- pfocr section included for results that have NCBIGene IDs mapped to >= 2 QNodes. Discussed here, here, and here
  - Note that NCBIGene is the top of the Gene ID-namespace priority list, so the entity's key should be NCBIGene-ID if it has a mapping to this namespace
- to match figures with TRAPI results, we want >=2 NCBIGene IDs to be annotated to the figure and to the result. So 1 figure could show up in multiple results' pfcor sections, and 1 result may match many figures.
  - figure retrieval / matching uses scrolling behavior and templated / complex querying logic
- scores: discussed in these May posts, chi-square implementation in September starting with this post
- each result includes the top-scored figures up to a max of 20 (requirement set
have informative TRAPI-level AND console logs

Future directions

have a true/false parameter to control whether this behavior is done
are there ways to improve the figure retrieval process (genes connected to lots of figures or errors in figure annotation)?
are there ways to improve the scoring process (sorting figures)?
Including chemicals / diseases annotated to figures + in results. Issue with finding the IDs in the node-object?
- allowing non-exact matches between figure entities and entities in results?
grouping results together?
making a module / endpoint for use by others?

colleenXu · 2022-12-22T21:51:09Z

biothings/bte_trapi_query_graph_handler#131

AlexanderPico · 2023-02-08T23:13:37Z

I agree that chi square and Fisher's Exact Test are not ideal for these small n comparisons. Here's an alternative that was implemented by NDEx iQuery to address this same issue:

Cosine similarity: This score characterizes the similarity between the query set and the genes in the pathway while considering that some genes are much more universal than others and will appear in many more pathways. So, it takes into consideration overall frequencies without applying (assuming) a rigorous statistical test. Implemented here in Java as for a REST service. @tokebe

tokebe · 2023-03-28T20:37:32Z

Deployed to prod 🚀

@andrewsu I assume anything in the future related to this would be discussed in a new issue?

andrewsu · 2023-04-12T21:27:56Z

I posted the query below to our prod instance (https://bte.transltr.io/v1/query) and got the answer snippet below which includes a pfocr section in the results. All is working as intended, so closing this issue.

Query: NGLY1 - [Gene] - [Gene]

{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:55768"
                    ]
                },
                "n1": {
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        }
    }
}

output snippet

        "results": [
            {
                "node_bindings": {
                    "n0": [
                        {
                            "id": "NCBIGene:55768"
                        }
                    ],
                    "n1": [
                        {
                            "id": "NCBIGene:1956"
                        }
                    ],
                    "n2": [
                        {
                            "id": "NCBIGene:1950"
                        }
                    ]
                },
                "edge_bindings": {
                    "e01": [
                        {
                            "id": "5aead8f41af2158496a3b1f29752b3b1"
                        },
                        {
                            "id": "02d8e265a19107ccdd6763dfb4a8163c"
                        },
                        {
                            "id": "8d67d590d0cfbc7e6b64fe666da3b849"
                        }
                    ],
                    "e02": [
                        {
                            "id": "e7ff2726ea4f571da3b1d0339a7ccbd0"
                        },
                        {
                            "id": "b78edf7537b2e885ce5b5fb95cbcb5c8"
                        }
                    ]
                },
                "score": 9.643175465683468,
                "pfocr": [
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3983693/bin/nihms549611f6.jpg",
                        "pmc": "PMC3983693",
                        "nodes": [
                            "n1",
                            "n2"
                        ],
                        "matchedCuries": [
                            "NCBIGene:1956",
                            "NCBIGene:1950"
                        ],
                        "score": 0.5714285714285715
                    },
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3304012/bin/nihms-350548-f0003.jpg",
                        "pmc": "PMC3304012",
                        "nodes": [
                            "n1",
                            "n2"
                        ],
                        "matchedCuries": [
                            "NCBIGene:1956",
                            "NCBIGene:1950"
                        ],
                        "score": 0.5
                    },

andrewsu · 2023-04-25T22:45:40Z

Showing one more result from the NGLY1 - [Gene] - [Gene] example above that may illustrate the value of this work.

The second ranked result corresponds to NGLY1 - NFE2 - DDI2.

                    {
                        "node_bindings": {
                            "n0": [ { "id": "NCBIGene:55768" } ],
                            "n1": [ { "id": "NCBIGene:4779" } ],
                            "n2": [ { "id": "NCBIGene:84301" } ]
                        },
                        "edge_bindings": {
                           ...
                        }
                        "score": 7.260500069613187,
                        "pfocr": [
                            { ... }
                            {
                                "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5704294/bin/oc-2017-00224x_0001.jpg",
                                "pmc": "PMC5704294",
                                "nodes": [
                                    "n0",
                                    "n2"
                                ],
                                "matchedCuries": [
                                    "NCBIGene:84301",
                                    "NCBIGene:55768"
                                ],
                                "score": 0.3333333333333333
                            },

The noted figure shows great context on how these three genes are related (plus a pointer to a highly relevant manuscript)

andrewsu assigned ariutta Mar 8, 2022

andrewsu mentioned this issue Apr 5, 2022

update data in PFOCR API biothings/pending.api#64

Closed

ariutta assigned ariutta and unassigned ariutta May 13, 2022

andrewsu assigned yihangx May 16, 2022

ariutta mentioned this issue May 25, 2022

PFOCR for prioritization / clustering #451

Closed

ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022

Add PFOCR data. Close biothings/biothings_explorer#420

37653be

ariutta mentioned this issue Jul 15, 2022

Augment TRAPI results using PFOCR data biothings/bte_trapi_query_graph_handler#109

Merged

ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022

feat; add PFOCR data. close biothings/biothings_explorer#420

c8995e8

ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022

feat: add PFOCR data. close biothings/biothings_explorer#420

1d7fc01

ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022

feat: add PFOCR figure data. close biothings/biothings_explorer#420

9b3af00

erikyao mentioned this issue Aug 31, 2022

Support "minimum_should_match" in pfocr API biothings/pending.api#88

Closed

tokebe self-assigned this Sep 1, 2022

andrewsu unassigned ariutta and yihangx Sep 21, 2022

tokebe closed this as completed in biothings/bte_trapi_query_graph_handler@24c3d4c Dec 22, 2022

colleenXu reopened this Dec 22, 2022

tokebe mentioned this issue Mar 8, 2023

Deploying Biolink 3.1.x to Prod: Organizing Items #578

Closed

andrewsu closed this as completed Apr 12, 2023

colleenXu mentioned this issue Apr 19, 2023

overview and management of TRAPI 1.4 features #613

Closed

15 tasks

colleenXu mentioned this issue Jan 31, 2024

Investigate PFOCR options (strict, synonyms, all) for BTE use #778

Closed

augment TRAPI results using PFOCR data #420

augment TRAPI results using PFOCR data #420

Comments

andrewsu commented Mar 8, 2022

colleenXu commented Mar 9, 2022

andrewsu commented Mar 9, 2022

andrewsu commented Mar 22, 2022

ariutta commented Mar 25, 2022 • edited Loading

ariutta commented May 10, 2022

tokebe commented May 10, 2022

ariutta commented May 13, 2022

tokebe commented May 13, 2022

ariutta commented May 13, 2022

ariutta commented May 13, 2022

ariutta commented May 19, 2022

ariutta commented May 19, 2022 • edited Loading

ariutta commented May 20, 2022

andrewsu commented May 20, 2022

AlexanderPico commented May 24, 2022

ariutta commented May 25, 2022

colleenXu commented May 25, 2022 • edited Loading

erikyao commented May 25, 2022

ariutta commented May 26, 2022

erikyao commented May 26, 2022

ariutta commented May 26, 2022

erikyao commented May 26, 2022 • edited Loading

ariutta commented Jul 20, 2022

ariutta commented Aug 5, 2022

colleenXu commented Aug 10, 2022

tokebe commented Sep 7, 2022

tokebe commented Sep 13, 2022

andrewsu commented Sep 13, 2022

tokebe commented Sep 13, 2022

andrewsu commented Sep 13, 2022

colleenXu commented Oct 19, 2022

Organizing previous discussions:

Requirements

Future directions

colleenXu commented Dec 22, 2022

AlexanderPico commented Feb 8, 2023

tokebe commented Mar 28, 2023

andrewsu commented Apr 12, 2023

andrewsu commented Apr 25, 2023

ariutta commented Mar 25, 2022 •

edited

Loading

ariutta commented May 19, 2022 •

edited

Loading

colleenXu commented May 25, 2022 •

edited

Loading

erikyao commented May 26, 2022 •

edited

Loading