Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

augment TRAPI results using PFOCR data #420

Closed
andrewsu opened this issue Mar 8, 2022 · 45 comments
Closed

augment TRAPI results using PFOCR data #420

andrewsu opened this issue Mar 8, 2022 · 45 comments
Assignees

Comments

@andrewsu
Copy link
Member

andrewsu commented Mar 8, 2022

We have an API for PFOCR that can be queried for multiple entities like this: http://pending.biothings.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:10879%20AND%20associatedWith.mentions.genes.ncbigene:7098. Let's experiment with augmenting TRAPI results with links to PFOCR pathway figures. Since PFOCR is mostly gene-based at the moment, let's focus on TRAPI results objects with two or more genes in them. For each such results object, let's query the PFOCR API and populate results.pfocr like this:

        "results": [
            {
                "node_bindings": {
                    "n0": [{"id": "NCBIGene:64963"}],
                    "n1": [{"id": "NCBIGene:1017"}],
                    "n2": [{"id": "NCBIGene:695"}]
                },
                "edge_bindings": { ... },
                "score": 1,
                "pfocr": [
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5845388/bin/fimmu-09-00427-g004.jpg",
                        "pubmed": "12345",
                        "pmc": "PMC5845388",
                        "nodes": ["n0", "n2"],
                        "score": 0.95
                    },
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5845388/bin/fimmu-09-00427-g005.jpg",
                        "pubmed": "98765",
                        "pmc": "PMC5845388",
                        "nodes": ["n1", "n2"],
                        "score": 0.93
                    }
                ]
            }
        ]
@colleenXu
Copy link
Collaborator

It sounds like we want to incorporate this into BTE first after results assembly (either as default behavior or with true/false parameter to control it)....

and maybe later make it an "endpoint" that can be used for the Translator workflow idea (operations)...

Does that sound correct?

@andrewsu
Copy link
Member Author

andrewsu commented Mar 9, 2022

yes, correct!

@andrewsu
Copy link
Member Author

Adding some notes from my discussion with @AlexanderPico

  • Good project for Yihang to work on with @ariutta
  • Rather than doing an API call for each result, BTE could take the union of all entities in a result set and send them to the API in a single request (using a call like this), and then BTE can sort out associating individual results with PFOCR records after
  • Linking individual TRAPI results with PFOCR entries is the first low-lying fruit. Later, we could pitch adding a feature at the ARAX / Translator UI level to show how different results relate to each other
  • Separately, we should do a data update on the data underlying the https://biothings.ncats.io/pfocr API

@ariutta
Copy link
Collaborator

ariutta commented Mar 25, 2022

Moved a comment to a different repo: wikipathways/pathway-figure-ocr#16 (comment)

@ariutta
Copy link
Collaborator

ariutta commented May 10, 2022

@tokebe, any feedback on where this code should go? The basic idea: once the TrapiResultsAssembler finishes, we add a pfocr property to each TRAPI result. This requires calling an API to get PFOCR data, but the number of results will remain unchanged.

The code organization for getting scores looks like a good pattern to follow here. I can create an async annotate function and call it within TrapiResultsAssembler.update. With your recent cross-repo work on naming, I figured you might have an opinion.

@tokebe
Copy link
Member

tokebe commented May 10, 2022

I agree that the current organization regarding scores seems like a good pattern to follow. I think a new file for this purpose makes the most sense. It might also be good to make a 'results-assembly' folder for all such supporting files, just for ease-of-navigation?

@ariutta
Copy link
Collaborator

ariutta commented May 13, 2022

@tokebe, most of the files appear to use snake_case for names, so how about a folder name of results_assembly? Or maybe TrapiResultsAssembler?

@ariutta ariutta assigned ariutta and unassigned ariutta May 13, 2022
@tokebe
Copy link
Member

tokebe commented May 13, 2022

Agreed, following the snake_case convention for filenames is probably best.

@ariutta
Copy link
Collaborator

ariutta commented May 13, 2022

@andrewsu, @yihangx is teaming on this with me, so I tried adding him to the assignees list, but it didn't let me. Maybe we need to change a permission somewhere or add him to BTE?

@ariutta
Copy link
Collaborator

ariutta commented May 13, 2022

As far as I can tell, the records are normalized for gene IDs to use NCBIGene, so that makes this easy for the first step -- adding pfocr data for genes -- because that's the datasource PFOCR uses as well.

When and if we add pfocr data for other types like diseases, we'll have to double check the datasource normalization. I think MESH is available in the normalized data, but the normalized primaryID appears to use MONDO.

@ariutta
Copy link
Collaborator

ariutta commented May 19, 2022

@andrewsu, in the example you gave, each result has a sub-sub-property pfocr.score like "score": 0.95, but with how we're querying the PFOCR API, I suspect this score is incorrect. That score would be correct if we made one PFOCR API query per result, but since we're trying to make just one query (see second bullet point), the score is based on all genes from all results. Should we drop pfocr.score?

@ariutta
Copy link
Collaborator

ariutta commented May 19, 2022

Another item to note: for some queries, we can get a large enough number of genes that the PFOCR API returns an error: "414 Request-URI Too Large". In other projects, I've gotten around an issue like this by using POST with a request body instead of GET with URI params.

@ariutta
Copy link
Collaborator

ariutta commented May 20, 2022

Related issue:
wikipathways/pathway-figure-ocr#24

@andrewsu
Copy link
Member Author

Regarding the pfocr.score, I wasn't thinking of directly using the ES score that is returned by the PFOCR API. Rather, I was thinking that would be some simple score that we computed in the new code that you are writing. For example, that score might be the percentage of entities in the result that are also found in the figure. Or if we want to be a little more complex, it could be a chi-square statistic from a 2x2 contingency table (nicely worked out example in https://online.stat.psu.edu/statprogram/reviews/statistical-concepts/chi-square-tests).

@AlexanderPico
Copy link
Collaborator

Re: pfocr.score this online tool makes it easy to plug in numbers to see how it would work:
https://www.graphpad.com/quickcalcs/contingency1.cfm (see screenshot)

Either a chi-sq or Fisher's would generate a p-value that could serve a pfocr.score for a given figure and a given set of result genes.

Screen Shot 2022-05-24 at 10 06 17 AM

@ariutta
Copy link
Collaborator

ariutta commented May 25, 2022

@erikyao, do you know how to get all hits for this API query:
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272

There should be 317 hits. The API response correctly gives a total of 317, but in the hits field, there are just 10 items. Do I need to add a parameter to tell it to return all hits?

@colleenXu
Copy link
Collaborator

colleenXu commented May 25, 2022

@ariutta I've been doing &size=1000, with the understanding that 1000 might be the max that can be returned. Yao would know more than me though.

https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272&size=1000

EDIT: yeah I get an error from trying to set size > 1000: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272&size=2000

@erikyao
Copy link

erikyao commented May 25, 2022

@erikyao, do you know how to get all hits for this API query: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:59272

There should be 317 hits. The API response correctly gives a total of 317, but in the hits field, there are just 10 items. Do I need to add a parameter to tell it to return all hits?

By default only the top 10 hits are returned. The max is currently set to 1,000 (up to 10,000), and can be implemented with a parameter &size=1000.

@ariutta
Copy link
Collaborator

ariutta commented May 26, 2022

We have a potential problem with the PFOCR API. As mentioned, the maximum size param is 1000, but some genes are in more than 1000 figures, e.g.:

name NCBIGene figure count
AKT1 207 11,343
ATF2 1386 2,086
PDK1 5163 1,983
WNT11 7481 3,304

That means it's not possible to get all the figures for genes like AKT1. The first query only gets 1k out of >11k, and the second query fails:
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:207&size=1000
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:207&size=12000

Most or all of these are probably gene families, e.g., "WNT11" could be because the figure had "WNT", so we included all the WNTs.

Any suggestions? We could just ignore genes that show up in more than 1k figures.

@erikyao
Copy link

erikyao commented May 26, 2022

@ariutta @colleenXu sorry I forgot that there is a &fetch_all=true parameter that indicates fetching all documents. Give it a try!

@ariutta
Copy link
Collaborator

ariutta commented May 26, 2022

Thanks, @erikyao! This worked:
https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:207&fetch_all=true

@erikyao
Copy link

erikyao commented May 26, 2022

@ariutta @colleenXu its behavior should be identical as documented in https://docs.mygene.info/en/latest/doc/query_service.html#fetch-all

Note that when fetch_all=true, the response will contain a _scroll_id field, whose value is leveraged to fetch all the documents in batches of 1000. For more information on scroll ids, please refer to https://docs.mygene.info/en/latest/doc/query_service.html#scroll-id

ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022
ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022
ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022
ariutta added a commit to biothings/bte_trapi_query_graph_handler that referenced this issue Jul 15, 2022
@ariutta
Copy link
Collaborator

ariutta commented Jul 20, 2022

For this first iteration, I just worked with NCBIGene identifiers as the first step in getting PFOCR results into BTE. Next, we'll want to handle other identifiers. PFOCR uses NCBIGene for genes and MESH for chemicals and diseases, but we want to match TRAPI results that use other identifiers. I know there's been work related to this, and I'd like to have a meeting sometime to discuss the best way of handling this.

@ariutta
Copy link
Collaborator

ariutta commented Aug 5, 2022

For much of the latest discussion on this issue, please refer to the PR:
biothings/bte_trapi_query_graph_handler#109

@colleenXu
Copy link
Collaborator

Some discussion from 8/10 lab meeting on Translator stuff:

Plan: have Chunlei do some more investigation

  • look into how to do batch queries where each entry has set-logic for multiple fields (and these entries can have varying lengths of stuff to match)
  • if we can't do this on the PFOCR API, then we could do this as post-processing within the custom PFOCR handler in BTE
    Untitled

Not clear which dev works on this next (Jackson? Due to templating knowledge?)

Logic:

  • Iterate through each TRAPI result, if it has NCBIGene IDs in >=2 QNodes, proceed. Otherwise, don't use that TRAPI result in the next steps.
  • Generate batch-queries to PFOCR API. This is the complicated step since we want each entry to represent the set-based logic for 1 TRAPI result (We want figures with genes X and Y, and in the future we may also want the figures to have diseases A and B in that same entry too)
  • Will use templating and set-logic stuff (haven't been used yet in queries…)
  • Send them out. If scrolling is necessary to retrieve all matching hits for all entries, do that (relates to previous row's notes)
  • Proceed with whatever logic Anders is doing to pick figures (just the top 20?) and write the pfocr sections for each trapi result

Complication: entries are sometimes different sizes (how many and would involve multiple fields in the future

@tokebe
Copy link
Member

tokebe commented Sep 7, 2022

I've done some additional testing and can confirm that I've achieved parity of results between Anders' code and my updates using the new POST method.

There were some differing figures between the two that I've now confirmed to be exclusively due to minute differences in the order figures are received/processed, causing different results to be truncated when trimming down to 20 figures.

I believe the implementation of figure sorting by score has been discussed above, but not yet implemented, so that'll be my next task after cleaning up my implementation and updating the PR for review.

@tokebe
Copy link
Member

tokebe commented Sep 13, 2022

@andrewsu RE: chi-square p-value for scoring, do we want a higher or lower p-value to be "better"?

I've implemented a working prototype of the behavior using a package I found, however documentation is a little...sparse, so I'm not 100% sure this will be acceptable. Unfortunately, not many relevant packages seem readily-available/working.

@andrewsu
Copy link
Member Author

in general the lower p-value will be considered "better"...

@tokebe
Copy link
Member

tokebe commented Sep 13, 2022

Ok, I asked because I'm not entirely sure I understand what the chi-square test would be testing in this case...the null hypothesis here is a little unclear to me.

That said, I'll push my changes to the PR...a PFOCR figure score in results is currently defined as 1-p where p is the p-value. Let me know if you'd prefer me to just leave it as the raw p-value.

@andrewsu
Copy link
Member Author

Yeah, to be clear, this is a really sketchy use of the chi square test because the counts in our 2x2 contingency table are so small. (If the path has four nodes, then the largest the minimum cell count could be is 2, and that's much smaller than what you'd want.) So this sort of ranking of PFOCR figures for a given result is a very crude ranking metric, and it may end up being so crude that we'd take it out after we actually look at how it behaves...

@colleenXu
Copy link
Collaborator

Organizing previous discussions:

Requirements

  • first post of this issue shows what Andrew originally envisioned. The scope was narrowed to genes only. The pfocr section would include figureURL, pmc, QNodeIDs, and a score
    • pfocr section included for results that have NCBIGene IDs mapped to >= 2 QNodes. Discussed here, here, and here
      • Note that NCBIGene is the top of the Gene ID-namespace priority list, so the entity's key should be NCBIGene-ID if it has a mapping to this namespace
    • to match figures with TRAPI results, we want >=2 NCBIGene IDs to be annotated to the figure and to the result. So 1 figure could show up in multiple results' pfcor sections, and 1 result may match many figures.
    • scores: discussed in these May posts, chi-square implementation in September starting with this post
    • each result includes the top-scored figures up to a max of 20 (requirement set
  • have informative TRAPI-level AND console logs

Future directions

@colleenXu
Copy link
Collaborator

@AlexanderPico
Copy link
Collaborator

I agree that chi square and Fisher's Exact Test are not ideal for these small n comparisons. Here's an alternative that was implemented by NDEx iQuery to address this same issue:

Cosine similarity: This score characterizes the similarity between the query set and the genes in the pathway while considering that some genes are much more universal than others and will appear in many more pathways. So, it takes into consideration overall frequencies without applying (assuming) a rigorous statistical test. Implemented here in Java as for a REST service. @tokebe

@tokebe
Copy link
Member

tokebe commented Mar 28, 2023

Deployed to prod 🚀

@andrewsu I assume anything in the future related to this would be discussed in a new issue?

@andrewsu
Copy link
Member Author

I posted the query below to our prod instance (https://bte.transltr.io/v1/query) and got the answer snippet below which includes a pfocr section in the results. All is working as intended, so closing this issue.

Query: NGLY1 - [Gene] - [Gene]
{
    "message": {
        "query_graph": {
            "nodes": {
                "n0": {
                    "ids": [
                        "NCBIGene:55768"
                    ]
                },
                "n1": {
                    "categories": [
                        "biolink:Gene"
                    ]
                },
                "n2": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            },
            "edges": {
                "e01": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e02": {
                    "subject": "n1",
                    "object": "n2"
                }
            }
        }
    }
}
output snippet
        "results": [
            {
                "node_bindings": {
                    "n0": [
                        {
                            "id": "NCBIGene:55768"
                        }
                    ],
                    "n1": [
                        {
                            "id": "NCBIGene:1956"
                        }
                    ],
                    "n2": [
                        {
                            "id": "NCBIGene:1950"
                        }
                    ]
                },
                "edge_bindings": {
                    "e01": [
                        {
                            "id": "5aead8f41af2158496a3b1f29752b3b1"
                        },
                        {
                            "id": "02d8e265a19107ccdd6763dfb4a8163c"
                        },
                        {
                            "id": "8d67d590d0cfbc7e6b64fe666da3b849"
                        }
                    ],
                    "e02": [
                        {
                            "id": "e7ff2726ea4f571da3b1d0339a7ccbd0"
                        },
                        {
                            "id": "b78edf7537b2e885ce5b5fb95cbcb5c8"
                        }
                    ]
                },
                "score": 9.643175465683468,
                "pfocr": [
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3983693/bin/nihms549611f6.jpg",
                        "pmc": "PMC3983693",
                        "nodes": [
                            "n1",
                            "n2"
                        ],
                        "matchedCuries": [
                            "NCBIGene:1956",
                            "NCBIGene:1950"
                        ],
                        "score": 0.5714285714285715
                    },
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3304012/bin/nihms-350548-f0003.jpg",
                        "pmc": "PMC3304012",
                        "nodes": [
                            "n1",
                            "n2"
                        ],
                        "matchedCuries": [
                            "NCBIGene:1956",
                            "NCBIGene:1950"
                        ],
                        "score": 0.5
                    },

@andrewsu
Copy link
Member Author

Showing one more result from the NGLY1 - [Gene] - [Gene] example above that may illustrate the value of this work.

The second ranked result corresponds to NGLY1 - NFE2 - DDI2.

                    {
                        "node_bindings": {
                            "n0": [ { "id": "NCBIGene:55768" } ],
                            "n1": [ { "id": "NCBIGene:4779" } ],
                            "n2": [ { "id": "NCBIGene:84301" } ]
                        },
                        "edge_bindings": {
                           ...
                        }
                        "score": 7.260500069613187,
                        "pfocr": [
                            { ... }
                            {
                                "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5704294/bin/oc-2017-00224x_0001.jpg",
                                "pmc": "PMC5704294",
                                "nodes": [
                                    "n0",
                                    "n2"
                                ],
                                "matchedCuries": [
                                    "NCBIGene:84301",
                                    "NCBIGene:55768"
                                ],
                                "score": 0.3333333333333333
                            },

The noted figure shows great context on how these three genes are related (plus a pointer to a highly relevant manuscript)

oc-2017-00224x_0001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants