Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Augment TRAPI results using PFOCR data #109

Merged
merged 9 commits into from
Dec 22, 2022
Merged

Augment TRAPI results using PFOCR data #109

merged 9 commits into from
Dec 22, 2022

Conversation

ariutta
Copy link
Collaborator

@ariutta ariutta commented Jul 15, 2022

This pull request adds the first PFOCR data to the TRAPI results:
biothings/biothings_explorer#420

The fields for each PFOCR entry: figureUrl, pmc and nodes (query node IDs). We can add other fields like score in a future round.

@tokebe, the only change since your comment "looks good" was minor cleanup.

@colleenXu, did you want to check this new code too? It's in the branch add_pfocr.

@ariutta ariutta requested a review from tokebe July 15, 2022 00:14
@ariutta ariutta force-pushed the add_pfocr branch 3 times, most recently from 1d7fc01 to 9b3af00 Compare July 15, 2022 00:17
@colleenXu
Copy link
Contributor

It may be useful to have a example response?

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 19, 2022

Here's an example pulled from the query in this notebook:

Expand to see JSON
{
    "node_bindings": {
        "n1": [
            {
                "id": "NCBIGene:5599"
            }
        ],
        "n0": [
            {
                "id": "NCBIGene:211"
            }
        ],
        "n2": [
            {
                "id": "PUBCHEM.COMPOUND:3121"
            }
        ]
    },
    "edge_bindings": {
        "e01": [
            {
                "id": "5fc6cc476f4bcb068460c5d299db52dd"
            }
        ],
        "e02": [
            {
                "id": "fc671d4b62d2d983372271323c3e8be3"
            },
            {
                "id": "7b6da63564f96e9b4d0aef2581eb3ba3"
            },
            {
                "id": "3e93479f91fd45c376d4851eea747174"
            },
            {
                "id": "8353796517ac7652afabf4042e944971"
            },
            {
                "id": "e495f8080658dd1cd0acbec8c4cfeaf7"
            },
            {
                "id": "551bd928ba58c60f465375306b81bbac"
            },
            {
                "id": "6f6dd5467485ea133e64cb76b405c7d2"
            }
        ]
    },
    "score": 2.1062617593370945,
    "pfocr": [
        {
            "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5354998/bin/nihms843846f10.jpg",
            "pmc": "PMC5354998",
            "nodes": [
                "n1",
                "n0"
            ]
        },
        {
            "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1906540/bin/cei0130-0363-f2.jpg",
            "pmc": "PMC1906540",
            "nodes": [
                "n1",
                "n0"
            ]
        },
        {
            "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7402116/bin/antioxidants-09-00636-g004.jpg",
            "pmc": "PMC7402116",
            "nodes": [
                "n1",
                "n0"
            ]
        }
    ]
}

@colleenXu
Copy link
Contributor

colleenXu commented Jul 20, 2022

Feedback for @ariutta to discuss with @andrewsu: Basically I'm not sure what conditions the pfocr augmenting happens in:

  • only works on 1-hops from Gene -> Gene? I tried an Explain style query (Chemical X -> Gene <- set of Genes A, B, C, D...) and the PFOCR augmenting didn't seem to happen...
Query

Related to July 2022 Translator Standup

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["UMLS:C0034407"],
                    "categories": ["biolink:SmallMolecule"],
                    "name": "Quinazolines"
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "ids": ["NCBIGene:10628", "NCBIGene:22861", "NCBIGene:51085",
                            "NCBIGene:1490", "NCBIGene:389692", "NCBIGene:3480",
                            "NCBIGene:598"],
                    "categories": ["biolink:Gene"],
                    "is_set": true,
                    "name": "TXNIP, NLRP1, MLXIPL, CTGF, MAFA, IGF1R, BCL2L1"
                }
            }
        }
    }
}
  • limiting the number of figures per result to a manageable number (5 or less?): In a simple query of 1 Gene ID (NCBIGene:1742)-> Gene, I got 1 result that had 53 figures...
  • As expected, ARAX doesn't show any of the pfocr stuff when the response is pasted into the UI. I wonder if the edge_binding "attributes" from TRAPI are used by other KPs / ARAs to cover "scoring" stuff like this...
  • a QGraph with only 1 Gene QNode can have multiple genes in results when that Gene QNode has is_set: true...but it looks like PFOCR augmenting doesn't happen in this situation. Is that okay?
An example of query with Gene QNode `is_set: true`

This query is used in the creative-mode run for this disease.

{
    "message": {
        "query_graph": {
            "nodes": {
                "creativeQueryObject": {
                    "ids":["MONDO:0007035"],
                    "categories":["biolink:Disease"],
                    "name": "acanth"
               },
                "nA": {
                    "categories":["biolink:Gene"],
                    "is_set": true
                },
                "creativeQuerySubject": {
                    "categories":["biolink:ChemicalEntity"]
                }
            },
            "edges": {
                "eA": {
                    "subject": "creativeQueryObject",
                    "object": "nA",
                    "predicates": ["biolink:caused_by"]
                },
                "eB": {
                    "subject": "nA",
                    "object": "creativeQuerySubject",
                    "predicates": ["biolink:entity_regulated_by_entity"]
                }
            }
        }
    }
}
  • PFOCR augmenting doesn't happen with QGraphs that have only 1 Gene QNode (I tried one-hops starting with a Gene ID NCBIGene:1742 and going to ChemicalEntity, Disease, SmallMolecule)

@colleenXu
Copy link
Contributor

colleenXu commented Jul 20, 2022

As noted during the meeting today, BTE already does "ID resolution" based on biolink-model's id_prefix priority order, so the KG Node key itself should be NCBIGene if a mapping to one was found.

However, for Disease / Chemical, the MESH IDs could be in the KG Node key or in the synonyms section of the node's attributes...

It may be easier to pull the IDs out in internal data things, similar to the UMLS ID (which is done for semmeddb ngd scoring)


Other notes from the meeting:

  • Fixing the issue where a query didn't have pfocr stuff limit (Chemical X -> Gene <- set of Genes A, B, C, D...)
  • figures per result to max 20 (Andrew)
  • topics to still discuss with Andrew?
    • putting the pfocr stuff in TRAPI result edge-attribute
    • using pfocr when there's 1 Gene QNode but it has is_set (so some results will have multiple genes in them)
    • using pfocr in cases where there aren't 2 Gene QNodes but results with multiple Genes are in the results (Gene ID -> NamedThing <- Disease ID?)

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 20, 2022

Fixing the issue where a query didn't have pfocr stuff limit (Chemical X -> Gene <- set of Genes A, B, C, D...)

This query didn't return PFOCR data because the first TRAPI result had a gene with a UMLS ID. I updated the code to check all results for NCBIGene IDs, and now that query does get results with PFOCR figures:
image

figures per result to max 20 (Andrew)

I further updated the code to limit to the first 20 figures per TRAPI result:
image

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 20, 2022

using pfocr when there's 1 Gene QNode but it has is_set (so some results will have multiple genes in them)

My current understanding is we don't want to include PFOCR in this case, so the code currently requires matching CURIEs for 2+ different QNodes for a TRAPI result.

using pfocr in cases where there aren't 2 Gene QNodes but results with multiple Genes are in the results (Gene ID -> NamedThing <- Disease ID?)

The current code just checks whether there's an NCBIGene associated with a QNode. It doesn't actually check whether the category is "biolink:Gene".

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 20, 2022

Some queries can result in trying to get essentially all of the PFOCR figures, e.g.:

{
    "message": {
        "query_graph": {
            "edges": {
                "e01": {
                    "object": "n0",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "subject": "n1"
                },
                "e02": {
                    "object": "n1",
                    "predicates": [
                        "biolink:related_to"
                    ],
                    "subject": "n2"
                }
            },
            "nodes": {
                "n0": {
                    "categories": [
                        "biolink:Gene"
                    ],
                    "ids": [
                        "NCBIGene:3855",
                        "NCBIGene:211",
                        "NCBIGene:26995"
                    ]
                },
                "n1": {
                    "categories": [
                        "biolink:SmallMolecule"
                    ],
                    "ids": [
                        "PUBCHEM.COMPOUND:3121"
                    ]
                },
                "n2": {
                    "categories": [
                        "biolink:Gene"
                    ]
                }
            }
        }
    }
}

When I ran that once, it timed out. When I ran it again, it ended up getting 67,280 figures. Should we put a limit on how many genes we submit in queries to the PFOCR API?

@colleenXu
Copy link
Contributor

Hmmm I don't know what's going on in #109 (comment): is it a massive number of genes (so splitting into multiple queries would help)? is it that some genes are too common (maybe we can remove them - so we just don't use them for the pfocr related tasks)?

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 22, 2022

is it a massive number of genes (so splitting into multiple queries would help)? is it that some genes are too common (maybe we can remove them - so we just don't use them for the pfocr related tasks)?

In the example I gave, the problem is a massive number of genes.

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 22, 2022

As far as I can tell, the gene count issue isn't really specific to PFOCR. If we get too many genes, the scoring can also fail, and for that matter, the entire BTE query can fail. So any gene count limits should probably be set at a system-level. If the PFOCR API is too slow, we could speed it up by updating it to support POST queries.

@colleenXu
Copy link
Contributor

colleenXu commented Jul 25, 2022

Potential issues:

  • now I'm getting 21 pfocr items (rather than 20) as the max.
  • I don't see any TRAPI-logs about the PFOCR scoring process (so only the console logs will record info about it).
  • Rewrite these console logs to make them clearer? And make them TRAPI logs as well?
    • bte:biothings-explorer-trapi:pfocr 72 PFOCR figures match 2+ genes in individual TRAPI result(s) +0ms I think this is saying there will be 72 PFOCR figures in the results section
    • bte:biothings-explorer-trapi:pfocr 19 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms So...there are 19 TRAPI results that will have pfocr sections (so the 72 figures are split between 19 TRAPI results)?
  • Are the figures within 1 result / pfocr section sorted in any way? Is there a helpful way to sort them? Is there a "pfocr_score" (I don't see one in the current response)?

Notes in general:

  • PFOCR augmenting (currently) only occurs when:
    • TRAPI Results have >=2 NCBIGene IDs, spread over >= 2 QNodes. Then the process of gathering the NCBIGene IDs, querying for PFOCR figure info, and matching to TRAPI results will happen
    • One TRAPI result will have a pfocr section when a PFOCR figure has been matched to it
  • Currently we don't do PFOCR augmenting when the multiple NCBIGene IDs are only on 1 QNode (this can happen when that QNode is set to is_set:true)
  • I've previously mentioned the TRAPI node_binding and edge_binding attributes as a place to put PFOCR stuff. After thinking some more, I think this won't really work:
    • node_binding attribute won't work when the PFOCR figure matches 2 QNodes
    • edge_binding attribute won't work when the PFOCR figure matches 2 QNodes that aren't attached to each other in the QGraph

@colleenXu
Copy link
Contributor

colleenXu commented Jul 25, 2022

Notes on performance (related to these comments):


A. Took ~ 2 min to run the PFOCR section of this query that had 2 Gene QNodes and 522 results. BTE took 5 min 14 s total.

query from Feb-March 2022 QotM work
{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e02": {
                    "subject": "n2",
                    "object": "n3"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["PUBCHEM.COMPOUND:3121"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {                   
                    "ids": ["CHEBI:30413"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n3": {
                    "ids": ["NCBIGene:211"],
                    "categories": ["biolink:Gene"]
                }
            }
        }
    }
}
some console logs
  bte:biothings-explorer-trapi:QueryResult Got 522 TRAPI result(s) +27ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n3,n1 +0ms
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +2ms
  bte:biothings-explorer-trapi:pfocr Making 6 scrolling request(s) for PFOCR figure data (multiple required due to query string length limit for GET requests) +0ms

  bte:biothings-explorer-trapi:pfocr 56654 total PFOCR figure hits retrieved +47ms
  bte:biothings-explorer-trapi:pfocr 56654 PFOCR figures match at least one gene from any TRAPI result +1ms
  bte:biothings-explorer-trapi:pfocr Finding the PFOCR figures and TRAPI result sets that share 2+ CURIEs +604ms
  bte:biothings-explorer-trapi:pfocr 13901 unique PFOCR figure CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 471 unique TRAPI result CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 471 CURIEs common to both TRAPI results and PFOCR figures +0ms

  bte:biothings-explorer-trapi:pfocr 53 PFOCR figures match 2+ genes in individual TRAPI result(s) +15s
  bte:biothings-explorer-trapi:pfocr 121 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms

B. Took ~ 2 min to run the PFOCR section of Gene TXNIP (NCBIGene:10628) -> Gene that had 466 results. BTE took 2 min 5 s total.

some console logs
  bte:biothings-explorer-trapi:QueryResult Got 466 TRAPI result(s) +6ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n0,n1 +41s
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +8ms
  bte:biothings-explorer-trapi:pfocr Making 6 scrolling request(s) for PFOCR figure data (multiple required due to query string length limit for GET requests) +0ms

  bte:biothings-explorer-trapi:pfocr 45782 total PFOCR figure hits retrieved +41ms
  bte:biothings-explorer-trapi:pfocr 45782 PFOCR figures match at least one gene from any TRAPI result +0ms
  bte:biothings-explorer-trapi:pfocr Finding the PFOCR figures and TRAPI result sets that share 2+ CURIEs +576ms
  bte:biothings-explorer-trapi:pfocr 13697 unique PFOCR figure CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 423 unique TRAPI result CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 368 CURIEs common to both TRAPI results and PFOCR figures +0ms

  bte:biothings-explorer-trapi:pfocr 118 PFOCR figures match 2+ genes in individual TRAPI result(s) +4s
  bte:biothings-explorer-trapi:pfocr 139 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms

C. Took ~ 20 sec to run the PFOCR section of this query that had 29 results. BTE took 1 min 56 s total.

Quinazolines -> Gene <- Bunch of Gene IDs
{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["UMLS:C0034407"],
                    "categories": ["biolink:SmallMolecule"],
                    "name": "Quinazolines"
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "ids": ["NCBIGene:10628", "NCBIGene:22861", "NCBIGene:51085",
                            "NCBIGene:1490", "NCBIGene:389692", "NCBIGene:3480",
                            "NCBIGene:598", "NCBIGene:2308", "NCBIGene:22877", "NCBIGene:2033"],
                    "categories": ["biolink:Gene"],
                    "is_set": true,
                    "name": "TXNIP, NLRP1, MLXIPL, CTGF, MAFA, IGF1R, BCL2L1, FOXO1, MLXIP, EP300"
                }
            }
        }
    }
}
some console logs
  bte:biothings-explorer-trapi:QueryResult Got 29 TRAPI result(s) +2ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n2,n1 +2m
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +19ms
  bte:biothings-explorer-trapi:pfocr Making 1 scrolling request(s) for PFOCR figure data +0ms

  bte:biothings-explorer-trapi:pfocr 22650 total PFOCR figure hits retrieved +8ms
  bte:biothings-explorer-trapi:pfocr 22650 PFOCR figures match at least one gene from any TRAPI result +1ms
  bte:biothings-explorer-trapi:pfocr Finding the PFOCR figures and TRAPI result sets that share 2+ CURIEs +336ms
  bte:biothings-explorer-trapi:pfocr 11664 unique PFOCR figure CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 27 unique TRAPI result CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 26 CURIEs common to both TRAPI results and PFOCR figures +0ms

  bte:biothings-explorer-trapi:pfocr 72 PFOCR figures match 2+ genes in individual TRAPI result(s) +0ms
  bte:biothings-explorer-trapi:pfocr 19 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms

D. Took ~ 32 sec to run the PFOCR section of this query that had 529 results. BTE took 4 min 20 s total.

Type 2 Diabetes -> BiologicalEntity <- TXNIP
{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["MONDO:0005148"],
                    "categories": ["biolink:DiseaseOrPhenotypicFeature"],
                    "name": "diabetes 2"
                },
                "n1": {
                    "categories": ["biolink:BiologicalEntity"]
                },
                "n2": {
                    "ids": ["NCBIGene:10628"],
                    "categories": ["biolink:Gene"],
                    "name": "TXNIP"
                }
            }
        }
    }
}
some console logs
  bte:biothings-explorer-trapi:QueryResult Got 529 TRAPI result(s) +66ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n2,n1 +16m
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +1ms
  bte:biothings-explorer-trapi:pfocr Making 3 scrolling request(s) for PFOCR figure data (multiple required due to query string length limit for GET requests) +0ms


  bte:biothings-explorer-trapi:pfocr 44178 total PFOCR figure hits retrieved +35ms
  bte:biothings-explorer-trapi:pfocr 44178 PFOCR figures match at least one gene from any TRAPI result +0ms
  bte:biothings-explorer-trapi:pfocr Finding the PFOCR figures and TRAPI result sets that share 2+ CURIEs +578ms
  bte:biothings-explorer-trapi:pfocr 13504 unique PFOCR figure CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 192 unique TRAPI result CURIEs +1ms
  bte:biothings-explorer-trapi:pfocr 187 CURIEs common to both TRAPI results and PFOCR figures +0ms

  bte:biothings-explorer-trapi:pfocr 120 PFOCR figures match 2+ genes in individual TRAPI result(s) +4s
  bte:biothings-explorer-trapi:pfocr 102 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms

E. Took ~ 51 sec to run the PFOCR section of this query that had 78 results. BTE took 1 min 20 s total.

Gene MLXIP -> Gene <- Gene CTGF
{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "A",
                    "object": "B"
                },
                "e01": {
                    "subject": "C",
                    "object": "B"
                }
            },
            "nodes": {
                "A": {                   
                    "ids": ["NCBIGene:22877"],
                    "categories": ["biolink:Gene"],
                    "name": "MLXIP"
                },
                "B": {
                    "categories": ["biolink:Gene"]
                },
                "C": {
                    "ids": ["NCBIGene:1490"],
                    "categories": ["biolink:Gene"],
                    "name": "CTGF"
                }
            }
        }
    }
}
some console logs
  bte:biothings-explorer-trapi:QueryResult Got 78 TRAPI result(s) +1ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: A,B,C +0ms
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +0ms
  bte:biothings-explorer-trapi:pfocr Making 1 scrolling request(s) for PFOCR figure data +0ms


  bte:biothings-explorer-trapi:pfocr 34198 total PFOCR figure hits retrieved +19ms
  bte:biothings-explorer-trapi:pfocr 34198 PFOCR figures match at least one gene from any TRAPI result +0ms
  bte:biothings-explorer-trapi:pfocr Finding the PFOCR figures and TRAPI result sets that share 2+ CURIEs +498ms
  bte:biothings-explorer-trapi:pfocr 12853 unique PFOCR figure CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 79 unique TRAPI result CURIEs +0ms
  bte:biothings-explorer-trapi:pfocr 78 CURIEs common to both TRAPI results and PFOCR figures +0ms


  bte:biothings-explorer-trapi:pfocr 185 PFOCR figures match 2+ genes in individual TRAPI result(s) +525ms
  bte:biothings-explorer-trapi:pfocr 58 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms

(It's really hard to run a Gene ID -> Gene -> Gene (Predict-style) without having it blow up with tons of results >.<)

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 25, 2022

now I'm getting 21 pfocr items (rather than 20) as the max.

Will be updated

I don't see any TRAPI-logs about the PFOCR scoring process (so only the console logs will record info about it).

Will be added

Rewrite these console logs to make them clearer? And make them TRAPI logs as well?

  • bte:biothings-explorer-trapi:pfocr 72 PFOCR figures match 2+ genes in individual TRAPI result(s) +0ms I think this is saying there will be 72 PFOCR figures in the results section
  • bte:biothings-explorer-trapi:pfocr 19 TRAPI results match 2+ genes in individual PFOCR figure(s) +0ms So...there are 19 TRAPI results that will have pfocr sections (so the 72 figures are split between 19 TRAPI results)?

Your interpretation was correct. Just note that the "split" can include many-to-many relationships (e.g., one figure could match multiple TRAPI results and one TRAPI result could match multiple figures), many-to-one, one-to-many or one-to-one.

Any suggestions for better wording? Maybe one of the following?

  1. 2+ CURIE matches: 72 PFOCR figures and 19 TRAPI results
  2. 19 TRAPI results share 2+ CURIEs with 72 PFOCR figures
  3. 19 TRAPI results match 72 PFOCR figures (2+ CURIEs shared)

We don't want the user to think each of the 19 TRAPI results matches 72 figures. But we probably do want the user to know that we required at least 2 CURIEs to be shared for each figure / TRAPI result pair in order to be a match.

Are the figures within 1 result / pfocr section sorted in any way? Is there a helpful way to sort them? Is there a biothings/biothings_explorer#420 (comment) (I don't see one in the current response)?

Currently no sorting and no scoring, but in issue #451, we're exploring this topic. Once that's figured out, we'll want to add some type of sorting and scoring.

Notes in general:

I think these bullet points correctly describe the current behavior, which I also understand to be the desired behavior. If you or @andrewsu think otherwise, just let me know.

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 25, 2022

As soon as I get feedback on the wording for logging the counts of PFOCR figure and TRAPI result matches, I'll make another push to this PR to address each item you mentioned, @colleenXu. Thanks for the comments.

@ariutta
Copy link
Collaborator Author

ariutta commented Jul 25, 2022

(It's really hard to run a Gene ID -> Gene -> Gene (Predict-style) without having it blow up with tons of results >.<)

Regarding the performance, this might be a question for @andrewsu and @erikyao. With the existing PFOCR API, I'm not sure of a good way to improve performance. If there's a large number of genes, the only way to get PFOCR data from the API is to make a large number of GET requests. I could just get results for the first 1k genes, but that seems wrong. Or maybe I could just download the entire PFOCR dataset from Dropbox if the number of genes exceeds a certain threshold?

@colleenXu
Copy link
Contributor

colleenXu commented Jul 27, 2022

Feedback on this post:

Out of the options for logs, I like 1. 2+ CURIE matches: 72 PFOCR figures and 19 TRAPI results. Perhaps @tokebe could weigh in quickly?

On sorting / scoring: perhaps a quick / simple method is to prefer figures that have more matching QNodes. I noticed some answers where there'd be only 1 QNode listed, probably because there were multiple IDs corresponding to that QNode (an is_set:true situation, I think)....and I wondered if they were less relevant (I would expect them lower on the list). (And just a note, I guess any sorting would ideally happen before truncating the number of figures to 20...)

And Andrew is off this week. I haven't been involved in setting the requirements for this issue....so I'd like to defer to you and Andrew on whether the "notes in general"/current behavior is the desired behavior.

@colleenXu
Copy link
Contributor

On this comment, I think it's worth revisiting the issue with Chunlei and Yao. If you're regularly trying to query with LOTs of gene IDs...to retrieve basically the entire dataset in the API....this does seem inefficient / odd / slow somehow....


some ideas/food for thought:

I wonder if it's possible to keep the pfocr data in a github file in the repo (compressed tabular data?), and basically run "filtering" operations on the data during the "pfocr augmenting" process...

And also wondering: is there anyway to decrease the number of figures we're working with?

  • if there was a score for the quality of the text-mining / entity-recognition, we could work with only the best quality ones?
  • there's some subset of figures that we could use...

@tokebe
Copy link
Member

tokebe commented Jul 27, 2022

Out of the options for logs, I like 1. 2+ CURIE matches: 72 PFOCR figures and 19 TRAPI results. Perhaps @tokebe could weigh in quickly?

I agree, I think this one makes the most sense.

@ariutta
Copy link
Collaborator Author

ariutta commented Aug 3, 2022

The logging and PFOCR figure truncation is updated as of my latest commit.

@ariutta
Copy link
Collaborator Author

ariutta commented Aug 3, 2022

The only blocker remaining is how to handle cases where we send a very large number of genes to the PFOCR API and/or where we receive a very large number of figures back. Proposed solutions:

  1. Once the PFOCR API uses this PR, try POST queries instead of GET. The maximum number of genes I can submit in a single GET request is about 80. How many should I be able to submit in a single POST request without overloading the server?
  2. When we only have two gene QNodes, we could get away with just sending the CURIEs associated with one of them, because both QNodes must be matched. So if we pick the QNode with fewer associated CURIEs, we can send fewer genes to the PFOCR API.
  3. Use a Bloom filter to only request CURIEs that are in PFOCR. A Bloom filter is a probabilistic data structure that allows you to check set membership: it tells you whether the item definitely is not in the set or probably is in the set. When requiring a false positive rate of no more than 1%, we can load all 14,253 PFOCR gene CURIEs into a Bloom filter and serialize the data into a 17KB file (just 103B if gzipped). Then we can load the Bloom filter data once upon launching BTE and use it every time we process a query to check whether each gene CURIE in a TRAPI result is in PFOCR. If it's definitely not in PFOCR, we can skip submitting it in a PFOCR API query. Note: I also checked how large it would be to serialize Bloom filter data for every pair of gene CURIEs, and it's too large at 58MB gzipped (up to 10% false positive rate). The best I could get was using an alternative to the Bloom filter, the Xor filter: 867KB gzipped (up to 10% false positive rate).

Regardless of the items above, we will still get a very large number of figures for certain genes, e.g., ~5k for MAPK1.

@colleenXu
Copy link
Contributor

On 1) you could try 100 (we do that right now with some other pending BioThings APIs)? and maybe more depending on how the retrieval of documents in a POST query goes with that PR

On 2) sounds like a good idea to me

On 3) I dunno. Sounds kinda like a good idea, but I'm not sure.

@ariutta
Copy link
Collaborator Author

ariutta commented Aug 4, 2022

Regarding bullet point 3), I pushed an experimental branch add_pfocr_bloom to demo Bloom and XOR filters. The code pulls the full PFOCR dataset from Dropbox and loads filters with either all gene CURIEs or all gene CURIE pairs in PFOCR. I tried both Bloom (two implementations) and XOR filters. Generating the filter data for the pairs does take awhile, but it would only need to be calculated once (until the next PFOCR dataset update).

Some stats on the current PFOCR dataset:

  • unique genes: 14,253
    • plain text list file size: 204K (41K gzipped)
    • bloom-filters.XorFilter serialized data file size: 120K (25K gzipped)
    • bloom-filters.BloomFilter serialized data file size: 22K (184B gzipped) at a maximum false positive rate of 1%
    • bloomit.BloomFilter serialized data file size: 17K (108B gzipped) at a maximum false positive rate of 1%
  • unique gene pairs: 4,630,449
    • bloom-filters.XorFilter serialized data file size: 38M (7.9M gzipped)
    • bloom-filters.BloomFilter serialized data file size
      • 7.1M (5.3M gzipped) at a maximum false positive rate of 1%
      • 3.5M (2.7M gzipped) at a maximum false positive rate of 10%
    • bloomit.BloomFilter serialized data file size
      • 71M (20M gzipped) at a maximum false positive rate of 1%
      • 35M (10M gzipped) at a maximum false positive rate of 10%
  • the full PFOCR API dataset file size: 41M (9.9M gzipped)

If you want to use a Bloom or XOR filter, you could pick one and remove the code for the others. Then run generateFilter() to get a serialized data file and commit it to the repo. Add some code so that every time the server is launched, it would parse the file and load it into the filter. When checking for matches between PFOCR figures and TRAPI results, check the gene CURIEs in the TRAPI result against the filter before querying the PFOCR API.

@tokebe
Copy link
Member

tokebe commented Sep 7, 2022

After verifying functional parity, I've pushed my changes.

Still to do is figure scoring.

I've also confirmed that BTE still augments explain queries, and did additional performance testing based on these tests:

With PR before my changes -> after my changes, in minutes:
A: 3.02 -> 1.67
B: 2.65 -> 2.12
C: 5.38 -> 4.14
D: 1.02 minutes -> 21.8 seconds

So, significant performance increases thanks to the intensive stuff being offloaded to the API!

Tagging @andrewsu @colleenXu to review in whatever capacity they see fit.

@tokebe
Copy link
Member

tokebe commented Sep 21, 2022

For a detailed example, given query D:

{
  "message": {
    "query_graph": {
      "edges": {
        "e00": {
          "subject": "A",
          "object": "B"
        },
        "e01": {
          "subject": "C",
          "object": "B"
        }
      },
      "nodes": {
        "A": {
          "ids": [
            "NCBIGene:22877"
          ],
          "categories": [
            "biolink:Gene"
          ],
          "name": "MLXIP"
        },
        "B": {
          "categories": [
            "biolink:Gene"
          ]
        },
        "C": {
          "ids": [
            "NCBIGene:1490"
          ],
          "categories": [
            "biolink:Gene"
          ],
          "name": "CTGF"
        }
      }
    }
  }
}

We get the attached file as a response (1.4Mb, compressed as zip):
output.zip

@andrewsu
Copy link
Member

That example output file looks great to me. In terms of functionality, I think this is ready to deploy to dev. (If it were easy to cap the number of PFOCR results returned per TRAPI result at 20, I think that would be good. But that's a minor point that shouldn't hold up deployment...)

@tokebe
Copy link
Member

tokebe commented Sep 28, 2022

Ah, it appears I had disabled figure limits during testing and forgot to re-enable. I’ll enable that before deploying to dev.

@codecov
Copy link

codecov bot commented Sep 28, 2022

Codecov Report

Merging #109 (6f4c4e7) into main (7a4f2db) will decrease coverage by 1.29%.
The diff coverage is 63.06%.

@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
- Coverage   53.29%   51.99%   -1.30%     
==========================================
  Files          26       27       +1     
  Lines        2289     2577     +288     
==========================================
+ Hits         1220     1340     +120     
- Misses       1069     1237     +168     
Impacted Files Coverage Δ
src/results_assembly/score.js 63.63% <ø> (ø)
src/results_assembly/pfocr.js 61.81% <61.81%> (ø)
src/results_assembly/query_results.js 55.82% <80.00%> (ø)
src/index.js 64.28% <100.00%> (+1.26%) ⬆️
src/utils.js 65.51% <0.00%> (+17.24%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@colleenXu
Copy link
Contributor

Note: After checking out this branch and compiling, I encountered this message when running BTE:

module missing
➜  bte-trapi-workspace git:(main) ✗ USE_THREADING=false npm start

> start
> ./scripts/start_server.sh


> bte-trapi-monorepo@ compile /Users/colleenxu/Desktop/bte-trapi-workspace
> tsc -b tsconfig.build.json


> @biothings-explorer/[email protected] debug /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi
> DEBUG=biomedical-id-resolver,bte* nodemon --ignore './data/' src/server.js

[nodemon] 2.0.20
[nodemon] to restart at any time, enter `rs`
[nodemon] watching path(s): *.*
[nodemon] watching extensions: js,mjs,json
[nodemon] starting `node src/server.js`
  bte:biothings-explorer-trapi:EdgeReverse BioLink-model class is initiated. +0ms
node:internal/modules/cjs/loader:985
  const err = new Error(message);
              ^

Error: Cannot find module 'chi-square-p-value'
Require stack:
- /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/results_assembly/pfocr.js
- /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/results_assembly/query_results.js
- /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/index.js
- /Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/server.js
    at Function.Module._resolveFilename (node:internal/modules/cjs/loader:985:15)
    at Function.Module._load (node:internal/modules/cjs/loader:833:27)
    at Module.require (node:internal/modules/cjs/loader:1057:19)
    at require (node:internal/modules/cjs/helpers:103:18)
    at Object.<anonymous> (/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/results_assembly/pfocr.js:6:30)
    at Module._compile (node:internal/modules/cjs/loader:1155:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1209:10)
    at Module.load (node:internal/modules/cjs/loader:1033:32)
    at Function.Module._load (node:internal/modules/cjs/loader:868:12)
    at Module.require (node:internal/modules/cjs/loader:1057:19) {
  code: 'MODULE_NOT_FOUND',
  requireStack: [
    '/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/results_assembly/pfocr.js',
    '/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/results_assembly/query_results.js',
    '/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/query_graph_handler/built/index.js',
    '/Users/colleenxu/Desktop/bte-trapi-workspace/packages/@biothings-explorer/bte-trapi/src/server.js'
  ]
}
[nodemon] app crashed - waiting for file changes before starting...

So I ran "npm i" which installed 1 package and changed 2 other packages.

@colleenXu
Copy link
Contributor

colleenXu commented Oct 19, 2022

EDITED 2022-10-24 after discussion with Andrew:

Feedback for devs

  1. When there's a query that doesn't run the PFOCR-augmenting...the TRAPI log below is in the response, which is confusing.
    • If the intent of this log is "starting the PFOCR-augmenting module, and checking whether this query meets the criteria for augmenting", then perhaps the "message" could be changed to Checking whether this query meets criteria for augmenting results with PFOCR figure info
    • perhaps there can be a TRAPI-level log that makes it clear that the augmenting didn't happen: Query did not meet criteria: BTE will not augment results with PFOCR figure info
the TRAPI log
        {
            "timestamp": "2022-10-19T05:25:17.893Z",
            "level": "DEBUG",
            "message": "Enriched TRAPI results with PFOCR figures",
            "code": null
        },
  1. When there's a query that DOES run the PFOCR-augmenting, there's only 1 TRAPI log in the response (the one above). Perhaps we can provide more info in the TRAPI-level logs?
    • How many results were augmented with a pfocr section? (And how many matching figures were there?)
    • How many pfocr sections were truncated?
  2. ⭐ For each figure (item) in a pfocr section, could we list the Genes that were in the figure + the result? Either by CURIE or by name.
    • This helps me when I'm double-checking whether this code is working "correctly"
    • This helps users see what Genes in the result should be in the figure (sometimes there's a lot of entities mapped to a QNode).

Are requirements met?

Requirements from this post of the issue

Not sure if the "scoring" requirements are met: that scoring is good and higher scores are better. However, I reviewed the top figures for some results, and I thought the stuff looked relevant. It'll be easier to see if the figures were correctly annotated with the genes once Feedback point 3 (adding Gene IDs/names) is done.

I think requirements are met for:

  • Execution:
    • Pfocr section is on results with NCBIGene IDs mapped to >= 2 QNodes
    • Figure retrieval involved scrolling/templated/special-logic POST queries
    • There are TRAPI-level / console logs. But I have feedback on that (see above section)
  • In the response:
    • Pfocr has figureURL, pmc, nodes, score section.
    • Returns max of 20 (top-scoring) figures
Noticed some results that fulfill the criteria but don't have a pfocr section. Turns out that (correctly) no matching figures were found...

See the example below from Case B.

example: has 2 QNodes with NCBIGene IDs, but no pfocr section
            {
                "node_bindings": {
                    "n0": [
                        {
                            "id": "PUBCHEM.COMPOUND:9210"
                        }
                    ],
                    "n1": [
                        {
                            "id": "NCBIGene:7384"
                        }
                    ],
                    "n2": [
                        {
                            "id": "NCBIGene:22877"
                        },
                        {
                            "id": "NCBIGene:1490"
                        }
                    ]
                },
                "edge_bindings": {
                    "e00": [
                        {
                            "id": "344ff1a9445124d6b35dfa824fc9270f"
                        }
                    ],
                    "e01": [
                        {
                            "id": "350a8b15b18e8e88af18a7c8075de9f6"
                        },
                        {
                            "id": "688a7ee29cd8908484a2cd30371c8349"
                        }
                    ]
                },
                "score": 1.3847382974915614
            },
  • n1 is NCBIGene:7384 aka UQCRC1. I searched PFOCR's figures for this gene, and didn't find the other two genes (22877 and 1490)
  • one of n2's genes is NCBIGene:22877 aka MLXIP. I searched PFOCR's figures for this gene, and didn't find the other n2 gene (1490)

Recording queries used for testing

PFOCR-augmenting

Cases A, B are also examples of PFOCR-augmenting being useful.

Case A

1-hop Predict: Gene NCBIGene:23595 (ORC3) -> NamedThing. Inspired by a PMI case and testing the case described in the last point of this post.

Some results have an entity with an NCBIGene ID mapped to the NamedThing QNode - so they were used for PFOCR-augmenting.

Response
example1.txt

Case A query and console logs

query

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["NCBIGene:23595"],
                    "categories": ["biolink:Gene"]
                },
                "n1": {
                    "categories": ["biolink:NamedThing"]
                }
            }
        }
    }
}

console logs

  bte:biothings-explorer-trapi:QueryResult Got 2376 TRAPI result(s) +58ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n0,n1 +17m
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +5ms
  bte:biothings-explorer-trapi:pfocr Batch window 0-1000: 1375 hits retrieved for PFOCR figure data +552ms
  bte:biothings-explorer-trapi:pfocr 59 total PFOCR figure hits retrieved +1ms
  bte:biothings-explorer-trapi:pfocr 59 PFOCR figures match at least 2 genes from any TRAPI result +0ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4999 +7ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 5599 +13ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 8317 +5ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 5000 +3ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4176 +4ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 990 +6ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 8318 +3ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 7157 +2ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4173 +6ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 254394 +3ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 84515 +5ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 1019 +5ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 1017 +1ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 10926 +9ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4172 +4ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4171 +4ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4175 +4ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4174 +3ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 4254 +3ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 1032 +3ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 23595 55869 +5ms
  bte:biothings-explorer-trapi:pfocr 2+ CURIE matches: 59 PFOCR figures and 83 TRAPI results +3ms
  bte:biothings-explorer-trapi:QueryResult Successfully scored 356 results, couldn't score 2020 results. +664ms
  bte:biothings-explorer-trapi:Graph pruning BTEGraph nodes/edges... +7s
  bte:biothings-explorer-trapi:Graph pruned 0 nodes and 0 edges from BTEGraph. +7ms
  bte:biothings-explorer-trapi:main (14) TRAPI query finished. +7s
Case B

2-hop Explain: quinazoline -> Gene <- TXNIP-related gene set. From the July Translator QotM (last point in this post).

Also tested as Query C in this earlier post.

Response
example2.txt

Case B query and console logs

query

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["UMLS:C0034407"],
                    "categories": ["biolink:SmallMolecule"],
                    "name": "Quinazolines"
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {
                    "ids": ["NCBIGene:10628", "NCBIGene:22861", "NCBIGene:51085",
                            "NCBIGene:1490", "NCBIGene:389692", "NCBIGene:3480",
                            "NCBIGene:598", "NCBIGene:2308", "NCBIGene:22877", "NCBIGene:2033"],
                    "categories": ["biolink:Gene"],
                    "is_set": true,
                    "name": "TXNIP, NLRP1, MLXIPL, CTGF, MAFA, IGF1R, BCL2L1, FOXO1, MLXIP, EP300"
                }
            }
        }
    }
}

Console logs

  bte:biothings-explorer-trapi:QueryResult Got 30 TRAPI result(s) +2ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n1,n2 +5m
  bte:biothings-explorer-trapi:pfocr Getting PFOCR figure data +0ms
  bte:biothings-explorer-trapi:pfocr Batch window 0-1000: 5348 hits retrieved for PFOCR figure data +2s
  bte:biothings-explorer-trapi:pfocr 2172 total PFOCR figure hits retrieved +5ms
  bte:biothings-explorer-trapi:pfocr 2172 PFOCR figures match at least 2 genes from any TRAPI result +1ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 1950 2033 2308 3480 598 22877 1490 +14ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 4609 2033 3480 22861 2308 598 10628 +31ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 1956 51085 3480 1490 598 22877 10628 2033 +29ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 6774 2033 1490 3480 2308 22877 10628 598 +19ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 7124 10628 1490 2308 22877 3480 598 2033 +23ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 2033 2308 10628 +23ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 1017 2033 2308 598 3480 +4ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 2033 10628 2308 +20ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 1432 2033 598 389692 2308 +4ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 7040 1490 22877 2033 10628 2308 +23ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 10013 2033 2308 3480 22877 10628 +18ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 10628 598 2033 1490 2308 +26ms
  bte:biothings-explorer-trapi:pfocr Truncating PFOCR figures at 20 for TRAPI result w/ 6934 2308 2033 +8ms
  bte:biothings-explorer-trapi:pfocr 2+ CURIE matches: 2172 PFOCR figures and 20 TRAPI results +8ms
  bte:biothings-explorer-trapi:QueryResult Successfully scored 30 results, couldn't score 0 results. +2s
Some queries that aren't use cases (for testing only)

From this earlier post, looking at performance. Previously used for testing by Jackson as well.

A is from Feb-March QotM. B-E are inspired by July QotM.

A. Valproic acid -> Gene -> heme -> ALAS1

Took 2 min 6 sec to run. 525 results total -> 121 are augmented with PFOCR section.

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n1",
                    "object": "n2"
                },
                "e02": {
                    "subject": "n2",
                    "object": "n3"
                }
            },
            "nodes": {
                "n0": {
                    "ids": ["PUBCHEM.COMPOUND:3121"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                },
                "n2": {                   
                    "ids": ["CHEBI:30413"],
                    "categories": ["biolink:SmallMolecule"]
                },
                "n3": {
                    "ids": ["NCBIGene:211"],
                    "categories": ["biolink:Gene"]
                }
            }
        }
    }
}
B. Gene TXNIP -> Gene

Took 14 sec to run. 502 results total -> 147 are augmented with PFOCR section.

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["NCBIGene:10628"],
                    "categories": ["biolink:Gene"],
                    "name": "TXNIP"
                },
                "n1": {
                    "categories": ["biolink:Gene"]
                }
            }
        }
    }
}
D. Type 2 Diabetes -> BiologicalEntity <- TXNIP

Took 6 min 1 sec to run. 577 results total -> 107 are augmented with PFOCR section.

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["MONDO:0005148"],
                    "categories": ["biolink:DiseaseOrPhenotypicFeature"],
                    "name": "diabetes 2"
                },
                "n1": {
                    "categories": ["biolink:BiologicalEntity"]
                },
                "n2": {
                    "ids": ["NCBIGene:10628"],
                    "categories": ["biolink:Gene"],
                    "name": "TXNIP"
                }
            }
        }
    }
}
E. Gene MLXIP -> Gene <- Gene CTGF

Took 20 sec to run. 86 results total -> 65 are augmented with PFOCR section.

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "A",
                    "object": "B"
                },
                "e01": {
                    "subject": "C",
                    "object": "B"
                }
            },
            "nodes": {
                "A": {                   
                    "ids": ["NCBIGene:22877"],
                    "categories": ["biolink:Gene"],
                    "name": "MLXIP"
                },
                "B": {
                    "categories": ["biolink:Gene"]
                },
                "C": {
                    "ids": ["NCBIGene:1490"],
                    "categories": ["biolink:Gene"],
                    "name": "CTGF"
                }
            }
        }
    }
}

PFOCR-augmenting didn't happen

Checking that PFOCR-augmenting is only happening in certain situations. Requirements from this post of the issue

Case 1

1 Gene QNode with is_set: true (and no IDs provided). This query is used in the creative-mode run for this disease (it's the 3rd template). Previously discussed here and here.

query

{
    "message": {
        "query_graph": {
            "nodes": {
                "creativeQueryObject": {
                    "ids":["MONDO:0007035"],
                    "categories":["biolink:Disease"],
                    "name": "acanth"
               },
                "nA": {
                    "categories":["biolink:Gene"],
                    "is_set": true
                },
                "creativeQuerySubject": {
                    "categories":["biolink:ChemicalEntity"]
                }
            },
            "edges": {
                "eA": {
                    "subject": "creativeQueryObject",
                    "object": "nA",
                    "predicates": ["biolink:caused_by"]
                },
                "eB": {
                    "subject": "nA",
                    "object": "creativeQuerySubject",
                    "predicates": ["biolink:entity_regulated_by_entity"]
                }
            }
        }
    }
}

The console-logs correctly show that only 1 QNode maps to NCBIGene IDs in the results. It looks like PFOCR-augmenting didn't happen (no more console logs regarding PFOCR).

  bte:biothings-explorer-trapi:QueryResult Got 1196 TRAPI result(s) +110ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: nA +0ms
  bte:biothings-explorer-trapi:QueryResult Successfully scored 1196 results, couldn't score 0 results. +4ms
Case 2

1 Gene QNode with is_set: true AND multiple IDs provided.

From July Translator QotM (last query of this post, which I "didn't run through ARS")

response
example2no.txt

Case 2 query and console logs

query

{
    "message": {
        "query_graph": {
            "edges": {
                "e00": {
                    "subject": "n0",
                    "object": "n1"
                },
                "e01": {
                    "subject": "n2",
                    "object": "n1"
                }
            },
            "nodes": {
                "n0": {                   
                    "ids": ["UMLS:C0038760"],
                    "categories": ["biolink:SmallMolecule"],
                    "name": "Sulfonamides"
                },
                "n1": {
                    "categories": ["biolink:DiseaseOrPhenotypicFeature", "biolink:BiologicalProcessOrActivity"]
                },
                "n2": {
                    "ids": ["NCBIGene:10628", "NCBIGene:22861", "NCBIGene:51085",
                            "NCBIGene:1490", "NCBIGene:389692", "NCBIGene:3480",
                            "NCBIGene:598"],
                    "categories": ["biolink:Gene"],
                    "is_set": true,
                    "name": "TXNIP, NLRP1, MLXIPL, CTGF, MAFA, IGF1R, BCL2L1"
                }
            }
        }
    }
}

The console-logs correctly show that only 1 QNode maps to NCBIGene IDs in the results. It looks like PFOCR-augmenting didn't happen (no more console logs regarding PFOCR).

 bte:biothings-explorer-trapi:QueryResult Got 146 TRAPI result(s) +8ms
  bte:biothings-explorer-trapi:pfocr QNode(s) having CURIEs that PFOCR could potentially match: n2 +10m
  bte:biothings-explorer-trapi:QueryResult Successfully scored 144 results, couldn't score 2 results. +0ms
  bte:biothings-explorer-trapi:Graph pruning BTEGraph nodes/edges... +5s
  bte:biothings-explorer-trapi:Graph pruned 0 nodes and 0 edges from BTEGraph. +0ms
  bte:biothings-explorer-trapi:main (14) TRAPI query finished. +5s

However, there are results that have multiple NCBIGene IDs mapped to 1 QNode (n2) - so maybe theoretically pfocr-augmenting could be done on them. This is 1 example:

            {
                "node_bindings": {
                    "n0": [
                        {
                            "id": "UMLS:C0038759"
                        }
                    ],
                    "n1": [
                        {
                            "id": "MONDO:0002118"
                        }
                    ],
                    "n2": [
                        {
                            "id": "NCBIGene:22861"
                        },
                        {
                            "id": "NCBIGene:1490"
                        }
                    ]
                },
                "edge_bindings": {
                    "e00": [
                        {
                            "id": "901e3f7f26ba45faa17917f31452be6d"
                        },
                        {
                            "id": "ae959a73220b08d04dd83e2278d79dea"
                        }
                    ],
                    "e01": [
                        {
                            "id": "c16d90ebf485587fdface60065f56604"
                        },
                        {
                            "id": "5ac509c7cb0eae92bb9002dbfb6cea92"
                        }
                    ]
                },
                "score": 4.838056316067967
            },

@colleenXu
Copy link
Contributor

colleenXu commented Oct 20, 2022

EDITED 2022-10-24 after discussion with Andrew:

Tasks for devs:

  • ⭐ Point 3 from the "Feedback for devs" section of the above post, also pasted here: For each figure (item) in a pfocr section, could we list the Genes that were in the figure + the result? Either by CURIE or by name.
  • ⭐ only allow this behavior for the v1/query endpoint (and v1/asyncquery). Those are the "BTE as ARA" endpoints, and the reasons for this are similar to those for scoring in this issue disable scoring for "KP endpoints" (by-api and by-team) biothings_explorer#520
  • optional: the log-related points from the "Feedback for devs" section of the above post

@colleenXu
Copy link
Contributor

colleenXu commented Oct 25, 2022

@tokebe I've edited both of my posts above...The next things to do are now listed in the post right before this one.


EDIT: And for myself, here's something to check after the first starred ⭐ task above is done ("adding Gene IDs/names")...

Even when there are 2 Gene QNodes, a result's PFOCR section says it only maps to 1 QNode's IDs. Is that correct? See the example below from Case B.

It's related to this PFOCR hit.

example: result with pfocr section mapped to 1 QNode
            {
                "node_bindings": {
                    "n0": [
                        {
                            "id": "PUBCHEM.COMPOUND:9210"
                        }
                    ],
                    "n1": [
                        {
                            "id": "UMLS:C0031727"
                        }
                    ],
                    "n2": [
                        {
                            "id": "NCBIGene:10628"
                        },
                        {
                            "id": "NCBIGene:2033"
                        }
                    ]
                },
                "edge_bindings": {
                    "e00": [
                        {
                            "id": "10369da438bed7deb5ae5c76d4690195"
                        }
                    ],
                    "e01": [
                        {
                            "id": "60e35d68e6283bdf4a86176cab5376a4"
                        },
                        {
                            "id": "047149fcdcc6fe8d56618551cb8f769f"
                        }
                    ]
                },
                "score": 1.9713651331873512,
                "pfocr": [
                    {
                        "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5973613/bin/pone.0198016.g007.jpg",
                        "pmc": "PMC5973613",
                        "nodes": [
                            "n2"
                        ],
                        "score": 0.73934
                    }
                ]
            },

Hmmm...it's interesting that Case 2 has similar results but PFOCR-augmenting doesn't happen, maybe because the query doesn't meet the criteria (>=2 QNodes have NCBIGene IDs).

@tokebe tokebe merged commit 6844aaf into main Dec 22, 2022
tokebe added a commit that referenced this pull request Dec 22, 2022
This reverts commit 6844aaf, reversing
changes made to 17ecb35.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants