Additional content extraction and annotations from pathway figures #16

AlexanderPico · 2019-11-13T02:00:49Z

for summarizing query path results during evaluation. For example, annotating pathway figures with terms from the Pathway Ontology will allow BTE users to filter genes by hierarchically-organized pathway classes.

AlexanderPico · 2020-03-31T18:51:29Z

@ariutta Shall we dive into this one? I was thinking we could add:

PMIDs
Pathway Ontology terms (IDs and names)
Disease Ontology terms (IDs and names)

You have 1 already. We have the start of 3 (from Jensen). And we should brainstorm on 2.

Other ideas?

AlexanderPico · 2020-04-06T22:29:09Z

@khanspers Why don't you try getting GO:BP terms associated with PFOCR results.

AlexanderPico · 2020-04-06T22:32:28Z

@ariutta I'll send you a tsv with figid mappings to disease terms. Can you work with kevin to get consensus on the formatting.

ariutta · 2020-04-08T20:32:30Z

@kevinxin90, we're looking to add more fields (PMID, Pathway Ontology, Disease Ontology) to our data for BTE. What do you think about the format below (the values are just placeholders)? The keys match http://identifiers.org/ prefixes in order to ensure they stay consistent and unique.

{"_id": "PMC0000000__nihm00000000",
 "associatedWith": {
    "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC0000000/bin/nihm00000000.jpg",
    "pmc": "PMC0000000",
    "pubmed": "00000000",
    "ncbigene": ["1000", "1001", "1002", "1003"],
    "PW": ["0000001", "0000002"],
    "DOID": ["10000", "10001"]
  }
}

kevinxin90 · 2020-04-08T22:00:01Z

@ariutta Hi Anders, this looks great to me!!

AlexanderPico · 2020-04-09T18:30:46Z

@ariutta, Here's a map of figid to doid and human readable terms:
https://www.dropbox.com/s/vgazcxdq4wsl5yk/pfocr_disease_map.tsv?dl=0

@kevinxin90, Should we include both DOID and term name? As a proposed "node property" the human readable term name may be more useful than DOID. But if you already have a mapping for Disease Ontology IDs and terms, then it's perhaps not necessary. Thoughts?

kevinxin90 · 2020-04-09T19:29:46Z

@AlexanderPico Yes. I think it's a good idea to include both DOID and term name in the API response. For Biothing Explorer, it doesn't matter whether you include the name or not (we will do the ID resolving internally). But as an API itself, it would be very useful (in case the user doesn't access it through BioThings Explorer).

ariutta · 2020-10-19T22:29:54Z

@kevinxin90, in this reporting period, @AlexanderPico and I have done more work to parse our PFOCR data, using APIs like PubTator to extract chemical and disease mentions in the OCR text. Notice that these disease mentions are pulled directly from the OCR text, unlike the disease associations Alex referenced above, which come from pathway enrichment analysis. I'm thinking of the former as disease mentions and the latter as disease annotations.

In order to include this additional information, I propose we use the following updated JSON format, but I'm open to suggestions. We also need to distinguish between mentions vs. annotations somehow, so I set the mentions
as key/value pairs under associatedWith, and the annotations as key/value pairs under associatedWith.annotations.

{
  "_id": "PMC0000000__nihm00000000",
  "associatedWith": {
    "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC0000000/bin/nihm00000000.jpg",
    "pmc": "PMC0000000",
    "pubmed": "00000000",
    "ncbigene": [
      "1000",
      "1001",
      "1002",
      "1003"
    ],
    "disease:mesh": [
      "D000860"
    ],
    "chemical:mesh": [
      "D000431",
      "D000079",
      "D000085"
    ],
    "annotations": {
      "PW": [
        "0000001",
        "0000002"
      ],
      "DOID": [
        "10000",
        "10001"
      ]
    }
  }
}

(dummy data -- just showing the format)

Questions:

How do you want to distinguish mentions vs. annotations?
The PubTator API returns MESH IDs, so that's why I used MESH IDs for disease:mesh but not DOID. Would it be worth translating to use a single identifier for diseases?
Unlike the other identifiers, mesh covers genes, diseases, chemicals, etc. That's why I specified disease:mesh. Do you prefer something else?

ariutta · 2020-10-20T21:14:17Z

@andrewsu, you're also welcome to comment on this format, if you're interested, especially how to distinguish mentions vs. annotations.

kevinxin90 · 2020-10-20T22:38:37Z

Hi @ariutta, thanks for the update. Here're a few suggestions from me.

Group the mentions and annotations as two separate dictionaries.
Group the results based on the semantic type, e.g. gene, chemical, disease
It's fine to keep mesh for results from pubtator

Here is what I would propose:

{
  "_id": "PMC0000000__nihm00000000",
  "associatedWith": {
    "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC0000000/bin/nihm00000000.jpg",
    "pmc": "PMC0000000",
    "pubmed": "00000000",
    "annotations": {
        "pathway": {
            "PW": [00001, 00002],
        },
        "disease": {
             "DOID": ["DOID:0001", "DOID:000012"]
        }
    },
    "mentions": {
        "disease": {
              "mesh": "D000860"
        },
        "chemical": {
               "mesh": ["D0001", "D00002"]
        }
     }
}

ariutta · 2020-10-20T23:01:44Z

@kevinxin90, that looks great. I just need to get you an updated file with this format.

ariutta · 2020-12-01T18:17:22Z

Hi @kevinxin90, I wanted to let you know I have the chemicals extracted from almost all the PFOCR pathway figures (63591 of 64643 because the PubTator API returned an error for some of them). I currently have them in TSV format (link included in case you want to take a sneak preview), but I'll get an updated file with the format we agreed on.

Summary Stats

All pathway figures

24873 figures with at least one chemical found
120262 retained matches, excluding duplicates of matched_ocr_text within the same figure
18427 unique matched_ocr_texts
6628 unique matched MeSH terms
5677 unique MeSH IDs

Just the pathway figures with 10+ genes:

10483 figures with at least one chemical found
39540 retained matches, excluding duplicates of matched_ocr_text within the same figure
7157 unique matched_ocr_texts
3367 unique matched MeSH terms
2941 unique MeSH IDs

ariutta · 2020-12-04T02:20:31Z

@kevinxin90,

Here is an updated file in the format we discussed, including chemicals in addition to genes this time:
https://www.dropbox.com/s/m03hd447oi3yjz1/pfocr_biothings_65k_20201203.ndjson?dl=0

Summary Stats

for pathway figures having at least 3 chemicals and/or genes found

figure count: 55309
unique gene count: 13442
total gene count: 1069777
unique chemical count: 5560
total chemical count: 108340

Questions:

For identifiers that could be represented as integers, I'm still formatting them as strings, e.g., NCBI Gene or PubMed PMID. Let me know if you want them to be integers.
For cases where a figure had genes found but no chemicals (or vice versa), I still created an empty array [] for mentions.genes.ncbigene or mentions.chemicals.mesh. Let me know if you'd prefer those items to just not be included at all.

kevinxin90 · 2020-12-04T17:52:10Z

@ariutta I found a couple places where the value of pubmed equal to 0. Does it mean pubmed id is not found?
Example:

'_id': 'PMC6887797__10.1177_1758835919887665-fig1.jpg', 
'associatedWith': {
    'figureUrl': 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6887797/bin/10.1177_1758835919887665-fig1.jpg', 
    'pmc': 'PMC6887797', 
    'pubmed': 0, 
    'mentions': {
        'chemicals': {
            'mesh': ['D004967']
        },
        'genes': {
            'ncbigene': ['10000', '1019', '1021', '144455', '1869', '1870', '1871', '1874', '1875', '1876', '1956', '2064', '2065', '207', '208', '2475', '595', '6198', '6199', '7248', '7249', '79733']
         }
       }
    }
}

ariutta · 2020-12-04T17:57:46Z

Good catch! Yes, you're right. I'll take a look at filling those in.

kevinxin90 · 2020-12-04T18:09:47Z

Great! I currently just removed those pubmed fields if they equal to 0.

@ariutta The new API is currently up at: https://biothings.ncats.io/pfocr

Some example queries:

https://biothings.ncats.io/pfocr/geneset/PMC4671449__medscimonit-21-3736-g001.jpg

Query by Chemical MESH ID: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.chemicals.mesh:D000431

Query By Gene NCBIGene ID: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:1017

AlexanderPico · 2020-12-05T00:07:39Z

@kevinxin90 Do the chemical MeSH IDs connect in BTE to any other entries, e.g., pharmaceutical products? It would be nice to mention a query path in the annual report that is enabled by this new deposition of chemicals in pathway figures.

AlexanderPico · 2020-12-09T00:40:37Z

@kevinxin90 Is there a particular place we should upload future .ndjson files, i.e., somewhere that is periodically checked and queued up for BTE integration? Or is this something we should work out next year?

kevinxin90 · 2020-12-09T01:14:02Z

@AlexanderPico Hi Alex, if it's small. You could just upload to this github. For us, we only need a static url we can track the last-modified date. It could also be a google cloud storage or other cloud storage services. You could view an example here.
The example I showed above, if you look at line 19, we set a cron job to check for update every Monday, and that one is hosted on Google Cloud API.

kevinxin90 · 2020-12-09T01:18:49Z

pharmaceutical products

@kevinxin90 Do the chemical MeSH IDs connect in BTE to any other entries, e.g., pharmaceutical products? It would be nice to mention a query path in the annual report that is enabled by this new deposition of chemicals in pathway figures.

We don't really distinguish pharmaceutical products (I assume drug, right?) and ChemicalSubstance in BTE. But we do can do some ID translation (e.g. MeSH -> Drugbank ID) within BTE.

We can make a query path like Disease -> Gene -> Chemical (with the later step using PFOCR as one of the consulted resources? since PFOCR has information about what genes and chemicals are in the same pathway) Up for better suggestions.

AlexanderPico · 2021-12-03T20:29:53Z

@ariutta Should this be closed? Are chemical, disease ont and pathway ont terms integrated?

ariutta · 2021-12-08T22:45:57Z

Chemicals: yes. Diseases, pathways or amino acids: I'm not sure.

AlexanderPico · 2022-03-22T19:51:30Z

Next, let's add title and disease mentions like so:

{
  "_id": "PMC0000000__nihm00000000",
  "associatedWith": {
    "title": "signaling in immune response",
    "figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC0000000/bin/nihm00000000.jpg",
    "pmc": "PMC0000000",
    "pubmed": "00000000",
    "mentions": {
        "diseases": {
              "mesh": ["D000860", "D000333"]
        },
        "chemicals": {
               "mesh": ["D0001", "D00002"]
        },
        "genes": {
               "ncbigene": ["1234", "2345"]
        }
     }
}

ariutta · 2022-03-25T20:13:48Z

How inclusive should we be for chemicals? For example, some of our collaborators don't want side metabolites included in their analyses, e.g., Na+, NADPH or S-adenosyl-L-methioninate.

Thanks to @tokebe's suggestion, I ran a sample of the PFOCR chemicals results through the node normalizer. I got back the following:

biolink types

ChemicalMixture
ChemicalEntityOrProteinOrPolypeptide
MolecularEntity
ChemicalOrDrugOrTreatment
SmallMolecule
BiologicalEntity
Entity
MolecularMixture
PhysicalEssenceOrOccurrent
NamedThing
ThingWithTaxon
ChemicalEntityOrGeneOrGeneProduct
PhysicalEssence
ChemicalEntity
Polypeptide

normalized identifier namespaces

UMLS
MESH
KEGG.COMPOUND
CHEMBL.COMPOUND
PUBCHEM.COMPOUND
DrugCentral

I'm proposing that we go ahead and incorporate all the results into BTE, without any filtering to exclude items like side metabolites. However, if it's preferable, I could split up the chemicals into biolink categories.

andrewsu · 2022-03-28T19:25:36Z

I'm proposing that we go ahead and incorporate all the results into BTE, without any filtering to exclude items like side metabolites.

I agree with this strategy!

ariutta · 2022-03-31T19:33:45Z

I've got a draft version of the latest export file ready:
https://github.com/wikipathways/pfocr-pipeline/raw/main/export/bte_chemicals_diseases_genes.ndjson

Note this version does not have PubMed IDs. Is it important that I provide them, or is this something easily handled by existing TRAPI APIs like node normalizer?

andrewsu · 2022-03-31T19:53:32Z

On quick glance, looks good! I do not believe that Translator has an easy mechanism to translate PMCIDs to PMIDs. So if it's easy for you to add, please do. If not, the PMCID will be fine as provenance for now...

ariutta · 2022-04-05T20:49:48Z

@andrewsu, we have an updated dataset available here:
https://www.dropbox.com/s/1f14t5zaseocyg6/bte_chemicals_diseases_genes.ndjson?dl=0

This dataset is formatted the same way as the previous one, except:

we now have titles for figures
we're not including pubmed

We now have chemicals, diseases and genes. The new genes are from our latest batch of PFOCR data. The new chemicals are from both the latest batch of PFOCR data as well as from relaxing the filtering restrictions we applied last time -- we've included all chemicals except ones with names only containing the letters HCONSP because those ones introduced too many false positives.

Summary stats:

275,456 chemicals (14,482 unique) from 47,831 figures
20,465 diseases (1,430 unique) from 13,622 figures
1,369,680 genes (14,253 unique) from 73,876 figures

AlexanderPico added the enhancement New feature or request label Nov 13, 2019

AlexanderPico added this to the Segment 2 milestone Nov 13, 2019

AlexanderPico assigned ariutta, AlexanderPico and khanspers Nov 13, 2019

AlexanderPico added the Group 4 label Feb 15, 2020

andrewsu mentioned this issue Apr 7, 2020

Support expansion and filtering operations in BTE through the PFOCR API with expanded concept extraction #15

Closed

AlexanderPico changed the title ~~Add node properties to our figure-based gene sets~~ Additional content extraction and annotations from pathway figures Dec 8, 2020

ariutta mentioned this issue Mar 25, 2022

augment TRAPI results using PFOCR data biothings/biothings_explorer#420

Closed

andrewsu mentioned this issue Apr 5, 2022

update data in PFOCR API biothings/pending.api#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional content extraction and annotations from pathway figures #16

Additional content extraction and annotations from pathway figures #16

AlexanderPico commented Nov 13, 2019

AlexanderPico commented Mar 31, 2020 •

edited

Loading

AlexanderPico commented Apr 6, 2020

AlexanderPico commented Apr 6, 2020

ariutta commented Apr 8, 2020

kevinxin90 commented Apr 8, 2020

AlexanderPico commented Apr 9, 2020 •

edited

Loading

kevinxin90 commented Apr 9, 2020

ariutta commented Oct 19, 2020 •

edited

Loading

ariutta commented Oct 20, 2020

kevinxin90 commented Oct 20, 2020

ariutta commented Oct 20, 2020

ariutta commented Dec 1, 2020 •

edited

Loading

ariutta commented Dec 4, 2020

kevinxin90 commented Dec 4, 2020 •

edited

Loading

ariutta commented Dec 4, 2020

kevinxin90 commented Dec 4, 2020

AlexanderPico commented Dec 5, 2020 •

edited

Loading

AlexanderPico commented Dec 9, 2020

kevinxin90 commented Dec 9, 2020 •

edited

Loading

kevinxin90 commented Dec 9, 2020

AlexanderPico commented Dec 3, 2021

ariutta commented Dec 8, 2021 •

edited

Loading

AlexanderPico commented Mar 22, 2022 •

edited

Loading

ariutta commented Mar 25, 2022 •

edited

Loading

andrewsu commented Mar 28, 2022

ariutta commented Mar 31, 2022

andrewsu commented Mar 31, 2022

ariutta commented Apr 5, 2022

Additional content extraction and annotations from pathway figures #16

Additional content extraction and annotations from pathway figures #16

Comments

AlexanderPico commented Nov 13, 2019

AlexanderPico commented Mar 31, 2020 • edited Loading

AlexanderPico commented Apr 6, 2020

AlexanderPico commented Apr 6, 2020

ariutta commented Apr 8, 2020

kevinxin90 commented Apr 8, 2020

AlexanderPico commented Apr 9, 2020 • edited Loading

kevinxin90 commented Apr 9, 2020

ariutta commented Oct 19, 2020 • edited Loading

ariutta commented Oct 20, 2020

kevinxin90 commented Oct 20, 2020

ariutta commented Oct 20, 2020

ariutta commented Dec 1, 2020 • edited Loading

Summary Stats

All pathway figures

Just the pathway figures with 10+ genes:

ariutta commented Dec 4, 2020

Summary Stats

Questions:

kevinxin90 commented Dec 4, 2020 • edited Loading

ariutta commented Dec 4, 2020

kevinxin90 commented Dec 4, 2020

AlexanderPico commented Dec 5, 2020 • edited Loading

AlexanderPico commented Dec 9, 2020

kevinxin90 commented Dec 9, 2020 • edited Loading

kevinxin90 commented Dec 9, 2020

AlexanderPico commented Dec 3, 2021

ariutta commented Dec 8, 2021 • edited Loading

AlexanderPico commented Mar 22, 2022 • edited Loading

ariutta commented Mar 25, 2022 • edited Loading

andrewsu commented Mar 28, 2022

ariutta commented Mar 31, 2022

andrewsu commented Mar 31, 2022

ariutta commented Apr 5, 2022

AlexanderPico commented Mar 31, 2020 •

edited

Loading

AlexanderPico commented Apr 9, 2020 •

edited

Loading

ariutta commented Oct 19, 2020 •

edited

Loading

ariutta commented Dec 1, 2020 •

edited

Loading

kevinxin90 commented Dec 4, 2020 •

edited

Loading

AlexanderPico commented Dec 5, 2020 •

edited

Loading

kevinxin90 commented Dec 9, 2020 •

edited

Loading

ariutta commented Dec 8, 2021 •

edited

Loading

AlexanderPico commented Mar 22, 2022 •

edited

Loading

ariutta commented Mar 25, 2022 •

edited

Loading