-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional content extraction and annotations from pathway figures #16
Comments
@ariutta Shall we dive into this one? I was thinking we could add:
You have 1 already. We have the start of 3 (from Jensen). And we should brainstorm on 2. Other ideas? |
@khanspers Why don't you try getting GO:BP terms associated with PFOCR results. |
@ariutta I'll send you a tsv with |
@kevinxin90, we're looking to add more fields (PMID, Pathway Ontology, Disease Ontology) to our data for BTE. What do you think about the format below (the values are just placeholders)? The keys match http://identifiers.org/ prefixes in order to ensure they stay consistent and unique.
|
@ariutta Hi Anders, this looks great to me!! |
@ariutta, Here's a map of @kevinxin90, Should we include both DOID and term name? As a proposed "node property" the human readable term name may be more useful than DOID. But if you already have a mapping for Disease Ontology IDs and terms, then it's perhaps not necessary. Thoughts? |
@AlexanderPico Yes. I think it's a good idea to include both DOID and term name in the API response. For Biothing Explorer, it doesn't matter whether you include the name or not (we will do the ID resolving internally). But as an API itself, it would be very useful (in case the user doesn't access it through BioThings Explorer). |
@kevinxin90, in this reporting period, @AlexanderPico and I have done more work to parse our PFOCR data, using APIs like PubTator to extract chemical and disease mentions in the OCR text. Notice that these disease mentions are pulled directly from the OCR text, unlike the disease associations Alex referenced above, which come from pathway enrichment analysis. I'm thinking of the former as In order to include this additional information, I propose we use the following updated JSON format, but I'm open to suggestions. We also need to distinguish between {
"_id": "PMC0000000__nihm00000000",
"associatedWith": {
"figureUrl": "https://www.ncbi.nlm.nih.gov/pmc/articles/PMC0000000/bin/nihm00000000.jpg",
"pmc": "PMC0000000",
"pubmed": "00000000",
"ncbigene": [
"1000",
"1001",
"1002",
"1003"
],
"disease:mesh": [
"D000860"
],
"chemical:mesh": [
"D000431",
"D000079",
"D000085"
],
"annotations": {
"PW": [
"0000001",
"0000002"
],
"DOID": [
"10000",
"10001"
]
}
}
} (dummy data -- just showing the format) Questions:
|
@andrewsu, you're also welcome to comment on this format, if you're interested, especially how to distinguish mentions vs. annotations. |
Hi @ariutta, thanks for the update. Here're a few suggestions from me.
Here is what I would propose:
|
@kevinxin90, that looks great. I just need to get you an updated file with this format. |
Hi @kevinxin90, I wanted to let you know I have the chemicals extracted from almost all the PFOCR pathway figures (63591 of 64643 because the PubTator API returned an error for some of them). I currently have them in TSV format (link included in case you want to take a sneak preview), but I'll get an updated file with the format we agreed on. Summary StatsAll pathway figures
Just the pathway figures with 10+ genes:
|
Here is an updated file in the format we discussed, including chemicals in addition to genes this time: Summary Statsfor pathway figures having at least 3 chemicals and/or genes found
Questions:
|
@ariutta I found a couple places where the value of pubmed equal to 0. Does it mean pubmed id is not found?
|
Good catch! Yes, you're right. I'll take a look at filling those in. |
Great! I currently just removed those pubmed fields if they equal to 0. @ariutta The new API is currently up at: https://biothings.ncats.io/pfocr Some example queries: https://biothings.ncats.io/pfocr/geneset/PMC4671449__medscimonit-21-3736-g001.jpg Query by Chemical MESH ID: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.chemicals.mesh:D000431 Query By Gene NCBIGene ID: https://biothings.ncats.io/pfocr/query?q=associatedWith.mentions.genes.ncbigene:1017 |
@kevinxin90 Do the chemical MeSH IDs connect in BTE to any other entries, e.g., pharmaceutical products? It would be nice to mention a query path in the annual report that is enabled by this new deposition of chemicals in pathway figures. |
@kevinxin90 Is there a particular place we should upload future .ndjson files, i.e., somewhere that is periodically checked and queued up for BTE integration? Or is this something we should work out next year? |
@AlexanderPico Hi Alex, if it's small. You could just upload to this github. For us, we only need a static url we can track the last-modified date. It could also be a google cloud storage or other cloud storage services. You could view an example here. |
We don't really distinguish pharmaceutical products (I assume drug, right?) and ChemicalSubstance in BTE. But we do can do some ID translation (e.g. MeSH -> Drugbank ID) within BTE. We can make a query path like Disease -> Gene -> Chemical (with the later step using PFOCR as one of the consulted resources? since PFOCR has information about what genes and chemicals are in the same pathway) Up for better suggestions. |
@ariutta Should this be closed? Are chemical, disease ont and pathway ont terms integrated? |
Chemicals: yes. Diseases, pathways or amino acids: I'm not sure. |
Next, let's add
|
How inclusive should we be for chemicals? For example, some of our collaborators don't want side metabolites included in their analyses, e.g., Na+, NADPH or S-adenosyl-L-methioninate. Thanks to @tokebe's suggestion, I ran a sample of the PFOCR chemicals results through the node normalizer. I got back the following: biolink types
normalized identifier namespaces
I'm proposing that we go ahead and incorporate all the results into BTE, without any filtering to exclude items like side metabolites. However, if it's preferable, I could split up the chemicals into biolink categories. |
I agree with this strategy! |
I've got a draft version of the latest export file ready: Note this version does not have PubMed IDs. Is it important that I provide them, or is this something easily handled by existing TRAPI APIs like node normalizer? |
On quick glance, looks good! I do not believe that Translator has an easy mechanism to translate PMCIDs to PMIDs. So if it's easy for you to add, please do. If not, the PMCID will be fine as provenance for now... |
@andrewsu, we have an updated dataset available here: This dataset is formatted the same way as the previous one, except:
We now have chemicals, diseases and genes. The new genes are from our latest batch of PFOCR data. The new chemicals are from both the latest batch of PFOCR data as well as from relaxing the filtering restrictions we applied last time -- we've included all chemicals except ones with names only containing the letters Summary stats:
|
for summarizing query path results during evaluation. For example, annotating pathway figures with terms from the Pathway Ontology will allow BTE users to filter genes by hierarchically-organized pathway classes.
The text was updated successfully, but these errors were encountered: