Panel 2 (technical) theme #28
Replies: 14 comments 13 replies
-
All sounds good to me! Couple of comments below (trying to avoid giving my own opinions 🙃)
I really like this angle --- as an individual researcher/developer willing to invest (time!) in the tech side of this, I still feel there is a huge opportunity cost to "going semantic". Bio/life sciences show that there are great benefits to be had by going down this road, but how does the structure of our domain (quite literally, the microstructure and bonding in materials) affect the topology of resulting knowledge graphs? Are they still useful? Ties in with the named entities discussion in your following point, and with the fuzziness breakout.
I guess one distinction to make here is between capturing data for all samples/measurements (even ones where e.g. synthesis method yielded impure substance or something) vs. capturing "everything" about each sample (just because it is available), e.g., height above sea-level, partial pressure of radon every second whilst the sample was under storage.
Also, how do we ensure the chemists/materials scientists are empowered by their tooling to create, modify and disseminate their own schemas? |
Beta Was this translation helpful? Give feedback.
-
re-reading Shirky: Ontology is Overrated -- Categories, Links, and Tags :
|
Beta Was this translation helpful? Give feedback.
-
re-reading JSON-LD and Why I Hate the Semantic Web | The Beautiful, Tormented Machine:
|
Beta Was this translation helpful? Give feedback.
-
re-reading Whatever Happened to the Semantic Web?
|
Beta Was this translation helpful? Give feedback.
-
re-reading “Is the semantic web still a thing?”
|
Beta Was this translation helpful? Give feedback.
-
These resources are all great @kjappelbaum, perhaps begging for a section on the awesome interop list? |
Beta Was this translation helpful? Give feedback.
-
re-reading metacrap
Interesting observation: For google, observational metadata (the links) turned out to be more useful than "real" metadata |
Beta Was this translation helpful? Give feedback.
-
re-reading FLY ME TO THE MOON
|
Beta Was this translation helpful? Give feedback.
-
I'd been planning to relate XML and JSON so this seems a great discussion.
Disclaimer: I ran the mailing list for the development of XML, have worked
a lot with the RDF/Linked_data people such as Dan Brickley.
First and fundamental - this all depends on people and the current wiring
of their brains. There was a time before the hierarchical fling system and
that hurt people's brains. HTML and XML revolutionised documents and the
idea of computable objects. They also promoted democracy - in 1980 much of
the technical support was in closed company specifications. We now take
this for granted.
We have a very complex problem. It's rightly said of code "You cannot hide
complexity, you can only move it around". But you can also chunk parts of
the problem to agreed labels and then you don't have to explain them. CIF
does this. We have concepts such as _unit_cell where the community decides
that there is enough agreement that they can use this label and not its
contents. Those entering science now (including at school) are fluent in
scripting languages and the use of open public knowledge. They use larger
chunks than those weaned on FORTRAN.
JSON/Python_dictionaries and XML overlap to a large extent. We should be
prepared to use either and to interconvert. Each does some things better
than the other and vice versa.
XML is good at:
* namespaces (important for interoperability with other languages)
* running text (and especially inline tags (<i>, etc.)
* supporting documents creation and publishing (e.g. PubMedCentral)
* encouraging domain specific schemas
JSON is good at:
* reducing clutter
* integration with code (e.g. Python dictionaries)
* web display
(add more to both)
Both will last for at least 20 years.
RDF/SPARQL is in constant use for large sections of the bioscience
community and for Wikidata. I can't see a current alternative.
The danger in rejecting these is that you end up with a limited,
non-extensible domain-specific language. These persist for decades.
Examples include:
* MOL/SDF files - limited by the vision of 1980s FORTRAN thinking.
* SMILES. tt works for the set of well-behaved covalent organic molecules.
* InChI (disclaimer - I was actively involved in its development).
Attempts to extend SMILES and InChI (reactions, mixtures) are very limited
in scope. They have no extensibility and you have to write code for any
extension and *you* will have to maintain them. By using world standards
you get the huge benefit of other people and companies endeavours.
…On Mon, Jan 31, 2022 at 11:18 AM Kevin Jablonka ***@***.***> wrote:
stats about how JSON took over XML are here
https://twobithistory.org/2017/09/21/the-rise-and-rise-of-json.html
—
Reply to this email directly, view it on GitHub
<#28 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFTCS7MFVO7IFN7HFY36IDUYZVZFANCNFSM5MVIQH7A>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
Not sure where the best place to post this is, but this is the most active technical discussion and it may be useful feedstock for the panel. Here is a draft of one of the slides from the organizer talk that situates some of the technologies/formats/standards that are being discussing here and where they fall on the file format -> data model -> resource layout -> resource description scale. Of course this is not exhaustive, but if anything extremely relevant (e.g. InChI) is missing then I can add it to the slide: Aim for me now is to somehow weave in the ecosystem of archives (Zenodo, institutional, figshare etc), domain repositories (MaterialsCloud, NOMAD), databases (OPTIMADE, Chemspider, COD, Materials Project, etc.) and specific research project databases (dynamic) and datasets (static). |
Beta Was this translation helpful? Give feedback.
-
Rehashed version, will keep editing. Feasibility of semantic techniques In the talks so far, we have seen already that many propose the use of semantic techniques. However, there is a large opportunity cost with going semantic. You have to learn a completely new ecosystem, it feels. “The reasons are complex but it basically boils down to: going through all the effort of putting semantic markup with no guarantee of a payoff for yourself was a stupid idea.”
How do we actually capture the data?
Quality (control) Following Fletcher et al reliable data is (see also Donny's post)
If we want to use the data for ML purposes, we do want to have high-quality, high-fidelity data. However, quoting metacrap, people make typos all the time, people are stupid, people lie. How do we handle quality control in this setting? What do you think would be a good mechanism? Should there be a review process upon publication (as in the chemotion repository). Should there be a system of votes, comments, pull reviews (similar to open source software)? Linked to that:
Will someone maintain this?
|
Beta Was this translation helpful? Give feedback.
-
Thank you, this is a very useful list.
Some of the "we hate the SWeb/XML/Schemas" etc probably relate to trying to
solve everything, the limited toolset and the novelty. We've come a long
way in the last few years and I'm generally optimistic that there are
limited valuable solutions *in materials science". My recommendation is
that we find subsets of materials science that:
* are of interest to a critical mass of people
* includes people prepared to experiment with code and data structures
* show rapid benefit
We explored most of these areas in CML (polymers, mixtures,
non-stoichimetry, non-bonds, reactions, spectra, declarative computing *and
created running code*. So I am strongly asserting that in materials science
it is possible to create semantic tools covering useful subsections of the
discipline. The SW naysayers are generally talking about much broader
problems than a single scientific discipline, often several years ago,
without modern tolols and Wikidata.
As inspiration:
* reductionist bioscience (genes -> proteins - > enzymes -> metabolism) is
already highly semantic. (We are abstracting metabolic pathways from the
literature in semantic form).
* crystallography - the whole experiment is captured as much as we actually
understand the science. If we think it's valuable we have a new CIF
dictionary.
The easiest of all sciences to semanticise is computation, especially
compchem. The only reason it isn't already fully semantic is restrictive
practices (in companies and some academia) and a fragmented "market". In
bioscience it would have been done by now and be part of the
infrastructure. It's the easiest technical starting point since all
programs basically do the same experiment. "Compute the structures and
energies for this substance (molecule or crystal) with this Wikidata ID,
using this method (ID) with basis set (ID) functionals (ID) and calculate
properties P1(ID), P2 (ID) under these constraints (Param (ID))" .
The next easiest is physical properties. These have a defined data type
(scalar, matrix, tensor, etc). That extends to bundles such as spectra or
electrochemical measurements.
A lot of recipes are tractable, We took X , mixed with Y, in solvent Z ,
filtered A, purified with P to give A1 ... This is a narrative with defined
objects. Most of the objects are already well definable. We need a
narrative structure to tie the events together - I'd start with HTML,
including defined objects, and free text for the narrative and gradually
semanticize some of the verbs.
There are enough success stories of determined, people / small_grouos that
I think this can work. JMol, Figshare are examples which have captured the
world.
Comments on Kevin's list below...
On Tue, Feb 1, 2022 at 8:22 AM Kevin Jablonka ***@***.***> wrote:
Rehashed version, will keep editing.
*Feasibility of semantic techniques*
1. Is it worth the effort?
Yes, absolutely. There is no alternative - we cannot survive by sighted
humans reading PDFs and retyping.
1. Do we actually have use cases for all of this in mind? Or success
stories we could share?
This is critical. Pick one that can work and be useful. Don't try to do
all of materials science.
- What would be a simple demonstrator (e.g., data explorer tools) that
one could build to show the use of these tools?
The opportunities include:
* discovery (all papers about X, Y ...) This can be rapid - all preprints
this week with X, Y...
* automation - making repeated jobs automatic. Universities don't cost grad
student labour but the time wasted is huge
* quality - automation avoids human mistakes
* aggregation - collecting "all" the data on a subsystem
* workflow - chaining together different operations
-
1. A lot of this works really well if you have a finite number stable,
well-defined concepts.
Yes - and we should identify these. CML has reaction, spectrum, property,
fragment, etc.
1. However, in some parts of chemistry - I think of inorganic
chemistry - the focus of the research is to break the rules and to get new
insights into basic concepts. For instance, a chemical bond is nothing else
than a convenient fiction that might work well for many parts of “standard”
organic chemistry but is relatively useless in inorganic chemistry. (see
also Shirky: Ontology is Overrated -- Categories, Links, and Tags
<https://digitalcuration.umaine.edu/resources/shirky_ontology_is_overrated.pdf>
)
In CML a bond is a styled, annotated line drawn on paper. In organic
molecules it's generally useful enough to use graph theory for searching
and creating subcomponents. In solid state it probably has no use.
- Is it actually feasible in such a case to come up with a workable
representation?
- To what extent does the structure of the knowledge graph impact the
questions we can ask/answer? - Links to Cory Doctorow “schemas aren’t
neural” in metacrap
<https://chnm.gmu.edu/digitalhistory/links/pdf/preserving/8_17.pdf>
I know and respect Cory but this is a 20-yr old, very broad criticism. .
1. Semantic technologies depend on URIs/IRIs. However, can we mint an
identifier for every compound or instrument?
For purified stoichiometric compounds it's been done for over 50 years by
CAS. The problem is political. Pubchem and Wikidata are rapidly building
an acceptable alternative. For non-stoichometry we need free variables
(Mg(x), Ca(1-x), ) etc. CML did this.
Marketed instruments should be straightforward and can be scraped from the
literature and added to Wikidata.
- What do we do when the substance is ill defined (polymers, alloys,
…) and how do we decide what is ill-defined? There will always be
impurities in experiment, so how can we make sure we talk about the same
thing?
That's hard. We spent years on polymers (and created
PolymerMarkupLanguage). But the conclusion was that the processing was
often more important than the simple composition. If you don't understand
the science completely, you can't easily make it semantic.
- This shows the huge importance of provenance / tracking the history
of a sample. Not only might this give you a way to better understand why
the sample is as it is, but also you capture that some processes might be
destructive to samples (which is quite different from simulations).
- What are the technologies we have available to capture this
provenance of samples? How could we make it accessible to the users and how
would we query it?
I suspect this has to be largely by Natural Language Processing (NLP) . We
have built this for organic syntheses (http://chemicaltagger.ch.cam.ac.uk/).
The architecture allows it to be extended for other domains (we did
atmospheric chemistry).
... This is in danger of turning into a huge interleaved document, so I'll
stop for breath.
The analysis, and preparation for this meeting is impressive . The most
important thing is that this is seen as a long-term process. It can't be
solved in 2 years. We set up the Blue Obelisk (
https://en.wikipedia.org/wiki/Blue_Obelisk) 17 years ago and it's brought a
lot of people and software into better synchronisation than without it.
It's one of many social models that has had some success.
TL;DR - it's hard, committed work but it's possible to make enough useful
progress in a few subdomains to show the world a new future.
P.
…--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
You asked two questions:
I think our systems will be massively hybrid. The current generation of software allows different technologies to be mixed. The (apparent) limitations of RDF. XML etc. were frequently about needing to do everything in a single language. (Tim BL tried to convince me to implement chemistry in triples - it doesn't work - you need containers.| We need chunking/containment and we need links/relations and they need different technologies. We can identify the current technologies that work for particular problems and use them to "chunk" the system. a "spectrum" is a chunk. We often don't need to look inside the details. Biblography is a chunk. It's a solved problem. So are crystal structures. So are covalent molecules, etc. |
Beta Was this translation helpful? Give feedback.
-
>PMR: Tim BL tried to convince me to implement chemistry in triples - it
doesn't work - you need containers.KJ: I would love to hear more about this.
1997-ish Tim had a view that the whole world can be represented in RDF. But
it doesn't fit into most people's brains (and it also only worked then with
toy examples). In the early years of the web there was very little
client-side code - I had to write my own menu system in Java. Computer
scientists had been building theoretical views of the world, so many
thought this was the chance to put them into practice. By building in OWL
we could validate and compute reality. Triple stores were the magic solution
It didn't work - there wasn't enough data and there wasn't enough code and
it didn't interoperate. So the Web forged ahead with whatever people liked
working with. Some of the early web was based on pragmatics - others on
theory. XML, XPath were pragmatic. They worked, XSLT and XSD were driven by
theoretical designs, inspired by LISP, and left most people behind. They
are complex and overengineered. There's a lot of remains of unsuccessful
experiments
It's important that today's approaches are both machine- *and*
people-friendly.
If we all agree what something is, it's generally not so difficult to write
code, and it also provides "chunking" where our brains don't have to
worry about the details. <molecule>...</molecule> is a well defined chunk,
whereas doing it in B-nodes is a mess of wiring. The chunks give the same
freedom as integrated circuits did - you don't have to worry about what's
inside. We need integrated knowledge components for materials - crystals,
surfaces, mixtures, measurements, synthesis. Since it's the first time
through this exercise not all the bits will work out. But we know a lot
about the system to start with.
We need chunks and we need links. XML or JSON does well for chunks. Fixed
format files (like SDF/MOL cannot provide extensibility and they cannot
support links) The chunks need links to the outside world and for that we
need well defined connection points (such as "id" attributes) and semantic
relations such as what RDF provides.
And "running code"
P.
--
Peter Murray-Rust
Founder ContentMine.org
and
Reader Emeritus in Molecular Informatics
Dept. Of Chemistry, University of Cambridge, CB2 1EW, UK
|
Beta Was this translation helpful? Give feedback.
-
Why don't we already have a solution?
Is it feasible?
Do we need to capture all data?
What do we do with legacy systems and instruments?
What do we actually standardize?
Agility vs. flexibility
Who is responsible for quality control and how do we do this?
Beta Was this translation helpful? Give feedback.
All reactions