Skip to content

Latest commit

 

History

History
169 lines (84 loc) · 63 KB

hangout-2015-07-15.md

File metadata and controls

169 lines (84 loc) · 63 KB

Notes: Google Hangout of Vocabulary Management Task Group (VOCAB)

Prepared by Steve Baskauf, VOCAB TG Convener

Note: The Google Hangout occurred at 21:00 UTC on Wednesday, July 15. I think that corresponds to:

Australia Eastern Standard Time 07:00 Thursday, July 16

US Pacific Daylight 14:00 Wednesday, July 15

US Central Daylight 16:00 Wednesday, July 15

US Eastern Daylight 17:00 Wednesday, July 15

Western European Daylight 22:00 Wednesday, July 15

The URL to join the hangout is:

https://plus.google.com/hangouts/_/calendar/M2JkMzB0dmlxYXNnYmRvYm9kMWk3MnRybG9AZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ.31pqmol6aom9p6sg1di0tfh83k?authuser=0

Thanks to Bob Morris for helping me set that up.

Participating: Bob Morris, Terry Catapano, Steve Baskauf, Greg Whitbread, John Wieczorek, and Stan Blum

Some thoughts about our peer standards organizations (does not require discussion)

(with reference to the notes at https://github.com/tdwg/vocab/blob/master/process-models.md)

  • The W3C and IETF are similar to TDWG in that they are open to participation by all and are consensus-driven. Any interested parties can create a task group and move forward with a proposed standard.

  • DCMI is similar to TDWG in that terms in its vocabulary standard (the DCMI Metadata Terms) can be modified over time (although in practice, its terms have changed very little). Terms in W3C vocabulary Recommendations seem to be modified only when the vocabulary is replaced by another Recommendation that supersedes it. IETF doesn’t really produce vocabularies. However, DCMI seems to be more centralized administratively and less open to change from the public.

  • The W3C process seems to be the most analogous to TDWG’s process, although the scope of and participation in W3C standards development is much larger and subsequently more complex that in TDWG. Nevertheless, many of the procedures of the W3C process seem applicable as a model for TDWG.

Strategy for this meeting:

  • Aim for the low hanging fruit in the Issues Tracker (Items A and B) to show some progress and build momentum.

  • Assign specific people to work on some of the technical writing required to resolve the issues in item C.

  • Formulate a plan to move forward within the timeframe specified in the TG charter.

A. Designation of normative parts of standards (Issue #4, Issue #5, and to some extent Issue #9)

After examining the practices of peer organizations and reflecting on our experience in TDWG, I think that the current TDWG practice of designating particular documents as normative (“Type 1” in the language of the draft Standards Documentation Specification) and non-normative (“Type 2”) is too restrictive. I think it would be easier to write a coherent descriptive document if particular sections of a standards document could be designated as normative or non-normative. For example, see http://www.w3.org/TR/vocab-dcat/ where particular sections say “This section is non-normative.” It would also be possible to designate entire documents as non-normative (e.g. use cases and examples).

It would also be possible to state that particular components of a document are non-normative. For example, in the description of a vocabulary term, one could indicate that the label, URI, and description are normative, while the comments about the term are not. If this practice were followed, one could also have a stricter set of requirements for changing normative parts of a standard. For example, making changes to non-normative comments or examples could be done after a period of public notification, while changes to normative definitions could require a more extensive process that requires consensus and approval by the executive.

With respect to “how people find standards”, if a standard is simple and contained in a single document, the permanent URL of the standard could link to the HTML version of that document. If the standard is complicated and consists of multiple documents (e.g. Darwin Core), the permanent URL of the standard could link to a landing page that describes all of the documents that form a part of the standard, with notes about the purpose and status of each document. For example, see the OWL *2 *Web Ontology Language Document Overview. Section 4 of this document contains a document overview of this sort.

On the question of whether the normative description of a vocabulary should be human or machine-readable [comment by Bob: These aren't always mutually exclusive, if part of a normative document references a machine readable serialization of a vocabulary.], I found the working draft of the Data on the Web Best Practices informative. The overall tone is that humans “know” and “understand” things, while machines “access” or “process” things. Best Practice 16 specifically spells out that “The description of the vocabulary must be human-readable.” I think that this argues that the normative description of a vocabulary should be in human-readable form. There is also the issue that an RDF document may not be able to contain all of the information necessary to fully define the vocabulary. For example, Dublin Core terms are part of Darwin Core. However, the normative RDF document contains no information that indicates that this is the case.

Proposed resolutions:

Issue #4 (How is the normative part of a standard designated?)

Eliminate the “Type” designation of documents. Each document that forms a part of the standard is clearly identified as part of the standard, including the permanent URL of the standard. If the standard is complicated with multiple documents, a landing page will summarize and link to the component documents. The normative parts of documents are designated by text in the document itself.

Issue #5 (Should the normative document be RDF, human-readable, or either?)

The normative description of a vocabulary should be in a human-readable format.

Discussion:

(Bob: Re Issue # 5: AudubonCore illustrates that it may be hard to say what is the normative document when the human readable form is derived entirely from an arguably non-human document, namely data intended for the MediaWiki template system…) Steve: Hmm. Well, the source of many Web documents are actually written in HTML, but what we care about is the part that humans see when the page is displayed (the text). If you had text that appeared the same to humans but one copy was in a Word document and the other a Web page, I would say that the text content was the same, even if the underlying characters in the document differed.

Normative is required to conform to the standard. Examples, diagrams would be non-normative. In some standards, the actual XML schemas are not normative, descriptions are.


B. RFC 2119. (Issue #19)

There has been some discussion about the circumstances under which it is appropriate to use the terms “MUST”, “MAY”, “SHALL NOT”, etc. from RFC 2119 in a standards document. RFC 2119 is a Best Practice statement of the IETF, whose primary concern is creating technical specifications that specify protocols to facilitate communication across interconnected networks. In a Technical Specification of that sort, failing to meet a “MUST” or “SHALL NOT” requirement probably means that an application will fail to work and communication will fail. In other kinds of standards such as an Applicability Statement that specifies particular values of parameters, failure to follow a recommendation may result in communicating incorrect or nonsensical information, but that doesn’t “break” an application. So it seems like the RFC 2119 terms should only be applied in circumstances where not following a directive containing them would cause a compliant application to fail. Is this the correct distinction? How do we articulate a guideline for the circumstances under which RFC 2119 terms should be used in TDWG Standards?

Progress towards resolution:

Issue #19 (Under what circumstances (if any) should the terms defined in RFC 2119 be used in TDWG Standards documentation?)

Draft a paragraph describing the circumstances under which it is appropriate to use RFC 2119 terms.

Discussion:

There are probably circumstances where this is appropriate. To do is to determine what these might be, and if it is something that should be required or recommended.


C. Documenting relationships among vocabularies, terms, documents, versions, etc. (Issue #20, Issue #17, Issue #14, some parts of Issue #9, Issue #3)

This issue grouping breaks down four ways:

  • There is a versioning component: how do we keep track of versions of resources (vocabularies, terms, documents)?

  • There is a hierarchical component: how do we indicate that terms are part of a vocabulary, that a vocabulary is described in a document, that a document is part of a standard, etc. ?

  • There is a term-to-term component: how do we indicate that the entity described by some term has some relationship to another (part of, broader, equivalent)?

  • There is a serialization/representation component: how do we relate a vocabulary to its various representations (human-readable, RDF) and how do we use human readable text or RDF properties to describe all of the kinds of relationships listed in the bullet points above.

With respect to versioning, I think that the approach used by the W3C for documents works well (example). There is an generic URI for the “Latest published version” (e.g. http://www.w3.org/TR/vocab-dcat/) that always dereferences to the most current version. There is also a specific version URI (e.g. http://www.w3.org/TR/2014/REC-vocab-dcat-20140116/) that always dereferences to a specific, stable version. Each stable version lists the specific version URI of the previous version as well as the generic “current version”. This system definitely works for web documents. Darwin Core applies this approach to individual terms (see the complete historical record) where dwc:preparations has the version dwc:preparations-2014-10-23 that replaces dwc:preparations-2009-04-24. The approach could probably also be applied to versions of vocabularies and standards versions (as abstract things vs. the documents that describe them), although I haven’t thought carefully how URI dereferencing might allow machines to discover and inter-relate the versions. DCMI terms dcterms:hasVersion, dcterms:replaces, etc. can be used to describe relationships among versions. I don’t entirely understand how a semantic client processes the information in the DwC RDF to link the dated URI versions to the generic term URIs - maybe we can spell this out more clearly. I have taken a first pass at fleshing out how this can work. The WIP is at https://github.com/tdwg/vocab/blob/master/version-model.md . See https://docs.google.com/drawings/d/1HNfS2JiuCqh_aKSbujZA3klbxCKiowjyE5duQgF83ic/edit?usp=sharing for a diagram.

There are some ideas about how to handle the hierarchical relationships in the Data Catalog Vocabulary (DCAT) W3C Recommendation. If one considers a vocabulary to be a dcat:Dataset, then particular representations might be considered dcat:Distributions. The VoID Vocabulary is also used for datasets that are sets of RDF triples. The example here uses owl:versionIRI, and owl:priorVersion. There are probably other examples. I am unsure as to the circumstances where skos:Collection and skos:member would be useful. Since SKOS concept collections are groups of SKOS concepts, I’m not sure about what sorts of vocabularies might be instances of skos:Collection. Maybe controlled vocabularies? SKOS has so many entailments that it makes me nervous to suggest using it. DCMI has commonly used properties, such as dcterms:isPartOf and dcterms:references, that might relate terms to vocabularies and documents to vocabularies. Darwin Core already uses rdfs:isDefinedBy to relate a term to its defining vocabulary. I have also taken a first pass at this one. The WIP is at https://github.com/tdwg/vocab/blob/master/hierarchy-model.md (text) https://docs.google.com/drawings/d/1xIa74GiLFQAhclO7baP1lNHKYk0Uvhs4bAdRytQUwmc/edit?usp=sharing (diagram). This model addresses several issues:

  1. Currently, there is no single machine readable document that actually “defines” Darwin Core. There are several (dwcterms:, dwcattributes:, dwciri:) but no explicit connection to the DCMI terms that are imported. The model deals with this by creating umbrella ontologies (e.g., the “Darwin Core Basic Vocabulary”) that imports several term lists.

  2. There are some constituencies in TDWG who want to develop Darwin Core into a more heavyweight ontology, some who would like it to be a lightweight ontology, and some who only care about its terms with basic text definitions. It was suggested on tdwg-content several years ago that we could use a layered approach where additional semantics could be overlaid on top of a basic Darwin Core vocabulary. This model addresses this issue by allowing for the creation of several umbrella ontologies that import the term lists that define the semantics that are important to the use cases that a particular umbrella ontology is designed to satisfy. It would also be possible to add additional components that translate the vocabulary definitions to other languages as such components are created.

  3. This model addresses many of the issues raised in Section 5 of the VoMaG report (https://github.com/tdwg/vocab/blob/master/gbif_TDWG_Vocabulary_Management_Task_Group_en_v1.0.pdf ), particularly 5.3, 5.4, 5.5, 5.6, and 5.12. In particular, it specifies classes for most of the components and suggests properties that can be used to link the different types of components so that a machine could discover all of the components.

It seems like SKOS may provide the capability to describe many of the desired term-to-term relationships, like broader, closeMatch, etc.

Progress towards resolution:

Note: I’ve started hacking away at some code I’ve pulled from various sources. They are at

https://github.com/tdwg/vocab/blob/master/code-examples/ontology-vocabulary.ttl

https://github.com/tdwg/vocab/blob/master/code-examples/controlled-vocabulary.ttl

Maybe we can eventually morph them into meaningful examples relevant to Darwin Core (as below). 2015-07-10: I’ve worked on modifying the ontology-vocabulary.ttl document so that it reflects the hierarchical model that I laid out in the Google diagram linked above.

2015-07-14: “Finished” the ontology-vocabulary.ttl doc, loaded it into a triplestore and ran the competency question SPARQL queries to make sure they actually worked.

Issue #20 (Describe how controlled vocabularies should be documented)

Look at extant examples of how SKOS is used to document controlled vocabularies and try to apply them to values of Darwin Core terms where “best practice is to use a controlled vocabulary”. See http://rs.gbif.org/vocabulary/gbif/ for GBIF lists of controlled values. See http://id.loc.gov/vocabulary/iso639-2/eus.rdf for the Library of Congress’ description of a language using SKOS. http://auscope-services.arrc.csiro.au/sissvoc/isc2014/resource.rdf?uri=http://resource.geosciml.org/classifier/ics/ischart/Devonian for a description of a geological era using SKOS. There are probably many other examples.

Issue #17 (Specify a model for deprecating vocabularies)

I’ve taken a stab at this. See https://github.com/tdwg/vocab/issues/17#issuecomment-117758058 Does this work? Can we write up some version of it as a model? There is also term deprecation in the version model: https://github.com/tdwg/vocab/blob/master/version-model.md

Issue #14 (Establish best practices for asserting relationships within RDF documents)

The document https://github.com/pyvandenbussche/lov/blob/master/public/Recommendations_Vocabulary_Design.pdf (http://lov.okfn.org/Recommendations_Vocabulary_Design.pdf) was cited in Annex 1 of the VoMaG Report. As Bob has pointed out, this document doesn’t have any particular standing. Is there a more authoritative set of recommendations elsewhere? At least this is a starting point and one could look at some of the vocabularies at http://lov.okfn.org/dataset/lov/ to see if there is any apparent best-practice.

Issue #9 [partial] (Where do we keep documents, how do people find them, and how do people look at them?) and Issue #3 (Implications of vocabulary changes on interaction between applications and archived data)

Can we formalize the W3C mechanism described above for web documents and extend it to URIs for vocabularies and terms? How can or should owl:deprecated be used to indicate that a particular term is deprecated? How does one indicate the replacement term? We need RDF and human-readable examples.

Discussion:

There was support among the group for the general models presented in the work-in-progress documents for the vocabulary hierarchy model and the version model. Steve will continue to work on cleaning these up and fleshing them out. Terry said he could try to work up an example using SKOS to describe a controlled vocabulary (I think he said towards the end of July). Some candidate controlled vocabularies are at http://rs.gbif.org/vocabulary/gbif/


Next steps:

The target date in the charter for submission of a working draft was 2015-07-15. That’s going to be tough to meet.

  1. Work on a draft for the Documentation Standard, starting with the existing document and adding text we create while following up on Issue Grouping B and C. If I have time, I might start on this in advance of the meeting.

  2. Discuss Issue #8 (Clarify roles of task groups, interest groups, the TAG in maintenance of vocabularies) in a dedicated Google Hangout that would include Cyndy Parr and some representation from the Executive. This issue is blocking several other issues and needs to be resolved soon.

  3. Create an outline/flow chart for vocabulary change/maintenance process based on the DwC Namespace Policy. Consider Issue #6 (Revise Darwin Core Namespace Policy to make it applicable across TDWG vocabulary standards), Issue #7 (Create a process to govern how non-term changes should be made to vocabulary standards), Issue #11 (Establish a timeframe for addressing proposed changes to vocabularies), Issue #12 (Clarify the communication mechanisms that should be used during the change process), and Issue #13 (Criterion for assessing the need for a change to a vocabulary) in addition to Issue #8 (above).

  4. Work on a draft for the Vocabulary Maintenance Specification.

Discussion:

The timeline in the charter is looking overly ambitious. We talked about the following steps forward:

  1. Steve would continue to work on fleshing out the machine-readable aspects of the Standards Documentation Specification that he’s been working on. Hopefully he’ll have some sort of draft by the time the fall semester cranks up in mid August.

  2. Terry was going to try to come up with some SKOS prototype for a controlled vocabulary (forgot to write down timeframe - I think around the end of July).

  3. Stan was going to figure out who should be involved in a Google hangout to discuss process related to adapting the DwC Namespace policy into a general Vocabulary Maintenance Specification. In particular, what roles do the various groups (IG, TG, TAG, Exec) play in the stated policy vs. in reality. How can we lay out a process where reality matches the policy? I think the timeframe for this was early August.

  4. The next major milestone would be to have some kind of material to present at the TDWG Annual Meeting at the end of September. Draft documents would be good, but that may be expecting too much.