The CWLProv folder structure is a Research Object that conforms to the RO BagIt profile and contains PROV traces detailing the execution of the workflow and its steps.
A relevant parts of the CWLProv folder structure is here explained using the revsort-run-1 example:
- bag-info.txt - minimal bag metadata (notably the
External-Identifier
) - manifest-*.txt - checksums of files under
data/
(algorithms subject to change) - tagmanifest-*.txt - checksums of metadata files excluding
data/
- metadata/manifest.json - Research Object manifest as JSON-LD. Types and relates files within bag.
See the CWLProv BagIt profile for details on the BagIt structures and suggested file paths.
This document defines what elements should be present in the Research Object manifest; forming the CWLProv Research Object profile.
While the BagIt manifests provides checksums of CWLProv files, they cannot include any additional information, such as file type, provenance, attribution or relations. To this end CWLProv uses the Research Object specifications, which reuse existing Linked Data standards like OAI-ORE, JSON-LD, Web Annotation Model, PROV and PAV.
While advanced users may facilitate these underlying standards and their corresponding tooling, CWLProv is intended to be generated or consumed without any deep knowledge of their working.
The file metadata/manifest.json follows the structure defined for Research Object Bundles. Note that .ro/
is instead called metadata/
as CWLProv conforms to the derived RO BagIt profile for storing a Research Object using BagIt.
The metadata/manifest.json
file SHOULD follow the JSON structure defined here, and MUST be valid JSON-LD, e.g. escaping space in file name URIs as %20
.
Consumers of CWLProv MAY parse the RO manifest as pure JSON, alternatively as JSON-LD using tools like Apache Jena for querying or integration.
The expected keys of the CWLProv manifest are explained below. Note that hashes/UUIDs below may not match exactly the revsort-run-1 example.
The @context
SHOULD be of the form:
"@context": [
{
"@base": "arcp://uuid,4cca2dd8-3bd5-45cb-8d34-e7f346be027e/metadata/"
},
"https://w3id.org/bundle/context"
]
This JSON-LD context enables consumers to alternatively consume the JSON file as Linked Data with absolute identifiers, and provides mapping to namespaces of the reused standards.
The @base
value SHOULD be based on the [arcp External-Identifier](bagit.md#External Identifier) in the bag-info.txt
, and SHOULD be an absolute URI for the /metadata/
folder within this bag (where this manifest is stored).
CWLProv research objects MUST declare conformsTo
to indicate their conformance with this document. The value SHOULD match a published CWLProv permalink.
"conformsTo": "https://w3id.org/cwl/prov/0.4.0",
The manifest SHOULD lists which software version created the Research Object under createdBy
:
"createdBy": {
"uri": "urn:uuid:7c9d9e88-666b-4977-85f4-c02da08a942d",
"name": "cwltool 1.0.20180416145054"
}
Note that the uri
here constitutes a particular execution on a particular machine. This identifier SHOULD be a UUID, but it MAY be http
or https
based to indicate a particular web portal installation.
The manifest SHOULD list the person who "authored the run" - e.g. who requested cwltool
to execute the workflow with the given inputs. In a portal environment this will be the logged in user who clicked the Run button.
"authoredBy": {
"orcid": "https://orcid.org/0000-0002-1825-0097",
"name": "Stian Soiland-Reyes"
}
The author SHOULD be identified at orcid
using ORCID identifiers starting with https://orcid.org/
. The uri
field MAY be included, e.g. http://portal.example.com/user/2
.
Engines SHOULD propagate the value of the ORCID
shell environment variable if provided, ensuring the ORCID identifier format is valid.
Note that the author of the workflow run may differ from the author of the workflow definition, which can instead be indicated under aggregates.
The list of aggregates
are the main resources that this Research Object transports.
FIXME: Rewrite this section to recommendation language
"aggregates": [
{
"uri": "urn:hash::sha1:53870991af88a6d678cbeed3255bb65993c52925",
...
},
{ "provenance/primary.cwlprov.xml",
...
},
{
"uri": "../workflow/packed.cwl",
"createdBy": {
"uri": "urn:uuid:7c9d9e88-666b-4977-85f4-c02da08a942d",
"name": "cwltool 1.0.20180416145054"
},
"conformsTo": "https://w3id.org/cwl/",
"mediatype": "text/x+yaml; charset=\"UTF-8\"",
"createdOn": "2018-04-16T18:27:09.513824"
},
{
"uri": "../snapshot/hello-workflow.cwl",
"conformsTo": "https://w3id.org/cwl/",
"mediatype": "text/x+yaml; charset=\"UTF-8\"",
"createdOn": "2018-04-04T13:29:55.717707"
}
Beyond being a listing of file names and identifiers, this also lists formats and light-weight provenance. We note that the CWL file is marked to conform to the https://w3id.org/cwl/ CWL specification.
Some of the files like packed.cwl
have been created by cwltool
as part of the run, while others have been created "outside" the run (e.g. inputs).
(Note that cwltool
is currently unable to extract the original authors and contributors of the original files, this is planned for future versions).
Under annotations
we see that the main point of this whole research object (/
aka arcp://uuid,67f38794-d24a-435f-bd4a-0242a56a581b/
)
is to describe something called urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b
:
"annotations": [
{
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
"content": "/",
"oa:motivatedBy": {
"@id": "oa:describing"
}
},
We will later see that this is the UUID for the workflow run. A workflow run is an activity, something that happens - it can't be directly saved to a file. However it can be described in different ways, in this case as CWLProv provenance:
{
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
"content": [
"provenance/primary.cwlprov.xml",
"provenance/primary.cwlprov.nt",
"provenance/primary.cwlprov.ttl",
"provenance/primary.cwlprov.provn",
"provenance/primary.cwlprov.jsonld",
"provenance/primary.cwlprov.json"
],
"oa:motivatedBy": {
"@id": "http://www.w3.org/ns/prov#has_provenance"
}
Finally the research object wants to highlight the workflow file:
{
"about": "workflow/packed.cwl",
"oa:motivatedBy": {
"@id": "oa:highlighting"
}
},
And links the run ID 67f38794..
to the primary-job.json
and packed.cwl
:
{
"about": "urn:uuid:67f38794-d24a-435f-bd4a-0242a56a581b",
"content": [
"workflow/packed.cwl",
"workflow/primary-job.json"
],
"oa:motivatedBy": {
"@id": "oa:linking"
}
}