Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Requested feature] Interaction with identifiers.org API Web Services #59

Open
M-casado opened this issue Dec 8, 2022 · 7 comments
Open

Comments

@M-casado
Copy link
Contributor

M-casado commented Dec 8, 2022

Summary

A feature to check whether a CURIE resolves against identifiers.org API web services, as to know if an element exists in another resource.

Motivation

A feature of this type would improve greatly the utility of the schemas, adding an extra step of semantic validation with the resourceful identifiers.org. See the below use cases for examples on how I would envision this feature to enrich the metadata standards of my resource (EGA).

Details

Similar to how the current custom keywords interact with OLS API, I would like to request a feature (e.g. a new keyword) that allows for a quick API call to identifiers.org and validates whether an element exist in another resource based on a given CURIE.

In order to resolve a CURIE, identifiers.org exclusively requires a Compact Identifier consisting of a unique prefix and a local provider designated accession number (prefix:accession). Given this structure, an example with the minimal custom keyword I envisioned (named here identifiersExists, but can take any other name) is:

{
    "type": "object",
    "properties": {
        "arrayOrEnaIdentifier": {
            "type": "string",
            "identifiersExists":  {
                "prefixes" : ["arrayexpress", "ena.embl"]
            }
        }
    }
}

In the above example, we would be indicating that the given arrayOrEnaIdentifier (CURIE) would have to exist in either Array Express or ENA's EMBL namespaces (arrayexpress and ena.embl respectively). Therefore, the following JSON documents (i.e. data) would be valid:

# JSON document 1
{
    "arrayOrEnaIdentifier": "arrayexpress:E-MEXP-1712"
}

# JSON document 2
{
    "arrayOrEnaIdentifier": "ena.embl:BN000065"
}

These last two identifiers would resolve automatically against identifiers.org using the following URI structure:

  • identifiers.org + compact identifier
    • JSON document 1: https://identifiers.org/arrayexpress:E-MEXP-1712
    • JSON document 2: https://identifiers.org/ena.embl:BN000065

Nevertheless, it is also important to account for the designated namespace's prefix: not only a compact identifier needs to be resolved to an existing record in a resource, but also need to have the designated prefix. One of the namespaces of identifiers.org is itself, which could be used for this purpose as well if needed to assert a namespace exists (when compiling the schemas). Therefore, the following JSON document would not be valid, even though it is correctly resolved by identifiers.org:

{
    "arrayOrEnaIdentifier": "ncbigene:100010"
}

Likewise, it would be invalid if the compact identifier, even with the correct prefix, would not resolve to a record in the resource. For example, if I used the following made up accession arrayexpress:E-MEXP-17121 (added an extra 1 at the end):

{
    "arrayOrEnaIdentifier": "arrayexpress:E-MEXP-17121"
}

It is also important to differentiate an invalid record because identifiers.org rejected the API call (e.g. format error - e.g. arrayexpress:hello-world) or due to the record not existing in the designated resource (e.g. arrayexpress:E-MEXP-17121). Although this last one depends on how each resource redirects non-existing records, it should be straightforward to address once the identifier is resolved to the registry URI.

Use cases

  • Asserting a referenced gene in an experimental design does exist: https://identifiers.org/ncbigene:100010
  • Asserting an Array Design Format (ADF) exists in ArrayExpress instead of having to submit it to the EGA: https://identifiers.org/arrayexpress.platform:A-AFFY-98
  • Relying on platforms submitted to ArrayExpress instead of having to submit them to the EGA: https://identifiers.org/arrayexpress.platform:A-GEOD-50
  • Asserting previously submitted objects exist in EGA without the need of internal access: https://identifiers.org/ega.dataset:EGAD00000000001
  • Asserting a reference to an existing pipeline exists in GitHub: https://identifiers.org/github:EbiEga/ega-metadata-schema
@theisuru
Copy link
Collaborator

@M-casado
Started working on this but stuck on a point regarding identifier resolution.

From your issue:

Likewise, it would be invalid if the compact identifier, even with the correct prefix, would not resolve to a record in the resource. For example, if I used the following made up accession arrayexpress:E-MEXP-17121 (added an extra 1 at the end):

{
    "arrayOrEnaIdentifier": "arrayexpress:E-MEXP-17121"
}

It seems like identifier.org will resolve/redirect even if it was give a invalid identifer with the correct format. This request will be redirected to the resource page and there will be different mechanisms by each resource to handle non-existing identifers. Any thoughts on this?

@M-casado
Copy link
Contributor Author

@theisuru

It's basically what I was envisioning, but more difficult, perhaps:

It is also important to differentiate an invalid record because identifiers.org rejected the API call (e.g. format error - e.g. arrayexpress:hello-world) or due to the record not existing in the designated resource (e.g. arrayexpress:E-MEXP-17121). Although this last one depends on how each resource redirects non-existing records, it should be straightforward to address once the identifier is resolved to the registry URI.

I was hoping that the responses could be aggregated and interpreted easily (e.g. 200 meaning a record exists, etc.; anything else meaning it failed). In our use-case, being able to know through identifiers.org if the format of another identifier is correct (#61) is only half of the problem.

My hopes:

  • When an identifier resolves to a record (in any archive), it has a similar response (i.e. not 404).
  • Identifiers.org may have a programmatic way to interpret the records of the archives it compiles.

Besides, at least in my use-case, the content of the record is not relevant, just that the record exists. So hopefully all archives respond in a similar way when a record is missing and is unresolvable (?)

@theisuru
Copy link
Collaborator

The problem is each archive have a different way of responding if record is not there. If all responded with 404 it would be possible. So for your example arrayexpress:E-MEXP-17121, Array Express returns with 200 and a HTML page.
I would expect if I sent "Content-type" to JSON, they would return a JSON payload with a 404, but that seems to be not the case.

I believe this is a responsibility of the identifiers.org to return if resource actually exists. As users of their API, it is out of our scope to infer beyound what they provide. We can contact them and ask, if this is possible.

@M-casado
Copy link
Contributor Author

That's a shame, would be amazing to have that feature working at some point. We should definitely ask identifiers.org if there is a way to do so.

@M-casado
Copy link
Contributor Author

I took a quick look at their API documentation and they have this Validate Sample ID section, but I believe it's probably just what you were doing already, right? The fact that it's a validation of the ID doesn't mean, I guess, that there's an existing record behind the ID.

@theisuru
Copy link
Collaborator

That seems to be the correct API to use, but as you suspected, it is not working correctly if archives are providing wrong HTTP status codes.

Check below two examples with invalid IDs in both requests:

# responds with error message: Id does not exist
curl -X POST "https://registry.api.identifiers.org/prefixRegistrationApi/validateSampleId" -H "accept: */*" -H "Content-Type: application/json" -d '{
	"apiVersion": "1.0",
	"payload": {
		"sampleId": "SAMEA23976766",
		"providerUrlPattern": "https://www.ebi.ac.uk/biosamples/samples/{$id}"
	}
}'

# responds with VALIDATION OK
curl -X POST "https://registry.api.identifiers.org/prefixRegistrationApi/validateSampleId" -H "accept: */*" -H "Content-Type: application/json" -d '{
	"apiVersion": "1.0",
	"payload": {
		"sampleId": "E-MEXP-171211",
		"providerUrlPattern": "https://www.ebi.ac.uk/biostudies/arrayexpress/studies/{$id}"
	}
}'

This might also delay the validation considerably given the addition of 2 more API calls.
I am thinking, maybe we could introduce an extra property in the keyword if this second resolution in necessary.

@M-casado
Copy link
Contributor Author

@theisuru I agree with the extra property in the keyword, something just to denote that not the format alone, but "record exists" should also be enforced.

Now, onto how to do it... We could always contact the archives we intend to use so that they provide correct responses. It could be both that their API response is lazy or that the endpoint they mapped to identifiers.org is not the correct one.

For example, knowing that BSD does provide correct responses, we could use it as it is with this API call. And if we add another to the bunch, we would check first, and contact them (?)

I cannot think of any other way to interpret CURIEs in a generic way through identifiers.org

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants