-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove restrictions on netCDF object names #237
Comments
While I generally approve of relaxing the character set restrictions, I think we may need to consider certain patterns that should either be reserved or restricted. As an example, the use of slashes ('/') in names wreaks havoc with group path formalisms that are already in place outside of CF. In addition to the prohibition on having leading underscores that is mentioned in the proposal, the netCDF-LD project (@marqh) is making use of doubled underscores within a name as a mechanism for marking namespaces. There may be other cases "in the wild" where certain patterns are in use, and I think we should be careful to avoid causing problems by being overly loose here. I suggest that, at minimum, we should disallow the use of slashes ('/') or backslashes ('') in names, and should call out two or more sequential underscores ('__') as reserved. |
I support the constraint indicated above. Especially allowing slashes and backslashes in names will be confusing. |
Agreed, I think it would be best if the restrictions were presented in a table for readability. |
We may get some benefit form considering other standardisation activity in this domain? RFC3986 defines the generic syntax for the Universal Resource Identifier (URI) As netCDF variables are resources that are being identified within the domain of a netCDF file, could we benefit from just adopting RFC3986? This has a reserved character section: Disclaimer: I have not cross referenced this in detail with the NUG to examine consistency or problem areas (potential for contribution if useful) If these are consistent, then adopting the NUG definition unchanged looks sensible to me. It already mandates against the use of a '/' character, which is the most problematic one for me, given groups and variable identity within groups. I'd like to see an explicit reference to the relevant NUG section in the text or linked, as I had to search a bit and I know what I'm looking for mark |
@marqh I like the overall suggestion of RFC3986. I think we should not adopt the "% encoding" concept of RFC3986. And, again, I think we should reserve leading "" characters (per NUG) and multiple sequential "" characters (per netCDF-LD). Are there any other special character sequences in the wild that anyone is aware of — in UGRID or Radial perhaps? I notice that the NUG section you referenced implies that space characters are allowed as long as they are not at the end of a variable name. Do we want to allow internal spaces? |
I agree, @JimBiardCics, that adoption of %encoding is not a path I would want to walk. it's perhaps a useful cross reference, but points like this suggest against including some specific use of RFC3986 within CF
internal spaces!?!? really if we can stop that, then that is a good thing. Why would the NUG allow variable names with spaces in them?? my reading of
lead me to view space as not allowed. However the following:
Could someone from a Unidata background confirm or deny that in netCDF4, a space may be used within a variable name? |
I have zero Unidata authority, but I'd like to state the obvious: Unicode is complicated. |
I'm afraid I'm the odd man out here - I don't think the list of benefits in the original issue stacks up against the costs; in fact some of them don't seem to BE benefits. Maybe some use cases would be helpful ... Could you elaborate on how this change would support international usage? Is improved compliance for some existing data sets really a goal? What's in these data sets that needs to be described with a name that begins with a number or contains spaces or special characters? Maybe this is a selfish concern - we use Matlab's built-in netCDF library, and I'm not sure how that would deal with this change. If it's really needed for some specific reason, we'll deal with it, but absent that explanation, this is just a headache for a lot of CF users. |
Is there a user asking for this extension, a particular use case that needs addressing? CF has generally tried to avoid extensions that seem like a good idea but don’t have a current use case. Having said that, if we do move forward, I think we should be very cautious. Not only is Unicode very complicated as @zklaus points out, so are the rules around reserved character sets in URLs (and in which part of the URL) and file systems. Extending the set of characters allowed to include those reserved characters means they will need to be properly encoded when used in URLs (e.g., OPeNDAP and OGC WCS). Which, it turns out, isn’t as easy as it might seem. Also, this or similar proposals/discussions have come up before, I think several times but so far I've only found these two:
|
@WardF and @lesserwhirls - Could you address the question of whether whitespace characters are allowed in netCDF variable names? |
Having blank spaces in names would break other CF conventions like use of the ancillary variables attribute. "The attribute ancillary_variables is used to express these types of relationships. It is a string attribute whose value is a blank separated list of variable names. " How to parse this? |
I must be missing something, but if a variable is named, for example, "a-b", and one uses that in a computer code, how is it interpreted? How is that variable distinguished from the operation: subtract variable "b" from variable "a"? Don't "+", "-", "/", "*", " " all have this problem? |
@taylor13 Your code would have to parse the variable name into code. Until you did something like that, it is just a string. |
As a user of data, I usually like the names of my variables (in my codes) to be the same as their names in the netCDF file. With the current naming convention for CF, this is always possible, I think. If, however certain restrictions were removed, as suggested above, this would no longer be true. |
Well, thank you for all yout thoughtful responses. I see that we are rehashing the 2014 discussion, and probably others. Thanks @ethanrd for finding that. There are good arguments pro and con there, and it is worth reading. The difference is that only 4 extra characters were proposed in 2014. I simply want to legalize all the other 137 thousand!
No, I do not have a current use case. This is a recurring issue, so I thought this comprehensive approach would be beneficial. Past use cases were mentioned or implied in the 2014 discussion, and in trac 157. NetCDF developers put some care into expanded name capability, 12 years ago. However, CF restrictions are copied virtually unchanged from 25 year old COARDS rules, which were probably based on ASCII only. CF is overdue to allow the full naming range for creative purposes by all scientific users. Name quoting is generally easy and well supported in most modern programming languages. This takes care of UTF-8, math symbols, and other active characters. IMO, naming freedom should outweigh exactly matching names of program variables. |
Not everyone writes their own netCDF translators, and some packages no doubt take the variable and attribute names from the netCDF variable and attribute names. Those who use these packages are least likely to be in a position to accommodate this change. When I have a minute I'll give it a try with the Matlab netCDF interface. I'd be much happier to spend the time on it if there was more than 'creative purposes' for a reason. The trac ticket has an example of isotopes with names that begin with a number, which has some weight, but the work around for that seems simple compared to what would be needed by someone using code that auto-assigns variable names. On the other hand, most folks probably work with multiple standards; OceanSITES would no doubt maintain the variable name restriction, if CF doesn't. |
I agree that it would be good to have use cases. @ngalbraith is also right that not everyone is writing their CF code based on naked netCDF access. Indeed, I consider such an approach foolish, since CF is far too rich by now to stand a series chance of getting it right. However, while using the netCDF variable name as a program variable name might be excused in small, not reused code that only ever will deal with, say Hence, I don't think the argument that all netCDF variable names should be permissible program variable names in all programming languages should guide the design of CF. |
I had the same thoughts as @zklaus when thinking about the security implications of what I could only imagine was an |
I agree that some use cases would be helpful. I'm not sure about the specific proposal that initiated the discussion, but I do agree with the thought behind it that we should have a considered and reasoned policy on this, rather than just having a frozen-in rule based on past library constraints. One reason that we might want to depart from the full freedom allowed in NetCDF is that we have, in CF, a range of different attributes to describe a variable. The Some application libraries need, in places, identifiers with a restricted character set. For example, I can construct a Note that the Another potential use case is for identifiers of concepts described in RDF Turtle which has a character restriction on object names, broader, I think, than "alphanumeric characters and underscores", but definitely narrower than 137 thousand available of UTF-8. The desire to have a simple identifier is linked, in my mind at least, to the concept of a namespace, which is being discussed in the context of NetCDF (see NetCDF-ld and discussion on namespace delimiters). I don't this is simply a matter of upgrading software to make it accept generic strings: there is a wide range of applications that exploit identifiers constructed from a limited character set in order to enable the use of identifiers within an text string. |
One potential use-case that always came to my mind without an actual example at hand Is the native names of weather stations, say a temperature time-series from the Umeå station, where the variable name contains the station name. What makes this particularly interesting is that it seems to be permitted already under current CF conventions, since under CF-1.8, Section 2.3 Naming Conventions it says:
|
HI @zklaus : good point about the existing rules. Regarding your use case; wouldn't that use case be covered by setting the The cfchecker (4.0) takes a narrower view of what is allowed, restricting variable names to string matching the python regex: |
Yes, that might be a good way to encode the information. What I wanted to say is this: I find it very plausible that in a national weather service a group sits together and decides to code their station data using variable names So I think being more explicit about what is meant by "letter" would be good, even if that means saying that only ascii letters are allowed. |
@JonathanGregory, no, this issue is not waiting on #477. This issue #237 is a free-standing proposal to remove all CF-specific restrictions on Netcdf object names. In my view, this #237 is currently an open discussion, and waiting vaguely on a general consensus. |
Early on in this thread there were references to work on "Netcdf-LD", and I found a github repo. Anyone know the current status of this proposal in general, and in relation to OGC? Maybe @marqh or @ethanrd? I am asking because of the comment that
|
Hi Lars @larsbarring - I believe this OGC netCDF-LD GH repo is the more current one. It provides a link to the OGC netCDF-LD draft specification. The OGC process involves a public comment period before proceeding to a vote. If I'm remembering correctly, the specification went out for public comment but hasn't yet gone out for a vote. Mark @marqh may be able to provide more details. |
If this is waiting on general consensus to come to a resolution, I'll jump in and say that I oppose this proposal. A lot of very serious interoperability and security concerns have been raised about the idea of removing all restrictions on naming, and I don't see any benefits that outweigh them. Moreover, we don't have an actual motivating use case; this is an anticipatory change, which CF generally tries to avoid. I'm open to motivated proposals that extend the allowed set of characters in a specific and more limited way, such as #477 (which has been accepted and is just waiting for a PR), but I think the discussion there demonstrates why it's important to be conservative and carefully discuss all the impacts of adding new allowed characters. |
I fully agree with @sethmcg. Moreover, the opening sentence of Section 2.3 reads
where the operative word is should, which, if we interpret it as being in uppercase according to BCP14/RFC2119 means:
This interpretation of "should" strikes me as a reasonable balance between strictness/limitations and openness/flexibility. If the CF Community moves to introduce BCP14 in the Conventions document there is of course the possibility that the word should is replaced by MUST, but that is a good time to revisit this issue. |
The opening sentence of Section 2.3 states that
Lars is correct that the word "should" here is a recommendation, as is clarified by Sect 2.3 of the conformance document. The conformance document further clarifies it
and both the standard and the conformance document add (again, as a recommendation)
which results from the agreed proposal #477 of @Dave-Allured. @larsbarring and @sethmcg have expressed views against a blanket removal of restrictions on the characters to be used in CF-netCDF object names. I agree that removing all restrictions would not be consistent with the usual CF approach. Normally, we consider specific proposals to change the status quo, motivated by present use cases. Are the other views on this question? It would be good to reach a consensus. Thanks. |
The sections @JonathanGregory points at essentially provide whitelist of explicitly allowed characters, all other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by amending the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. All other characters would then belong to a "greylist" where users are on their own and cannot expect the same level of interoperability and support from common libraries and software tools. |
This wording with "should" is confusing and unfriendly in context of that opening paragraph on netCDF object names. Witness multiple tickets filed to remove character restrictions which did not really exist. If that were simply reworded to clearly express the allowed versus recommended character sets, that would be sufficient. CF is for scientists and programmers, not lawyers. |
We've already agreed elsewhere that we will check all the "must", "should" etc. words to make them conform to BCP-14, in which "should" indicates a recommendation. In this case, our interpretation has apparently changed. The text in sect 2.3
has been the same since CF version 1.0. However, up to version 1.7 of the conformance document this was listed as a requirement
In version 1.8 of the conformance document it turned into a recommendation
That change was made by @davidhassell in 2a44ccc and c3fa6fd. Do you remember why this change was made, David? According to principle 9 of sect 1.2, we shouldn't revert to making it a requirement:
Therefore I propose that we change the first sentence of 2.3 to read
which makes it consistent with the present conformance document. I believe that all those who've contributed recently think that this is what the text should mean. Are you content with making this change? |
Hello @JonathanGregory,
Those commits were from PR #227 that fixed issue #226 (Correct the wording in the conformance document section 2.3 "Naming Conventions"). Thanks, David |
Dear @davidhassell Thanks. I didn't remember about #226, where we previously decided that "should" was intended mean a recommendation. Since the discussion above shows that it is open to question, I believe that my proposal to change the text in sect 2.3 would be helpful, from
to
I'm relabelling this issue as a Best wishes Jonathan |
Well aware of my ever so often much too "free and relaxed interpretation" of English spelling and grammar, I nevertheless venture to ask if it would be possible to somehow exclude the "should" in the suggested wording:
? |
Hi Lars, That sounds like a good suggestion. BCP14 says
so providing both words (... recommended ... should ...) doesn't add anything beyond using just one of them. |
It's true that "should" doesn't convey any information, given "recommended" for clarity. It would be OK in English to say
where begin is a subjunctive (a vestigial feature of English grammar). That's not such a common construction though. Maybe some readers might find it obscure? What do you think of
or
|
I too recommend that we should avoid both "should" and "recommend" in the same sentence. :) . Personally, I prefer the first of the 3 options appearing in the previous post (with the subjunctive construct). I don't find it confusing. Perhaps I'm just a vestige of a disappearing generation, so as a second choice I might slightly prefer "are recommended to begin", but that seems a bit awkward to me. |
Is there some reference that can be added where users can read the disadvantages/problems they may have if they don't follow the recommendations? |
There is quite a lot of discussion of pros and cons earlier in this issue. Jonathan |
Four weeks have passed without objection to the proposed remedy for the defect. Therefore we've agreed to make the change, and I've prepared pull request 526 to implement it. The PR replaces the existing sentence in 2.3
with the wording preferred by @larsbarring and Karl @taylor13
to indicate that this is not a requirement, but a recommendation, as shown by the conformance document. Please could someone check and merge this PR e.g. @larsbarring or @davidhassell? In addition, I am labelling this issue for consideration as a FAQ, in view of the question from Tim @MTG-Formats "Is there some reference that can be added where users can read the disadvantages/problems they may have if they don't follow the recommendations?" It seems to me that if someone has time it would be useful to summarise the early discussion about the advantages of sticking to the convention in the FAQ, or at least we could refer to this issue as a reference from the FAQ. Thanks to all for contributions to this issue and to @Dave-Allured for raising it. PS Discussion 323 on creating a character blacklist is also relevant. |
@JonathanGregory I have just approved and merged the PR. But it just struck me that the label |
Dear @larsbarring Thanks for merging the PR. I see your point about Best wishes Jonathan |
What do you think of |
Title: Remove restrictions on netCDF object names
Moderator:
Moderator Status Review: New issue, 2020 January 23
Requirement Summary: None.
Technical Proposal Summary: Remove CF 1.7 section 2.3 restrictions on characters in names of variables, attributes, etc. Resolve ambiguous use of such restrictions.
Benefits
Caveats
Status Quo: Object names are now restricted to a traditional yet limited character set which does not accommodate many non-western languages, nor other desired naming patterns.
Detailed Proposal: Change the first paragraph of 2.3 Naming Conventions as follows. The remainder of 2.3 is left unchanged.
Current version (1.8 draft):
Proposed:
(Edit: Added forward slash "/" after following comments were posted.)
The text was updated successfully, but these errors were encountered: