Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove restrictions on netCDF object names #237

Closed
Dave-Allured opened this issue Jan 23, 2020 · 89 comments · Fixed by #526
Closed

Remove restrictions on netCDF object names #237

Dave-Allured opened this issue Jan 23, 2020 · 89 comments · Fixed by #526
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors

Comments

@Dave-Allured
Copy link
Contributor

Dave-Allured commented Jan 23, 2020

Title: Remove restrictions on netCDF object names

Moderator:

Moderator Status Review: New issue, 2020 January 23

Requirement Summary: None.

Technical Proposal Summary: Remove CF 1.7 section 2.3 restrictions on characters in names of variables, attributes, etc. Resolve ambiguous use of such restrictions.

Benefits

  • Support international usage.
  • Allow special characters in names.
  • Remove ambiguity over requirement versus preference.
  • Simplify CF rules.
  • Simplify conformance checking.
  • Improve compliance for some existing data sets.

Caveats

  • Breaks compliance with COARDS name rules, but is a superset of them.
  • Some existing softwares can not handle non-traditional characters. They would need upgrades, but only when presented with new files using expanded character set.

Status Quo: Object names are now restricted to a traditional yet limited character set which does not accommodate many non-western languages, nor other desired naming patterns.

Detailed Proposal: Change the first paragraph of 2.3 Naming Conventions as follows. The remainder of 2.3 is left unchanged.

Current version (1.8 draft):

  • Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows use of the hyphen character. The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use.

Proposed:

  • Variable, dimension, attribute, and group names are not generally restricted by this convention. Any names that are acceptable to the netCDF library may be used. The most notable rules from netCDF are ASCII or UTF-8 character set, forward slash "/" not allowed, and names should not begin with underscore or certain other special characters. Refer to file format specs in the NUG for more details.

(Edit: Added forward slash "/" after following comments were posted.)

@Dave-Allured Dave-Allured added the enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format label Jan 23, 2020
@JimBiardCics
Copy link
Contributor

While I generally approve of relaxing the character set restrictions, I think we may need to consider certain patterns that should either be reserved or restricted. As an example, the use of slashes ('/') in names wreaks havoc with group path formalisms that are already in place outside of CF. In addition to the prohibition on having leading underscores that is mentioned in the proposal, the netCDF-LD project (@marqh) is making use of doubled underscores within a name as a mechanism for marking namespaces. There may be other cases "in the wild" where certain patterns are in use, and I think we should be careful to avoid causing problems by being overly loose here.

I suggest that, at minimum, we should disallow the use of slashes ('/') or backslashes ('') in names, and should call out two or more sequential underscores ('__') as reserved.

@steingod
Copy link

I support the constraint indicated above. Especially allowing slashes and backslashes in names will be confusing.

@erget
Copy link
Member

erget commented Jan 28, 2020

Agreed, I think it would be best if the restrictions were presented in a table for readability.

@marqh
Copy link
Member

marqh commented Jan 28, 2020

We may get some benefit form considering other standardisation activity in this domain?

RFC3986 defines the generic syntax for the Universal Resource Identifier (URI)
https://tools.ietf.org/html/rfc3986

As netCDF variables are resources that are being identified within the domain of a netCDF file, could we benefit from just adopting RFC3986?

This has a reserved character section:
https://tools.ietf.org/html/rfc3986#section-2.2

Disclaimer: I have not cross referenced this in detail with the NUG to examine consistency or problem areas (potential for contribution if useful)
First glance, these look pretty similar.

If these are consistent, then adopting the NUG definition unchanged looks sensible to me. It already mandates against the use of a '/' character, which is the most problematic one for me, given groups and variable identity within groups.

I'd like to see an explicit reference to the relevant NUG section in the text or linked, as I had to search a bit and I know what I'm looking for
I think:
https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_set_components.html#Permitted
is stable enough for a standards document
(@ethanrd do you agree this is a stable URI for the resource please?)

mark

@JimBiardCics
Copy link
Contributor

@marqh I like the overall suggestion of RFC3986. I think we should not adopt the "% encoding" concept of RFC3986. And, again, I think we should reserve leading "" characters (per NUG) and multiple sequential "" characters (per netCDF-LD). Are there any other special character sequences in the wild that anyone is aware of — in UGRID or Radial perhaps?

I notice that the NUG section you referenced implies that space characters are allowed as long as they are not at the end of a variable name. Do we want to allow internal spaces?

@marqh
Copy link
Member

marqh commented Jan 28, 2020

@marqh I like the overall suggestion of RFC3986. I think we should not adopt the "% encoding" concept of RFC3986. And, again, I think we should reserve leading "" characters (per NUG) and multiple sequential "" characters (per netCDF-LD). Are there any other special character sequences in the wild that anyone is aware of — in UGRID or Radial perhaps?

I agree, @JimBiardCics, that adoption of %encoding is not a path I would want to walk. it's perhaps a useful cross reference, but points like this suggest against including some specific use of RFC3986 within CF

I notice that the NUG section you referenced implies that space characters are allowed as long as they are not at the end of a variable name. Do we want to allow internal spaces?

internal spaces!?!? really

if we can stop that, then that is a good thing. Why would the NUG allow variable names with spaces in them??

my reading of

The names of dimensions, variables and attributes (and, in netCDF-4 files, groups, user-defined types, compound member names, and enumeration symbols) consist of arbitrary sequences of alphanumeric characters, underscore '_', period '.', plus '+', hyphen '-', or at sign '@', but beginning with an alphanumeric character or underscore. However names commencing with underscore are reserved for system use.

lead me to view space as not allowed. However the following:

Beginning with versions 3.6.3 and 4.0, names may also include UTF-8 encoded Unicode characters as well as other special characters, except for the character '/', which may not appear in a name.
Names that have trailing space characters are also not permitted.

Could someone from a Unidata background confirm or deny that in netCDF4, a space may be used within a variable name?

@zklaus
Copy link

zklaus commented Jan 28, 2020

I have zero Unidata authority, but I'd like to state the obvious: Unicode is complicated.
This may already account for the somewhat vague formulation in the NUG if one takes a look at the list of whitespace characters in unicode. Indeed, whether one wants to go with a blacklist or a whitelist approach, it may be a good idea to think and write in terms of Unicode character categories (cf here or here).

@ngalbraith
Copy link

I'm afraid I'm the odd man out here - I don't think the list of benefits in the original issue stacks up against the costs; in fact some of them don't seem to BE benefits. Maybe some use cases would be helpful ... Could you elaborate on how this change would support international usage?

Is improved compliance for some existing data sets really a goal? What's in these data sets that needs to be described with a name that begins with a number or contains spaces or special characters?

Maybe this is a selfish concern - we use Matlab's built-in netCDF library, and I'm not sure how that would deal with this change. If it's really needed for some specific reason, we'll deal with it, but absent that explanation, this is just a headache for a lot of CF users.

@ethanrd
Copy link
Member

ethanrd commented Jan 28, 2020

Is there a user asking for this extension, a particular use case that needs addressing? CF has generally tried to avoid extensions that seem like a good idea but don’t have a current use case.

Having said that, if we do move forward, I think we should be very cautious. Not only is Unicode very complicated as @zklaus points out, so are the rules around reserved character sets in URLs (and in which part of the URL) and file systems. Extending the set of characters allowed to include those reserved characters means they will need to be properly encoded when used in URLs (e.g., OPeNDAP and OGC WCS). Which, it turns out, isn’t as easy as it might seem.

Also, this or similar proposals/discussions have come up before, I think several times but so far I've only found these two:

  • A 2014 discussion on the email list (the initial email is here) focused mainly on expanding the set of characters allowed to include ‘@’, ‘+’, ‘-’, and ‘.’ with some mention of Unicode coming fairly late in the discussion.
  • Trac Ticket #157 suggested moving from “should” to “must” on the current set of allowed characters.

@ethanrd
Copy link
Member

ethanrd commented Jan 28, 2020

@WardF and @lesserwhirls - Could you address the question of whether whitespace characters are allowed in netCDF variable names?

@MTG-Formats
Copy link

Having blank spaces in names would break other CF conventions like use of the ancillary variables attribute.

"The attribute ancillary_variables is used to express these types of relationships. It is a string attribute whose value is a blank separated list of variable names. "

How to parse this?
float q_error_limit(time)
q_error_limit:standard_name = "specific humidity standard error" ;
q_error_limit:units = "g/g" ;

@taylor13
Copy link

I must be missing something, but if a variable is named, for example, "a-b", and one uses that in a computer code, how is it interpreted? How is that variable distinguished from the operation: subtract variable "b" from variable "a"? Don't "+", "-", "/", "*", " " all have this problem?

@JimBiardCics
Copy link
Contributor

@taylor13 Your code would have to parse the variable name into code. Until you did something like that, it is just a string.

@taylor13
Copy link

As a user of data, I usually like the names of my variables (in my codes) to be the same as their names in the netCDF file. With the current naming convention for CF, this is always possible, I think. If, however certain restrictions were removed, as suggested above, this would no longer be true.
I would echo others and ask what particular use cases are driving this?

@Dave-Allured
Copy link
Contributor Author

Dave-Allured commented Jan 28, 2020

Well, thank you for all yout thoughtful responses. I see that we are rehashing the 2014 discussion, and probably others. Thanks @ethanrd for finding that. There are good arguments pro and con there, and it is worth reading.

The difference is that only 4 extra characters were proposed in 2014. I simply want to legalize all the other 137 thousand!

Is there a user asking for this extension, a particular use case that needs addressing? CF has generally tried to avoid extensions that seem like a good idea but don’t have a current use case.

No, I do not have a current use case. This is a recurring issue, so I thought this comprehensive approach would be beneficial. Past use cases were mentioned or implied in the 2014 discussion, and in trac 157.

NetCDF developers put some care into expanded name capability, 12 years ago. However, CF restrictions are copied virtually unchanged from 25 year old COARDS rules, which were probably based on ASCII only. CF is overdue to allow the full naming range for creative purposes by all scientific users.

Name quoting is generally easy and well supported in most modern programming languages. This takes care of UTF-8, math symbols, and other active characters. IMO, naming freedom should outweigh exactly matching names of program variables.

@ngalbraith
Copy link

@taylor13 Your code would have to parse the variable name into code. Until you did something like that, it is just a string.

Not everyone writes their own netCDF translators, and some packages no doubt take the variable and attribute names from the netCDF variable and attribute names. Those who use these packages are least likely to be in a position to accommodate this change.

When I have a minute I'll give it a try with the Matlab netCDF interface. I'd be much happier to spend the time on it if there was more than 'creative purposes' for a reason. The trac ticket has an example of isotopes with names that begin with a number, which has some weight, but the work around for that seems simple compared to what would be needed by someone using code that auto-assigns variable names.

On the other hand, most folks probably work with multiple standards; OceanSITES would no doubt maintain the variable name restriction, if CF doesn't.

@zklaus
Copy link

zklaus commented Jan 30, 2020

I agree that it would be good to have use cases.

@ngalbraith is also right that not everyone is writing their CF code based on naked netCDF access. Indeed, I consider such an approach foolish, since CF is far too rich by now to stand a series chance of getting it right.

However, while using the netCDF variable name as a program variable name might be excused in small, not reused code that only ever will deal with, say tas, it is inexcusable in general-purpose library code. How would such a variable enter the namespace without the program knowing its name beforehand? Ultimately, the only way is via the equivalent of eval(var_name). Such code is prone to breakage no matter what restrictions we put on the character set since it would always leave open the possibility of having reserved words of the particular programming language as variable names. Another serious problem is that it opens the possibility to maliciously crafted variable names: How about var_name='system("rm -rf .")'?

Hence, I don't think the argument that all netCDF variable names should be permissible program variable names in all programming languages should guide the design of CF.

@DocOtak
Copy link
Member

DocOtak commented Jan 30, 2020

I had the same thoughts as @zklaus when thinking about the security implications of what I could only imagine was an eval(var_name). I've even seen some of the matlab code which does exactly this to load all the variable into a matlab namespace. I'd even go so far as to recommend that the CF document itself warn against doing this...

@martinjuckes
Copy link
Contributor

I agree that some use cases would be helpful. I'm not sure about the specific proposal that initiated the discussion, but I do agree with the thought behind it that we should have a considered and reasoned policy on this, rather than just having a frozen-in rule based on past library constraints.

One reason that we might want to depart from the full freedom allowed in NetCDF is that we have, in CF, a range of different attributes to describe a variable. The long_name is designed to hold human readable text, the standard_name and units which both have strongly constrained values.

Some application libraries need, in places, identifiers with a restricted character set. For example, I can construct a collections.namedtuple with name tas, but not with name tas.Amon because, in python "Type names and field names can only contain alphanumeric characters and underscores" (cited from an error message generated by collections.namedtuple). Could this be considered as a use case for having place in the convention to specify, for CF objects, an identifier which is composed of "alphanumeric characters and underscores"? The variable name is the de facto place which many people use for this kind of identifier (perhaps because of legacy packages).

Note that the standard_name fits the character restriction, but does not fit the use case because different variables may have the same standard_name.

Another potential use case is for identifiers of concepts described in RDF Turtle which has a character restriction on object names, broader, I think, than "alphanumeric characters and underscores", but definitely narrower than 137 thousand available of UTF-8.

The desire to have a simple identifier is linked, in my mind at least, to the concept of a namespace, which is being discussed in the context of NetCDF (see NetCDF-ld and discussion on namespace delimiters). I don't this is simply a matter of upgrading software to make it accept generic strings: there is a wide range of applications that exploit identifiers constructed from a limited character set in order to enable the use of identifiers within an text string.

@zklaus
Copy link

zklaus commented Nov 23, 2020

One potential use-case that always came to my mind without an actual example at hand Is the native names of weather stations, say a temperature time-series from the Umeå station, where the variable name contains the station name.

What makes this particularly interesting is that it seems to be permitted already under current CF conventions, since under CF-1.8, Section 2.3 Naming Conventions it says:

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. [...] Languages other than English are permitted for variables, dimensions, and non-standardized attributes.

@martinjuckes
Copy link
Contributor

HI @zklaus : good point about the existing rules.

Regarding your use case; wouldn't that use case be covered by setting the long_name to "Temperature time-series from the Umeå station"? The current convention appears to permit "Umeå_station", but not "Umeå station" (blanks not allowed).

The cfchecker (4.0) takes a narrower view of what is allowed, restricting variable names to string matching the python regex: '^[a-zA-Z][a-zA-Z0-9_]*$'.

@zklaus
Copy link

zklaus commented Nov 23, 2020

Yes, that might be a good way to encode the information. What I wanted to say is this: I find it very plausible that in a national weather service a group sits together and decides to code their station data using variable names tas_station-name with a number of non ascii letters in the station names. Furthermore, that would appear to be perfectly valid CF.

So I think being more explicit about what is meant by "letter" would be good, even if that means saying that only ascii letters are allowed.

@Dave-Allured
Copy link
Contributor Author

@JonathanGregory, no, this issue is not waiting on #477. This issue #237 is a free-standing proposal to remove all CF-specific restrictions on Netcdf object names. In my view, this #237 is currently an open discussion, and waiting vaguely on a general consensus.

@larsbarring
Copy link
Contributor

Early on in this thread there were references to work on "Netcdf-LD", and I found a github repo. Anyone know the current status of this proposal in general, and in relation to OGC? Maybe @marqh or @ethanrd?

I am asking because of the comment that

... at minimum, we should disallow the use of slashes ('/') or backslashes ('') in names, and should call out two or more sequential underscores ('__') as reserved.

@ethanrd
Copy link
Member

ethanrd commented Jan 9, 2024

Hi Lars @larsbarring - I believe this OGC netCDF-LD GH repo is the more current one. It provides a link to the OGC netCDF-LD draft specification.

The OGC process involves a public comment period before proceeding to a vote. If I'm remembering correctly, the specification went out for public comment but hasn't yet gone out for a vote. Mark @marqh may be able to provide more details.

@sethmcg
Copy link
Contributor

sethmcg commented Feb 28, 2024

In my view, this #237 is currently an open discussion, and waiting vaguely on a general consensus.

If this is waiting on general consensus to come to a resolution, I'll jump in and say that I oppose this proposal.

A lot of very serious interoperability and security concerns have been raised about the idea of removing all restrictions on naming, and I don't see any benefits that outweigh them. Moreover, we don't have an actual motivating use case; this is an anticipatory change, which CF generally tries to avoid.

I'm open to motivated proposals that extend the allowed set of characters in a specific and more limited way, such as #477 (which has been accepted and is just waiting for a PR), but I think the discussion there demonstrates why it's important to be conservative and carefully discuss all the impacts of adding new allowed characters.

@larsbarring
Copy link
Contributor

I fully agree with @sethmcg.

Moreover, the opening sentence of Section 2.3 reads

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _.

where the operative word is should, which, if we interpret it as being in uppercase according to BCP14/RFC2119 means:

SHOULD This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

This interpretation of "should" strikes me as a reasonable balance between strictness/limitations and openness/flexibility. If the CF Community moves to introduce BCP14 in the Conventions document there is of course the possibility that the word should is replaced by MUST, but that is a good time to revisit this issue.

@JonathanGregory
Copy link
Contributor

The opening sentence of Section 2.3 states that

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

Lars is correct that the word "should" here is a recommendation, as is clarified by Sect 2.3 of the conformance document. The conformance document further clarifies it

This corresponds to ASCII characters in the decimal ranges (65-90), (97-122), (48-57), and (95). The corresponding Unicode codepoints are (U+0041-U+005A), (U+0061-U+007A), (U+0030-U+0039), and (U+005F).

and both the standard and the conformance document add (again, as a recommendation)

ASCII period (.) and ASCII hyphen (-) may also be included in attribute names only.

which results from the agreed proposal #477 of @Dave-Allured.

@larsbarring and @sethmcg have expressed views against a blanket removal of restrictions on the characters to be used in CF-netCDF object names. I agree that removing all restrictions would not be consistent with the usual CF approach. Normally, we consider specific proposals to change the status quo, motivated by present use cases. Are the other views on this question? It would be good to reach a consensus. Thanks.

@larsbarring
Copy link
Contributor

The sections @JonathanGregory points at essentially provide whitelist of explicitly allowed characters, all other characters are not recommended (or recommended against) but not explicitly disallowed. But throughout this conversation there have been several remarks that some characters should indeed be explicitly disallowed. This could easily be done by amending the text in section 2.3 to list which character and character ranges CF explicitly disallows, i.e. creating a blacklist. All other characters would then belong to a "greylist" where users are on their own and cannot expect the same level of interoperability and support from common libraries and software tools.

@Dave-Allured
Copy link
Contributor Author

the word "should" here is a recommendation, as is clarified by Sect 2.3 of the conformance document

This wording with "should" is confusing and unfriendly in context of that opening paragraph on netCDF object names. Witness multiple tickets filed to remove character restrictions which did not really exist. If that were simply reworded to clearly express the allowed versus recommended character sets, that would be sufficient. CF is for scientists and programmers, not lawyers.

@JonathanGregory
Copy link
Contributor

We've already agreed elsewhere that we will check all the "must", "should" etc. words to make them conform to BCP-14, in which "should" indicates a recommendation. In this case, our interpretation has apparently changed. The text in sect 2.3

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

has been the same since CF version 1.0. However, up to version 1.7 of the conformance document this was listed as a requirement

Variable, dimension and attribute names must begin with a letter and be composed of letters, digits, and underscores.

In version 1.8 of the conformance document it turned into a recommendation

Variable, dimension and attribute names should begin with a letter and be composed of letters, digits, and underscores.

That change was made by @davidhassell in 2a44ccc and c3fa6fd. Do you remember why this change was made, David?

According to principle 9 of sect 1.2, we shouldn't revert to making it a requirement:

Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions.

Therefore I propose that we change the first sentence of 2.3 to read

It is recommended that variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

which makes it consistent with the present conformance document. I believe that all those who've contributed recently think that this is what the text should mean. Are you content with making this change?

@davidhassell
Copy link
Contributor

Hello @JonathanGregory,

That change was made by @davidhassell in 2a44ccc and c3fa6fd. Do you remember why this change was made, David?

Those commits were from PR #227 that fixed issue #226 (Correct the wording in the conformance document section 2.3 "Naming Conventions").

Thanks, David

@JonathanGregory
Copy link
Contributor

Dear @davidhassell

Thanks. I didn't remember about #226, where we previously decided that "should" was intended mean a recommendation. Since the discussion above shows that it is open to question, I believe that my proposal to change the text in sect 2.3 would be helpful, from

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

to

It is recommended that variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

I'm relabelling this issue as a defect, meaning that the above change will be adopted three weeks from now (10th June) if no-one disagrees before then.

Best wishes

Jonathan

@JonathanGregory JonathanGregory added defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors and removed enhancement Proposals to add new capabilities, improve existing ones in the conventions, improve style or format labels May 20, 2024
@larsbarring
Copy link
Contributor

Well aware of my ever so often much too "free and relaxed interpretation" of English spelling and grammar, I nevertheless venture to ask if it would be possible to somehow exclude the "should" in the suggested wording:

It is recommended that variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

?

@davidhassell
Copy link
Contributor

Hi Lars,

That sounds like a good suggestion. BCP14 says

SHOULD   This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

so providing both words (... recommended ... should ...) doesn't add anything beyond using just one of them.

@JonathanGregory
Copy link
Contributor

It's true that "should" doesn't convey any information, given "recommended" for clarity. It would be OK in English to say

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores.

where begin is a subjunctive (a vestigial feature of English grammar). That's not such a common construction though. Maybe some readers might find it obscure? What do you think of

It is recommended for variable, dimension, attribute and group names to begin with a letter and be composed of letters, digits, and underscores.

or

Variable, dimension, attribute and group names are recommended to begin with a letter and be composed of letters, digits, and underscores.

@taylor13
Copy link

I too recommend that we should avoid both "should" and "recommend" in the same sentence. :) . Personally, I prefer the first of the 3 options appearing in the previous post (with the subjunctive construct). I don't find it confusing. Perhaps I'm just a vestige of a disappearing generation, so as a second choice I might slightly prefer "are recommended to begin", but that seems a bit awkward to me.

@MTG-Formats
Copy link

Is there some reference that can be added where users can read the disadvantages/problems they may have if they don't follow the recommendations?

@JonathanGregory
Copy link
Contributor

Is there some reference that can be added where users can read the disadvantages/problems they may have if they don't follow the recommendations?

There is quite a lot of discussion of pros and cons earlier in this issue. Jonathan

@JonathanGregory
Copy link
Contributor

JonathanGregory commented Jun 17, 2024

Four weeks have passed without objection to the proposed remedy for the defect. Therefore we've agreed to make the change, and I've prepared pull request 526 to implement it. The PR replaces the existing sentence in 2.3

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores.

with the wording preferred by @larsbarring and Karl @taylor13

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores.

to indicate that this is not a requirement, but a recommendation, as shown by the conformance document. Please could someone check and merge this PR e.g. @larsbarring or @davidhassell?

In addition, I am labelling this issue for consideration as a FAQ, in view of the question from Tim @MTG-Formats "Is there some reference that can be added where users can read the disadvantages/problems they may have if they don't follow the recommendations?" It seems to me that if someone has time it would be useful to summarise the early discussion about the advantages of sticking to the convention in the FAQ, or at least we could refer to this issue as a reference from the FAQ.

Thanks to all for contributions to this issue and to @Dave-Allured for raising it.


PS Discussion 323 on creating a character blacklist is also relevant.

@larsbarring
Copy link
Contributor

@JonathanGregory I have just approved and merged the PR. But it just struck me that the label change agreed is both correct as we did agree on some changes, and incorrect as the changes we agreed on are rather the opposite of the initial suggestion. I wonder whether it would be prudent/relevant/informative to actually use both labels, change agreed and agreement not to change ? This may seem as confusing for a reader, but it may seem even more confusing if the reader is searching the conventions text for CF support for a wide range of Unicode characters.

@JonathanGregory
Copy link
Contributor

Dear @larsbarring

Thanks for merging the PR. I see your point about change agreed. It was intended to be the opposite of agreement not to change. They should be mutually exclusive. The aim is to indicate why the issue was closed. To avoid this possible confusion, I suggest we should rename change agreed to something clearer, which indicates that some change was agreed, although not necessarily what was originally proposed. That's often the case, of course. The renamed label would appear in all the places where change agreed currently appears i.e. it's the same identity, just different text.

Best wishes

Jonathan

@JonathanGregory
Copy link
Contributor

What do you think of convention was changed? Does it avoid the problem you raised?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change agreed Issue accepted for inclusion in the next version and closed defect Conventions text meaning not as intended, misleading, unclear, has typos, format or language errors
Projects
None yet