-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Localized metadata in NetCDF files #244
Comments
Interesting idea. If you'd like more input/discussion, this could form the basis for a breakout at the upcoming 2023 CF Workshop. .. |
oh thats cool - I can't find any info on that yet, I guess more info will be coming later? |
Hi Erin - The dates for the 2023 CF Workshop (virtual) were just announced (issue #243). There has also been a call for breakout session proposals (issue #233). Further information will be broadcast here as well so everyone watching this repo will get the updates. A web page for the workshop will be added to the CF meetings page in the next month or two. |
Hi @turnbullerin and others, I wanted to echo my interest in seeing a metadata translation convention come about from the CF Conventions. My team and I have been developing some implementations of metadata translations to better support our French-speaking users, as well as open the possibility of supporting other language translations for climate metadata. One of our major open source projects for calculating climate indicators (xclim) has an internationalization module built into it for conditionally providing translated fields based on the ISO 639 Language code found within the running environment's locale or set explicitly. For more information, here is some documentation that better describes our approach:
We would love to take part in this discussion if there happens to be a session in October. Best, |
Erin, I like the general direction of your localization proposal. I would like to suggest a simplified strategy. I do not see a need for those global attributes or the level of indirection represented in them. In short, I suggest simply add ISO-19115 suffixes to standard CF attribute names, as needed. Here are a few more details.
More details:
The choice of the primary delimiter will be controversial. I like period "." for visual flow and general precedent in language design. Some will hold out for underscore as the CF precedent. I think underscore is overused in CF. In particular, the ISO suffix deserves some kind of special character to stand out as a modifier. The general use of special characters such as "." and "-" is part of proposal cf-convention/cf-conventions#237. |
Thanks for your feedback! I think there is value in the two attributes. Defining English (and which English, eng-US, eng-CA, eng-UK, etc.) as the universal default is very Anglo-centric. There is a clear use case for datasets produced in other countries to have a primary language that is not English, and documenting it is valuable to inform locale-aware applications processing CF-compliant files. Not everyone will want to provide an English version of every string. So having an attribute that defines the default locale of the text strings in the file is still useful I feel, but perhaps we could define the default if not present as "eng" (no country specified) so that it can be omitted in many cases. For the other locales, I think it helps applications and humans reading the metadata to know what languages are in the file. If we did not list them, applications would need to be aware of all ISO-639 codes and check each attribute if it exists with any mix of country/language code suffix to build a list of all languages that exist in the metadata. Having a single attribute list them all has a lot of value in my opinion. In unilingual datasets, it can of course be omitted. This also raises the question on if we should use ISO 639-1 or ISO 639-2/T or ISO 639-3. ISO 19115 allows users to specify the vocabulary that codes are taken from, but if we were to specify one I would recommend ISO 639-2/T for language and ISO 3166 alpha-3 for country (this aligns with the North American Profile of ISO-19115). Alternatively, we could just specify the delimiter and let people override the vocabulary for language and country codes in attributes if they want. I am torn on the delimiter - I see the value in what you propose, but I would not want to delay this issue if #237 is not adopted quickly and I foresee some technical issues adopting it even if it is agreed to (for example, the Python NetCDF4 library supports attributes as Python variables on the dataset or variable objects [and thus are restricted to [A-Za-z0-9_]; allowing arbitrary names would require them to make a significant change before the standard could be adopted; see https://unidata.github.io/netcdf4-python/#attributes-in-a-netcdf-file). I do like the idea of standardizing the suffixes though and if we can agree on a format, I support that wholeheartedly. I would propose _xxxYYY where xxx is the lower-case ISO 639-2/T code and YYY is the ISO 3166 alpha-3 country code. If #237 is adopted, .xxx-YYY is also a good solution I think. We could include both for compatibility with applications and libraries that won't support #237 right away if adopted. Also, I fully agree on UTF-8. It supports all natural languages as far as I know, so there should be no issue with using it as the default encoding. However, I do note that the NetCDF standard allows for other character sets - I guess we are then just saying that all text data must be in UTF-8 (i.e. _Encoding="utf-8")? In terms of display, I agree with you that locale-aware applications (given a country and language code they should display in) should use the attributes in the following order:
|
Erin, thank you for your very thoughtful reply. Anglo-centric: Yes I was thinking about that when I wrote down my initial thoughts, but I decided to test the waters. I am glad to have triggered that direct conversation. English is a dominant language in the science and business worlds. However, this CF enhancement is a great opportunity for constructs to level the playing field, within the technical context of file metadata. I agree immediately to the value of a global attribute that sets the default language for the current data file, such that all string attributes with no suffix are interpreted in the specified language. I leave the name of such attribute up to you and others. Yes, keep the default as English if the global attribute is not included. |
I think adding support for multiple languages to selected CF attribute values would be a great addition. As I have absolutely zero insight into the technical aspects please bear with me if I am asking a stupid question: If this functionality is implemented without an universal default language does it mean that all string valued attributes are expected to follow a specified locale? If so, how would CF attributes that only can take values from a controlled vocabulary be treated, e.g. Thanks, |
List of languages present: It really is no problem to scan a file's metadata, pull off all the language specifiers, and sort them into an organized inventory. This is the kind of thing that can be programmed once, added in to a convenience library, and then used by everybody. If you have a redundant inventory attribute, you immediately have issues with maintenance and mismatches. Such issues will persist forever. |
ISO vocabulary: It would be really nice if CF could settle on single universal choices for the |
@larsbarring I think we would apply this only to natural language attributes, not to those taking their values from a controlled vocabulary. So title, summary, acknowledgement, etc. are translated; units, standard_name, cell_methods, etc. are not. Perhaps some form of identification of those would be useful? |
I think identifying what is and is not a language specifier might be challenging. Assuming attribute_xxx[YYY] as an algorithm, I would write this:
Versus, with an attribute, it is:
I think, while it can be done, having an attribute with all languages in the file greatly simplifies the code for understanding which languages are present (which is the point of some of the metadata, like we could calculate geospatial_max_lon and geospatial_min_lon but we have those for convenience). It also ensures attributes which happen to look like valid localized attributes are not actually treated as such. |
Identifying: Yeah. ;-) Add this to my list of reasons for dot notation.
I see great value in settling on a single, optimal syntax up front, and not providing alternative syntaxes. I also value adopting an exact syntax from ISO 19115, rather than having a new CF creation. You already see my preference for dot and dash, and my reasons. I think it is worth holding out for the optimal syntax. I see a growing interest in character set expansion for CF. The classic netCDF API's included special character handling from the moment of their creation. Python can adapt. I like 2-letter ISO 639 language codes, but 3-letter will be okay too. Choose one. I defer to your greater expertise on the various ISO flavors. I am not well studied there. |
Erin, take everything I said as mere suggestions. I do not want to bog you down with too much technical detail, right before the upcoming workshop. Good luck! |
So, after today's workshop on this, here's a rough draft of what I think we should include for the moment. It is still open for discussion ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text) Files that wish to provide localized (i.e. multilingual) versions of variables shall reference section #TBD for details on ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6) Files that wish to provide localized (i.e. multilingual) versions of attributes shall reference section #TBD for details on NEW SECTION TBD. LocalizationCertain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (English), the country (Canada), the script (English alphabet), and other features. This section defines a standard pattern for localizing a file, which means to specify the default locale of a file and for providing alternative versions of such attributes or variables in alternative locales using a suffix. The use of localization is OPTIONAL. If localization information is not provided, applications SHOULD assume the locale of the file is Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. Locales are defined by a "locale string" that follows the format specified in BCP 47. Localized files MUST define an attribute Localized files with more than one locale MUST define an attribute Applications that support localized NetCDF files SHOULD apply BCP 47 in determining the appropriate content to show a user if the requested locale is not available. If one cannot be found, the default value to display MUST be the attribute without suffix if available. Supporting localization is OPTIONAL for applications. The following is an example of a file with Canadian English (default), Canadian French and Mexican Spanish with the title and summary attribute translated but missing Spanish summary.
An application supporting localization would display the following:
ADDITION TO APPENDIX A
References |
@turnbullerin I did more research after our post meeting discussion:
|
Here is a CDL strawman for what I was asking about regarding namespacing:
I think @ethanrd said there was an attribute namespace discussion, my quick searching couldn't find it. I would suggest that Happy for more discussion on this at tomorrow's (or Thursday's session). I also have some code I'd like to share. |
Will update to cite BCP 47 explicitly - I imagine that's so that if the underlying RFCs change, the reference doesn't have to change. I think the IANA list is fine (I imagine it's what the RFCs refer to) and we can include a link. Rather than following the Accept-Language in HTTP, I think we should match the current standard for lists in CF (space-delimited, no commas). Here's a (very old) discussion I found on namespacing: https://cfconventions.org/Data/Trac-tickets/27.html Personally I find namespacing for languages confusing, namespacing is usually to group things of a common type rather than a more specific version of a thing. Instead of namespacing at the beginning, maybe instead we could reserve a trailing set of square brackets for containing a locale? Like |
@turnbullerin Adding this here so it isn't lost in the zoom chat. I coded up some examples using python and xarray (the ncdump CDL is at the bottom) https://github.com/DocOtak/2023_cf_workshop/blob/master/localization/localized_examples.ipynb My takeaway from the unicode breakout was that the proposal will not be rejected, but details need to be worked out. So we can expect any of the options that use attribute names outside what is currently allowed to be OK in the future. |
Thanks for the coding example! I was looking into what ERDDAP supports and apparently it only supports NetCDF attributes that follow the pattern
That said, they have to consider other metadata formats as well, so there might be restrictions in those. |
From the ERDDAP docs
|
@MathewBiddle yeah, that's going to be an issue - that said, cf-convention/cf-conventions#237 has identified several very good use cases where these restrictions are not reasonable for the description of scientific variables (notably some chemistry names that include apostrophes, dashes, and commas) so I don't think that is going to block this change. |
I see you created an issue in the ERDDAP repo, so I'll comment over there on the specifics for ERDDAP. I just need to say that this is a fantastic proposal and I'm glad to see such a robust conversation here. |
After discussions with the ERDDAP people, I think a full Unicode implementation is going to take a long time and I suspect there are other applications out there who will also struggle to adapt to the new standard. There are a lot of special characters out there that have special meanings ([] is used as a hyperslab operator in DAP for example) and I'm concerned about interoperability if we do something that greatly changes how names usually work. I would propose that we then stick to the current naming convention for attributes and variables in making a proposal for localization (possibly using the double underscore to make it clearly a separate thing) for now since it would maximize interoperability with other systems that use NetCDF files. We could keep the prefix system or we could just use the locale but replacing hyphens with underscores (so title_en_CA and title_fr_CA). |
Here are some further suggestions.
|
@Dave-Allured excellent points. I will rewrite as suggested and will shift the text to its own repo here so we can do a pull request when we're done. After thinking about this a lot, I think I'm seeing some good real use cases for why one might not want to follow a particular naming convention - in certain contexts, some characters might be more challenging to use and predicting them all is difficult (see my post on the Unicode thread for reserved characters in different contexts). Making what I think of as a fairly core feature of metadata (multilingualism) dependent on Unicode support or even broader US-ASCII support is maybe not the best choice. Downstream applications relying on NetCDF files might specify their own standard. That said, using an alternative naming structure like My suggestion to resolve this would be to define the default behaviour suffixes like
The code for it, in Python, would be something like def parse_locale_others(other_locales: str) -> dict[str, str]:
locale_map = {}
pieces = [x for x in global_attributes['locale_others'].split(' ') if x != '']
i = 0
while i < len(pieces):
if pieces[i][-1] == ":":
locale_map[pieces[i][:-1]] = pieces[i+1]
i += 2
else:
locale_map[f"[{pieces[i]}]" = pieces[i]
i += 1
return locale_map
def localized_title(metadata: dict[str, typing.Any]) -> dict[str, typing.Optional[str]]:
default_locale = metadata['locale_default'] if 'locale_default' in metadata else 'en'
other_locales = parse_locale_others(metadata['locale_others']) if 'locale_others' in metadata else {}
titles = {
default_locale: metadata['title'] if 'title' in metadata else None
}
for locale_suffix in other_locales:
localized_title_key = f"title{locale_suffix}"
titles[other_locales[locale_suffix]] = metadata[localized_title_key] if localized_title_key in metadata else None
return titles
```
Edit: We can also add text strongly suggesting people use the default unless there is a good reason not to. |
@larsbarring for clarity, in Option 1 the format of the suffix is entirely up to the originator of the file and is specified completely in
The list of valid suffixes can then be determined from the |
Just to make sure I understand, this is proposing basically one extra variable with no data per language that would have the global attributes set as variable attributes? And one extra variable per language per variable with both localized metadata (i.e. This feels inefficient to me but I'll let others weigh in as well :). |
@turnbullerin thanks for explaining how you envisage you option 1. I will here continue my previous comment that I had to pause. As I wrote, I think that we should very careful in overloading the underscore with conceptually new roles. I interpret earlier comments from @turnbullerin, and @aulemahal (and possibly others) that this is seen as a necessity to meet restrictions from downstream systems and applications, rather than something desirable in its own right. While I do think that interoperability is a key concept for CF (essentially that is why we have CF in the first place...). But there will always be software somewhere for which some new functionality or concept will not be possible at all to implement, or just not practical for some reason. Hence I think that we have to be concrete and specific when using concerns for interoperability as a argument. In this issue ERDDAP has be used as a use-case of an important downstream application. Thank you @rmendels for your comment regarding ERDDAP, and for the link to the ERDDAP/erddap#114 issue ! When browsing through that issue I see that the conversation soon expanded to deal with the implications for ERDDAP if all (or at least a large set of) unicode characters were to be allowed in attribute names, and in variable names. In that respect it pretty much mirrors what is going on in cf-convention/cf-conventions#237. This was maybe where we were at in our conversation here a couple of weeks ago when "your issue" was initiated. Since then the conversation here has developed so that now only two or three additional characters are needed to implement localization. And these are hyphen |
@turnbullerin Yes, basically one extra variable per language/locale, I didn't want to use the term "namespace" but this is a mechanism I saw in the netCDF-LD ODC propsal. My example is a little busy as I hadn't thought about the variable localization at all yet in these discussions. And when I realized I could localize data, it felt really powerful and I immediately tried it. The localization variables could be shared between data variables so perhaps not every data variable would need and independent localization variable (e.g. if it uses entirely controlled attributes). The intent of my proposal was to avoid all the attribute name convention arguments. My proposal uses some pretty well established CF mechanisms to keep our proposal from conflicting with what might already be in the file. One of the other issues that the ERDDAP team raised was how to parse attribute names and how that might conflict in existing datasets. Using a non standard BCP 47 locale tag (i.e. one that has had dashes replaced with the underscore) I think would be bad. Even though I disagree strongly with using netCDF variable and attribute names as program language symbols... I'm now more hesitant with introducing any sort of parsing grammar for the attribute names themselves given the concerns expressed by the ERDDAP team (@Dave-Allured ?). So my most recent proposal completely does away with needing to parse attribute names other than matching exactly in the same file. I suspect that, given what I know about how ERDDAP is configured, the extra variable proposal would allow localizations to be added to an existing dataset on ERDDAP today and it would ignore the extra variables, unless reconfigured to be aware of them. The extra attributes in the data variables and global attributes would not have anything ERDDAP breaking in them. ERDDAP would continue to be unaware of localizations in the dataset until that functionality is added. I would like to prepare a "real" data file with only French and English to see what it would actually look like. @turnbullerin do you have any thing that could be used for this example? |
From the recent conversation over at ERDDAP/erddap#114 I think the position of the ERDDAP folks is clear. They are pretty dependent on a character set limited to [A-Za-z0-9_] for variables and attributes. From my side I do not have much more to contribute regarding how to introduce localization into CF than what I stated before. If the top priority is to support existing software (irrespective of age and provenance) then using underscore to implement localization seems as the only option. The drawback is that this introduces the new role for the underscore to be a delimiter between attribute and locale. Moreover, and importantly, CF would then become even more locked down into the current character set restrictions, while the general netCDF community goes to the other extreme by allowing almost all of Unicode. An at the same time as there are (and will be ) more and more well motivated requests from various communities for relaxing the restrictions. But that is a conversation more suited in cf-convention/cf-conventions#237. If the conclusion is that the CF community should go ahead with underscore to implement localisation i will not be the one that blocks. |
@larsbarring I've attempted an option that eliminates the use of underscore or any attribute name parsing (only attribute values) in this comment. Please take a look. If my kitchen sink example is too busy or hard to understand, I could make a simpler one. PS: @turnbullerin Don't be discouraged by this long process, actual changes to the conventions take time and everyone here is a volunteer |
Hi @DocOtak, it took me a little while and some experimentation to get into what you suggest. To me it looks like a general and powerful approach, but also a bit awkward by requiring one variable per locale and per variable that have localized attributes (as @turnbullerin notes). This might be a possible solution. At the same time I was looking back with (at least somewhat) fresh eyes on other suggested solutions:
Based on these comments, here is another simplified alternative inspired by Erin's first and second alternative. Only the one global attribute
An example:
In this example I have used "mangled" language tags as keys, but this is not requires (but good practice?). This has the advantage of easy reading for humans, and still simple decoding for software. If one wants to restrict the freedom in choosing keys an alternative is to allow only "loc1", "loc2", "loc3", .... but I do not think this is necessary. This suggestion has the following advantages: it requires only one global attribute, the keys (suffixes) only have an underscore as first character, the tags follow the established format, it is"lightweight". It seems so simple that I wonder if I have overlooked something? |
@larsbarring Your concept seems very similar to what I was proposing and I think it is great. I think there's some value in separating out the "default" tag but I'm not married to the idea of it being in a separate attribute; having it with a "magic" suffix seems fine too. I'm just not a fan of "magic number" type things and adding an extra attribute made more sense to me. Alternative to a magic suffix, we could say that Is there value in restricting the character set like this though? From what I was reading, CF doesn't tend to make things mandatory without a good cause. I appreciate the "underscore = space" argument but I think that's actually a good reason to not make it REQUIRED so others can make their own decisions on how to mangle and what characters to use or omit. Instead, I would suggest we RECOMMEND starting suffixes with an underscore followed by ASCII letters (A-Z and a-z) for maximum compatibility. To avoid parsing complications I would suggest though that we say that colons MUST NOT be part of the suffix (and they won't be part of the language tag by BCP 47) which makes it very easy to parse and to identify the default tag if we use my suggest above (it is the one without a colon after splitting on spaces) |
For consistency, we could also have it have a "blank" suffix which I like better than a keyword like "default" (so it would be |
Thanks for the pick me up :) I'm not too discouraged, I work for the Government lol. Change takes time and even if I'm more usually of the approach of "well try something and take good notes, then do it better next time", I recognize a major feature like this to a significant and widely used standard will be both contentious and lengthy to agree on. But it's so worth it :). Plus I get paid to have these discussions at work which is nice. |
I think this is good cause to RECOMMEND but not REQUIRE the |
@turnbullerin a couple of comments and questions
Yes, it is your idea, no doubt, I was just making some minor adjustments here and there: credit where credit's due. Regarding which of the following is best I am not sure: I am not sure that I follow when you write:
Given the current CF limitation to [A-za-z0-9_] for variable and attribute names, which I think might take some time to change, should I understand that you suggest that any of the other characters is acceptable (although not RECOMMENDED), e.g. |
@larsbarring other than the ":" character, I think it would be acceptable but not recommended practice. It doesn't affect a programmatic interpretation of the attributes, it's just more confusing to human readers. I think people would avoid that anyways. But it would allow things like Maybe it would be better to say:
Though the last is redundant and perhaps confusing as long as CF doesn't allow colons in attribute/variable names anyways. As an analogy for why I feel this way, I would note CF doesn't restrict people from doing confusing things in other areas - for example, I can name my variables |
I'd also add quickly that a REQUIRED format of
We are relying on people to choose suffixes that clearly represent the locale with any system where we let them define a suffix, so my thought is to leave it as open as possible and trust them do something sensible for human readability (as long as we can parse it). |
I have started CF #477 to enable period (.) and hyphen (-) in attribute names only. This is in support of my recommended strategy, #477 is intended to remove one roadblock to adopting proposal 3, or similar strategies that need either the period or hyphen characters. #477 is not intended to express preference or foreclose on any other localization strategies. If you agree with adding these two characters for attribute names only, please post a supporting comment on #477. |
Hi Erin @turnbullerin, I think this would be a very useful extension of the CF Conventions. Many thanks, |
Hi All, Just getting back into all the CF things after my long expedition (and Ocean Science meeting). In my opinion, CF should strongly resit adding something to the standard that requires any programatic parsing and interpretation of the attribute keys themselves. Complexities of parsing attributes aside, I'm also concerned about "breaking" ERDDAP. At the Ocean Sciences meeting, all the talks/town halls I went to about the technical implementation of the goals of the UN Ocean Decade had ERDDAP featured somewhat heavily (if any data system was mentioned at all) and I think it is set for becoming the recommended way of serving data in national systems. |
@DocOtak I didn't know we had become so popular!!!! :-) More seriously if I remember the lengthy discussion related to this (on a different list) for which Bob Simons knows a lot more about this than I, part of the discussion had to do with problems in ERDDAP code and part had to do with breaking clients (mostly where traversing some structure) as well as reading CDL files, I believe there were a few more examples. |
@rmendels Kevin O'Brien is quite the advocate. I didn't really want to say "look at my proposal again" since I'm not too attached to it, but my feeling is that this discussion got stuck on what the best way to mangle attributes is and not the possibility of alternatives. Would folks (@turnbullerin @larsbarring @Dave-Allured others?) be willing to find time for a call to discuss/make progress? |
@DocOtak I am happy to make time for a call! @larsbarring I'm also happy to work on the enhancement and pull request. My thoughts haven't changed too much, but I agree with a number of key points made, which I'll outline below as a starting point:
The discussion is now focused on the technical issues of how to implement this. With this in mind, mangling the names in any way that requires expanding the character set from CF 1.10 is probably a no-go as it goes against 2 - ERDDAP won't be able to easily support these without significant issues. This leaves us with two options for implementation for attributes: A. Using a suffix or other alteration of the attribute name to identify them using existing character sets. Personally, using variables to group together locale-related global attributes seems to be counter-intuitive for me - structurally they're in the incorrect place and for someone not familiar with CF's use of them, it could be confusing. I wonder if a reasonable alternative would be to store triples that map an attribute name to a new attribute name in a given locale, e.g. as follows:
Maybe it would make the localizations attribute too long though? I don't know if there's a maximum length - could also compress it a bit by saying "locale1 en_attr1 fr_attr1 en_attr2 fr_attr2 ; locale2..."? I'm open to other ideas too! Maybe we can brainstorm other solutions, but I'm still leaning towards a clean and backwards-compatible mangling approach as the easiest to manage. |
Yes, I am happy to particiapte in a call. /Lars |
I am way behind making the call sorry :). Life of a national manager. I looked a bit more at the parsing and processing side of things though, and I am more strongly leaning towards the suffix-based approach but with user-defined suffixes - I think mandating I think by specifying the allowable suffixes and meanings in an attribute itself, we aren't then interpreting the attribute names themselves, merely the presence or absence of specific names (e.g. So, given the challenges I foresee ERDDAP having with meta variables, I would propose we move forward with an update based on suffixes. I'll prepare some sample text. |
ADDITION TO 2.5 (prior to 2.5.1 heading, after the existing text) Files that wish to provide localized (i.e. multilingual) versions of the content of variables shall reference section #TBD for details on how to do so. ADDITION TO 2.6 (prior to 2.6.1 heading, after the existing text following 2.6) Files that wish to provide localized (i.e. multilingual) versions of the content of attributes shall reference section #TBD for details on how to do so. NEW SECTION TBD. LocalizationCertain attributes and variables in NetCDF files contain natural language text. Natural language text is written for a specific locale: this defines the language (e.g. English), the country (e.g Canada), the script (e.g English alphabet), and/or other features of the natural language. Locales are defined by a "language tag" that follows the format specified in BCP 47 //link to BCP47 here//, such as Localization of attributes and variables is limited to natural language string values that are not taken from a controlled vocabulary. See Appendix A for recommendations on localization of CF attributes. Non-CF text attributes that use a natural language may also be localized using these rules. To localize an attribute or variable, an alternative version of it is supplied using a suffix for its name that is associated with the language tag. TBD.1 Localized FilesA "localized file" is one that provides the global attribute The default locale should be chosen to represent the most complete set of attributes and variables; if only some of the natual language text attributes have localized versions, then the more complete language should be chosen as the default. Where there are two or more complete sets, the predominant language that the content was originally written in should be chosen. An attribute or a variable in a localized file must not have a name ending with a locale suffix unless it is used to indicate the locale as per this section. Applications that process NetCDF files are encouraged to apply BCP 47 in determining which content to show a user when localized content is available. When content is not available in a suitable locale for the user, the default locale should be used. TBD.2 Localized AttributesLocalized attributes are created by appending a locale suffix to the usual attribute name. For example:
TBD.3 Localized VariablesLocalized variables are created by appending a locale suffix to the variable name; note that this is only necessary where the data stored in the variable itself is localized and does not come from a controlled vocabulary. Natural language attributes for a localized variable should be provided in the locale of that variable. Localized versions of a variable must be of the same data type and dimensions and must contain the same number of elements appearing in the same order (i.e.
ADDITION TO APPENDIX A
References EDIT NOTES:
|
I'd like to especially draw people's attention to the change in Appendix A I put above as there are some open questions there still that have not been answered. |
Dear Erin @turnbullerin Thanks for your proposal. Although this issue started as a discussion, you're now making a definite proposal to change the convention. Therefore I think it would be appropriate if you began a new issue with this in the conventions repo. Best wishes Jonathan |
Will do! |
Thanks, @turnbullerin. All interested in Erin's proposal, please comment on #528, and thanks for the discussion up to now. |
Hi Everyone!
So I work for the Government of Canada and I am working on defining the required metadata fields for us to publish data in NetCDF format. We'll be moving a lot of data into this format, so we are trying to make sure we get the format right the first time. The CF conventions are our starting point for metadata attributes.
As the data will be officially published by the Government of Canada eventually, we will have to make sure the metadata is available in both English and French. If the data contains English or French text (not from a controlled list), it needs to be translated too. I haven't found any efforts towards creating a convention for bilingual (or multilingual) metadata and data in NetCDF formats, so I wanted to reach out here to see if anyone has been working on this so we could collaborate on it.
My initial thought is that the metadata should be included in such a way as to make it easy to programmatically extract each language separately. This would allow applications that use NetCDF files (or tools that draw on the CF conventions like ERDDAP) to display the available language options and let the user select which one they would like to see without additional clutter. It should also be included in a way that does not impact existing applications to ensure compatibility.
Of note though is that some data comes from controlled lists where the values have meaning beyond the English meaning. This data probably shouldn't be translated as it would lose its meaning. For many controlled lists, applications can use their own lookup tables to translate the display if they want, and bigger vocabulary lists (like GCMD keywords) can have translations available on the web.
ISO-19115 handles this by defining "locales" (a mix of a mandatory ISO 639 language code, optional ISO 3166 country code, and optional IANA character set) and using PT_FreeText to define one value per locale for different text fields. I like this approach and I think it can translate fairly cleanly to NetCDF attributes. To align with ISO-19115, I would propose two global attributes, one called
locale_default
and one calledlocale_others
(I kept the word 'locale' in front instead of at the end like in ISO-19115 since this groups similar attributes and I see this is what CF has usually done). Thelocale_others
could use a prefix system (like whatkeywords_vocabulary
uses) to separate different values. I would propose using the typical standards used in the HTTP protocol for separating the language, country, and encoding, e.g.language-COUNTRY;encoding
. Maybe encoding and country are not necessary, I'm not sure, I just know ISO included them.I would then propose using the prefixes from
locale_others
as suffixes on existing attribute names to represent the value of that attribute in another locale.For example, this would give us the following global attributes if we wanted to include English (Canada), French (Canada), and Spanish (Mexico) in our locales and translate the title:
I was torn if the default locale should define a prefix too, if it did, it would let one use the non-suffixed attribute name for a combination of languages as the default (for applications that don't support localization); for example:
But then this seems like an inaccurate use of
locale_default
since the default is actually a combo. Maybe English should be added tolocale_others
in this case andlocale_default
changed to something likeund;utf-8
or even just use the delimiter like[eng] | [fra]
to show the format.I haven't run into a data variable that needs translating yet, but if so, my thought was to define an attribute on the data variable that would allow an application to identify all the related localized variables (i.e. same data, different locale) and which variable goes with which locale. Something like
Thoughts, feedback, any other suggestions are very welcome!
The text was updated successfully, but these errors were encountered: