Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically update data/translations #9946

Merged
merged 8 commits into from
Jul 5, 2024
Merged

Conversation

stephen-huan
Copy link
Contributor

@stephen-huan stephen-huan commented Jul 4, 2024

This implements the "automatic" translation system proposed in #9930.

  • The first three commits fix existing issues unrelated to this PR.
    • The third commit removes ":" from pl (although this was a manual tweak, no other file has a colon in it).
  • The fourth commit adds the translation script (implemented in Python, with no external dependencies).
  • For ease of review, the update to data/translations has been split into two commits:
    • A commit updating existing translations, avoiding overwriting existing manual tweaks (to the best of my ability).
    • A commit adding new translation files.

Implementation notes:

  • Running python tools/update-translations.py will not leave the git tree untouched. Instead, running it shows precisely which files were manually modified/deleted (the changes that were not automatically generated).
  • Babel and Polyglossia are downloaded by git from the pandoc branch of my forks: babel, polyglossia.
  • Babel exposes its configuration data as ini, which can by parsed by python natively. Polyglossia's latex files are parsed with a hacky regex parser. Python lacks a native yaml reader/writer, so both are done ad hoc. Consistency is checked against pandoc's yaml parser, exposed through json (which actually parses strings as markdown).
  • If Babel and Polyglossia conflict in their translations, the data source with more keys is preferred. Tiebreaks are broken in favor of Babel. This heuristic currently only affects ja, kn, lt, and lv.
  • I tried to avoid adding too many subtags. For example, zh-Hans and zh-Hant are subtags of zh. Theoretically, for consistency, es, pt, and sr should be removed, as I've added the more specific variants es-ES/es-MX, pt-BR/pt-PT, and sr-Cyrl/sr-Latn. I haven't removed these existing translation files as I'm not familiar with the conventions around these languages.

@jgm
Copy link
Owner

jgm commented Jul 4, 2024

Looks great!

Why is the script downloading babel & polyglossia from your fork rather than the authoritative repositories? The latter would be better going forward (there may be changes to these, and no guarantee that they'll be updated in your fork).

It would be nice to add a Makefile target that runs the script.

@stephen-huan
Copy link
Contributor Author

Why is the script downloading babel & polyglossia from your fork rather than the authoritative repositories? The latter would be better going forward (there may be changes to these, and no guarantee that they'll be updated in your fork).

The first-order reason is I can use it to integrate some fixes that haven't been upstreamed yet, e.g. latex3/babel#303, reutenauer/polyglossia#651, reutenauer/polyglossia#649. The first two commits just fix some minor mistakes I discovered while working on this and aren't really necessary, but the last commit allows me to import Polyglossia's tools/bcp47.py from python3 which has some useful data. Without the fixes, it's a python2 script that can't be imported from python3. Of course I could always vendor the data in tools/update-translations.py itself, but I consider that a worse solution.

More generally, it's because the script isn't as automatic as I'd like, and there are a lot of manual hacks littered around the code, so I'm not sure how amendable it would be to updates. Fixing a branch of a fork is a way of implicitly pinning the data that the script processes. My intended workflow for updating the translations would be something like:

  • Bump the pandoc branch of the forks to the latest commit.
  • Run the script and fix any issues that arise.
  • Manually inspect the changes, reverting overwrites to the existing manual tweaks.
  • Commit the updated translation files.

Now that I think about it, the reproducibility of the script is not particularly important as its goal is just to update the translations. Of course, I'm happy to remove the forks once the Polyglossia PR is merged or change the script not to depend on bcp47.py.

It would be nice to add a Makefile target that runs the script.

Added the update-translations target. I'm not familiar with the Makefile syntax, so it's based on the other targets.

@jgm
Copy link
Owner

jgm commented Jul 4, 2024

Great, sounds good.

@jgm
Copy link
Owner

jgm commented Jul 4, 2024

The test is failing because of a new translation. You can update all the golden tests by doing make TESTARGS=--accept.
(Just check to make sure that nothing else was changed but that one test, and that that test looks right.)

@jgm
Copy link
Owner

jgm commented Jul 4, 2024

Also, if you put a descriptive comment after the makefile target ## does so and so, it will appear when users do make help (see the other targets).

@stephen-huan
Copy link
Contributor Author

The test is failing because of a new translation. You can update all the golden tests by doing make TESTARGS=--accept. (Just check to make sure that nothing else was changed but that one test, and that that test looks right.)

Thanks for the headsup. The situation with Serbian (sr) is more subtle than I thought. The test is failing because sr now refers to Serbian written with the Cyrillic alphabet (sr-Cyrl), rather than with the Latin alphabet (sr-Latn).

See below for the updated test.

--- a/test/command/translations.md
+++ b/test/command/translations.md
@@ -25,5 +25,10 @@
 \figurename~2
 \figurename.
 ^D
-[ Para [ Str "Slika\160\&2" , SoftBreak , Str "Slika." ] ]
+[ Para
+    [ Str "\1057\1083\1080\1082\1072\160\&2"
+    , SoftBreak
+    , Str "\1057\1083\1080\1082\1072."
+    ]
+]
 ```

This behavior matches current babel and has been the case since ~2017. However, polyglossia uses sr to mean sr-Latn (I think). According to Wikipedia, it's pretty evenly split. I've decided to follow babel's convention, and let sr refer to sr-Cyrl. But rather than canonicalize this (fairly arbitrary) decision in the test, I've decided to keep the test output the same and instead refer explicitly to sr-Latn.

--- a/test/command/translations.md
+++ b/test/command/translations.md
@@ -21,7 +21,7 @@
 ```

 ```
-% pandoc -f latex -t native -M lang=sr
+% pandoc -f latex -t native -M lang=sr-Latn
 \figurename~2
 \figurename.
 ^`D

Also, if you put a descriptive comment after the makefile target ## does so and so, it will appear when users do make help (see the other targets).

Thanks for the tip. Added a message.

@stephen-huan
Copy link
Contributor Author

stephen-huan commented Jul 5, 2024

This is perhaps only tangentially related to this PR, but I think it's a conversation worth having now. This confusion around sr referring to sr-Cyrl or sr-Latn was why I floated the idea of removing es, pt, and sr in favor of their more specific variants. Deciding what script sr should refer to (or not deciding at all) is an inherently political decision!

To quote from Wikipedia,

The Latin script continues to be used in official contexts, although
the government has indicated its desire to phase out this practice due
to national sentiment. The Ministry of Culture believes that Cyrillic
is the "identity script" of the Serbian nation.[36]
...
To most Serbians, the Latin script tends to imply a
cosmopolitan or neutral attitude, while Cyrillic appeals
to a more traditional or vintage sensibility.[37]
...
Latin script has become more and more popular in Serbia,
as it is easier to input on phones and computers.[39]

and the linked article [39],

Much of that comfort is likely aided by the fact that the Internet in
Serbia has gone decidedly Latin, as it has in most of the cyberworld.

According to the Serbian National Internet Domain Registry, 101,648
websites have been registered on the .rs Latin-script domain,
compared with just 2,512 for its Cyrillic-script equivalent.

To give the example of Chinese, which has been discussed previously in #6904 and whose situation I know a bit better, the correct way to indicate whether text is written with simplified or traditional characters is with the script tag, zh-Hans or zh-Hant (see the w3c quote below). This is the position pandoc currently takes.

Although for common uses of language tags it is not likely that you will
need to specify the script, there are one or two situations that have
been crying out for it for some time. One such example is Chinese. There
are many Chinese dialects, often mutually unintelligible, but these
dialects are all written using either Simplified or Traditional Chinese
script. People typically want to label Chinese text as either Simplified
or Traditional, but until recently there was no way to do so. People had
to bend something like zh-CN (meaning Chinese as spoken in China) to
mean Simplified Chinese, even in Singapore, and zh-TW (meaning Chinese
as spoken in Taiwan) for Traditional Chinese. (Other people, however,
use zh-HK for Traditional Chinese.) The availability of zh-Hans and
zh-Hant for Chinese written in Simplified and Traditional scripts should
improve consistency and accuracy, and is already becoming widely used.

However, zh-CN is still extremely common on the web. For example, mdn uses zh-Hans in its example for the lang attribute, but its actual Chinese page has the attribute lang="zh-CN". As previously noted, pandoc's decision to only include the script subtags (and not zh-CN or zh) means these common tags aren't translated correctly. I think the response in #6904 (comment) that the translation system is a new feature internal to pandoc and therefore can reasonably expect users to avoid legacy language tags is a bit misleading,

Especially from the bold sentence it seems to suggest bare "zh" should be
used only for historical purposes. Since this is a new "feature" here that
no one else has relied on in pandoc before, may be we should just expect
people to use the more precise variants (with language subtags.)

as pandoc is frequently used to build websites, e.g. hakyll. Indeed, the reason I opened #9930 adding Japanese translations was to hint the language of a page on my personal website, using pandoc to convert from markdown to html.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja">

#9932 was submitted for the exact same reason, to fix a warning in quarto-dev/quarto-cli#10178; quarto leverages pandoc as a website generator (markdown to html). quarto-dev/quarto-cli#5197 (comment) is another instance of the same issue.

This means which tags we decide to include also affect which language tags websites generated with pandoc use. Babel, for example, uses "chinese" to mean zh-Hans. Polyglossia uses "chinese" to mean zh-Hans-CN and also provides zh as an alias for chinese, zh-CN for simplified, and zh-TW for traditional.

There are three natural things we could do.

  1. Have no official position, and defer to external contributors. This is the current status quo.
  2. Strictly enforce having no general tag if more specific tags exist (e.g. sr, zh). We would avoid deciding what a general tag should mean (which is sometimes controversial), but we would be going a bit contrary to existing practices and people's expectations. Currently, this would imply removing es, pt, and sr.
  3. Defer to Babel/Polyglossia for general tags. This would imply adding the files below.

My personal opinion lies somewhere between [1] and [3]. I made this PR with the intention of reducing the number of language PRs. Expecting users to copy data from Babel or Polyglossia to $XDG_DATA_HOME/pandoc/translations/ja.yaml as a workaround for missing languages in pandoc upstream is bit much.

However, I made this PR following [2], respecting the precedent in #6904. For transparency, a full list of the files I decided not to include is given below. I decided not to include tags that appeared fully redundant or near completely redundant with an existing tag (for example, the English variants en-* are identical to en.yaml). The German variants are slightly different (Swiss German, Austrian German, blackletter script) but they only differed in a few keys. I did not include the French variants since they differed from the manual PR #4766 and did not include the Chinese variants as #6904.

aeb.yaml
afb.yaml
apd.yaml
ar-DZ.yaml
ar-EG.yaml
ar-IQ.yaml
ar-JO.yaml
ar-LB.yaml
ar-MA.yaml
ar-MR.yaml
ar-PS.yaml
ar-SA.yaml
ar-SY.yaml
ar-TN.yaml
ar-YE.yaml
arq.yaml
ary.yaml
arz.yaml
ayl.yaml
bs-Latn.yaml
ckb.yaml
cu-Cyrs.yaml
de-1901.yaml
de-1996.yaml
de-AT-1901.yaml
de-AT-1996.yaml
de-AT.yaml
de-CH-1901.yaml
de-CH-1996.yaml
de-CH.yaml
de-DE-1901.yaml
de-DE-1996.yaml
de-DE.yaml
de-Latf-AT-1901.yaml
de-Latf-AT-1996.yaml
de-Latf-AT.yaml
de-Latf-CH-1901.yaml
de-Latf-CH-1996.yaml
de-Latf-CH.yaml
de-Latf-DE-1901.yaml
de-Latf-DE-1996.yaml
de-Latf-DE.yaml
de-Latf.yaml
el-polyton.yaml
en-AU.yaml
en-CA.yaml
en-GB.yaml
en-NZ.yaml
en-US.yaml
fr-BE.yaml
fr-CA.yaml
fr-CH.yaml
fr-FR.yaml
fr-LU.yaml
fr-x-acadian.yaml
kmr.yaml
la-x-classic.yaml
la-x-ecclesia.yaml
la-x-medieval.yaml
pa-Guru.yaml
ro-MD.yaml
sr-Cyrl-BA.yaml
sr-Cyrl-ME.yaml
sr-Cyrl-XK.yaml
sr-Latn-BA.yaml
sr-Latn-ME.yaml
sr-Latn-XK.yaml
sr-Latn-ijekavsk.yaml
sr-ijekavsk.yaml
zh-CN.yaml
zh-TW.yaml
zh.yaml

@jgm
Copy link
Owner

jgm commented Jul 5, 2024

I'd like to avoid 2, because it's going to be natural for people to use e.g. es and they will be puzzled if it doesn't work at all. 3 seems like a good way to go. Let me get clearer about the concern. Is the worry that we'll be bloating pandoc with a bunch of mostly superfluous data files?

@stephen-huan
Copy link
Contributor Author

Let me get clearer about the concern. Is the worry that we'll be bloating pandoc with a bunch of mostly superfluous data files?

That's one concern, but I think fairly minimal since the language files are small and mostly independent of everything else in pandoc. My thinking was that the current precedent in pandoc (based on only #6904) avoids ambiguity in language tags whenever possible (e.g. zh-Hans instead of the more ambiguous zh-CN or zh). I was unsure whether this is a policy we want to explicitly enforce in the future.

Even if we want to add more language tags, I'm not sure how useful some of them are. For example, the Arabic variants listed above are spoken variants I believe. In Chinese, for example, cmn (Chinese mandarin) and yue (Cantonese) refer to spoken variants of Chinese, and possibly, but not necessarily, written differences. Since pandoc is ultimately a text processing system, it makes sense to only add tags/subtags that distinguish based on script differences.

To be concrete, I think the PR is fine as-is, for now. If people who know Arabic, French, German, Chinese, etc. want additional tags, they can be easily added in additional PRs. My concerns are admittedly a bit theoretical---in the nearly 4 years since #6904, to the best of my knowledge there hasn't been an issue/PR requesting zh-CN or zh (the closest is #7945, but I don't think that was ultimately a data/translations issue).

@jgm
Copy link
Owner

jgm commented Jul 5, 2024

OK, let's merge this then, and see what issues come up in practice.

@jgm jgm merged commit 9aea033 into jgm:main Jul 5, 2024
9 of 12 checks passed
@jgm
Copy link
Owner

jgm commented Jul 5, 2024

And thank you!

@stephen-huan
Copy link
Contributor Author

Thank you for all your hard work on pandoc and being so responsive on this PR!!

@stephen-huan stephen-huan deleted the auto-translation branch July 5, 2024 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants