-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically update data/translations #9946
Conversation
eae217e
to
54b1a62
Compare
Looks great! Why is the script downloading babel & polyglossia from your fork rather than the authoritative repositories? The latter would be better going forward (there may be changes to these, and no guarantee that they'll be updated in your fork). It would be nice to add a Makefile target that runs the script. |
98b8232
to
d4c8b7b
Compare
The first-order reason is I can use it to integrate some fixes that haven't been upstreamed yet, e.g. latex3/babel#303, reutenauer/polyglossia#651, reutenauer/polyglossia#649. The first two commits just fix some minor mistakes I discovered while working on this and aren't really necessary, but the last commit allows me to import Polyglossia's tools/bcp47.py from python3 which has some useful data. Without the fixes, it's a python2 script that can't be imported from python3. Of course I could always vendor the data in tools/update-translations.py itself, but I consider that a worse solution. More generally, it's because the script isn't as automatic as I'd like, and there are a lot of manual hacks littered around the code, so I'm not sure how amendable it would be to updates. Fixing a branch of a fork is a way of implicitly pinning the data that the script processes. My intended workflow for updating the translations would be something like:
Now that I think about it, the reproducibility of the script is not particularly important as its goal is just to update the translations. Of course, I'm happy to remove the forks once the Polyglossia PR is merged or change the script not to depend on bcp47.py.
Added the |
Great, sounds good. |
The test is failing because of a new translation. You can update all the golden tests by doing |
Also, if you put a descriptive comment after the makefile target |
d4c8b7b
to
2485b3d
Compare
Thanks for the headsup. The situation with Serbian ( See below for the updated test. --- a/test/command/translations.md
+++ b/test/command/translations.md
@@ -25,5 +25,10 @@
\figurename~2
\figurename.
^D
-[ Para [ Str "Slika\160\&2" , SoftBreak , Str "Slika." ] ]
+[ Para
+ [ Str "\1057\1083\1080\1082\1072\160\&2"
+ , SoftBreak
+ , Str "\1057\1083\1080\1082\1072."
+ ]
+]
``` This behavior matches current babel and has been the case since ~2017. However, polyglossia uses --- a/test/command/translations.md
+++ b/test/command/translations.md
@@ -21,7 +21,7 @@
```
```
-% pandoc -f latex -t native -M lang=sr
+% pandoc -f latex -t native -M lang=sr-Latn
\figurename~2
\figurename.
^`D
Thanks for the tip. Added a message. |
This is perhaps only tangentially related to this PR, but I think it's a conversation worth having now. This confusion around To quote from Wikipedia,
and the linked article [39],
To give the example of Chinese, which has been discussed previously in #6904 and whose situation I know a bit better, the correct way to indicate whether text is written with simplified or traditional characters is with the script tag,
However,
as pandoc is frequently used to build websites, e.g. hakyll. Indeed, the reason I opened #9930 adding Japanese translations was to hint the language of a page on my personal website, using pandoc to convert from markdown to html. <!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="ja" xml:lang="ja"> #9932 was submitted for the exact same reason, to fix a warning in quarto-dev/quarto-cli#10178; quarto leverages pandoc as a website generator (markdown to html). quarto-dev/quarto-cli#5197 (comment) is another instance of the same issue. This means which tags we decide to include also affect which language tags websites generated with pandoc use. Babel, for example, uses "chinese" to mean There are three natural things we could do.
My personal opinion lies somewhere between [1] and [3]. I made this PR with the intention of reducing the number of language PRs. Expecting users to copy data from Babel or Polyglossia to However, I made this PR following [2], respecting the precedent in #6904. For transparency, a full list of the files I decided not to include is given below. I decided not to include tags that appeared fully redundant or near completely redundant with an existing tag (for example, the English variants
|
I'd like to avoid 2, because it's going to be natural for people to use e.g. |
That's one concern, but I think fairly minimal since the language files are small and mostly independent of everything else in pandoc. My thinking was that the current precedent in pandoc (based on only #6904) avoids ambiguity in language tags whenever possible (e.g. Even if we want to add more language tags, I'm not sure how useful some of them are. For example, the Arabic variants listed above are spoken variants I believe. In Chinese, for example, To be concrete, I think the PR is fine as-is, for now. If people who know Arabic, French, German, Chinese, etc. want additional tags, they can be easily added in additional PRs. My concerns are admittedly a bit theoretical---in the nearly 4 years since #6904, to the best of my knowledge there hasn't been an issue/PR requesting |
OK, let's merge this then, and see what issues come up in practice. |
And thank you! |
Thank you for all your hard work on pandoc and being so responsive on this PR!! |
This implements the "automatic" translation system proposed in #9930.
pl
(although this was a manual tweak, no other file has a colon in it).data/translations
has been split into two commits:Implementation notes:
python tools/update-translations.py
will not leave the git tree untouched. Instead, running it shows precisely which files were manually modified/deleted (the changes that were not automatically generated).pandoc
branch of my forks: babel, polyglossia.ja
,kn
,lt
, andlv
.zh-Hans
andzh-Hant
are subtags ofzh
. Theoretically, for consistency,es
,pt
, andsr
should be removed, as I've added the more specific variantses-ES
/es-MX
,pt-BR
/pt-PT
, andsr-Cyrl
/sr-Latn
. I haven't removed these existing translation files as I'm not familiar with the conventions around these languages.