Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AsciiDoc exclude cross-references and anchors ids from the *.po(t) translation file #400

Open
benoitrolland opened this issue Mar 26, 2023 · 13 comments

Comments

@benoitrolland
Copy link

benoitrolland commented Mar 26, 2023

First, thank you so much for the great job of including assciidoc into po4, allowing translation from asciidoc files.
Reading the Locale::Po4a::AsciiDoc documentation,
I do not see how to exclude anchors ids , cross reference ids and custom-ids from the generated *.po(t) translation file.
I believe it might prevent an effective automatic translation of asciidoc files using po4a.
Let find the details here

@mquinson
Copy link
Owner

Hello @benoitrolland, thanks for the feedback.

Could you please provide me with an example file exemplifying what you want to exclude? I'm not expert in all the formats handled by po4a and often struggle to generate such test cases myself.

Thanks,

@jnavila
Copy link
Collaborator

jnavila commented Mar 27, 2023

Hi @benoitrolland

Refering to your points, first please understand that cross-references are in-line formattings and as such, they are not modified in the string to translate. Take for example the following text:

See the <<URLS,GIT URLS>> section below for more information on specifying repositories.

We want to keep the crosslink reference and how it is formatted. Po-format is not powerful enough (and has not been designed) to handle attributes in-line text. It is up to the translator to understand this formatting and make the translation follow the same pattern. In some languages the translation may completely modify the way the sentence is written and where the crosslink is.

As for your points:

  • take in account "Various things" since only the reference key "img-Things-various" seems to be candidate for translation. ref: [[img-Things-various, Various things]] Anchor names are not translated by default. During the process of internationlization of asciidoc source, the anchors in the text should be made formal, and not rely on string which are translated.
  • only take in account the string value ("Sketches: ")of <<ill-sketches-intro,Sketches: >> This is a corner case of the more general case where the crosslink appears in the middle of a sentence and requires context for being translated.
  • ignore key-only references like: <<img-Things-various>> I don't think the proposed string is referring to the key only reference, but that it is the string of the picture description. As already said, cross references are passed "as is".

Hope this clarifies your issues.

@benoitrolland
Copy link
Author

Hello @benoitrolland, thanks for the feedback.

Could you please provide me with an example file exemplifying what you want to exclude? I'm not expert in all the formats handled by po4a and often struggle to generate such test cases myself.

Thanks,

from an asciidoc file containing:

[[ill-sketches-intro,Sketches]]
[NOTE,icon=texte-introduction.svg]
.Sketches
====
. <<img-Things-various>>
====

(...)

[[img-Things-various, Various things]]
.<<ill-sketches-intro,Sketches: >><<img-Things-various>>
[caption=""]
image::intro/07ThingsVarious.jpg[img-Things-various,180,100,float="left",align="center"]

Using po4a version 0.69.,

the generated *.pot file content is:

    #. type: Block title
    #: bookname.adoc.pp:343
    #, no-wrap
    msgid "<<ill-sketches-intro,Sketches: >><<img-Things-various>>"
    msgstr ""
    
    #. type: Positional ($1) AttributeList argument for macro 'image'
    #: bookname.adoc.pp:345
    #, no-wrap
    msgid "img-Things-various"
    msgstr ""
    
    #. type: Target for macro image
    #: bookname.adoc.pp:345
    #, no-wrap
    msgid "intro/07ThingsVarious.jpg"
    msgstr ""

    #. type: Block title
    #: bookname.adoc.pp:2469
    #: bookname.adoc.pp:2483
    #: bookname.adoc.pp:2492
    #: bookname.adoc.pp:2517
    #: bookname.adoc.pp:2545
    #: bookname.adoc.pp:2576
    #, no-wrap
    msgid "Sketches"
    msgstr ""

    #. type: delimited block =
    #: bookname.adoc.pp:2472
    msgid "<<img-Things-various>>"
    msgstr ""

Given the Locale::Po4a::AsciiDoc.3pm documentation available, I could declare in the source asciidoc my image macro for it not to be translated like this: //po4a: macro image[]

The po4a generated *.pot file now contains:

    #. type: Block title
    #: bookname.adoc.pp:343
    #, no-wrap
    msgid "<<ill-sketches-intro,Sketches: >><<img-Things-various>>"
    msgstr ""
    
    #. type: Block title
    #: bookname.adoc.pp:2469
    #: bookname.adoc.pp:2483
    #: bookname.adoc.pp:2492
    #: bookname.adoc.pp:2517
    #: bookname.adoc.pp:2545
    #: bookname.adoc.pp:2576
    #, no-wrap
    msgid "Sketches"
    msgstr ""

    #. type: delimited block =
    #: bookname.adoc.pp:2472
    msgid "<<img-Things-various>>"
    msgstr ""

But would you know how to make the *.po translation file:

  • take in account "Various things" since the reference key "img-Things-various" seems to be candidate for translation instead. ref: [[img-Things-various, Various things]]
  • only take in account the string value ("Sketches: ")of <<ill-sketches-intro,Sketches: >>
  • ignore key-only references like: <>

Simply said, how to exclude cross-references and anchors ids as well as custom-ids from the po4a generated *.po(t) translation file.

@benoitrolland
Copy link
Author

benoitrolland commented Mar 30, 2023

Hi @benoitrolland

Refering to your points, first please understand that cross-references are in-line formattings and as such, they are not modified in the string to translate. Take for example the following text:

See the <<URLS,GIT URLS>> section below for more information on specifying repositories.

We want to keep the crosslink reference and how it is formatted. Po-format is not powerful enough (and has not been designed) to handle attributes in-line text. It is up to the translator to understand this formatting and make the translation follow the same pattern. In some languages the translation may completely modify the way the sentence is written and where the crosslink is.

As for your points:

  • take in account "Various things" since only the reference key "img-Things-various" seems to be candidate for translation. ref: [[img-Things-various, Various things]] Anchor names are not translated by default. During the process of internationlization of asciidoc source, the anchors in the text should be made formal, and not rely on string which are translated.
  • only take in account the string value ("Sketches: ")of <<ill-sketches-intro,Sketches: >> This is a corner case of the more general case where the crosslink appears in the middle of a sentence and requires context for being translated.
  • ignore key-only references like: <<img-Things-various>> I don't think the proposed string is referring to the key only reference, but that it is the string of the picture description. As already said, cross references are passed "as is".

Hope this clarifies your issues.

Thank you for reading my case. I understand but this seems to disqualify Po4a/asciidoc as a candidate for a fully automated translation ...

@jnavila
Copy link
Collaborator

jnavila commented Mar 31, 2023

Sorry to read that.

One point I'd like mention though: you won't find any translation tool that can "exclude cross-references and anchors ids as well as custom-ids", because this information is needed by the translation tool to correctly guide the translator into converting the original anchors and cross-refs into the translated ones.

@benoitrolland
Copy link
Author

benoitrolland commented Mar 31, 2023

Yes, but maybe a scenario where isolated text could first be translated and then reviewed would help asciidoc/po4a adapt to new challenges in automation, not to say that text references are often used for isolated text.
Beside that some tools like deepl.com are smart enought to not translate elements like <<img-Things-various>>
Maybe could it evolve to translate Sketches: when within asccidoctor elements like <<ill-sketches-intro,Sketches: >><<img-Things-various>>
The remaining problem in that case is when string like "Various Things" is not reported at all in the .pot/.po file (like in [[img-Things-various, Various things]])

@jnavila
Copy link
Collaborator

jnavila commented Mar 31, 2023

The remaining problem in that case is when string like "Various Things" is not reported at all in the .pot/.po file (like in [[img-Things-various, Various things]])

This seems to be a bug, indeed. I'll look into it.

@mquinson
Copy link
Owner

mquinson commented Apr 13, 2023

Hello,

I come a bit after the party here, but I wanted to mention that there is a notion of placeholder in the XML module of po4a, where a specific tag and all its content can be hidden by po4a, and replaced with <place attr="thetagtoprotect" id="0"> to ensure that (1) it wont bother the translators (2) the translators will not try to translate it when they should not (3) they wont break the content formatting. Indeed, po4a then checks that the translated string still contains the placeholders it's expecting when reinjecting the content.

Maybe something similar could be done here. For example, some text <<ill-sketches-intro,Sketches: >> blah <<img-Things-various>> could result in the following PO chunk:

msgid "some text [PLACEHOLDER 1] blah [PLACEHOLDER 2]"
msgstr ""

msgid "Sketches"
msgstr ""

Also, without the text around (ie for the content <<ill-sketches-intro,Sketches: >><<img-Things-various>>) we could avoid generating a msgid containing only placeholders and skip it from the PO file.

I did not dig into the code, but I think that all this could be possible. If it does not make the code too ugly, that'd probably be a good set of improvements, don't you think? But my main concern here is that @benoitrolland was speaking of fully automated translation process. If some robot changes PLACEHOLDER to e.g. ESPACE RÉSERVÉ (the french for placeholder), then the whole process would fail. I'm not sure of how to "collaborate" with a fully automated translation system here.

Btw, if [...] does not sound very asciidoc-ish, do not hesitate to use another markup.

@jnavila
Copy link
Collaborator

jnavila commented Apr 14, 2023

Replacing the markup of asciidoc by another markup that is not recognized by any translation tool is useless. The replacing markup must be natively handled by po. To this end, we can use placeholders which are related to the programming languages natively handled by gettext and that do not interfere with asciidoc's own markup. For instance:

  • use python placeholders
    "this is a *strong* text" → "this is a {1}strong{2} text"
  • use xml tags
    "this is a *strong* text" → "this is a <tag1>strong<tag2> text"

I haven't had a look at how common translation applications and automatic translation tools handle these tags to select the most supported tag system.

@mquinson
Copy link
Owner

You are perfectly right. I tend to personally prefer that the <placeholder id="1"> version, because I find it more explicit, but both of your proposal could do the trick. We could also add a comment to the msgid explaining to manual translators to not change these strings.

@silopolis
Copy link
Contributor

silopolis commented Apr 15, 2023 via email

@mquinson
Copy link
Owner

@silopolis do you have a link or two to describe what's existing in these TMS? Thx

@silopolis
Copy link
Contributor

silopolis commented Apr 18, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants