-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translations source copy check #13419
base: 2.4
Are you sure you want to change the base?
Conversation
d83d292
to
7afcc71
Compare
3da94ca
to
a2cdac2
Compare
…le multi threading
a2cdac2
to
a26bc01
Compare
Now I am finished to clean up the translations form false source copies. The resulting source_copy_allow_list.xml is up to date with all allowed source copies as far as I was able to check it. |
.pre-commit-config.yaml
Outdated
exclude: ^(packaging/wix/LICENSE.rtf.in|src/dialog/dlgabout\.cpp|.*\.(?:pot?|(?<!d\.)ts|wxl|svg))$ | ||
exclude: ^(packaging/wix/LICENSE.rtf.in|src/dialog/dlgabout\.cpp|.*\.(?:pot?|ts|wxl|svg))$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this excludes .d.ts
files used for the controller API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same issues below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all ts files are excluded but not "d.ts" files so it should be OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see. Thank you. fixed.
<source>Param EQ</source> | ||
<allow_all_languages>true</allow_all_languages> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questionable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, in other languages, it may be translated differently. I don't think we can make this blanket statement that its "Param EQ" in all languages. This concern applies to many strings here.
<source>Loudness</source> | ||
<allow_all_languages>true</allow_all_languages> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questionable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I like this allowlist approach. There are many strings which are debatable and having to maintain this giant list is not great either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, XML diffs are painful to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding the reasons why xml is painful to review is if a tool restructures it. This should not happen here, because additional texts are appended.
I have picked XML, because the ts files are also XML.
Any suggestions?
@Swiftb0y what could be the alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really want an allow list (not sure if we really want this), let's use a plain text file with one source per line, followed by a tab character and then a comma separated list of fnmatch expressions:
Phase allowed for en and German variants en,de*
This is allowed for all languages. *
That would be much less verbose than a huge XML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not much interest to write a custom parser. Can we decide for an established format.
not sure if we really want this
Can you confirm the issue? Is there a alternative to distinguish wanted form unwanted source copies?
I have checked the ts format and transifex but there is nothing we can use as a flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom parser would be straightforward:
def parse_allowlist(path: pathlib.Path) -> typing.Iterable[tuple[str, list[str]]:
with path.open(mode="r") as fp:
for line in fp:
source, _, langstr = line.partition("\t")
assert source
langs = langstr.split(",")
assert langs
yield source, langs
allowlist = dict(parse_allowlist(pathlib.Path("path/to/allowlist")))
# Check if (current_source, current_lang) is on allowlist
is_allowed = any(fnmatch.fnmatchcase(current_lang, lang) for lang in allowlist.get(current_source, []))
(wrote this on my phone, so it's untested)
Btw, I can't open the allowlist in the GitHub app on my phone because it is already too large.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allow list has currently 3755 lines with ~4 lines per string we will still almost 2000 lines. Still long.
Any custom format is not well defined, not extensible, without escape rules. While all these isues are solved with xml because this format is also used for source TS files.
Since the script extends this file automatically, there is no need to build it by hand. We need just confirm new entries which are only new lines, not suffreing any review issues.
I am not convinced to replace XML.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the lack of definition and the unwillingness to write a parser is the problem, just use CSV with two columns... The tree nature of XML is overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The allow list has currently 3755 lines with ~4 lines per string we will still almost 2000 lines. Still long.
Where does the 4 lines per string come from? With my proposal it would be 1 line per allowed string (you can list multiple allowed languages in the same line)
@Swiftb0y actually the format I proposed is already CSV with tab delimiter (or TSV). You could probably already use the stdlib csv
module with delimiter='\t'
if there are multiline strings instead of str.partition
.
TOML would also work (see tomllib
in the python stdlib), but that would not be much shorter than XML (but way more readable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nice thing about TSV is that it's rendered as a table by github when you use the Web UI (does not work in the Android app unfortunately).
Example: https://github.com/Holzhaus/helicon/blob/main/mapping.tsv
Missing context here. Under which circumstances are English source strings copied to the translated target language? Is it a manual thing from a transifex user? Or why does this happen? |
Yes, it is a manual think from a user. If they are too lazy and like to gain some more percentage they seems to just copy the source strings. |
In that case I'm questioning if we really want to check it on pull/during committing. Can this somehow be prevented on transifex? Or maybe a monthly check which opens a github issue if necessary? Because you cannot really fix the commit locally anyway. Instead you need to go to transifex and remove translations and the perform a fresh tx pull, or am I misunderstanding this? |
pre-commit takes care that the check is only done if one is committing changes to the ts files. This is the right moment to reject false translations.
Correct. I think the CI of the new automatic created PR will also fail in that case, which is desired, right? |
9a47c31
to
a6e2409
Compare
Done. |
a6e2409
to
f97a0c7
Compare
@@ -0,0 +1,436 @@ | |||
lang source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you swap the columns, it's much more readable IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately not. This was my first version and I have swapped columns to have languages aligned.
nl is | ||
nl Cover | ||
nl Track BPM: | ||
nl Artist + Title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't "Titel" the durch word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes. I will remove the translation.
vi Shuffle | ||
vi Relink | ||
nl Lossy | ||
nl Lossless |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this "exact omkeerbaar"?
de,nl Decks | ||
de,nl Track | ||
de,nl Tracks | ||
de Add Crate as Track Source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
tools/ts_source_copy_check.py
Outdated
if source == source_text: | ||
if lang == "*": | ||
return True | ||
if language in lang: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that en
will also match if the string is de,en_US,fr
although en
is not in the list.
if language in lang: | |
if language in lang.split(","): |
Even better would be to use fnmatch
as I suggested before, because if you really want to match all English dialects, you could write de,en*,fr
without having to list each and every one of them.
return False | ||
|
||
|
||
def add_to_allow_list(source_text, language): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function basically duplicates the parsing logic from above.
tools/ts_source_copy_check.py
Outdated
if ret: | ||
print( | ||
"\n" | ||
"All not allowed copied source translations need to be removed" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"All not allowed copied source translations need to be removed" | |
"All disallowed copied source translations need to be removed" |
tools/ts_source_copy_check.py
Outdated
|
||
if ret: | ||
print( | ||
"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is desired to have a distance to the individual complains.
f97a0c7
to
233777b
Compare
Done |
When a source string is copied to translation it is listed as translated=yes and will be likely not become translated later.
This is the issue, this PR tries to fix.
On the other hand there are strings like "1/4" which are the same in all languages or terms that are by luck the same as the target language.
It is solved by checking all new translations for source == translation. If this is the case an allow list is consulted and than the commit is rejected. This need to be fixed at Transiflex or if it is a valid untranslated string the allow list has to be maintained.
The last days I have used the script to put Transiflex into a good shape. But that was a really tedious work. Especially because these bogus translations pop up again from the translation memory when not explicit deleted for each language.
This script will hopefully prevent future faults.