Check pofile string delimiters #1151

rtobar · 2024-11-17T15:41:22Z

This PR adds checks to the pofile parser code to validate that message strings are correctly delimited by double quotes. Keeping with the current design, an error is only raised if requested, otherwise a warning is printed, the faulty lines are corrected and parsing goes on.

I found this issue while processing a pofile used in the Spanish translation of the CPython documentation. One of our files was incorrectly written, and from all our tooling only the msgcat tool of GNU's gettext package complained, while babel, polib and others didn't. See python/python-docs-es#2873, izimobil/polib#161 and https://git.afpy.org/AFPy/powrap/pulls/4 for further reference.

While implementing this change I found that the _NormalizedString class not only was used to contain message lines, but also participated in the parsing process (and hid some parsing as well). I thus broke down my changes into three separate commits:

I first clarified the usage of the current _NormalizedString class across the codebase (see details in commit).
I then added the double quote delimitation check logic I wanted to add to the parser
Now that all strings have the same form, I more formally constrained how _NormalizedString behaves

Along the way I also implemented three small quality-of-life changes. They are included as the first three commits of this PR, happy to submit these separately if required:

Avoid re-compiling a regular expression
Remove a duplicate test assertion
Perform a better assertion in a particular test, allegedly what was intended in the first place

While the re module caches some of the latest compilations, it's better form to not rely on it doing so. Signed-off-by: Rodrigo Tobar <[email protected]>

The exact same check is performed a few lines above. Signed-off-by: Rodrigo Tobar <[email protected]>

Since Python 2.3 sorted() has been guaranteed to be stable. The comment was wrong, and thus it makes sense to do the full assertion as clearly intended. Signed-off-by: Rodrigo Tobar <[email protected]>

The _NormalizeString helper class mixes some responsibilities, not only acting as a container for potentially multiple lines of a single string message, but also doing and hiding some of the parsing of such strings. "Doing" because it performs a .strip() on all incoming strings in order to remove any whitespace before/after them, and "hiding" because when invoking the "denormalize" method, each line is slices to remove the first and last element, which are implicitly assumed to be the string delimiters (double quotes, in principle). These multiple roles have already led to confusion within the codebase as to how this class is supposed to be used. Its existing unit test doesn't provide strings with proper delimiters (and thus calling .denormalize() on these objects would return unexpected results -- empty strings in all cases). Similarly, missing msgstr instances also result in a call to _NormalizeString(""), which does work, but is conceptually incorrect, as the empty string is somethiing that _NormalizeString should never see coming in. This commit changes all the places where confusing usage of the _NormalizeString class happens. In particular, the existing unit test's strings are now always delimited by double quotes (so calling .denormalize on them would yield the expected value). A number of new unit tests have also been added exercising the denormalize() method, which includes unescaping escaped characters. Finally, the construction of an empty string message has been simplified to _NormalizeString(). Signed-off-by: Rodrigo Tobar <[email protected]>

Strings should be delimited on both ends by double quotes, but this is currently not being been detected, and content is simply being incorrectly trimmed. This commit adds a check for each string to verify it starts and ends with a double quote character, issuing a warning/error if that's not the case (and fixing it as appropriate). A few new test cases have been added to check that the lack of double quotes to delimit strings issues errors as expected. Signed-off-by: Rodrigo Tobar <[email protected]>

Now that all strings given as inputs to _NormalizeString have been verified (or corrected) to be correctly delimited with double quotes, there's no reason to continue doing an internal strip anymore. Moreover, we can express this internal constraint with an assertion to avoid issues in the future. Signed-off-by: Rodrigo Tobar <[email protected]>

En `library/re.po` había una entrada que no estaba delineada correctamente con comillas dobles (si ven el diff entero es la última entrada en el diff, o pueden ver simplemente el primer commit de este PR). Esto hacía que `powrap --check` se saltara el archivo y no lo validara. Esto, a su vez, ocurría porque la utilidad `msgcat` de `gettext` identificaba el error de sintaxis, y fallaba al ser ejecutada. `powrap` no consideraba esos errores al momento de calcular el exit code del proceso, y por lo tanto el archivo no sólo seguía siendo inválido, sino que tampoco era verificado. De igual forma, el archivo no podía ser wrapeado correctamente usando `powrap library/re.po`. Ya abrí un PR contra `powrap` para cambiar este comportamiento en https://git.afpy.org/AFPy/powrap/pulls/4 (actualización: el PR ya fue mergeado, y una nuevs versión de powrap fue publicada, pornlo que también actualicé en este PR nuestra dependencia de powrap, además del pre-commit hook de powrap). Por otro lado, el resto de nuestras herramientas *no* consideraban este archivo como inválido, Esto es porque `polib` no hacía la validación correspondiente, e incorrectamente parseaba la entrada. También abrí un PR contra polib para esto en izimobil/polib#161. Actualización: en el intertanto también me di cuenta de que el paquete `babel` sufre del mismo problema, yo incorrectamente había asumido que babel dependía de polib; PR creada contra babel: python-babel/babel#1151. Después de corregir el error de sintaxis, ejecuté powrap de tal manera que ahora `library/re.po` está bien formateado. --------- Signed-off-by: Rodrigo Tobar <[email protected]>

rtobar · 2024-11-21T07:21:45Z

Gentle ping, at least to kick off CI and check if there's any obvious mistakes to be fixed

rtobar added 6 commits November 17, 2024 23:22

Avoid re-compiling regular expression

43c5386

While the re module caches some of the latest compilations, it's better form to not rely on it doing so. Signed-off-by: Rodrigo Tobar <[email protected]>

Remove duplicate test assertion

df1072f

The exact same check is performed a few lines above. Signed-off-by: Rodrigo Tobar <[email protected]>

Perform full intended assertion in test

8f736f5

Since Python 2.3 sorted() has been guaranteed to be stable. The comment was wrong, and thus it makes sense to do the full assertion as clearly intended. Signed-off-by: Rodrigo Tobar <[email protected]>

This was referenced Nov 17, 2024

Arregla y wrapea library/re.po python/python-docs-es#2873

Merged

Detect and report incorrectly delimited strings izimobil/polib#161

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check pofile string delimiters #1151

Check pofile string delimiters #1151

rtobar commented Nov 17, 2024

rtobar commented Nov 21, 2024

Check pofile string delimiters #1151

Are you sure you want to change the base?

Check pofile string delimiters #1151

Conversation

rtobar commented Nov 17, 2024

rtobar commented Nov 21, 2024