Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pp_ser: handle unicode characters in the source file with Python 3 #249

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

skosukhin
Copy link

@skosukhin skosukhin commented Feb 11, 2021

Currently, pp_ser fails to process source files containing Unicode character with Python 3. For example:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 604: ordinal not in range(128)

This is normally not a problem if we use Python 2. The reason is that in the case of Python 2 files are read and written using the same encoding locale.getpreferredencoding() (usually UTF-8) but in the case of Python 3, strings that are read with the locale.getpreferredencoding() encoding are written using ascii.

One way to fix the problem is to replace ascii with locale.getpreferredencoding() inside the to_ascii function (the name of the function is misleading in the case of Python2, therefore I would rename it to to_bytes).

The second approach, which is implemented in this PR, is to eliminate the explicit conversion of strings. All we need to do is to open the temporary file in the text mode (the default on is 'w+b').

@jenkins-apn
Copy link
Collaborator

Hi there, this is jenkins continuous integration...
Do you want me to verify this patch?

@skosukhin
Copy link
Author

It turns out that both of the suggested approaches fail with older versions of Python3 (e.g. 3.5.2) if the environment variable LC_ALL is set to C (the case when the script is run within Spack build environment, which is relevant for some users).

As a recap. Currently, we get the following when a source file contains Unicode characters:

Traceback (most recent call last):
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 1108, in <module>
    ser.preprocess()
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 950, in preprocess
    output_file.write(to_ascii(self.__outputBuffer))
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 53, in to_ascii
    return bytes(text, 'ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9104: ordinal not in range(128)

The original solution of this PR with Python 3.5.2 and LC_ALL=C:

Traceback (most recent call last):
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 1101, in <module>
    ser.preprocess()
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 936, in preprocess
    self.parse()                # first pass, analyse only
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 901, in parse
    for line in input_file:
  File "/sw/rhel6-x64/python/python-3.5.2-gcc49/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 887: ordinal not in range(128)

The updated solution is to convert the input file to UTF-8 explicitly.

@skosukhin skosukhin changed the title pp_ser: avoid explicit conversion of native strings to bytes pp_ser: handle unicode characters in the source file with Python 3 Jun 1, 2021
@havogt
Copy link
Collaborator

havogt commented Jun 2, 2021

Hi @skosukhin, thanks for your contribution. And sorry that I didn't reply earlier. For contributions to a GridTools project we need you or your organization to sign a contributor license agreement with ETH. Probably the latter makes more sense, assuming we have more contributions in the future from different people. I'll check and come back to you.

@lxavier
Copy link

lxavier commented Jun 4, 2021

@havogt we need that for the os upgrade on Tsa (testing ongoing now) , would it be possible to go for an individual agreement first ? We were going to do it but jsut realised Sergey did it also.

@havogt
Copy link
Collaborator

havogt commented Jun 7, 2021

Sure, but not sure if that's actually faster, because @skosukhin probably needs approval from his employer anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants