pp_ser: handle unicode characters in the source file with Python 3 #249

skosukhin · 2021-02-11T13:16:11Z

Currently, pp_ser fails to process source files containing Unicode character with Python 3. For example:

UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 604: ordinal not in range(128)

This is normally not a problem if we use Python 2. The reason is that in the case of Python 2 files are read and written using the same encoding locale.getpreferredencoding() (usually UTF-8) but in the case of Python 3, strings that are read with the locale.getpreferredencoding() encoding are written using ascii.

One way to fix the problem is to replace ascii with locale.getpreferredencoding() inside the to_ascii function (the name of the function is misleading in the case of Python2, therefore I would rename it to to_bytes).

The second approach, ~~which is implemented in this PR~~, is to eliminate the explicit conversion of strings. All we need to do is to open the temporary file in the text mode (the default on is 'w+b').

jenkins-apn · 2021-02-11T13:20:03Z

Hi there, this is jenkins continuous integration...
Do you want me to verify this patch?

skosukhin · 2021-06-01T15:18:09Z

It turns out that both of the suggested approaches fail with older versions of Python3 (e.g. 3.5.2) if the environment variable LC_ALL is set to C (the case when the script is run within Spack build environment, which is relevant for some users).

As a recap. Currently, we get the following when a source file contains Unicode characters:

Traceback (most recent call last):
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 1108, in <module>
    ser.preprocess()
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 950, in preprocess
    output_file.write(to_ascii(self.__outputBuffer))
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 53, in to_ascii
    return bytes(text, 'ascii')
UnicodeEncodeError: 'ascii' codec can't encode character '\u2013' in position 9104: ordinal not in range(128)

The original solution of this PR with Python 3.5.2 and LC_ALL=C:

Traceback (most recent call last):
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 1101, in <module>
    ser.preprocess()
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 936, in preprocess
    self.parse()                # first pass, analyse only
  File "./serialbox/src/serialbox-python/pp_ser/pp_ser.py", line 901, in parse
    for line in input_file:
  File "/sw/rhel6-x64/python/python-3.5.2-gcc49/lib/python3.5/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 887: ordinal not in range(128)

The updated solution is to convert the input file to UTF-8 explicitly.

havogt · 2021-06-02T07:04:08Z

Hi @skosukhin, thanks for your contribution. And sorry that I didn't reply earlier. For contributions to a GridTools project we need you or your organization to sign a contributor license agreement with ETH. Probably the latter makes more sense, assuming we have more contributions in the future from different people. I'll check and come back to you.

lxavier · 2021-06-04T09:39:34Z

@havogt we need that for the os upgrade on Tsa (testing ongoing now) , would it be possible to go for an individual agreement first ? We were going to do it but jsut realised Sergey did it also.

havogt · 2021-06-07T07:05:05Z

Sure, but not sure if that's actually faster, because @skosukhin probably needs approval from his employer anyway.

pp_ser: avoid explicit conversion of native strings to bytes

ab62eb3

pp_ser: compatibility with old Python3 installations

08e470f

skosukhin changed the title ~~pp_ser: avoid explicit conversion of native strings to bytes~~ pp_ser: handle unicode characters in the source file with Python 3 Jun 1, 2021

pp_ser: handle UTF-8 conversion errors

e165cb5

skosukhin mentioned this pull request Oct 14, 2022

serialbox: update patch to handle UTF-8 conversion errors spack/spack#33317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pp_ser: handle unicode characters in the source file with Python 3 #249

pp_ser: handle unicode characters in the source file with Python 3 #249

skosukhin commented Feb 11, 2021 •

edited

Loading

jenkins-apn commented Feb 11, 2021

skosukhin commented Jun 1, 2021

havogt commented Jun 2, 2021

lxavier commented Jun 4, 2021 •

edited

Loading

havogt commented Jun 7, 2021

pp_ser: handle unicode characters in the source file with Python 3 #249

Are you sure you want to change the base?

pp_ser: handle unicode characters in the source file with Python 3 #249

Conversation

skosukhin commented Feb 11, 2021 • edited Loading

jenkins-apn commented Feb 11, 2021

skosukhin commented Jun 1, 2021

havogt commented Jun 2, 2021

lxavier commented Jun 4, 2021 • edited Loading

havogt commented Jun 7, 2021

skosukhin commented Feb 11, 2021 •

edited

Loading

lxavier commented Jun 4, 2021 •

edited

Loading