Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect character encoding of YAML files and ignore files #630

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Jayman2000
Copy link
Contributor

@Jayman2000 Jayman2000 commented Jan 3, 2024

This PR makes sure that yamllint never uses open()’s default encoding. Specifically, it uses the character encoding detection algorithm specified in chapter 5.2 of the YAML spec when reading both YAML files and files that are on the ignore-from-file list.

There are two other PRs that are similar to this one. Here’s how this PR compares to those two:

  • This PR doesn’t have any merge conflicts.
  • This PR has a cleaner commit history. You can run the tests and flake8 on each commit in this PR, and they’ll report no errors. I don’t think that you can do that with Detect encoding per yaml spec (fix #238) #240.
  • This PR has longer commit messages. I really tried to explain why I think that my changes make sense.
  • This PR detects the encoding of files being linted, config files, and files on the ignore-from-file list. Those two PRs only detects the encoding of files being linted.
  • Detect encoding per yaml spec (fix #238) #240 PR adds a dependency on chardet. This PR doesn’t add any dependencies.
  • This PR only supports UTF-8, UTF-16 and UTF-32. Both of those PRs support additional encodings.
  • Unicode yaml #581 adds support for running tests on Windows. This PR doesn’t.
  • The code that this PR adds to the yamllint package is simpler.
  • The code that this PR adds to the test package is much more complicated, but hopefully it tests things more thoroughly.

Fixes #218. Fixes #238. Fixes #347.
Closes #240. Closes #581.

@coveralls
Copy link

coveralls commented Jan 3, 2024

Coverage Status

coverage: 99.789% (-0.04%) from 99.825%
when pulling a6031a4 on Jayman2000:auto-detect-encoding
into 8513d9b on adrienverge:master.

@Jayman2000 Jayman2000 force-pushed the auto-detect-encoding branch 4 times, most recently from 8cedbee to 3fa4c57 Compare January 10, 2024 12:40
@Jayman2000 Jayman2000 force-pushed the auto-detect-encoding branch from 3fa4c57 to 75b2889 Compare January 13, 2024 11:46
@Jayman2000 Jayman2000 force-pushed the auto-detect-encoding branch from 75b2889 to be0cc85 Compare January 20, 2024 12:04
@Jayman2000 Jayman2000 force-pushed the auto-detect-encoding branch 2 times, most recently from fd2c72d to bb8dc2b Compare February 8, 2024 16:01
@Jayman2000
Copy link
Contributor Author

I just noticed that one of the checks for this PR is failing. The coverage for yamllint/config.py went down, but that’s just because the total number relevant lines went down. There’s only two lines that aren’t covered, but those same two lines aren’t covered in the master branch. Is there anything that I need to do here?

@adrienverge
Copy link
Owner

Is there anything that I need to do here?

At the moment, no. I'm sorry, please excuse the delay, this is a big change with much impact, I need a large time slot to review this, which I couldn't find yet.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Jason, please excuse the very long delay for reviewing this... This was a big piece and I needed time. I apologize.

The 6 commits are well splitted, well explained, and make the review much easier. Thanks a lot!

In my opinion this PR is good to go. I suspect it can solve problems in several cases (including the issues you pointed out), but I also see a small risk of breakage on exotic systems the day it's released. If this happens, will you be around to help find a solution?

A few notes:

  • I notice that you used encoding names with underscores (e.g. utf_8 vs. utf-8). I just read on https://docs.python.org/fr/3/library/codecs.html#standard-encodings that not only are they valid, but they also seem to be the "right" notation:

    Notice that spelling alternatives that only differ in case or use a hyphen instead of an underscore are also valid aliases; therefore, e.g. 'utf-8' is a valid alias for the 'utf_8' codec.

  • I feared that using open().decode() would put the whole contents of files in memory before even linting them, and affect performance. But this is already what yamllint does currently.

tests/common.py Outdated Show resolved Hide resolved
tests/common.py Outdated Show resolved Hide resolved
tests/common.py Outdated Show resolved Hide resolved
tests/common.py Outdated Show resolved Hide resolved
tests/test_decoder.py Show resolved Hide resolved
yamllint/decoder.py Outdated Show resolved Hide resolved
Before this change, build_temp_workspace() would always encode a path
using UTF-8 and the strict error handler [1]. Most of the time, this is
fine, but systems do not necessarily use UTF-8 and the strict error
handler for paths [2].

[1]: <https://docs.python.org/3.12/library/stdtypes.html#str.encode>
[2]: <https://docs.python.org/3.12/glossary.html#term-filesystem-encoding-and-error-handler>
Before this commit, test_run_default_format_output_in_tty() changed the
values of sys.stdout and sys.stderr, but it would never change them
back. This commit makes sure that they get changed back.

At the moment, this commit doesn’t make a user-visible difference. A
future commit will add a new test named test_ignored_from_file_with_multiple_encodings().
That new test requires stdout and stderr to be restored, or else it will
fail.
Before this change, yamllint would open YAML files using open()’s
default encoding. As long as UTF-8 mode isn’t enabled, open() defaults
to using the system’s locale encoding [1][2]. This can cause problems in
multiple different scenarios.

The first scenario involves linting UTF-8 YAML files on Linux systems.
Most of the time, the locale encoding on Linux systems is set to UTF-8
[3][4], but it can be set to something else [5]. In the unlikely event
that someone was using Linux with a locale encoding other than UTF-8,
there was a chance that yamllint would crash with a UnicodeDecodeError.

The second scenario involves linting UTF-8 YAML files on Windows
systems. The locale encoding on Windows systems is the system’s ANSI
code page [6]. The ANSI code page on Windows systems is NOT set to UTF-8
by default [7]. In the very likely event that someone was using Windows
with a locale encoding other than UTF-8, there was a chance that
yamllint would crash with a UnicodeDecodeError.

Additionally, using open()’s default encoding is a violation of the YAML
spec. Chapter 5.2 says:

	“On input, a YAML processor must support the UTF-8 and UTF-16
	character encodings. For JSON compatibility, the UTF-32
	encodings must also be supported.

	If a character stream begins with a byte order mark, the
	character encoding will be taken to be as indicated by the byte
	order mark. Otherwise, the stream must begin with an ASCII
	character. This allows the encoding to be deduced by the pattern
	of null (x00) characters.” [8]

This change fixes all of those problems by implementing the YAML spec’s
character encoding detection algorithm. Now, as long as YAML files
begin with either a byte order mark or an ASCII character, yamllint
will automatically detect them as being UTF-8, UTF-16 or UTF-32. Other
character encodings are not supported at the moment.

Credit for the idea of having tests with pre-encoded strings goes to
@adrienverge [9].

Fixes adrienverge#218. Fixes adrienverge#238. Fixes adrienverge#347.

[1]: <https://docs.python.org/3.12/library/functions.html#open>
[2]: <https://docs.python.org/3.12/library/os.html#utf8-mode>
[3]: <https://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html>
[4]: <https://wiki.musl-libc.org/functional-differences-from-glibc.html#Character-sets-and-locale>
[5]: <https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/SUPPORTED;h=c8b63cc2fe2b4547f2fb1bff6193da68d70bd563;hb=36f2487f13e3540be9ee0fb51876b1da72176d3f>
[6]: <https://docs.python.org/3.12/glossary.html#term-locale-encoding>
[7]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page>
[8]: <https://yaml.org/spec/1.2.2/#52-character-encodings>
[9]: <adrienverge#630 (comment)>
Before this change, yamllint would decode files on the ignore-from-file
list using open()’s default encoding [1][2]. This can cause decoding to
fail in some situations (see the previous commit message for details).

This change makes yamllint automatically detect the encoding for files
on the ignore-from-file list. It uses the same algorithm that it uses
for detecting the encoding of YAML files, so the same limitations apply:
files must use UTF-8, UTF-16 or UTF-32 and they must begin with either a
byte order mark or an ASCII character.

[1]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.input>
[2]: <https://docs.python.org/3.12/library/fileinput.html#fileinput.FileInput>
In general, using open()’s default encoding is a mistake [1]. This
change makes sure that every time open() is called, the encoding
parameter is specified. Specifically, it makes it so that all tests
succeed when run like this:

	python -X warn_default_encoding -W error::EncodingWarning -m unittest discover

[1]: <https://peps.python.org/pep-0597/#using-the-default-encoding-is-a-common-mistake>
The previous few commits have removed all calls to open() that use its
default encoding. That being said, it’s still possible that code added
in the future will contain that same mistake. This commit makes it so
that the CI test job will fail if that mistake is made again.

Unfortunately, it doesn’t look like coverage.py allows you to specify -X
options [1] or warning filters [2] when running your tests [3]. To work
around this problem, I’m running all of the Python code, including
coverage.py itself, with -X warn_default_encoding and
-W error::EncodingWarning. As a result, the CI test job will also fail
if coverage.py uses open()’s default encoding. Hopefully, coverage.py
won’t do that. If it does, then we can always temporarily revert this
commit.

[1]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-X>
[2]: <https://docs.python.org/3.12/using/cmdline.html#cmdoption-W>
[3]: <https://coverage.readthedocs.io/en/7.4.0/cmd.html#execution-coverage-run>
@Jayman2000
Copy link
Contributor Author

Jayman2000 commented Nov 29, 2024

Hello Jason, please excuse the very long delay for reviewing this... This was a big piece and I needed time. I apologize.

You shouldn’t be apologizing, I should be thanking you.

Thank you for taking the time to review this PR and to write valuable review comments. Sometimes, maintainers don’t take the time to thoroughly review my contributions. When that happens, my contributions end up being rejected without being understood which is frustrating and saddening. You’re review comments are different, though. They clearly demonstrate that you took the time to read, understand and think about this PR. I’d much rather receive a thoughtful review than a timely one.

In my opinion this PR is good to go. I suspect it can solve problems in several cases (including the issues you pointed out), but I also see a small risk of breakage on exotic systems the day it's released. If this happens, will you be around to help find a solution?

Sure! I had thought about adding a --force-encoding option that would disable encoding autodetection. The idea is that someone could use --force-encoding shift_jis to make yamllint decode everything using Shift JIS or --force-encoding cp1252 to make yamllint decode everything using code page 1252. I decided against adding the --force-encoding in this PR because this PR was already so big.

Now that you mention the small risk of breakage on exotic systems, it’s making me think about the --force-encoding idea again. It might be wise to implement --force-encoding after this PR gets merged and before there’s a new stable release of yamllint. That way, we can give users an easy workaround if encoding autodetection breaks something. If you think that that is a good idea, then I can open another PR that adds a --force-encoding option after this PR gets merged.


I just pushed a new version of this pull request with the following changes:

  • I rebased it on 8513d9b (the tip of master, at the moment).
  • I implemented the review suggestions (see the conversations that just recently got resolved).
  • The variable in tests/common.py that used to be named test_strings is now called TEST_STRINGS_TO_ENCODE_AT_RUNTIME. I made this change for two reasons:
    1. I needed a add a new variable that contained a different type of test strings. I needed to add that variable in order to implement one of the suggestions from your review comments.
    2. The variable name should have been all uppercase anyway because the variable is a constant.
  • For similar reasons, I also renamed the error1 and error2 variables to ERROR1 and ERROR2 in the function that used to be named test_detect_encoding().
  • I added a repr() call to the msg="auto_decode({repr(input_bytes)}) returned the wrong value." line in tests/test_decoder.py. The repr() call helps ensure that the error message has proper syntax.
  • I fixed a dead link in one of the commit messages (https://sourceware.org/glibc/manual/html_node/Extended-Char-Intro.htmlhttps://www.gnu.org/software/libc/manual/html_node/Extended-Char-Intro.html).
  • I revised the commit message for “decoder: Autodetect detect encoding of YAML files” to (hopefully) make it easier to understand for people who are unfamiliar with this issue.
  • I made the commit message for “decoder: Autodetect encoding for ignore-from-file” slightly simpler.
  • I improved the commit message for “CI: Fail when open()’s default encoding is used”. I reworded part of it in order to make it more clear why coverage.py is affected by default encoding errors.
  • Updated copyright years for some files to account for the fact that I worked on some files in both 2023 and 2024.

Copy link
Owner

@adrienverge adrienverge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for following up, again it's a great work.
Amended commits are still perfectly clear and anticipate some of my questions :)

Sure! I had thought about adding a --force-encoding option that would disable encoding autodetection. The idea is that someone could use --force-encoding shift_jis to make yamllint decode everything using Shift JIS or --force-encoding cp1252 to make yamllint decode everything using code page 1252. I decided against adding the --force-encoding in this PR because this PR was already so big.

Now that you mention the small risk of breakage on exotic systems, it’s making me think about the --force-encoding idea again. It might be wise to implement --force-encoding after this PR gets merged and before there’s a new stable release of yamllint. That way, we can give users an easy workaround if encoding autodetection breaks something. If you think that that is a good idea, then I can open another PR that adds a --force-encoding option after this PR gets merged.

If I understand correctly, YAML specification says that files should be Unicode-encoded, but in the real world not all YAML files are.
But I'm not sure whether today, with yamllint 1.35.1, it's possible to successfully run yamllint on non-Unicode-encoded YAML files. Do you know whether it's the case?
Here is what I tried on my system (a recent GNU/Linux):

  • Python's sys.getdefaultencoding() always equals utf-8, even when setting environment variables PYTHONIOENCODING=cp1252, PYTHONUTF8=0 and LANG=fr_FR.CP1252.
  • After creating a CP1252-encoded file (python -c 'with open("/tmp/cp1252.yaml", "wb") as f: f.write("- éçà".encode("cp1252"))'), running PYTHONIOENCODING=cp1252 yamllint /tmp/cp1252.yaml fails with UnicodeDecodeError: 'utf-8' codec can't decode ….

I'm not an expert of how Python handles encoding, so your input is more than welcome on this 🙂

  • If it's not the case, then no need for an extra option --force-encoding.
  • If it's the case, then I agree it would be better to add a way to avoid breakage for those users. However I'm not a fan of a new command line option (--force-encoding), because ideally this would be temporary and removed after a few yamllint versions (after users fixed their files to use UTF-8), so it would break usage the day when we remove this option.
    Instead, we could use an environment variable YAMLLINT_IO_ENCODING, that detect_encoding() would use to override the encoding if defined (then we could unit-test 2 or 3 cases in a new test_detect_encoding_with_env_var_override()). Removing support for this env var in the future won't break command-line options. In my opinion, this should be fairly simple and should go in the same commit.

I revised the commit message for “decoder: Autodetect detect encoding of YAML files”

I like the new version too, it's clearer about the Windows case. Did you mean Autodetect detectAutodetect?

I revised the commit message for “decoder: Autodetect detect encoding of YAML files” to (hopefully) make it easier to understand for people who are unfamiliar with this issue.

Indeed, it's slightly clearer 👍 and will allow grepping warn_default_encoding or EncodingWarning inside Git history.

Comment on lines +76 to +81
test_codec_infos = {
'utf_32_be_sig': CodecInfo(encode_utf_32_be_sig, codecs.getdecoder('utf_32')), # noqa: E501
'utf_32_le_sig': CodecInfo(encode_utf_32_le_sig, codecs.getdecoder('utf_32')), # noqa: E501
'utf_16_be_sig': CodecInfo(encode_utf_16_be_sig, codecs.getdecoder('utf_16')), # noqa: E501
'utf_16_le_sig': CodecInfo(encode_utf_16_le_sig, codecs.getdecoder('utf_16')), # noqa: E501
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand these # noqa originate from #630 (comment), but instead of overriding the linter, can you use this:

test_codec_infos = {
    'utf_32_be_sig':
    CodecInfo(encode_utf_32_be_sig, codecs.getdecoder('utf_32')),

or this:

test_codec_infos = {
    'utf_32_be_sig': CodecInfo(encode_utf_32_be_sig,
                               codecs.getdecoder('utf_32')),

?

# An empty string
PreEncodedTestStringInfo(
b'',
None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

tests/test_decoder.py Show resolved Hide resolved
Comment on lines +343 to +348
def detect_encoding_test_helper(
self,
original_string,
input_bytes,
expected_output
):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These variable names confused me for 2 minutes (I thought "output" referred to a string) 😅
Can you get rid of original_string (it isn't used) and use something like this:

Suggested change
def detect_encoding_test_helper(
self,
original_string,
input_bytes,
expected_output
):
def detect_encoding_test_helper(self, input_bytes, expected_codec):

… and 3 lines later actual_outputactual_codec?

PS: For consistency, we could also change auto_decode_test_helper() to use expected_string instead of expected_output. It would be more explicit and easier to grasp.

decoder.auto_decode(input_bytes)
except UnicodeDecodeError as exception:
return exception
return None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning None is implicit and not stricly needed here. As you prefer.

Comment on lines +425 to +436
exception = self.auto_decode_test_helper(
pre_encoded_test_string_info.input_bytes,
pre_encoded_test_string_info.codec_for_input_bytes,
pre_encoded_test_string_info.expected_output_str
)
if exception is not None:
new_exception = self.failureException(
msg=ERROR.format(
repr(pre_encoded_test_string_info.input_bytes)
)
)
raise new_exception from exception
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the logic and I'm OK to keep this code, but if it was me I would simplify all this and just let the exception pop up from auto_decode_test_helper(). It's a less clean, but shorter 😉

(same later)

Comment on lines +451 to +454
msg=(
"None of the TEST_STRINGS_TO_ENCODE_AT_RUNTIME triggered a "
+ "decoding error."
)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The + is not needed:

Suggested change
msg=(
"None of the TEST_STRINGS_TO_ENCODE_AT_RUNTIME triggered a "
+ "decoding error."
)
msg=("None of the TEST_STRINGS_TO_ENCODE_AT_RUNTIME triggered a "
"decoding error.")

Or if you really like Black 😉:

Suggested change
msg=(
"None of the TEST_STRINGS_TO_ENCODE_AT_RUNTIME triggered a "
+ "decoding error."
)
msg=(
"None of the TEST_STRINGS_TO_ENCODE_AT_RUNTIME triggered a "
"decoding error."
)

)


def encode_utf_32_be_sig(obj, errors='strict'):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like errors='strict' is already the default: https://docs.python.org/3/library/codecs.html#codecs.encode and no yamllint code uses this argument anywhere.
But I'm OK to keep it if you think it's better to be explicit.

from yamllint import decoder


class PreEncodedTestStringInfo():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class PreEncodedTestStringInfo():
class PreEncodedTestStringInfo:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants