Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tests with non-GNU iconv #16840

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

orlitzky
Copy link
Contributor

@orlitzky orlitzky commented Nov 17, 2024

Get the iconv test suite passing with musl. This fixes some of #13696

Many of these issues relate to the //IGNORE suffix that can be appended to the target charset during conversion. This //IGNORE was recently standardized in POSIX 2024, but there are still some pitfalls:

  • The standard says that //IGNORE will ignore untranslatable sequences (valid input sequences that cannot be represented in the target charset), but sequences that are a priori invalid should instead cause an EILSEQ -- even in earlier versions of the standard.
  • The two GNU implementations have always ignored that, and have returned an EILSEQ for invalid input sequences. Their //IGNORE allows these errors to be ignored. In other words, the two GNU implementations have never conformed to POSIX.
  • Other implementations may conform to POSIX, but not the 2024 version yet. In particular, musl follows the earlier standards and will e.g. replace untranslatable sequences by a * rather than fail. But it does not yet implement //IGNORE, and will simply crash if it encounters //IGNORE in the charset name.

So, what is done here:

  • Two of our tests rely on the non-standard (GNU) illegal input sequence //IGNORE behavior. These get SKIPIFs.
  • One test uses //IGNORE, but it has no effect on the output. Here the //IGNORE was simply dropped.
  • Two ISO-2022-JP tests are looking for GNU output, but that encoding is stateful and there is no single correct answer. The musl answer is different. Rather than whitelist the GNU implementations here, I've blacklisted "unknown" implementations, which includes musl. (I don't see a good way to fix these tests otherwise.)

@petk @arnaud-lb

This test ensures that iconv() will not return part of the string if
it encounters an error, but the particular error it wants to cause is
not portable. When the input sequence is not representable in the
output charset, POSIX says that "iconv() shall perform an
implementation-defined conversion on the character" rather than
fail. As far as I know, the only two implementations that go against
this recommendation (and return the error that this test is expecting)
are GNU libiconv and glibc.

This commit adds a SKIPIF check for the two GNU implementations to
ensure that an error is actually encountered during this test.
This test performs a conversion from UCS-2 to UTF-8 using iconv and
stream filters, with the suffix "//IGNORE" being added to the target
charset. This magic //IGNORE string was recently standardized in POSIX
2024, but it is not yet portable: musl for example will simply fail
when it encounters //IGNORE in what is supposed to be a charset name.

Fortunately, we do not need to think too hard about the general
problem here: the input is "abc", which has no un-translatable
sequences. We can simply drop the "//IGNORE".
@orlitzky
Copy link
Contributor Author

Hm, it looks like some versions of glibc iconv actually do the right thing with invalid input sequences and //IGNORE, but we are still testing for the wrong thing:

- string(10) "aa%C3%B8aa"
+ Notice: iconv(): Detected an illegal character in input string in /home/runner/work/php-src/php-src/ext/iconv/tests/bug48147.php on line 4
+ string(0) ""

That SHOULD warn/fail because the input is invalid, and //IGNORE should not ignore invalid inputs. I'll do some digging and try to figure out where the divergence happened.

@petk
Copy link
Member

petk commented Nov 17, 2024

Thanks for the patch. Yes, that GNU libiconv is non-standard and also LGPL based library, which makes it so-so useful. I certainly would add priority to built-in iconv implementation over GNU libiconv. I'm not sure if any packages out there rely on it as there isn't proper integration of POSIX iconv in PHP yet for some characters. I remember I've always had some annoyances with converting characters using it. :D

Not to mention that with gnu-libiconv some tests would fail on PHP's CI setup.

Perhaps those IGNORE hacks should be removed but I should check in more details.

@orlitzky
Copy link
Contributor Author

Yuck.

  • The glibc bug that started this whole thing is still there; it still eats up more of the string than it should when //IGNORE is used.
  • But it shouldn't matter, because in 2010, PHP was changed to not return part of the string on failure, in e3fdf31.
  • However, the test for ICONV_BROKEN_IGNORE is, ironically, broken. It tests that //IGNORE will ignore invalid input sequences. (It should not). Glibc, musl, and anything conforming to any version of POSIX will fail this test, and trigger the workaround.
  • The ICONV_BROKEN_IGNORE workaround turns what should be an error into a success, so the 2010 change does not kick in when //IGNORE is used. (It should, because illegal input sequences are an error and shouldn't be ignored, even with //IGNORE).

This is annoying because, like, three mistakes add up to something that is sort of correct. POSIX and implementations have always agreed that processing would stop at an invalid input sequence, even in the presence of //IGNORE. So, in C, you'll get part of the string back. Glibc however still has that bug that makes it mess this up with //IGNORE. PHP, on the other hand, is trying to return false if iconv() fails, rather than the partially converted string. But the old glibc workaround and wonky ICONV_BROKEN_IGNORE test are conspiring to make PHP return the partially converted string, so long as //IGNORE was given. It would not be crazy to rely on this if you have not read the PHP documentation, because that's how it works in C.

The PHP docs for iconv, however, do say that any failure will get you false rather than the input string. So in short, I think we should treat the new output above as correct. If you get an EILSEQ, even with //IGNORE, that's a failure and you should get back false rather than a partially converted string. This conforms to POSIX, the iconv docs, and the PHP docs. But it is a breaking change for people who are expecting to get the partial result back.

This test meaningfully uses the //IGNORE charset suffix that was was a
GNU/Solaris extension but is now standardized in POSIX 2024. The way
//IGNORE is used, however, is non-standard: POSIX says that //IGNORE
will cause untranslatable sequences to be skipped, but this test is
expecting it to skip input sequences that are invalid rather than
unexpressible in the target charset. That behavior is specific to the
two GNU implementations (and was always non-conforming...), so we add
a SKIPIF block to ensure that one of the GNU implementations is used.
This test does a byte comparison of the expected ISO-2022-JP encoding
with the result of iconv(). The ISO-2022-JP encoding, however, is
stateful; there are many ways to arrive at a correct answer. Musl is
known to have an inefficient encoding that causes it to fail this
test, so we add a SKIPIF for the "unknown" iconv implementation that
musl has.

Nothing is wrong here per se, but to support the musl output (and that
of other iconvs), an expert would need to verify it.
This test checks the output of iconv_mime_encode() with a target
charset of ISO-2022-JP. That charset is stateful, though, and there
are many ways to arrive at a correct answer. Musl has an inefficient
encoding of it that causes this to fail because the actual output
differs from (and is much longer than) the expected output. We add a
SKIPIF for the "unknown" iconv implementation that musl has.
@orlitzky
Copy link
Contributor Author

orlitzky commented Nov 17, 2024

I dropped the two commits that change the //IGNORE workaround, since the others should be uncontroversial and suitable for 8.3.

I'll open a new PR that makes the case for eliminating the //IGNORE workaround and bringing everything to parity with POSIX and the PHP manual.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants