Fix tests with non-GNU iconv #16840

orlitzky · 2024-11-17T14:50:59Z

Get the iconv test suite passing with musl. This fixes some of #13696

Many of these issues relate to the //IGNORE suffix that can be appended to the target charset during conversion. This //IGNORE was recently standardized in POSIX 2024, but there are still some pitfalls:

The standard says that //IGNORE will ignore untranslatable sequences (valid input sequences that cannot be represented in the target charset), but sequences that are a priori invalid should instead cause an EILSEQ -- even in earlier versions of the standard.
The two GNU implementations have always ignored that, and have returned an EILSEQ for invalid input sequences. Their //IGNORE allows these errors to be ignored. In other words, the two GNU implementations have never conformed to POSIX.
Other implementations may conform to POSIX, but not the 2024 version yet. In particular, musl follows the earlier standards and will e.g. replace untranslatable sequences by a * rather than fail. But it does not yet implement //IGNORE, and will simply crash if it encounters //IGNORE in the charset name.

So, what is done here:

Two of our tests rely on the non-standard (GNU) illegal input sequence //IGNORE behavior. These get SKIPIFs.
One test uses //IGNORE, but it has no effect on the output. Here the //IGNORE was simply dropped.
Two ISO-2022-JP tests are looking for GNU output, but that encoding is stateful and there is no single correct answer. The musl answer is different. Rather than whitelist the GNU implementations here, I've blacklisted "unknown" implementations, which includes musl. (I don't see a good way to fix these tests otherwise.)

@petk @arnaud-lb

This test ensures that iconv() will not return part of the string if it encounters an error, but the particular error it wants to cause is not portable. When the input sequence is not representable in the output charset, POSIX says that "iconv() shall perform an implementation-defined conversion on the character" rather than fail. As far as I know, the only two implementations that go against this recommendation (and return the error that this test is expecting) are GNU libiconv and glibc. This commit adds a SKIPIF check for the two GNU implementations to ensure that an error is actually encountered during this test.

This test performs a conversion from UCS-2 to UTF-8 using iconv and stream filters, with the suffix "//IGNORE" being added to the target charset. This magic //IGNORE string was recently standardized in POSIX 2024, but it is not yet portable: musl for example will simply fail when it encounters //IGNORE in what is supposed to be a charset name. Fortunately, we do not need to think too hard about the general problem here: the input is "abc", which has no un-translatable sequences. We can simply drop the "//IGNORE".

orlitzky · 2024-11-17T15:28:15Z

Hm, it looks like some versions of glibc iconv actually do the right thing with invalid input sequences and //IGNORE, but we are still testing for the wrong thing:

- string(10) "aa%C3%B8aa"
+ Notice: iconv(): Detected an illegal character in input string in /home/runner/work/php-src/php-src/ext/iconv/tests/bug48147.php on line 4
+ string(0) ""

That SHOULD warn/fail because the input is invalid, and //IGNORE should not ignore invalid inputs. I'll do some digging and try to figure out where the divergence happened.

petk · 2024-11-17T16:38:25Z

Thanks for the patch. Yes, that GNU libiconv is non-standard and also LGPL based library, which makes it so-so useful. I certainly would add priority to built-in iconv implementation over GNU libiconv. I'm not sure if any packages out there rely on it as there isn't proper integration of POSIX iconv in PHP yet for some characters. I remember I've always had some annoyances with converting characters using it. :D

Not to mention that with gnu-libiconv some tests would fail on PHP's CI setup.

Perhaps those IGNORE hacks should be removed but I should check in more details.

orlitzky · 2024-11-17T16:58:38Z

Yuck.

The glibc bug that started this whole thing is still there; it still eats up more of the string than it should when //IGNORE is used.
But it shouldn't matter, because in 2010, PHP was changed to not return part of the string on failure, in e3fdf31.
However, the test for ICONV_BROKEN_IGNORE is, ironically, broken. It tests that //IGNORE will ignore invalid input sequences. (It should not). Glibc, musl, and anything conforming to any version of POSIX will fail this test, and trigger the workaround.
The ICONV_BROKEN_IGNORE workaround turns what should be an error into a success, so the 2010 change does not kick in when //IGNORE is used. (It should, because illegal input sequences are an error and shouldn't be ignored, even with //IGNORE).

This is annoying because, like, three mistakes add up to something that is sort of correct. POSIX and implementations have always agreed that processing would stop at an invalid input sequence, even in the presence of //IGNORE. So, in C, you'll get part of the string back. Glibc however still has that bug that makes it mess this up with //IGNORE. PHP, on the other hand, is trying to return false if iconv() fails, rather than the partially converted string. But the old glibc workaround and wonky ICONV_BROKEN_IGNORE test are conspiring to make PHP return the partially converted string, so long as //IGNORE was given. It would not be crazy to rely on this if you have not read the PHP documentation, because that's how it works in C.

The PHP docs for iconv, however, do say that any failure will get you false rather than the input string. So in short, I think we should treat the new output above as correct. If you get an EILSEQ, even with //IGNORE, that's a failure and you should get back false rather than a partially converted string. This conforms to POSIX, the iconv docs, and the PHP docs. But it is a breaking change for people who are expecting to get the partial result back.

This test meaningfully uses the //IGNORE charset suffix that was was a GNU/Solaris extension but is now standardized in POSIX 2024. The way //IGNORE is used, however, is non-standard: POSIX says that //IGNORE will cause untranslatable sequences to be skipped, but this test is expecting it to skip input sequences that are invalid rather than unexpressible in the target charset. That behavior is specific to the two GNU implementations (and was always non-conforming...), so we add a SKIPIF block to ensure that one of the GNU implementations is used.

This test does a byte comparison of the expected ISO-2022-JP encoding with the result of iconv(). The ISO-2022-JP encoding, however, is stateful; there are many ways to arrive at a correct answer. Musl is known to have an inefficient encoding that causes it to fail this test, so we add a SKIPIF for the "unknown" iconv implementation that musl has. Nothing is wrong here per se, but to support the musl output (and that of other iconvs), an expert would need to verify it.

This test checks the output of iconv_mime_encode() with a target charset of ISO-2022-JP. That charset is stateful, though, and there are many ways to arrive at a correct answer. Musl has an inefficient encoding of it that causes this to fail because the actual output differs from (and is much longer than) the expected output. We add a SKIPIF for the "unknown" iconv implementation that musl has.

orlitzky · 2024-11-17T20:30:17Z

I dropped the two commits that change the //IGNORE workaround, since the others should be uncontroversial and suitable for 8.3.

I'll open a new PR that makes the case for eliminating the //IGNORE workaround and bringing everything to parity with POSIX and the PHP manual.

orlitzky added 2 commits November 17, 2024 09:25

github-actions bot added Category: Build System Extension: iconv labels Nov 17, 2024

orlitzky added 3 commits November 17, 2024 15:25

orlitzky force-pushed the nongnu-iconv-tests branch from 699b51c to f25e0e5 Compare November 17, 2024 20:26

orlitzky mentioned this pull request Nov 18, 2024

Make iconv //IGNORE behavior conform to POSIX and the docs #16846

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tests with non-GNU iconv #16840

Fix tests with non-GNU iconv #16840

orlitzky commented Nov 17, 2024 •

edited

Loading

orlitzky commented Nov 17, 2024

petk commented Nov 17, 2024

orlitzky commented Nov 17, 2024

orlitzky commented Nov 17, 2024 •

edited

Loading

Fix tests with non-GNU iconv #16840

Are you sure you want to change the base?

Fix tests with non-GNU iconv #16840

Conversation

orlitzky commented Nov 17, 2024 • edited Loading

orlitzky commented Nov 17, 2024

petk commented Nov 17, 2024

orlitzky commented Nov 17, 2024

orlitzky commented Nov 17, 2024 • edited Loading

orlitzky commented Nov 17, 2024 •

edited

Loading

orlitzky commented Nov 17, 2024 •

edited

Loading