How to generate a Unicode slug? #1929

hjellinek · 2024-12-23T21:55:04Z

hjellinek
Dec 23, 2024
Maintainer

I've got code that writes to a UTF-8 output stream, meaning that my program outputs XCCS-coded characters and the stream implementation converts them to Unicode on the way out.

In certain unusual cases I need to explicitly write a U+FFFD � REPLACEMENT CHARACTER char to the output file. Because the stream converts XCCS to Unicode automatically, in effect what I need is either (1) an XCCS character code that maps to U+FFFD, or (2) a way to go "around" the conversion and write the UTF-8 equivalent of U+FFFD into the file. I couldn't find anything suitable in xccs_medley.txt.

Alternative (1) is most appealing, but if it's too hard I could do (2). Maybe calling BOUT would suffice? I think the UTF-8 version of U+FFFD is (#xEF #xBF #xBD).

@rmkaplan, this is your area of expertise. Your thoughts?

nbriggs · 2024-12-23T23:32:40Z

nbriggs
Dec 23, 2024
Maintainer

You'll get the right sequence in the file if you call BOUT for those three bytes.

% od -t x1 -c /tmp/test-utf8.txt
0000000    54  68  69  73  20  69  73  20  61  20  74  65  73  74  ef  bf
           T   h   i   s       i   s       a       t   e   s   t   �  **
0000020    bd  20  61  6e  64  20  61  66  74  65  72  20  74  68  65  20
          **       a   n   d       a   f   t   e   r       t   h   e

0 replies

rmkaplan · 2024-12-23T23:43:58Z

rmkaplan
Dec 23, 2024
Maintainer

Is there a XCCS notion of a “slug” ? Then printing that XCCS code should (or could be made to) print the right bytes.

…

On Dec 23, 2024, at 3:33 PM, Nick Briggs ***@***.***> wrote: You'll get the right sequence in the file if you call BOUT for those three bytes. % od -t x1 -c /tmp/test-utf8.txt 0000000 54 68 69 73 20 69 73 20 61 20 74 65 73 74 ef bf T h i s i s a t e s t � ** 0000020 bd 20 61 6e 64 20 61 66 74 65 72 20 74 68 65 20 ** a n d a f t e r t h e — Reply to this email directly, view it on GitHub <#1929 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJP3CQUF3Y7ETUOS6Z32HCMSZAVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRVGQYDANY>. You are receiving this because you were mentioned.

0 replies

nbriggs · 2024-12-24T00:41:46Z

nbriggs
Dec 24, 2024
Maintainer

I don't recall there being an XCCS character code for which the meaning is a slug. The Alto fonts are set up so that there is a slug glyph available one entry past the last character code present in the font to use in place of a missing character -- but that's a rendering convenience, nothing to do with the meaning of a particular code.

0 replies

nbriggs · 2024-12-24T00:43:02Z

nbriggs
Dec 24, 2024
Maintainer

@rmkaplan Perhaps a browse through your XCCS standard? I'd look, but I have never found one online.

0 replies

hjellinek · 2024-12-24T01:16:18Z

hjellinek
Dec 24, 2024
Maintainer Author

I found a gist from John Cowan containing a link to the XCCS 2.0 standard, which turns out to be in our repo. (Google searches did not unearth it for me.) The document mentions a character called "Replacement symbol (IBM)", XCCS code 360B/307B. That could work for me if it has a Unicode equivalent, but when I attempt to write it using \OUTCHAR I get

EXHAUSTED RANGE FOR UNMAPPED CODES
61639

So I guess it doesn't.

0 replies

nbriggs · 2024-12-24T08:05:06Z

nbriggs
Dec 24, 2024
Maintainer

After looking at the Character Code Standard, I think that 360b/307b [Replacement symbol (for undefined code points)]is the right symbol to translate to the Unicode U+FFFD (and therefore UTF-8 #xEF #xBF #xBD).

0 replies

nbriggs · 2024-12-24T16:42:02Z

nbriggs
Dec 24, 2024
Maintainer

In the Wikipedia entry on specials in the Unicode block --

At one time the replacement character was often used when there was no glyph available in a font for that character, as in font substitution. However, most modern text rendering systems instead use a font's .notdef character, which in most cases is an empty box, or "?" or "X" in a box[7] (this browser displays 􏿮), sometimes called a 'tofu'. There is no Unicode code point for this symbol.

Thus the replacement character is now only seen for encoding errors. Some software programs translate invalid UTF-8 bytes to matching characters in Windows-1252 (since that is the most common source of these errors), so that the replacement character is never seen.

What is the situation where you need to emit a replacement character?

1 reply

hjellinek Dec 24, 2024
Maintainer Author

My OUTCHARFN needs to handle cases in which the character code it's passed is not valid in XCCS. That breaks down into two sub-cases:

(1) the character set is not defined in XCCS. For example, character set 0x20.
(2) the character set is valid, but there is no character with that CHAR8CODE. E.g., charset 0x2A (Extended Cyrillic), charcode 0xEE. Or charset 0, character 0, come to think of it.

I can detect both cases. But how to handle this? My code could simply output no character at all, but that seems like a poor design.

Ideally, I'd like to substitute an XCCS character code that means "undefined" or "not found" and which the output stream will convert to Unicode 0xFFFD.

Would it work for me to create a PR that maps XCCS 360b/307b [Replacement symbol (for undefined code points)] to U+FFFD?

rmkaplan · 2024-12-24T16:51:35Z

rmkaplan
Dec 24, 2024
Maintainer

Are you getting this error after you have fillled it up with a lot of other unmapped codes? When it sees an unmapped code (in either direction), it allocates and assigns a code to be used internally so that that the original code can be resurrected if it is later written to a new file. But there are a limited number of such codes available—they have to be in the intersection of the unused regions of both XCCS and Unicode, and they have to be smallp.

…

On Dec 23, 2024, at 5:16 PM, Herb Jellinek ***@***.***> wrote: but when I attempt to write it using \OUTCHAR I get EXHAUSTED RANGE FOR UNMAPPED CODES 61639

1 reply

hjellinek Dec 24, 2024
Maintainer Author

My test program attempts to write ~28 characters from character set 0x0A, which does not exist. I got that error message after I added 360b/307b (61639 decimal) to the set of chars to output.

rmkaplan · 2024-12-24T17:59:01Z

rmkaplan
Dec 24, 2024
Maintainer

Where are these codes coming from, if they aren’t either Unicode or XCCS?

…

On Dec 24, 2024, at 9:49 AM, Herb Jellinek ***@***.***> wrote: My test program attempts to write ~28 characters from character set 0x0A, which does not exist. I got that error message after I added 360b/307b (61639 decimal) to the set of chars to output. — Reply to this email directly, view it on GitHub <#1929 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJL4UMDOGCRUI7LNC4D2HGND5AVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRVHE2TKNA>. You are receiving this because you were mentioned.

1 reply

hjellinek Dec 24, 2024
Maintainer Author

They are coming from my test software only, part of making my code bulletproof. (Having an XCCS character that maps to U+FFFD just seems like a good idea anyway.)

rmkaplan · 2024-12-24T21:33:49Z

rmkaplan
Dec 24, 2024
Maintainer

We could add that mapping, if there isn’t already a mapping for that XCCS character. But I’n not sure how you intend to use it. If you see that particuilar XCCS code in the input, then it would produce that Unicode code. And it would work in the other direction, like any other mapping (modulo the fact that some mappings are not one-to-one). If you are thinking of substituting that XCCS code for any codes that don’t have a mapping (like in a character set that is undefined), I think that’s a mistake. The set up for the (limited) number of unmapped codes is to assign an internal code-token that records the original code, so the original code can be written out when it shows up in the printing stream. That gives the invariant that you get a proper round-trip: If you copychar an XCCS fie to a UTF-8 file, and then copychar it back, you end up where you started. I.e. (PRINTCCODE (READCCODE (PRINTCCODE (READCODE)))) (between XCCS and UTF-8) should end up with file bytes that encode the original character. i don’t think it is a useful stress test to pump it with non-existent codes.

…

On Dec 24, 2024, at 10:10 AM, Herb Jellinek ***@***.***> wrote: They are coming from my test software only, part of making my code bulletproof. (Having an XCCS character that maps to U+FFFD just seems like a good idea anyway.) — Reply to this email directly, view it on GitHub <#1929 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJML3WP6TZTTM7AB2332HGPSDAVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRVHE3DGNA>. You are receiving this because you were mentioned.

1 reply

hjellinek Dec 24, 2024
Maintainer Author

I was unit testing my HTML OUTCHARFN, checking that it would behave properly in the face of questionable input. Unknowingly, that turned into a stress test of the Unicode <-> XCCS mechanism. I need to make sure that a program that renders HTML output doesn't break with an error because the input character sequence is flawed, which is currently possible, or in the less-likely case that we overlooked some XCCS mappings.

I'm not too concerned about possible problems with a return trip for U+FFFD, since U+FFFD is defined as the "replacement" character, and so it's understood that it's an "onto," information-losing, mapping, which I think you covered in your comment "(modulo the fact that some mappings are not one-to-one)."

In any case, it seems to mee that XCCS 360b/307b "Replacement symbol (IBM)" (61639 decimal) can/arguably should map to U+FFFD � REPLACEMENT CHARACTER, "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Even the glyph () in the standard suggests "we can't turn this into Unicode." (A coincidence, of course!) In the reverse direction, it's appropriate for it to mean "we can't turn this into XCCS."

rmkaplan · 2024-12-24T22:41:23Z

rmkaplan
Dec 24, 2024
Maintainer

I think an error is not unreasonable if you are doing something that doesn’t make sense, e.g. treating a code as XCSS when it really isn’t. Otherwise, in this situation, the program would sail through and possibly lose information without telling the user. We can put a mapping to FFFD for 360/307 (= F0C7 ?) in the mapping table for charset 360Q, although that may be a perfectly good (but unnamed) character whose glyph is somehwere else in Unicode. But I don’t think that the underlying Unicode machinery should do more than what it is doing now. A missing mapping may or may not be because the character is undefined, it may be that there is a missing entry in the table for a perfectly good character (as possibly for this one).

…

On Dec 24, 2024, at 2:03 PM, Herb Jellinek ***@***.***> wrote: I was unit testing my HTML OUTCHARFN, checking that it would behave properly in the face of questionable input. Unknowingly, that turned into a stress test of the Unicode <-> XCCS mechanism. I need to make sure that a program that renders HTML output doesn't break with an error because the input character sequence is flawed, which is currently possible, or in the less-likely case that we overlooked some XCCS mappings. I'm not too concerned about possible problems with a return trip for U+FFFD, since U+FFFD is defined as the "replacement" character, and so it's understood that it's an "onto," information-losing, mapping, which I think you covered in your comment "(modulo the fact that some mappings are not one-to-one)." In any case, it seems to mee that XCCS 360b/307b "Replacement symbol (IBM)" (61639 decimal) can/arguably should map to U+FFFD � REPLACEMENT CHARACTER, "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Even the glyph (screenshot_734.png (view on web) <https://github.com/user-attachments/assets/12865ee8-80f2-41aa-8eb4-05eedf1421b1>) in the standard suggests "we can't turn this into Unicode." In the reverse direction, it's appropriate for it to mean "we can't turn this into XCCS." — Reply to this email directly, view it on GitHub <#1929 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJL74K2BL2MTCP7XIT32HHK2PAVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRWGA2DAMQ>. You are receiving this because you were mentioned.

1 reply

hjellinek Dec 24, 2024
Maintainer Author

I think an error is not unreasonable if you are doing something that doesn’t make sense, e.g. treating a code as XCSS when it really isn’t. Otherwise, in this situation, the program would sail through and possibly lose information without telling the user. We can put a mapping to FFFD for 360/307 (= F0C7 ?) in the mapping table for charset 360Q, although that may be a perfectly good (but unnamed) character whose glyph is somehwere else in Unicode. But I don’t think that the underlying Unicode machinery should do more than what it is doing now. A missing mapping may or may not be because the character is undefined, it may be that there is a missing entry in the table for a perfectly good character (as possibly for this one).
…
On Dec 24, 2024, at 2:03 PM, Herb Jellinek @.***> wrote: I was unit testing my HTML OUTCHARFN, checking that it would behave properly in the face of questionable input. Unknowingly, that turned into a stress test of the Unicode <-> XCCS mechanism. I need to make sure that a program that renders HTML output doesn't break with an error because the input character sequence is flawed, which is currently possible, or in the less-likely case that we overlooked some XCCS mappings. I'm not too concerned about possible problems with a return trip for U+FFFD, since U+FFFD is defined as the "replacement" character, and so it's understood that it's an "onto," information-losing, mapping, which I think you covered in your comment "(modulo the fact that some mappings are not one-to-one)." In any case, it seems to mee that XCCS 360b/307b "Replacement symbol (IBM)" (61639 decimal) can/arguably should map to U+FFFD � REPLACEMENT CHARACTER, "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Even the glyph (screenshot_734.png (view on web) https://github.com/user-attachments/assets/12865ee8-80f2-41aa-8eb4-05eedf1421b1) in the standard suggests "we can't turn this into Unicode." In the reverse direction, it's appropriate for it to mean "we can't turn this into XCCS."

I don't want to change the Unicode machinery (that is, logic) at all. Adding the 360/307 to FFFD mapping would solve my problem. I don't know if there's an invariant that requires the inverse mapping also, but my code doesn't need it.

nbriggs · 2024-12-24T22:55:04Z

nbriggs
Dec 24, 2024
Maintainer

I'm a bit confused as to why you have an OUTCHARFN for your UTF-8 HTML stream that is different from the generic UTF-8 stream OUTCHARFN that I think Ron implemented.

2 replies

hjellinek Dec 24, 2024
Maintainer Author

I'm a bit confused as to why you have an OUTCHARFN for your UTF-8 HTML stream that is different from the generic UTF-8 stream OUTCHARFN that I think Ron implemented.

The HTML.OUTCHARFN is what clients call (via OUTCHAR); behind the scenes it maintains a normal backing output stream that writes to the file. HTML.OUTCHARFN has to do various small translations, like turning < and & into HTML character entities, as well as deciding when to break lines, etc. The backing stream is set up with its external format equal to UTF-8, and I call OUTCHAR on it in the expected way.

nbriggs Dec 25, 2024
Maintainer

Ah, got it. Didn't realize you were doing the HTML character entity translations at this level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interlisp.org

How to generate a Unicode slug? #1929

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Interlisp.org

How to generate a Unicode slug? #1929

hjellinek Dec 23, 2024 Maintainer

Replies: 12 comments · 7 replies

nbriggs Dec 23, 2024 Maintainer

rmkaplan Dec 23, 2024 Maintainer

nbriggs Dec 24, 2024 Maintainer

nbriggs Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

nbriggs Dec 24, 2024 Maintainer

nbriggs Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

rmkaplan Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

rmkaplan Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

rmkaplan Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

rmkaplan Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

nbriggs Dec 24, 2024 Maintainer

hjellinek Dec 24, 2024 Maintainer Author

nbriggs Dec 25, 2024 Maintainer

hjellinek
Dec 23, 2024
Maintainer

Replies: 12 comments 7 replies

nbriggs
Dec 23, 2024
Maintainer

rmkaplan
Dec 23, 2024
Maintainer

nbriggs
Dec 24, 2024
Maintainer

nbriggs
Dec 24, 2024
Maintainer

hjellinek
Dec 24, 2024
Maintainer Author

nbriggs
Dec 24, 2024
Maintainer

nbriggs
Dec 24, 2024
Maintainer

hjellinek Dec 24, 2024
Maintainer Author

rmkaplan
Dec 24, 2024
Maintainer

hjellinek Dec 24, 2024
Maintainer Author

rmkaplan
Dec 24, 2024
Maintainer

hjellinek Dec 24, 2024
Maintainer Author

rmkaplan
Dec 24, 2024
Maintainer

hjellinek Dec 24, 2024
Maintainer Author

rmkaplan
Dec 24, 2024
Maintainer

hjellinek Dec 24, 2024
Maintainer Author

nbriggs
Dec 24, 2024
Maintainer

hjellinek Dec 24, 2024
Maintainer Author

nbriggs Dec 25, 2024
Maintainer