Replies: 12 comments 7 replies
-
You'll get the right sequence in the file if you call
|
Beta Was this translation helpful? Give feedback.
-
Is there a XCCS notion of a “slug” ? Then printing that XCCS code should (or could be made to) print the right bytes.
… On Dec 23, 2024, at 3:33 PM, Nick Briggs ***@***.***> wrote:
You'll get the right sequence in the file if you call BOUT for those three bytes.
% od -t x1 -c /tmp/test-utf8.txt
0000000 54 68 69 73 20 69 73 20 61 20 74 65 73 74 ef bf
T h i s i s a t e s t � **
0000020 bd 20 61 6e 64 20 61 66 74 65 72 20 74 68 65 20
** a n d a f t e r t h e
—
Reply to this email directly, view it on GitHub <#1929 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJP3CQUF3Y7ETUOS6Z32HCMSZAVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRVGQYDANY>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
I don't recall there being an XCCS character code for which the meaning is a slug. The Alto fonts are set up so that there is a slug glyph available one entry past the last character code present in the font to use in place of a missing character -- but that's a rendering convenience, nothing to do with the meaning of a particular code. |
Beta Was this translation helpful? Give feedback.
-
@rmkaplan Perhaps a browse through your XCCS standard? I'd look, but I have never found one online. |
Beta Was this translation helpful? Give feedback.
-
I found a gist from John Cowan containing a link to the XCCS 2.0 standard, which turns out to be in our repo. (Google searches did not unearth it for me.) The document mentions a character called "Replacement symbol (IBM)", XCCS code 360B/307B. That could work for me if it has a Unicode equivalent, but when I attempt to write it using \OUTCHAR I get EXHAUSTED RANGE FOR UNMAPPED CODES So I guess it doesn't. |
Beta Was this translation helpful? Give feedback.
-
After looking at the Character Code Standard, I think that 360b/307b [Replacement symbol (for undefined code points)]is the right symbol to translate to the Unicode U+FFFD (and therefore UTF-8 #xEF #xBF #xBD). |
Beta Was this translation helpful? Give feedback.
-
In the Wikipedia entry on specials in the Unicode block --
What is the situation where you need to emit a replacement character? |
Beta Was this translation helpful? Give feedback.
-
Are you getting this error after you have fillled it up with a lot of other unmapped codes?
When it sees an unmapped code (in either direction), it allocates and assigns a code to be used internally so that that the original code can be resurrected if it is later written to a new file.
But there are a limited number of such codes available—they have to be in the intersection of the unused regions of both XCCS and Unicode, and they have to be smallp.
… On Dec 23, 2024, at 5:16 PM, Herb Jellinek ***@***.***> wrote:
but when I attempt to write it using \OUTCHAR I get
EXHAUSTED RANGE FOR UNMAPPED CODES
61639
|
Beta Was this translation helpful? Give feedback.
-
Where are these codes coming from, if they aren’t either Unicode or XCCS?
… On Dec 24, 2024, at 9:49 AM, Herb Jellinek ***@***.***> wrote:
My test program attempts to write ~28 characters from character set 0x0A, which does not exist. I got that error message after I added 360b/307b (61639 decimal) to the set of chars to output.
—
Reply to this email directly, view it on GitHub <#1929 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJL4UMDOGCRUI7LNC4D2HGND5AVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRVHE2TKNA>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
We could add that mapping, if there isn’t already a mapping for that XCCS character.
But I’n not sure how you intend to use it. If you see that particuilar XCCS code in the input, then it would produce that Unicode code. And it would work in the other direction, like any other mapping (modulo the fact that some mappings are not one-to-one).
If you are thinking of substituting that XCCS code for any codes that don’t have a mapping (like in a character set that is undefined), I think that’s a mistake. The set up for the (limited) number of unmapped codes is to assign an internal code-token that records the original code, so the original code can be written out when it shows up in the printing stream. That gives the invariant that you get a proper round-trip: If you copychar an XCCS fie to a UTF-8 file, and then copychar it back, you end up where you started. I.e. (PRINTCCODE (READCCODE (PRINTCCODE (READCODE)))) (between XCCS and UTF-8) should end up with file bytes that encode the original character.
i don’t think it is a useful stress test to pump it with non-existent codes.
… On Dec 24, 2024, at 10:10 AM, Herb Jellinek ***@***.***> wrote:
They are coming from my test software only, part of making my code bulletproof. (Having an XCCS character that maps to U+FFFD just seems like a good idea anyway.)
—
Reply to this email directly, view it on GitHub <#1929 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJML3WP6TZTTM7AB2332HGPSDAVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRVHE3DGNA>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
I think an error is not unreasonable if you are doing something that doesn’t make sense, e.g. treating a code as XCSS when it really isn’t. Otherwise, in this situation, the program would sail through and possibly lose information without telling the user.
We can put a mapping to FFFD for 360/307 (= F0C7 ?) in the mapping table for charset 360Q, although that may be a perfectly good (but unnamed) character whose glyph is somehwere else in Unicode.
But I don’t think that the underlying Unicode machinery should do more than what it is doing now. A missing mapping may or may not be because the character is undefined, it may be that there is a missing entry in the table for a perfectly good character (as possibly for this one).
… On Dec 24, 2024, at 2:03 PM, Herb Jellinek ***@***.***> wrote:
I was unit testing my HTML OUTCHARFN, checking that it would behave properly in the face of questionable input. Unknowingly, that turned into a stress test of the Unicode <-> XCCS mechanism. I need to make sure that a program that renders HTML output doesn't break with an error because the input character sequence is flawed, which is currently possible, or in the less-likely case that we overlooked some XCCS mappings.
I'm not too concerned about possible problems with a return trip for U+FFFD, since U+FFFD is defined as the "replacement" character, and so it's understood that it's an "onto," information-losing, mapping, which I think you covered in your comment "(modulo the fact that some mappings are not one-to-one)."
In any case, it seems to mee that XCCS 360b/307b "Replacement symbol (IBM)" (61639 decimal) can/arguably should map to U+FFFD � REPLACEMENT CHARACTER, "used to replace an incoming character whose value is unknown or unrepresentable in Unicode." Even the glyph (screenshot_734.png (view on web) <https://github.com/user-attachments/assets/12865ee8-80f2-41aa-8eb4-05eedf1421b1>) in the standard suggests "we can't turn this into Unicode." In the reverse direction, it's appropriate for it to mean "we can't turn this into XCCS."
—
Reply to this email directly, view it on GitHub <#1929 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AQSTUJL74K2BL2MTCP7XIT32HHK2PAVCNFSM6AAAAABUDW2YE6VHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNRWGA2DAMQ>.
You are receiving this because you were mentioned.
|
Beta Was this translation helpful? Give feedback.
-
I'm a bit confused as to why you have an OUTCHARFN for your UTF-8 HTML stream that is different from the generic UTF-8 stream OUTCHARFN that I think Ron implemented. |
Beta Was this translation helpful? Give feedback.
-
I've got code that writes to a UTF-8 output stream, meaning that my program outputs XCCS-coded characters and the stream implementation converts them to Unicode on the way out.
In certain unusual cases I need to explicitly write a U+FFFD � REPLACEMENT CHARACTER char to the output file. Because the stream converts XCCS to Unicode automatically, in effect what I need is either (1) an XCCS character code that maps to U+FFFD, or (2) a way to go "around" the conversion and write the UTF-8 equivalent of U+FFFD into the file. I couldn't find anything suitable in xccs_medley.txt.
Alternative (1) is most appealing, but if it's too hard I could do (2). Maybe calling
BOUT
would suffice? I think the UTF-8 version of U+FFFD is(#xEF #xBF #xBD)
.@rmkaplan, this is your area of expertise. Your thoughts?
Beta Was this translation helpful? Give feedback.
All reactions