Use Unicode line breaking algorithm when truncating posts #1625

sayunuh · 2024-12-14T22:55:29Z

I observe that when a long ActivityPub post has to be truncated for Bluesky, Bridgy Fed does that only at explicit word breaks. This behaviour causes issues handling languages (or writing systems) that don’t use U+0020 SPACE to delimit words or sentences, for example Japanese. Often entire paragraphs are gone.

At least in Japanese you can truncate at basically anywhere in a sentence. It is the same for Chinese. Every Chinese character/hiragana/katakana is a breaking opportunity. I believe you can refer to the Unicode line breaking properties for comprehensiveness.

By the way thanks for the service existing at all, it helps tremendously.

snarfed · 2024-12-15T01:10:03Z

Oh wow, great point, makes sense. Thank you for filing!

Looks like the Unicode line breaking algorithm is http://www.unicode.org/reports/tr14/ . Python has https://pypi.org/project/uniseg/0.6.4/ (native) and https://pypi.org/project/unicode-linebreak/ (Rust wrapped), and also a feature request to add it to textwrap, python/cpython#86141.

Tamschi · 2024-12-15T10:04:19Z

Since I incompletely implemented Chinese and Japanese word-wrap yesterday (unrelatedly to this project, just a small patch plugin for game development), there's some info on break-excluded characters for CJK languages on pages 85-86 of Office Open XML Part 4: Markup Language Reference in the section 2.3.1.16 kinsoku (Use East Asian Typography Rules for First and Last Character per Line).

sayunuh · 2024-12-15T16:00:50Z

I thought “just use the existing implementation” would suffice for the purpose of truncation, but since kinsoku is brought up here… To be precise, some punctuation marks prevent line‐breaking in Japanese typography. Opening brackets and the like must not be at the end of a line, and things like commas must not be at the start of a line.

As for whether they should be taken care of when truncating text, I think end‐of‐line prohibition (gyōmatsu kinsoku) may be obeyed, but start‐of‐line prohibition (gyōtō kinsoku) is better ignored. For example…

In これが「不気味の谷」です, 「不 is a non‐breaking pair because 「 is an end‐of‐line kinsoku character. It can be argued that これが […] looks better than これが「 […] as an output of truncation, though I can live with both.
In おはようございます。, す。 is a non‐breaking pair because 。 is a start‐of‐line kinsoku character. But regardless of that fact, I’m confident that おはようございます […] is better than おはようございま […] as an output of truncation.

So it can be complicated. At the end of the day you may ellipsize anywhere and it will be understandable. Using line‐breaking algorithm will work mostly, but the results might not be ideal since line‐breaking and truncation are slightly different problems.

Rethinking it and the rule may be like this: The character immediately before the truncation mark must be breakable right after it. The removed characters don’t matter.

snarfed changed the title ~~Better truncation of Japanese posts~~ Use Unicode line breaking algorithm when truncating posts Dec 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Unicode line breaking algorithm when truncating posts #1625

Use Unicode line breaking algorithm when truncating posts #1625

sayunuh commented Dec 14, 2024

snarfed commented Dec 15, 2024

Tamschi commented Dec 15, 2024 •

edited

Loading

sayunuh commented Dec 15, 2024 •

edited

Loading

Use Unicode line breaking algorithm when truncating posts #1625

Use Unicode line breaking algorithm when truncating posts #1625

Comments

sayunuh commented Dec 14, 2024

snarfed commented Dec 15, 2024

Tamschi commented Dec 15, 2024 • edited Loading

sayunuh commented Dec 15, 2024 • edited Loading

Tamschi commented Dec 15, 2024 •

edited

Loading

sayunuh commented Dec 15, 2024 •

edited

Loading