Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Unicode line breaking algorithm when truncating posts #1625

Open
sayunuh opened this issue Dec 14, 2024 · 3 comments
Open

Use Unicode line breaking algorithm when truncating posts #1625

sayunuh opened this issue Dec 14, 2024 · 3 comments

Comments

@sayunuh
Copy link

sayunuh commented Dec 14, 2024

I observe that when a long ActivityPub post has to be truncated for Bluesky, Bridgy Fed does that only at explicit word breaks. This behaviour causes issues handling languages (or writing systems) that don’t use U+0020 SPACE to delimit words or sentences, for example Japanese. Often entire paragraphs are gone.

At least in Japanese you can truncate at basically anywhere in a sentence. It is the same for Chinese. Every Chinese character/hiragana/katakana is a breaking opportunity. I believe you can refer to the Unicode line breaking properties for comprehensiveness.

By the way thanks for the service existing at all, it helps tremendously.

@snarfed
Copy link
Owner

snarfed commented Dec 15, 2024

Oh wow, great point, makes sense. Thank you for filing!

Looks like the Unicode line breaking algorithm is http://www.unicode.org/reports/tr14/ . Python has https://pypi.org/project/uniseg/0.6.4/ (native) and https://pypi.org/project/unicode-linebreak/ (Rust wrapped), and also a feature request to add it to textwrap, python/cpython#86141.

@snarfed snarfed changed the title Better truncation of Japanese posts Use Unicode line breaking algorithm when truncating posts Dec 15, 2024
@Tamschi
Copy link
Collaborator

Tamschi commented Dec 15, 2024

Since I incompletely implemented Chinese and Japanese word-wrap yesterday (unrelatedly to this project, just a small patch plugin for game development), there's some info on break-excluded characters for CJK languages on pages 85-86 of Office Open XML Part 4: Markup Language Reference in the section 2.3.1.16 kinsoku (Use East Asian Typography Rules for First and Last Character per Line).

@sayunuh
Copy link
Author

sayunuh commented Dec 15, 2024

I thought “just use the existing implementation” would suffice for the purpose of truncation, but since kinsoku is brought up here… To be precise, some punctuation marks prevent line‐breaking in Japanese typography. Opening brackets and the like must not be at the end of a line, and things like commas must not be at the start of a line.

As for whether they should be taken care of when truncating text, I think end‐of‐line prohibition (gyōmatsu kinsoku) may be obeyed, but start‐of‐line prohibition (gyōtō kinsoku) is better ignored. For example…

  • In これが「不気味の谷」です, 「不 is a non‐breaking pair because is an end‐of‐line kinsoku character. It can be argued that これが […] looks better than これが「 […] as an output of truncation, though I can live with both.
  • In おはようございます。, す。 is a non‐breaking pair because is a start‐of‐line kinsoku character. But regardless of that fact, I’m confident that おはようございます […] is better than おはようございま […] as an output of truncation.

So it can be complicated. At the end of the day you may ellipsize anywhere and it will be understandable. Using line‐breaking algorithm will work mostly, but the results might not be ideal since line‐breaking and truncation are slightly different problems.

Rethinking it and the rule may be like this: The character immediately before the truncation mark must be breakable right after it. The removed characters don’t matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants