Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Emojis to include Unicode 15.0+ #621

Open
anaclumos opened this issue Jun 3, 2024 · 1 comment
Open

Update Emojis to include Unicode 15.0+ #621

anaclumos opened this issue Jun 3, 2024 · 1 comment

Comments

@anaclumos
Copy link
Contributor

Bug report

Description / Observed Behavior

What kind of issues did you encounter with Satori?

It doesn't render Unicode 15.0 emojis, such as 🪈

@Vizards
Copy link

Vizards commented Jul 5, 2024

I've made a investigation for this issue and found it seems to be a bug from linebreak not only because the default Emoji Providers does not yet support all emojis of Unicode 15.

I created a simple playground to demonstrate this more clearly:

Playground Preview

  • 🪈: A simple emoji, Code Point U+1FA88, correctly identified as { languageCode: 'emoji' }, and correctly rendered as <image />. The reason it may not display in the playground is likely because the Emoji Providers in the Playground have not yet been updated to support this Emoji.
  • 🫸🏽: An Emoji ZWJ Sequence, Code Point U+1FAF8 U+1F3FD, correctly identified as { languageCode: 'emoji' }, but incorrectly rendered as <path /> instead of <image />.
  • 🫸🏽 with style wordBreak: 'break-all': Correctly identified as { languageCode: 'emoji' }, and correctly rendered as <image />.

I found that the default wordBreak logic in src/utils.ts#L285 causes the Emoji ZWJ Sequence to be incorrectly recognized:

  if (wordBreak === 'break-all') {
    return { words: segment(content, 'grapheme'), requiredBreaks: [] }
  }

  if (wordBreak === 'keep-all') {
    return { words: segment(content, 'word'), requiredBreaks: [] }
  }

  const breaker = new LineBreaker(content)

Only when wordBreak === 'break-all' or wordBreak === 'keep-all' is specified, Intl.Segmenter will be called to handle text segmentation. When wordBreak is not specified, linebreak is called to handle. And linebreak currently supports Unicode version 13. It splits 🫸🏽 to ['🫸', '🏽'] that Satori couldn’t render the emoji correctly.

A probably workaround, hope this helps those experiencing similar issues:

  1. Specify the style wordBreak: 'break-all' or wordBreak: 'keep-all' on the text container that needs to display Unicode 13+ Emoji ZWJ Sequence
  2. Customize loadAdditionalAsset or graphemeImages (The Emoji Providers in the Playground do not support 🪈 or 🫸🏽)

But when wordBreak is not specified, satori cannot correctly segment the emoji (🫸🏽) in the example. Wondering if there is consideration to replace the default wordBreak with Intl.Segmenter for text segmentation? I'm willing to help with further investigation if needed. @shuding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants