Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Chinese Pinyin support #257

Open
lethefrost opened this issue Jul 9, 2023 · 5 comments
Open

[Feature request] Chinese Pinyin support #257

lethefrost opened this issue Jul 9, 2023 · 5 comments
Labels
help wanted Extra attention is needed

Comments

@lethefrost
Copy link

Is your feature request related to a problem? Please describe.

There is another plugin for obsidian which introduces fuzzy search of pinyin, allowing for searching Chinese characters by their Pinyin (pronunciation), either full spelling or acronym (initials).

Describe the solution you'd like

I hope we could find a chance to cooperate with the authors of that plugin and integrate this feature into omnisearch.

Describe alternatives you've considered

Additional context

@scambier
Copy link
Owner

scambier commented Jul 9, 2023

I know basically nothing about CJK languages, so what does this plugin bring over this one?

I could integrate it into Omnisearch as long is it provides a function that I can plug into the tokenizer

@lethefrost
Copy link
Author

I know basically nothing about CJK languages, so what does this plugin bring over this one?

I could integrate it into Omnisearch as long is it provides a function that I can plug into the tokenizer

Hi! Thank you for responding! Let me try to explain a little more about Chinese characters and the writing system, if it could help.

The one you have already integrated is a tokenizer or word splitter. As you could see, CJK languages do not naturally have spaces between words in their sentences, that's why the algorithms to determine the boundaries of words could be helpful in NLP fields.

For the one I brought up, I want to introduce a little background knowledge about Pinyin system. Each Chinese character can be transcribed or romanized/latinized into Latin characters, in many different ways, Pinyin is the most popular and widely used method among Chinese speakers. The main idea of Pinyin is to spell the character's pronunciation in Latin scripts (and that's why Pinyin is only used by Chinese users - even though we share many Chinese characters or interchangeably 汉字/漢字 (Hanzi/Kanji/Hanja) in our writing systems, for the same character the pronunciation could be very different in Japanese and Korean languages).

The mapping relationship introduced by Pinyin systems enables users to input Chinese characters using a standard English keyboard with the help of a tool called an Input Method Editors (IME). You can check the link above for more official definition and details. Pinyin isn't the only type of Chinese IME, but is the one used by most people.

The IME is similar to the auto-complete feature provided in IDEs or code editors. In Chinese, many characters correspond to the same pinyin spelling (either identical in pronunciation or with slight differences in tones). You can imagine it as having multiple symbols with the same name in a workspace or scope. Even if you type the complete symbol name, the editor may still provide a dropdown menu for you to select the specific object you are looking for. By the way, if you encounter a Chinese IME, the "dropdown menu" may have a different appearance than the auto-suggestions in an editor. Often, it is displayed horizontally to the right, but it can also resemble a common dropdown menu. The appearance is not the main point; the concept is what matters, and it is indeed quite similar to autocomplete and suggestion systems.

Furthermore, due to the one-to-many mapping relationship, it can be inconvenient to manually select the desired character from a massive list of candidate words. Chinese users often input a series of pinyin representing multiple characters forming a word or phrase. This context formed by the input can assist the IMEs in providing more intelligent suggestions, similar to the disambiguation algorithms used in English natural language processing. Nowadays, mainstream IMEs offer convenient ways to help Chinese users input large amounts of text quickly. For example, when you type the initial letter(s) or the first few letters for each character's pinyin composing a word, the input method can present a dropdown menu with sufficiently intelligent candidate words for you to choose from.

Allow me to provide an example to illustrate this. For example, 线性代数 is a noun phrase formed by an adjective 线性 (Xian Xing, which means linear) and a noun 代数 (Dai Shu, which means algebra). So the pinyin together is Xian Xing Dai Shu, or if you prefer to draw analogy to a single symbol in code, so the editor could make suggestion to complete it, you can consider it as CamelCase XianXingDaiShu (but this is not friendly to tokenizer because it cannot be simply split by some certain character) or snake_case Xian_Xing_Dai_Shu. Then, if you want to search for the characters 线性代数, you could just type xxds, xianxds, xianxingds, xxdais, and so on... This is the example they used in their README.jpg.

They didn't only support the search by pinyin, but also introduced fuzzy search of Chinese characters. For now in omnisearch, the fuzzy search support for CJK characters are not so well. Now it is based on the tokenizer library, and somehow lost the ability to search for single or series of characters that don't form a lexical entry or cross the boundaries of tokens. Many occurrences of the exact query string in the searched text are not retrieved at all, and some of them are found but no highlight is shown in the dropdown context strings. It seems like only full tokens tokenized by this library will be highlighted. I am wondering would that be possible to traverse the full text and enable word for word (or char for char in CJK context) indexing? Would that significantly reduce the performance of indexing? Due to this tokenization search algorithm, omnisearch finds far less matched results than the builtin global search feature, for the same query.

Thank you so much for your amazing plugin again! If I wasn't able to explain anything clear, please let me know and I would love to help clarify or join the coding work as long as you need.

@scambier
Copy link
Owner

Thanks for that very complete answer :)

From what I see and understand, the plugin obisian-fuzzy-chinese implements its own indexing logic that's quite different than Omnisearch's tokenizer.

Omnisearch works by tokenizing (basically split into words) each document, and then indexing all tokens from each document. When you type a query in the search box, the same tokenization process happens, omnisearch compares the search tokens with the indexed tokens, and returns a sorted list of results.

If we want to implement obsidian-fuzzy-chinese into Omnisearch to take advantage of its smart weighting logic, we still have to convert each document into an array of tokens. There's no way around that 🤷‍♂️

If obsidian-fuzzy-search can provide an api similar to cm-chs-patch to a) tokenize documents and b) convert a search like "xianxds" into "线性代数", then let's go. Otherwise, I think it would be quite a big refactor because I need tokens for the smart weighting.

Note that a single word can spawn several tokens, i.e. "CamelCase" is tokenized into "Camel", "Case", and "CamelCase"

@alexinea
Copy link

alexinea commented Sep 23, 2023

For example, the Chinese writing of Obsidian is "黑曜石", and its pinyin is "Hei Yao Shi". So I can match it by entering "hys".

I think you can refer to this project: https://github.com/lazyloong/obsidian-fuzzy-chinese.

@scambier

I am looking forward to seeing a surprise suddenly. :)


Sorry, I just saw that you already noticed the obsidian-fuzzy-chinese plugin. 🫨🫨

@scambier
Copy link
Owner

scambier commented Nov 1, 2023

So. From what I understand, obsidian-fuzzy-chinese could work with cm-chs-patch, which is used into Omnisearch's tokenizer.

The basic search flow would be :

  • Type query with pinyin "words"
  • Give the pinyin query to obsidian-fuzzy-chinese, which converts it into a list of Chinese words
  • If needed, use cm-chs-patch to further tokenize this list
  • Use the resulting tokens into Omnisearch

Is that correct?

I'd be happy to work on this feature, but I need to have an API from obsidian-fuzzy-chinese; at least a function to convert pinyin to Chinese.

@scambier scambier added the help wanted Extra attention is needed label Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants