Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Issues with generating new dictionaries using cspell-tools #6379

Open
1 task done
gothrek22 opened this issue Oct 16, 2024 · 6 comments
Open
1 task done

[Bug]: Issues with generating new dictionaries using cspell-tools #6379

gothrek22 opened this issue Oct 16, 2024 · 6 comments

Comments

@gothrek22
Copy link

Kind of Issue

Runtime - command-line tools, Building / Compiling

Tool or Library

cspell-tools

Version

8.14.4 and 8.15.2 for cspell-tools-cli

Supporting Library

No response

OS

Other

OS Version

Doesn't really matter

Description

Thanks for the great software.

I've been trying to help out by converting the Hunspell Korean dictionary into a cspell compatible source. But no matter what I try when running conversion, I get core dumps.

That's for sure caused by the size of the dict (11 mb for .aff 44 mb for .dic), I've tried bumping Max old space size up to 60 gigs (I've 64 gigs available right now), and it still dies. Any idea how I could split this job into chunks, so it runs longer but doesn't die?

Reporting this as a bug, because it seems to me that it tries to load up everything at once into memory and process it there, which causes it to run out (probably would run out until some ludicrous size).

Steps to Reproduce

No response

Expected Behavior

No response

Additional Information

No response

cspell.json

No response

cspell.config.yaml

No response

Example Repository

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@Jason3S
Copy link
Collaborator

Jason3S commented Oct 16, 2024

@gothrek22,

Thank you for trying.

Some dictionaries are very complicated and include nested compound rules.

Can you share some more information:

  • The hunspell source.
  • Did you use the script on cspell-dicts or are you running cspell-tools directly?
  • Do you have a cspell-tools.config.yaml file? Example: Basque/cspell-tools.config.yaml
    • maxDepth is used to limit nested compound rules. If 1 isn't working, try 0.

@gothrek22
Copy link
Author

gothrek22 commented Oct 16, 2024

@Jason3S I've used the one that's packaged by Fedora, which is this one: https://github.com/spellcheck-ko/hunspell-dict-ko

There is also: https://github.com/wooorm/dictionaries/tree/main/dictionaries/ko

I've setup the cspell config to look like so:

  - name: ko
    sources:
      - ko_KR.aff
    format: trie3
    generateNonStrict: true

Will try maxDepth in a sec.

I've tried installing cspell-tools globally and using that directly. Also tried hunspell-reader. Same way.

@gothrek22
Copy link
Author

Tried just now with this config:

---
targets:
  - name: ko
    sources:
      - ko_KR.aff
    format: trie3
    generateNonStrict: true
    maxDepth: 0

NODE_OPTIONS="--max_old_space_size=30720 " cspell-tools-cli build

Still got an OOM Kill. :(

@Jason3S
Copy link
Collaborator

Jason3S commented Oct 17, 2024

@gothrek22,

That means applying the rules is causing something to break.

It is not ideal because it is a limited dictionary, but it is possible to get a basic word list without applying rules by using hunspell-reader.

Like this:

hunspell-reader words --no-transform ko_KR.aff -o ko-words.txt
---
targets:
  - name: ko
    sources:
      - ko-words.txt
    format: trie3
    generateNonStrict: true
    maxDepth: 0

Do you have a link to ko_KR.aff/dic you are using?

@Jason3S
Copy link
Collaborator

Jason3S commented Oct 17, 2024

Do you have a link to ko_KR.aff/dic you are using?

I just noticed that you included it in a previous comment.

@gothrek22
Copy link
Author

Yep, basic dict generated properly. I'm guessing that the issue with compounding rules is that words in Korean can get weirdly complex.

As in, the root word can be both pluralized (in different ways), conjugated on top of that and potentially have additional suffixes. Which can turn a single four radical root word, into tens of permutations.

@Jason3S I'll try to link that to cspell and test it on some of the content I have and get back to you ASAP. Thank you for your help too mate 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants