-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any limit for the vocab size (#types)? #14
Comments
I'm not sure what the exact limit is but I'm not surprised that it failed On Thu, Mar 3, 2016 at 9:04 PM, Mohammad Sadegh Rasooli <
|
I have noticed that at the end of March a new commit was performed. The commit is labeled "Enable >= 2^31 tokens in input data" so I thought it would have addressed the issue raised here. However, I still ran into an issue similar to the one mentioned by rasoolims. I'm able to successfully run the code only with a file containing 10M tokens (700K types). With bigger files it fails saying "core dump: segmentation fault". thanks |
Did you try using the flag to restrict the vocabulary? On Thursday, July 14, 2016, lavelli [email protected] wrote:
|
Do you mean the min-occur flag? |
I know this is late and probably not important to OP anymore but for any other people facing the same issue, this pr fixed the issue for me. |
The code fails (with core dump: segmentation fault message) when I run it on a huge txt file (about 20M types and 14GB file size). I already used wcluster for different files with much less types and it worked pretty well.
Is there any limit for the vocabulary size (#types)?
The text was updated successfully, but these errors were encountered: