-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create words_alpha_clean.txt #108
base: master
Are you sure you want to change the base?
Conversation
Add file `words_alpha_clean.txt` that a of `words_alpha.txt` but the words that not exists in english has been removed. The sort has been effectuate with the `API` of [wordsapi](https://www.wordsapi.com/) that allow the search of words in english, from a script i've call the API for each word, and during not exists word i've remove word from a file. You can find the doc API [here](https://www.wordsapi.com/docs/). The exact filter of a word is based on `frequency` data of API ```javascript if(!!response.word && typeof response.frequency == "object") { if(response.frequency.perMillion >= 15) { // here word is not removed realWords.push(response.word); } // else word is removed } ``` The documentation indicate this below text for [frequency](https://www.wordsapi.com/docs/#frequency) data: > This is the number of times the word is likely to appear in any English corpus, per million words.
Nice work but from 350000+ lines only around 2500 survived? Seems like the parameters used have been a little too strict... |
I used less strict filter with |
|
The API is free for 2500 words per day. That is probably why.... |
@Orivoir did get |
maybe it kills too much words. for example: white lives matter, too [:joke:] |
Hi all, I have run the words_alpha.txt through the "nltk" python library. Total words are 210693. This seems to be a bit better, but I have noticed there are still a few oddities in there (maybe things like common abbreviations remain, which aren't actual words). But overall I think this has cleaned out any non-english words. |
@SDidge appreciate the share! |
@SDidge At first glance I can't seem to find any non-english words on the file so I'd say this one is the cleanest file so far, nice work! |
@SDidge , what exactly did you use from the NLTK library to check the list of words? |
@Timokasse , I just checked if the word existed in the "words" corpus E.g. from nltk.corpus import words word for words_alpha if word in words Something like this |
Add file
words_alpha_clean.txt
that a copy ofwords_alpha.txt
but the words that not exists in english has been removed.The sort has been effectuate with the
API
of wordsapi that allow the search of words in english, from a script i've call the API for each word,and during not exists word i've remove word from a file.
You can find the doc API here.
The exact filter of a word is based on
frequency
data of APIThe documentation indicate this below text for frequency data: