-
Notifications
You must be signed in to change notification settings - Fork 508
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modified preprocess.py to accept syllabic prediction... #64
base: master
Are you sure you want to change the base?
Conversation
…ocumentation accordingly.
Sounds great! Will check it out. It just might be a perfect compromise to make few of my small datasets (300-1000 KB) yield acceptable results. |
Hi, I had no problem to install PyHyphen with PIP as root (inside a docker — On 12 April 2016 at 06:01, Ostrosablin Vitaly [email protected]
|
Yes, that's probably the best option, because NLTK is well maintained, while PyHyphen seems to be abandoned and partially broken. I've tried to install dictionaries manually by downloading them from libreoffice git and pointing path variable in config.py to the directory with dicts, but it doesn't seem to work. I've installed it on Gentoo system. Have no idea why does it installs broken with PIP as root. |
I've modified your preprocessing module to make it work with another Python hyphenation library, Pyphen instead of PyHyphen. It has dictionaries built-in, so it has no problems like those I had with PyHyphen. It looks fine as far as I can tell, just need to check whether preprocessed datasets train to anything sensible. If anyone is interested - I could share my changes. Because I mostly train networks on non-english texts with occasional english words, I thought it would make sense to use two hyphenators, one with specified language and one for en_US as fallback. In end, script selects a list with most items as basis for syllabic splitting. Because otherwise, hyphenator would fail to split syllables of one of two languages. For english texts, it would use a single hyphenator. |
Well, that worked quite fine. Network is converging into readability really quickly and on really small datasets. I suspect it might catch on to features more poorly, since dataset is smaller, but it works for me, because I use torch-rnn mostly for fun. Here's my modified preprocess.py. Update: There's still a problem that sampler is not aware of syllabic splitting and it will fail to pre-seed with |
Maybe there should be workaround like this: diff --git a/scripts/preprocess.py b/scripts/preprocess.py
index 4881bca..6e13359 100644
--- a/scripts/preprocess.py
+++ b/scripts/preprocess.py
@@ -63,7 +63,7 @@ if __name__ == '__main__':
space = False
continue
if len(word)>0 :
- syls = separator.syllables(word.lower())
+ syls = separator.syllables(word.lower()[:80])
if len(syls) == 0 :
syls = [ word.lower() ]
word = '' |
`dict_info` no longer exists in pyhyphen. Instead, language packs are downloaded on-the-fly. This upgrade should be compatible both with the old and the new version of pyhyphen.
Upgrade hyphenation to the latest version of pyhyphen
I'm playing with torch-rnn to do computational poetry (didn't Turing's own interest on IA started on that?!) and I found the letter-by-letter predictor requires really huge corpora (e.g., Shakespeare) to even start making sense, while the word-by-word predictor has limitations of its own. The syllable predictor converges quickly to something that... sounds correctly, even when it means nothing. It might be an interesting compromise between size-of-vocabulary vs amount of context for other explorations. The syllabic separation is based on PyHyphen, which uses LibreOffice's hyphenation dictionaries.