Modified preprocess.py to accept syllabic prediction... #64

dreavjr · 2016-04-11T01:43:20Z

I'm playing with torch-rnn to do computational poetry (didn't Turing's own interest on IA started on that?!) and I found the letter-by-letter predictor requires really huge corpora (e.g., Shakespeare) to even start making sense, while the word-by-word predictor has limitations of its own. The syllable predictor converges quickly to something that... sounds correctly, even when it means nothing. It might be an interesting compromise between size-of-vocabulary vs amount of context for other explorations. The syllabic separation is based on PyHyphen, which uses LibreOffice's hyphenation dictionaries.

…ocumentation accordingly.

ostrosablin · 2016-04-11T18:37:18Z

Sounds great! Will check it out. It just might be a perfect compromise to make few of my small datasets (300-1000 KB) yield acceptable results.

dreavjr · 2016-04-14T05:56:25Z

Hi, I had no problem to install PyHyphen with PIP as root (inside a docker
container); but I talked to a colleague and he convinced me we should dump
PyHyphen altogether and move to NLTK http://www.nltk.org/. I'm looking
forward to attempt it.

—
Follow me : blog.eduardovalle.com - @dreavjr https://twitter.com/dreavjr -
+EduardoValle https://plus.google.com/+EduardoValle/posts
Follow us : recodbr.wordpress.com - @recodbr https://twitter.com/recodbr -
facebook.com/recodbr https://www.facebook.com/recodbr

On 12 April 2016 at 06:01, Ostrosablin Vitaly [email protected]
wrote:

I'm having trouble setting up pyhyphen module. I've installed it via pip,
according to docs it should autoconfigure itself (If I understood
correctly), but it doesn't set up config.py with proper values of
repository, etc. As result, it gives 404 for me when attempting to install
dicts. Config had placeholder value $repo for repository. I've tried
replacing it with
https://cgit.freedesktop.org/libreoffice/dictionaries/plain/dictionaries
but it still doesn't download dicts.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#64 (comment)

ostrosablin · 2016-04-15T14:09:13Z

Yes, that's probably the best option, because NLTK is well maintained, while PyHyphen seems to be abandoned and partially broken.

I've tried to install dictionaries manually by downloading them from libreoffice git and pointing path variable in config.py to the directory with dicts, but it doesn't seem to work. I've installed it on Gentoo system. Have no idea why does it installs broken with PIP as root.

ostrosablin · 2016-06-19T12:16:00Z

I've modified your preprocessing module to make it work with another Python hyphenation library, Pyphen instead of PyHyphen. It has dictionaries built-in, so it has no problems like those I had with PyHyphen. It looks fine as far as I can tell, just need to check whether preprocessed datasets train to anything sensible. If anyone is interested - I could share my changes.

Because I mostly train networks on non-english texts with occasional english words, I thought it would make sense to use two hyphenators, one with specified language and one for en_US as fallback. In end, script selects a list with most items as basis for syllabic splitting. Because otherwise, hyphenator would fail to split syllables of one of two languages. For english texts, it would use a single hyphenator.

ostrosablin · 2016-06-20T07:23:30Z

Well, that worked quite fine. Network is converging into readability really quickly and on really small datasets. I suspect it might catch on to features more poorly, since dataset is smaller, but it works for me, because I use torch-rnn mostly for fun.

Here's my modified preprocess.py.

Update: There's still a problem that sampler is not aware of syllabic splitting and it will fail to pre-seed with -start_text. It's difficult to do something about that, because sampler is written in lua.

vi · 2016-08-18T01:14:29Z

ValueError: Word to be hyphenated may have at most 100 characters.

Maybe there should be workaround like this:

diff --git a/scripts/preprocess.py b/scripts/preprocess.py
index 4881bca..6e13359 100644
--- a/scripts/preprocess.py
+++ b/scripts/preprocess.py
@@ -63,7 +63,7 @@ if __name__ == '__main__':
                   space = False
                   continue
               if len(word)>0 :
-                  syls = separator.syllables(word.lower())
+                  syls = separator.syllables(word.lower()[:80])
                   if len(syls) == 0 :
                     syls = [ word.lower() ]
                   word = ''

`dict_info` no longer exists in pyhyphen. Instead, language packs are downloaded on-the-fly. This upgrade should be compatible both with the old and the new version of pyhyphen.

Upgrade hyphenation to the latest version of pyhyphen

Modified preprocess.py to accept syllabic prediction, updated flags d…

f7cbedc

…ocumentation accordingly.

regisb and others added 2 commits September 15, 2017 08:21

Upgrade hyphenation to the latest version of pyhyphen

de44733

`dict_info` no longer exists in pyhyphen. Instead, language packs are downloaded on-the-fly. This upgrade should be compatible both with the old and the new version of pyhyphen.

Merge pull request #1 from regisb/regisb/fix-pyhyphen

a39c702

Upgrade hyphenation to the latest version of pyhyphen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modified preprocess.py to accept syllabic prediction... #64

Modified preprocess.py to accept syllabic prediction... #64

dreavjr commented Apr 11, 2016

ostrosablin commented Apr 11, 2016

dreavjr commented Apr 14, 2016

ostrosablin commented Apr 15, 2016

ostrosablin commented Jun 19, 2016

ostrosablin commented Jun 20, 2016 •

edited

Loading

vi commented Aug 18, 2016

Modified preprocess.py to accept syllabic prediction... #64

Are you sure you want to change the base?

Modified preprocess.py to accept syllabic prediction... #64

Conversation

dreavjr commented Apr 11, 2016

ostrosablin commented Apr 11, 2016

dreavjr commented Apr 14, 2016

ostrosablin commented Apr 15, 2016

ostrosablin commented Jun 19, 2016

ostrosablin commented Jun 20, 2016 • edited Loading

vi commented Aug 18, 2016

ostrosablin commented Jun 20, 2016 •

edited

Loading