Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified preprocess.py to accept syllabic prediction... #64

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Modified preprocess.py to accept syllabic prediction... #64

wants to merge 3 commits into from

Conversation

dreavjr
Copy link

@dreavjr dreavjr commented Apr 11, 2016

I'm playing with torch-rnn to do computational poetry (didn't Turing's own interest on IA started on that?!) and I found the letter-by-letter predictor requires really huge corpora (e.g., Shakespeare) to even start making sense, while the word-by-word predictor has limitations of its own. The syllable predictor converges quickly to something that... sounds correctly, even when it means nothing. It might be an interesting compromise between size-of-vocabulary vs amount of context for other explorations. The syllabic separation is based on PyHyphen, which uses LibreOffice's hyphenation dictionaries.

@ostrosablin
Copy link

Sounds great! Will check it out. It just might be a perfect compromise to make few of my small datasets (300-1000 KB) yield acceptable results.

@dreavjr
Copy link
Author

dreavjr commented Apr 14, 2016

Hi, I had no problem to install PyHyphen with PIP as root (inside a docker
container); but I talked to a colleague and he convinced me we should dump
PyHyphen altogether and move to NLTK http://www.nltk.org/. I'm looking
forward to attempt it.


Follow me : blog.eduardovalle.com - @dreavjr https://twitter.com/dreavjr -
+EduardoValle https://plus.google.com/+EduardoValle/posts
Follow us : recodbr.wordpress.com - @recodbr https://twitter.com/recodbr -
facebook.com/recodbr https://www.facebook.com/recodbr

On 12 April 2016 at 06:01, Ostrosablin Vitaly [email protected]
wrote:

I'm having trouble setting up pyhyphen module. I've installed it via pip,
according to docs it should autoconfigure itself (If I understood
correctly), but it doesn't set up config.py with proper values of
repository, etc. As result, it gives 404 for me when attempting to install
dicts. Config had placeholder value $repo for repository. I've tried
replacing it with
https://cgit.freedesktop.org/libreoffice/dictionaries/plain/dictionaries
but it still doesn't download dicts.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#64 (comment)

@ostrosablin
Copy link

Yes, that's probably the best option, because NLTK is well maintained, while PyHyphen seems to be abandoned and partially broken.

I've tried to install dictionaries manually by downloading them from libreoffice git and pointing path variable in config.py to the directory with dicts, but it doesn't seem to work. I've installed it on Gentoo system. Have no idea why does it installs broken with PIP as root.

@ostrosablin
Copy link

I've modified your preprocessing module to make it work with another Python hyphenation library, Pyphen instead of PyHyphen. It has dictionaries built-in, so it has no problems like those I had with PyHyphen. It looks fine as far as I can tell, just need to check whether preprocessed datasets train to anything sensible. If anyone is interested - I could share my changes.

Because I mostly train networks on non-english texts with occasional english words, I thought it would make sense to use two hyphenators, one with specified language and one for en_US as fallback. In end, script selects a list with most items as basis for syllabic splitting. Because otherwise, hyphenator would fail to split syllables of one of two languages. For english texts, it would use a single hyphenator.

@ostrosablin
Copy link

ostrosablin commented Jun 20, 2016

Well, that worked quite fine. Network is converging into readability really quickly and on really small datasets. I suspect it might catch on to features more poorly, since dataset is smaller, but it works for me, because I use torch-rnn mostly for fun.

Here's my modified preprocess.py.

Update: There's still a problem that sampler is not aware of syllabic splitting and it will fail to pre-seed with -start_text. It's difficult to do something about that, because sampler is written in lua.

@vi
Copy link

vi commented Aug 18, 2016

ValueError: Word to be hyphenated may have at most 100 characters.

Maybe there should be workaround like this:

diff --git a/scripts/preprocess.py b/scripts/preprocess.py
index 4881bca..6e13359 100644
--- a/scripts/preprocess.py
+++ b/scripts/preprocess.py
@@ -63,7 +63,7 @@ if __name__ == '__main__':
                   space = False
                   continue
               if len(word)>0 :
-                  syls = separator.syllables(word.lower())
+                  syls = separator.syllables(word.lower()[:80])
                   if len(syls) == 0 :
                     syls = [ word.lower() ]
                   word = ''

regisb and others added 2 commits September 15, 2017 08:21
`dict_info` no longer exists in pyhyphen. Instead, language packs are
downloaded on-the-fly. This upgrade should be compatible both with the
old and the new version of pyhyphen.
Upgrade hyphenation to the latest version of pyhyphen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants