Ancient Greek language #6604
Replies: 6 comments 10 replies
-
If the training data doesn't contain these quote symbols at all, the tagger/morphologizer won't learn how to tag them and will typically just pick high frequency tags instead. The attribute ruler is one option. What we do for some of the provided models is to augment the training data by substituting quotes, dashes, etc. You can have a look at |
Beta Was this translation helpful? Give feedback.
-
Hi, I have a question regarding the normalization table in spacy-lookups-data. I cannot get the grc_lexeme_norm.json to load, but the exceptions in tokenizer_exceptions.py work just fine. My norm file is registered in spacy-lookups-data as the other lemma files that the lemmatizer is able to load. I checked other languages tokenizer_exceptions files, and it seem to me that I'm calling the same modules. So, I'm kind of clueless. Any hints? Thanks. Jacobo |
Beta Was this translation helpful? Give feedback.
-
Update: I have given up the idea of using a pos based lemmatizer for grc. After running several tests and comparing to a simple lookup table, I find out that my pos lemmatizer was 15% less accurate than the lookup method. I think that the route of a rule based lemmatizer is better although it requires to define many exceptions. |
Beta Was this translation helpful? Give feedback.
-
After some experimentation, I finally put together a project to build four ancient Greek models: small, medium, large and transformer. The medium and large models were trained with floret vectors, and the transformer model with a transformer that I trained myself. The performance of the transformer model is better than other ancient Greek models offered by stanza and trankit, and it is also much smaller. The project can be found here: https://github.com/jmyerston/graCy I'm planning to add a ner pipeline in the future and improve lemmatization and sentence boundary, but the models have already very good performance and could be useful for those who work with ancient Greek texts or develop applications for processing Greek. So they could be an interesting addition to the spaCy ecosystem. |
Beta Was this translation helpful? Give feedback.
-
I will wait a little bit to make it into a python package. (I have to figure out how; never done before). It would be useful to install those models from the prompt with something like this:
python -m gracy install small
this should mask:
pip install https://huggingface.co/Jacobo/grc_ud_proiel_sm/resolve/main/grc_ud_proiel_sm-any-py3-none-any.whl
If you could tell me of an existing code I could look at and adapt, this would help a lot.
Ultimately, it would be nice to have a package that install models of ancient languages, but for now we only have ancient Greek, although there is a reference to Sanskrit in the spaCy documentation (why?!!)?
… On Oct 24, 2022, at 9:57 PM, polm ***@***.***> wrote:
Just wanted to ping you about this again - we'd love to have this in the Universe. If there's anything I can help with let me know.
—
Reply to this email directly, view it on GitHub <#6604 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKJYKB2SSTJVDVRPC32F6C3WE5SCPANCNFSM4VDQKHRQ>.
You are receiving this because you authored the thread.
|
Beta Was this translation helpful? Give feedback.
-
Before submitting the PR, I wanted to make sure that my project builds using the latest version of spaCy. Unfortunately it doesn't build with version 3.4.2 but it works with 3.4.1. I could not figure out what is wrong. This is the error I am getting:
|
Beta Was this translation helpful? Give feedback.
-
Hello,
we are about to finish a language module for ancient Greek. It will be obviously not the most popular model, but it will still be useful for some people.
We had it almost ready and then we decided to port it to spaCy3 and are running into a few issues that may be us just not knowing the new version of spaCy well enough. We have a pos lemmatizer that uses one of the largest ancient Greek lemmata lists and it is working quite well. For a sentence like this:
doc2 = nlp("δοκῶ μοι περὶ ὧν πυνθάνεσθε οὐκ ἀμελέτητος εἶναι.")
The lemmatizer gets every form right:
δοκῶ δοκέω VERB
μοι ἐγώ PRON
περὶ περί ADP
ὧν ὅς PRON
πυνθάνεσθε πυνθάνομαι VERB
οὐκ οὐ ADV
ἀμελέτητος ἀμελέτητος ADJ
εἶναι εἰμί VERB
. . PUNCT
Most existing problems are coming from issues in the corpus data (mistakes in the UD training files) and not from our code, but we are still having a problem with the morphologizer that is pos tagging quotation marks as verbs, nouns, and so on:
ὦ ὦ INTJ
Φαληρεύς φαληρεύς NOUN
' ' VERB
, , PUNCT
ἔφη φημί VERB
, , PUNCT
‘ ‘ VERB
How should we address this issue? Through the attribute_ruler and assigning the pos punct to the quotation marks?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions