Ancient Greek language #6604

jmyerston · 2020-12-21T00:36:21Z

jmyerston
Dec 21, 2020

Hello,

we are about to finish a language module for ancient Greek. It will be obviously not the most popular model, but it will still be useful for some people.

We had it almost ready and then we decided to port it to spaCy3 and are running into a few issues that may be us just not knowing the new version of spaCy well enough. We have a pos lemmatizer that uses one of the largest ancient Greek lemmata lists and it is working quite well. For a sentence like this:

doc2 = nlp("δοκῶ μοι περὶ ὧν πυνθάνεσθε οὐκ ἀμελέτητος εἶναι.")

The lemmatizer gets every form right:

δοκῶ δοκέω VERB
μοι ἐγώ PRON
περὶ περί ADP
ὧν ὅς PRON
πυνθάνεσθε πυνθάνομαι VERB
οὐκ οὐ ADV
ἀμελέτητος ἀμελέτητος ADJ
εἶναι εἰμί VERB
. . PUNCT

Most existing problems are coming from issues in the corpus data (mistakes in the UD training files) and not from our code, but we are still having a problem with the morphologizer that is pos tagging quotation marks as verbs, nouns, and so on:

ὦ ὦ INTJ
Φαληρεύς φαληρεύς NOUN
' ' VERB
, , PUNCT
ἔφη φημί VERB
, , PUNCT
‘ ‘ VERB

How should we address this issue? Through the attribute_ruler and assigning the pos punct to the quotation marks?

Thanks.

adrianeboyd · 2020-12-21T09:46:58Z

adrianeboyd
Dec 21, 2020

If the training data doesn't contain these quote symbols at all, the tagger/morphologizer won't learn how to tag them and will typically just pick high frequency tags instead.

The attribute ruler is one option. What we do for some of the provided models is to augment the training data by substituting quotes, dashes, etc. You can have a look at make_orth_variants in spacy/training/augment.py.

0 replies

jmyerston · 2021-02-14T04:49:20Z

jmyerston
Feb 14, 2021
Author

Hi,

I have a question regarding the normalization table in spacy-lookups-data. I cannot get the grc_lexeme_norm.json to load, but the exceptions in tokenizer_exceptions.py work just fine.

My norm file is registered in spacy-lookups-data as the other lemma files that the lemmatizer is able to load. I checked other languages tokenizer_exceptions files, and it seem to me that I'm calling the same modules. So, I'm kind of clueless.

Any hints?

Thanks.

Jacobo

1 reply

ines Jul 18, 2021
Maintainer

If you're running your local fork of spacy-lookups-data, make sure that you have it installed in the environment using pip install -e . – otherwise, the entry points are not recognised (which is how the files are provided to spaCy).

jmyerston · 2021-07-17T17:24:33Z

jmyerston
Jul 17, 2021
Author

Update: I have given up the idea of using a pos based lemmatizer for grc. After running several tests and comparing to a simple lookup table, I find out that my pos lemmatizer was 15% less accurate than the lookup method. I think that the route of a rule based lemmatizer is better although it requires to define many exceptions.

3 replies

wjbmattingly Jul 26, 2021

Have you started working on an NER pipe for your grc model?

jmyerston Jul 27, 2021
Author

Yes, I'm working on an ancient Greek NER data set right now. I expect to have some results by the end of the Fall. I have one of my graduate students working on annotating a series of text with Prodigy, while another is developing a REL extraction pipeline and an ancient Greek transformers. The idea is to extend and complement the existing annotated trained data and release some grc models. If you want to contribute, please contact me at jmyerstonATucsd.edu so that we do not overlap on the annotated texts.

wjbmattingly Jul 27, 2021

Oh this is very exciting. Kyle sent you an email yesterday actually. I have just added a reply to it. Looking forward to talking about this further!

jmyerston · 2022-10-14T04:32:57Z

jmyerston
Oct 14, 2022
Author

Hi @adrianeboyd @svlandeg

After some experimentation, I finally put together a project to build four ancient Greek models: small, medium, large and transformer. The medium and large models were trained with floret vectors, and the transformer model with a transformer that I trained myself. The performance of the transformer model is better than other ancient Greek models offered by stanza and trankit, and it is also much smaller.

The project can be found here:

https://github.com/jmyerston/graCy

I'm planning to add a ner pipeline in the future and improve lemmatization and sentence boundary, but the models have already very good performance and could be useful for those who work with ancient Greek texts or develop applications for processing Greek. So they could be an interesting addition to the spaCy ecosystem.

2 replies

polm Oct 17, 2022

Thanks for the update! I think this would make a great addition to the Universe. You can see here for details on creating a PR, and let us know if anything's unclear.

polm Oct 25, 2022

Just wanted to ping you about this again - we'd love to have this in the Universe. If there's anything I can help with let me know.

jmyerston · 2022-10-25T17:21:13Z

jmyerston
Oct 25, 2022
Author

I will wait a little bit to make it into a python package. (I have to figure out how; never done before). It would be useful to install those models from the prompt with something like this: python -m gracy install small this should mask: pip install https://huggingface.co/Jacobo/grc_ud_proiel_sm/resolve/main/grc_ud_proiel_sm-any-py3-none-any.whl If you could tell me of an existing code I could look at and adapt, this would help a lot. Ultimately, it would be nice to have a package that install models of ancient languages, but for now we only have ancient Greek, although there is a reference to Sanskrit in the spaCy documentation (why?!!)?

…

On Oct 24, 2022, at 9:57 PM, polm ***@***.***> wrote: Just wanted to ping you about this again - we'd love to have this in the Universe. If there's anything I can help with let me know. — Reply to this email directly, view it on GitHub <#6604 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKJYKB2SSTJVDVRPC32F6C3WE5SCPANCNFSM4VDQKHRQ>. You are receiving this because you authored the thread.

2 replies

polm Oct 26, 2022

So while it would be cool to make it a package, you don't have to do that to add it to the Universe - you can add your project as it is now and update it if you put it on PyPI later.

Also, since your model is less than 60MB, you can put it directly on PyPI - you don't need a separate download step. If you want a separate download step anyway, maybe look at my unidic-py package, which works like a simplified spacy download for a Japanese dictionary.

Sanskrit is mentioned in the spaCy documentation because we have basic support for it - we don't distribute models, but if you set the language of your pipeline to Sanskrit then you'll get a properly configured tokenizer, stopwords, maybe some lexical attributes (like flags for punctuation or numeric terms), etc. We would be happy to add something similar for ancient Greek.

jmyerston Oct 26, 2022
Author

Thanks for the links. I will then create the PR for the Universe. By the way, my project builds 4 models including sm, md, lg, and trf which is 430mb. This is why I thought about writing an installer cli.

jmyerston · 2022-11-01T17:20:57Z

jmyerston
Nov 1, 2022
Author

Before submitting the PR, I wanted to make sure that my project builds using the latest version of spaCy. Unfortunately it doesn't build with version 3.4.2 but it works with 3.4.1. I could not figure out what is wrong. This is the error I am getting:

(spacy) jacobo@lola greCy % python -m spacy project run preprocess
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/pkg_resources/_vendor/packaging/requirements.py", line 102, in __init__
    req = REQUIREMENT.parseString(requirement_string)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/pkg_resources/_vendor/pyparsing/core.py", line 1141, in parse_string
    raise exc.with_traceback(None)
pkg_resources._vendor.pyparsing.exceptions.ParseException: Expected string_end, found ':'  (at char 5), (line:1, col:6)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/spacy/__main__.py", line 4, in <module>
    setup_cli()
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/spacy/cli/_util.py", line 71, in setup_cli
    command(prog_name=COMMAND)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/typer/main.py", line 532, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/spacy/cli/project/run.py", line 45, in project_run_cli
    project_run(project_dir, subcommand, overrides=overrides, force=force, dry=dry)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/spacy/cli/project/run.py", line 81, in project_run
    _check_requirements([req.replace("\n", "") for req in requirements_file])
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/spacy/cli/project/run.py", line 336, in _check_requirements
    pkg_resources.require(req)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/pkg_resources/__init__.py", line 909, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/pkg_resources/__init__.py", line 751, in resolve
    requirements = list(requirements)[::-1]
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/pkg_resources/__init__.py", line 3102, in __init__
    super(Requirement, self).__init__(requirement_string)
  File "/usr/local/anaconda3/envs/spacy/lib/python3.10/site-packages/pkg_resources/_vendor/packaging/requirements.py", line 104, in __init__
    raise InvalidRequirement(
pkg_resources.extern.packaging.requirements.InvalidRequirement: Parse error at "'://githu'": Expected string_end

2 replies

polm Nov 2, 2022

I've run into similar errors recently - I'm not exactly sure where the change is, but I think you have to provide a name for packages in requirements.txt files, and you can't just provide a URL any more. So your file would look like this:

something @ http:// [url goes here]

This isn't a change in spaCy, it's a change in pip or something.

adrianeboyd Nov 4, 2022

This was due to a change in spacy project, and will be fixed in v3.4.3 (#11735).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancient Greek language #6604

{{title}}

Replies: 6 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Ancient Greek language #6604

Replies: 6 comments · 10 replies

jmyerston Feb 14, 2021 Author

ines Jul 18, 2021 Maintainer

jmyerston Jul 17, 2021 Author

jmyerston Jul 27, 2021 Author

jmyerston Oct 14, 2022 Author

jmyerston Oct 25, 2022 Author

jmyerston Oct 26, 2022 Author

jmyerston Nov 1, 2022 Author

Replies: 6 comments 10 replies

jmyerston
Feb 14, 2021
Author

ines Jul 18, 2021
Maintainer

jmyerston
Jul 17, 2021
Author

jmyerston Jul 27, 2021
Author

jmyerston
Oct 14, 2022
Author

jmyerston
Oct 25, 2022
Author

jmyerston Oct 26, 2022
Author

jmyerston
Nov 1, 2022
Author