Adding Romanian NER #4736

avramandrei · 2019-11-30T15:39:42Z

avramandrei
Nov 30, 2019

Hello,

I committed a pull request several months ago for adding a Romanian NER: #4151 (comment). I saw it was added in Spacy Universe: https://spacy.io/universe/project/ronec, but when I try to download the model with python -m spacy download ro_ner, it says that No compatible model found for 'ro_ner' (spaCy v2.2.3). We would like to know what is the progress in introducing the model and/or if you encountered any problems in training the model on the corpus (we would be happy to help you in this case).

P.S. We also refactored the spacy training tutorial for the Romanian NER corpus since then: https://github.com/dumitrescustefan/ronec/tree/master/spacy/train-local-model, and also, we added an already trained model (as a demo) that should be working with spacy: https://github.com/dumitrescustefan/ronec/tree/master/spacy/online-model.

Thank you,
Avram Andrei

adrianeboyd · 2019-12-02T08:27:17Z

adrianeboyd
Dec 2, 2019

Hi, thanks for getting in touch! For the models that spacy releases and maintains, we like to have a tagger, dependency parser, and NER model, so we wouldn't release an individual ro_ner model through spacy download. (Obviously it's great to have the entry in spacy universe and you are welcome to distribute the model yourselves!)

Looking around for other resources we could use for for Romanian, I think it's possible we can use your NER data in combination with the UD Romanian RRT corpus in order to have a full model to release. It looks like it's similar in size to the corpora used for several other languages in spacy, so I'll try training a model with all the data and if it all works well enough, we can plan to add Romanian with a tagger/parser/NER for v3.

Since I don't know much about Romanian, can I ask you a few questions related to this?

Does it make sense to train a model to accept some variation in orthography, like s/t with cedilla or s/t with/without commas? Currently we have a way to add variations for full tokens (mostly used for quotes and hyphens), but not yet for individual letters within tokens, although I don't think it would be difficult to add. Would you expect users of a Romanian model to potentially be working with texts with "incorrect" diacritics?
Have you worked with the UD Romanian RRT corpus before? Are you aware of any major issues with it? (In my experience there's a lot of variation in the quality of the UD corpora, so it's useful to know if there's anything to watch out for. You can also email me at adriane AT explosion.ai if you'd rather not discuss it publicly.)

0 replies

dumitrescustefan · 2019-12-02T10:10:01Z

dumitrescustefan
Dec 2, 2019

Hi @adrianeboyd , I'm Stefan and together with Andrei (@avramandrei , original poster) we developed RONEC. I also worked with the main person in charge of developing UD Romanian RRT. Here are my thoughts:

the "incorrect" diacritics is an old problem here in Romania, due to Windows' early lack of unicode support, we got "used" to writing with cedilla instead of the comma diacritics. So, at present, there still are texts that use the cedilla, though most of them use commas. There are a pretty large number of tokens that use ș and ț, so a correspondence at word level from every word that has a ş -> to the same word with ș (including flexed forms, etc.) could get out of hand. What I always do is perform a search and replace from the "old/incorrect" diacritics to the "new". So, if it is not feasible to have both formats, I'd just go for the "correct" comma one.
I actually would not expect users to complain even if you perform this change in the code for them (i.e. they give you "old" diacritics and you would output correct ones back). Most of them wouldn't notice, might even be thankful if they do. The ones that know what they're doing most likely have already done the replace themselves.
I also think that the biggest problem with Romanian is the lack of any diacritics, much worse than using "incorrect" ones. We got so used to writing without any diacritic that it has become second nature. So unless the text is from some official/published source, most natural texts like chats, even some blogs, won't have any diacritic :) For example, CoRoLa (1B+ token Romanian corpus of text that was released not so long ago) has all of these problems with diacritics.
Regarding UD Romanian RRT I don't think there are any major issues. I know how much volatility is in these corpora, but on the other hand, there is no corpus better or larger than RRT.
Finally, if it is possible, it would be nice to train spacy on RRT up to dependency level, and then on RONEC for NER.

As for future work, we're actually considering to:

Annotate RRT with NER tags.
Eventually annotate RONEC with pos and deps, but that would have to be done in accord with the developers of RRT, which will probably take some time (months to maybe years).

Thanks!

0 replies

adrianeboyd · 2019-12-03T09:58:32Z

adrianeboyd
Dec 3, 2019

Thanks for the info!

Spacy uses one corpus for tags/parses and another for NER in many cases, so combining the corpora this way isn't a problem at all. Looking at the size of RRT and the CoNLL shared task results, I think RRT is similar enough in size/quality to other corpora used for spacy's provided models.
RRT and RONEC both are normalized to have standard orthography everywhere, right?
Spacy explicitly never modifies the input text, so the options are 1) that the model only works with standard/normalized orthography or 2) we add variation to the training data so that the model can work with non-standard orthography. For the second options I would have to implement this for individual characters, so I don't have any idea of how this might change the results yet. Since so many tokens are affected, I think I would duplicate training sentences to include them with and without variation?

There is also the NORM attribute which could potentially help a bit, but I think it's not really intended for this kind of variation.
Could you have a look at the tokenizer settings here (mainly in punctuation.py and tokenizer_exceptions.py?

https://github.com/adrianeboyd/spaCy/tree/feature/ud-tokenization-ro/spacy/lang/ro

I tried to make the tokenizer do fairly well on RRT so that the training data works well. I added all the closed class clitic-ish words with hyphens as prefixes and suffixes and added a number of abbreviations as exceptions.

If there's anything obvious missing because I only observed part of a paradigm in RRT? I'm sure there are some capitalized vs. non-capitalized variations missing. (Or is there a good reference list somewhere? Or an open source rule-based tokenizer that everyone uses for Romanian?)

It's hard to get all of the hyphens correct with spacy's prefixes and suffixes because spacy always looks for a prefix first (și-ți as și- ți vs. și -ți), so if there are small number of obvious mistakes, they could be added as exceptions, too.

I trained a few models yesterday and the results look decent:

    "token_acc":99.6014768798,
    "tags_acc":95.4181220464,
    "uas":87.4785360762,
    "las":79.7010201677,
    "ents_p":77.6507686244,
    "ents_r":76.1205564142,
    "ents_f":76.8780487805,

A few of the NER tags aren't great and it might make sense to combine them into a MISC category (PRODUCT, LOC, FACILITY, WORK_OF_ART):

      "NAT_REL_POL":{
        "p":88.0733944954,
        "r":80.6722689076,
        "f":84.2105263158
      },
      "PRODUCT":{
        "p":55.0,
        "r":37.6712328767,
        "f":44.7154471545
      },
      "WORK_OF_ART":{
        "p":11.1111111111,
        "r":7.1428571429,
        "f":8.6956521739
      },
      "DATETIME":{
        "p":76.3406940063,
        "r":79.3442622951,
        "f":77.8135048232
      },
      "LOC":{
        "p":62.3376623377,
        "r":48.9795918367,
        "f":54.8571428571
      },
      "NUMERIC_VALUE":{
        "p":93.0909090909,
        "r":94.4649446494,
        "f":93.7728937729
      },
      "LANGUAGE":{
        "p":57.1428571429,
        "r":100.0,
        "f":72.7272727273
      },
      "FACILITY":{
        "p":48.9130434783,
        "r":42.0560747664,
        "f":45.2261306533
      },
      "PERSON":{
        "p":73.8805970149,
        "r":71.609403255,
        "f":72.7272727273
      },
      "ORGANIZATION":{
        "p":71.8181818182,
        "r":75.4777070064,
        "f":73.602484472
      },
      "GPE":{
        "p":85.6470588235,
        "r":91.9191919192,
        "f":88.6723507917
      },
      "MONEY":{
        "p":92.5925925926,
        "r":97.4025974026,
        "f":94.9367088608
      },
      "PERIOD":{
        "p":87.1794871795,
        "r":75.5555555556,
        "f":80.9523809524
      },
      "ORDINAL":{
        "p":80.6451612903,
        "r":81.9672131148,
        "f":81.3008130081
      },
      "QUANTITY":{
        "p":86.6666666667,
        "r":83.8709677419,
        "f":85.2459016393
      },
      "EVENT":{
        "p":64.2857142857,
        "r":57.4468085106,
        "f":60.6741573034
      }

Edited: update NER results with improved tokenization (removing SpaceAfter=No from the final tokens in sentences).

0 replies

dumitrescustefan · 2019-12-03T14:52:13Z

dumitrescustefan
Dec 3, 2019

Hi!

Responses, in turn:

RONEC has only "correct" diacritics; RRT has I think one instance of 'ş' in the train data, so a simple search and replace for s and t from the old to the new letters should "correct" it as well.
If spacy does not modify the input text, then it's a question whether duplicating the training data (because the majority of sentences have șs and țs, they are very common letters) would harm performance or not. If an eventual accuracy decrease is, at most, minimal, then I would go for duplicating. Otherwise, we should move forward and support only correct diacritics :) I guess an experiment is the only way to know; luckly there should be only 4 changes, a search and replace for all s and t letters, lowercase and uppercase, and then concatenate the result over the old corpus.
I feel the same way, I would not bother with the NORM attribute either.
For tokenizer_exceptions.py, the list is pretty small, but it's a start.
Looking at stop_words.py you'll see many cases of both old, new and no-diacritic words :D For this list I would try to actually do every word with all variations (no diacritic aceeasi, old aceeaşi and new aceeași). I might have a better stopword file, just give me a couple of days to find it.
For the punctuation.py, I don't know if this approach would work optimally, meaning it would catch many cases, but there are .. a lot .. of other cases not caught. I am not familiar with how Spacy performs tokenization (rule based or neural model?), but wouldn't a model trained on RRT perform better than a rule-based one? The thing is that in Romanian, tokenization means kind of knowing the POS tag of a word/the context of the "joined" word. In your example, și-ți should would always tokenize as și -ți because it is a contracted form of și îți, and the hyphen belongs to îți not și... I don't know how to get around this without a neural model that should generalize these cases.
Another example like the one above is :

# text = El e-mpotriva Partidului, nu eu.
1	El	el	PRON	Pp3msr--------s	Case=Acc,Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs|Strength=Strong	4	nsubj	_	_
2	e	fi	AUX	Vmip3s	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	4	cop	_	SpaceAfter=No
3	-mpotriva	împotriva	ADP	Spsgy	AdpType=Prep|Case=Gen|Variant=Short	4	case	_	_

where the dash belongs to mpotriva as it is short for împotriva. This example is in the dev file, and in the train file there is not a single instance of this :) Also, looking in the punctuation.py, I don't see an e- or an A- though there are the complementary upper/lower cases for these prefixes. Romanian is quite inflected unfortunately.

If there is no way to train a tokenizer on RRT, then let me ask some people if we could come up with a more complete prefix/suffix file, but with the mention that if prefix always gets split before the suffix, there will always be wrongly split words like in your example.

The model results on RRT look pretty nice!
Regarding the class accuracy for RONEC, the lowest inter-annotator agreement we've had was with the PRODUCT and FACILITY classes. WORK_OF_ART also is low because it contains very few instances. I would recommend just skipping these classes all toghether because they only create confusion. You did not mention EVENT, but I would remove it as well. They are pretty specific and edge classes, so won't make a major dent in everyday NER usage imho. (We tried to stick to the ACE and OntoNotes classes as much as possible when building RONEC, but our available text wasn't really suitable for these classes - maybe in a future release we'll cover more examples).
For LOC I'm surprised, it should have performed much better, but I would not join it with another class or remove it.

Let me see what I can do for punctuation and stopwords. Thanks!

0 replies

adrianeboyd · 2019-12-04T10:12:49Z

adrianeboyd
Dec 4, 2019

I'm relatively happy with the 99.6% accuracy for the tokenizer for the RRT dev set, but I'm not surprised that there are lots of prefixes/suffixes missing from the lists. I mainly wanted to get a version working a bit better for the initial round of training and it can definitely be improved further.

What tokenizer do you typically use?

Spacy's tokenizer is rule-based, but if the tokenization of the hyphens depends a lot on the context, it might be possible to have the parser do some retokenization. The tokenizer would split on all hyphens and the parser would learn how to rejoin them. The training option for this is still a bit experimental, but it would be worth a try. (It's not working as intended for Chinese, but this is a much easier case.)

I'll check if something seems wrong in the LOC conversion.

0 replies

dumitrescustefan · 2019-12-04T11:31:37Z

dumitrescustefan
Dec 4, 2019

We use an lstm model trained on UD's tokenization to predict joint sentence segmentation and word tokenization. This approach yields something like 99.74 tokenization and 95.5 segmentation accuracies.

However, a 99.6% rule based (light) model seems pretty good!

For a 0.1% I wouldn't bother changing too much code; however an experiment like the one you suggest with the parser should be worth it.

I'll get back if I can find some significantly better stop words.

0 replies

adrianeboyd · 2019-12-04T13:54:09Z

adrianeboyd
Dec 4, 2019

Yeah, it's best for spacy if the tokenizer is deterministic and fast, so I'm okay with a rule-based tokenizer that's a little bit worse. Here are the segmentation results for the same parser model as above on the RRT test set:

TOK       99.66 
Sent P    98.34 
Sent R    97.67 
Sent F    98.00

On a random 90% of RONEC (the same train set as above):

TOK       99.54  
Sent P    94.00  
Sent R    96.17  
Sent F    95.07

Stop words are totally separate from the tokenizer, so it wouldn't hurt to update the list, but most people end up bringing a task-specific list anyway, so that's not a priority.

0 replies

adrianeboyd · 2019-12-04T14:22:26Z

adrianeboyd
Dec 4, 2019

Wait, those sentence segmentation scores were inflated by some fake document boundaries (I grouped every 10 sentences into a document):

RRT test:

Sent P    98.33 
Sent R    97.12 
Sent F    97.72

RONEC:

Sent P    93.99  
Sent R    95.91  
Sent F    94.94

0 replies

adrianeboyd · 2020-06-10T14:07:51Z

adrianeboyd
Jun 10, 2020

If anyone would like to test the upcoming models for Romanian, the initial models have been published and can be tested with spacy v2.3.0.dev1:

pip install spacy==2.3.0.dev1
pip install https://github.com/explosion/spacy-models/releases/download/ro_core_news_sm-2.3.0/ro_core_news_sm-2.3.0.tar.gz --no-deps

Replace sm with md or lg for models with vectors.

0 replies

adrianeboyd · 2020-06-19T07:14:53Z

adrianeboyd
Jun 19, 2020

There's still some work to do on the Romanian tokenizer and character-based orthographic variants, but I think the main issue here has been addressed.

I decided that it would be better to keep all the entity types from original corpus in the models that we provide, even if the performance for some types is poor, so that the model corresponds closely to the references listed in the sources. The data is available if others would like to collapse or reduce the entity types and train their own models.

0 replies

dumitrescustefan · 2020-06-26T08:52:06Z

dumitrescustefan
Jun 26, 2020

Hi @adrianeboyd, I have managed to get a bit of time to test this, and while everything seems to work reasonably, especially tokenization and sentence segmentation, the POS is having issues:

nlp = spacy.load("ro_core_news_lg") 
doc = nlp("Winston se opri din scris, printre altele fiindcă-l ținea un junghi.")
for sentence in doc.sents:
    for token in sentence:
        print("\t".join([token.text, token.pos_]))

gives:

Winston PROPN
se      X
opri    X
din     X
scris   X
,       PUNCT
printre X
altele  X
fiindcă X
-l      X
ținea   X
un      X
junghi  X
.       PUNCT

This is Romanian RRT sentence test-6 taken from : https://raw.githubusercontent.com/UniversalDependencies/UD_Romanian-RRT/master/ro_rrt-ud-test.conllu :

# sent_id = test-6
# text = Winston se opri din scris, printre altele fiindcă-l ținea un junghi.
1	Winston	Winston	PROPN	Np	_	3	nsubj	_	_
2	se	sine	PRON	Px3--a--------w	Case=Acc|Person=3|PronType=Prs|Reflex=Yes|Strength=Weak	3	expl:pv	_	_
3	opri	opri	VERB	Vmis3s	Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin	0	root	_	_
4	din	din	ADP	Spsa	AdpType=Prep|Case=Acc	5	case	_	_
5	scris	scrie	VERB	Vmp--sm	Gender=Masc|Number=Sing|VerbForm=Part	3	obl	_	SpaceAfter=No
6	,	,	PUNCT	COMMA	_	8	punct	_	_
7	printre	printre	ADP	Spsa	AdpType=Prep|Case=Acc	8	case	_	_
8	altele	altul	PRON	Pi3fpr	Case=Acc,Nom|Gender=Fem|Number=Plur|Person=3|PronType=Ind	3	obl	_	_
9	fiindcă	fiindcă	SCONJ	Csssp	Polarity=Pos	11	mark	_	SpaceAfter=No
10	-l	el	PRON	Pp3msa--y-----w	Case=Acc|Gender=Masc|Number=Sing|Person=3|PronType=Prs|Strength=Weak|Variant=Short	11	obj	_	_
11	ținea	ține	VERB	Vmii3s	Mood=Ind|Number=Sing|Person=3|Tense=Imp|VerbForm=Fin	3	advcl	_	_
12	un	un	DET	Timsr	Case=Acc,Nom|Gender=Masc|Number=Sing|PronType=Ind	13	det	_	_
13	junghi	junghi	NOUN	Ncms-n	Definite=Ind|Gender=Masc|Number=Sing	11	nsubj	_	SpaceAfter=No
14	.	.	PUNCT	PERIOD	_	3	punct	_	_

So basically UD's UPOS (Spacy's .pos_) was not learned well. UD's XPOS and FEATS (Spacy's .tag_ and .dep_) seem to work (there were some differences, but this is to be expected).

0 replies

adrianeboyd · 2020-06-26T08:55:36Z

adrianeboyd
Jun 26, 2020

Thanks for testing this! This is probably an issue with the tag map, I'll have a look.

0 replies

dumitrescustefan · 2020-06-26T08:56:44Z

dumitrescustefan
Jun 26, 2020

Sure, I forgot to mention, the same behaviour is for sm, md an lg.

0 replies

adrianeboyd · 2020-06-26T09:14:23Z

adrianeboyd
Jun 26, 2020

Yeah, the training corpus and the tag map got out-of-sync. Since the fine-grained tags are detailed enough without the additional morphology, I intended to train from data with the plain fine-grained tags rather than the ones with __UFEATS, but they weren't removed from the data as intended when the corpus was prepared.

We will train and release an updated model.

0 replies

FPBHW · 2020-07-22T07:20:59Z

FPBHW
Jul 22, 2020

@adrianeboyd
Thanks for making this model available!
Could you give some insight into how the train/validate/test data split was done to train the model ? Notably, is the test set the same as in the RoNEC repository i.e. this file ?
Or did you use a different split ?

Thanks in advance!

0 replies

adrianeboyd · 2020-07-22T07:45:36Z

adrianeboyd
Jul 22, 2020

I think that split wasn't available yet when we set this up. We modified the conversion script slightly (see: https://github.com/adrianeboyd/ronec/commits/bugfix/spacy-iob-script -- I think the authors have addressed these bugs in a slightly different way in the main repo since I worked on this) and we did our own randomized 80/10/10 split. The NER results, especially for the rarer types, will vary a lot between splits.

I've attached a list of the sentence IDs per split. Our reported evaluations are on the dev sets, not the test sets, which we leave untouched for now.

ronec_split.txt

0 replies

FPBHW · 2020-07-22T07:58:28Z

FPBHW
Jul 22, 2020

Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Romanian NER #4736

{{title}}

Replies: 17 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply