Adding Romanian NER #4736
Replies: 17 comments
-
Hi, thanks for getting in touch! For the models that spacy releases and maintains, we like to have a tagger, dependency parser, and NER model, so we wouldn't release an individual Looking around for other resources we could use for for Romanian, I think it's possible we can use your NER data in combination with the UD Romanian RRT corpus in order to have a full model to release. It looks like it's similar in size to the corpora used for several other languages in spacy, so I'll try training a model with all the data and if it all works well enough, we can plan to add Romanian with a tagger/parser/NER for v3. Since I don't know much about Romanian, can I ask you a few questions related to this?
|
Beta Was this translation helpful? Give feedback.
-
Hi @adrianeboyd , I'm Stefan and together with Andrei (@avramandrei , original poster) we developed RONEC. I also worked with the main person in charge of developing UD Romanian RRT. Here are my thoughts:
As for future work, we're actually considering to:
Thanks! |
Beta Was this translation helpful? Give feedback.
-
Thanks for the info!
I trained a few models yesterday and the results look decent:
A few of the NER tags aren't great and it might make sense to combine them into a MISC category (PRODUCT, LOC, FACILITY, WORK_OF_ART):
Edited: update NER results with improved tokenization (removing |
Beta Was this translation helpful? Give feedback.
-
Hi! Responses, in turn:
where the dash belongs to If there is no way to train a tokenizer on RRT, then let me ask some people if we could come up with a more complete prefix/suffix file, but with the mention that if prefix always gets split before the suffix, there will always be wrongly split words like in your example.
Let me see what I can do for punctuation and stopwords. Thanks! |
Beta Was this translation helpful? Give feedback.
-
I'm relatively happy with the 99.6% accuracy for the tokenizer for the RRT dev set, but I'm not surprised that there are lots of prefixes/suffixes missing from the lists. I mainly wanted to get a version working a bit better for the initial round of training and it can definitely be improved further. What tokenizer do you typically use? Spacy's tokenizer is rule-based, but if the tokenization of the hyphens depends a lot on the context, it might be possible to have the parser do some retokenization. The tokenizer would split on all hyphens and the parser would learn how to rejoin them. The training option for this is still a bit experimental, but it would be worth a try. (It's not working as intended for Chinese, but this is a much easier case.) I'll check if something seems wrong in the LOC conversion. |
Beta Was this translation helpful? Give feedback.
-
We use an lstm model trained on UD's tokenization to predict joint sentence segmentation and word tokenization. This approach yields something like 99.74 tokenization and 95.5 segmentation accuracies. However, a 99.6% rule based (light) model seems pretty good! For a 0.1% I wouldn't bother changing too much code; however an experiment like the one you suggest with the parser should be worth it. I'll get back if I can find some significantly better stop words. |
Beta Was this translation helpful? Give feedback.
-
Yeah, it's best for spacy if the tokenizer is deterministic and fast, so I'm okay with a rule-based tokenizer that's a little bit worse. Here are the segmentation results for the same parser model as above on the RRT test set:
On a random 90% of RONEC (the same train set as above):
Stop words are totally separate from the tokenizer, so it wouldn't hurt to update the list, but most people end up bringing a task-specific list anyway, so that's not a priority. |
Beta Was this translation helpful? Give feedback.
-
Wait, those sentence segmentation scores were inflated by some fake document boundaries (I grouped every 10 sentences into a document): RRT test:
RONEC:
|
Beta Was this translation helpful? Give feedback.
-
If anyone would like to test the upcoming models for Romanian, the initial models have been published and can be tested with spacy v2.3.0.dev1:
Replace |
Beta Was this translation helpful? Give feedback.
-
There's still some work to do on the Romanian tokenizer and character-based orthographic variants, but I think the main issue here has been addressed. I decided that it would be better to keep all the entity types from original corpus in the models that we provide, even if the performance for some types is poor, so that the model corresponds closely to the references listed in the sources. The data is available if others would like to collapse or reduce the entity types and train their own models. |
Beta Was this translation helpful? Give feedback.
-
Hi @adrianeboyd, I have managed to get a bit of time to test this, and while everything seems to work reasonably, especially tokenization and sentence segmentation, the POS is having issues:
gives:
This is Romanian RRT sentence test-6 taken from : https://raw.githubusercontent.com/UniversalDependencies/UD_Romanian-RRT/master/ro_rrt-ud-test.conllu :
So basically UD's UPOS (Spacy's |
Beta Was this translation helpful? Give feedback.
-
Thanks for testing this! This is probably an issue with the tag map, I'll have a look. |
Beta Was this translation helpful? Give feedback.
-
Sure, I forgot to mention, the same behaviour is for sm, md an lg. |
Beta Was this translation helpful? Give feedback.
-
Yeah, the training corpus and the tag map got out-of-sync. Since the fine-grained tags are detailed enough without the additional morphology, I intended to train from data with the plain fine-grained tags rather than the ones with We will train and release an updated model. |
Beta Was this translation helpful? Give feedback.
-
@adrianeboyd Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
I think that split wasn't available yet when we set this up. We modified the conversion script slightly (see: https://github.com/adrianeboyd/ronec/commits/bugfix/spacy-iob-script -- I think the authors have addressed these bugs in a slightly different way in the main repo since I worked on this) and we did our own randomized 80/10/10 split. The NER results, especially for the rarer types, will vary a lot between splits. I've attached a list of the sentence IDs per split. Our reported evaluations are on the dev sets, not the test sets, which we leave untouched for now. |
Beta Was this translation helpful? Give feedback.
-
Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hello,
I committed a pull request several months ago for adding a Romanian NER: #4151 (comment). I saw it was added in Spacy Universe: https://spacy.io/universe/project/ronec, but when I try to download the model with
python -m spacy download ro_ner
, it says thatNo compatible model found for 'ro_ner' (spaCy v2.2.3)
. We would like to know what is the progress in introducing the model and/or if you encountered any problems in training the model on the corpus (we would be happy to help you in this case).P.S. We also refactored the spacy training tutorial for the Romanian NER corpus since then: https://github.com/dumitrescustefan/ronec/tree/master/spacy/train-local-model, and also, we added an already trained model (as a demo) that should be working with spacy: https://github.com/dumitrescustefan/ronec/tree/master/spacy/online-model.
Thank you,
Avram Andrei
Beta Was this translation helpful? Give feedback.
All reactions