Using SparkNLP's NerCrfApproach with custom labels #6270

mwunderlich · 2021-10-13T07:30:27Z

mwunderlich
Oct 13, 2021

I am trying to train a NerCrfApproach model with a dataset in CoNLL format that has custom labels for product entities (like I-Prod, B-Prod etc.). However, when using the trained model to make predictions, I get only "O" as the assigned label for all tokens. When using the same model trained on the CoNLL data from the SparkNLP workshop example, the classification works fine.

So, the question is: Does NerCrfApproach rely on the standard tag set for NER labels used by the CoNLL data? Or can I use it for any custom labels and, if yes, do I need to specify these somehow? My assumption was that the labels are inferred from the training data.

Cheers,
Martin

Answered by mwunderlich

Oct 14, 2021

As it turns out, this issue was not caused by the labels, but rather by the size of the dataset. I was using a rather small dataset for development purposes. Not only was this dataset quite small, but also heavily imbalanced, with a lot more "O" labels than the other labels. Fixing this by using a dataset of 10x the original size (in terms of sentences), I am able to get meaningful results, even for my custom labels.

View full answer

mwunderlich · 2021-10-14T06:00:36Z

mwunderlich
Oct 14, 2021
Author

So, the question is: Does NerCrfApproach rely on the standard tag set for NER labels used by the CoNLL data? Or can I use it for any custom labels and, if yes, do I need to specify these somehow? My assumption was that the labels are inferred from the training data.

Update: The issue might not be related to the labels after all. I tried to replace my custom labels with CoNLL standard labels and I am still not getting the expected classification results.

0 replies

mwunderlich · 2021-10-14T06:26:44Z

mwunderlich
Oct 14, 2021
Author

As it turns out, this issue was not caused by the labels, but rather by the size of the dataset. I was using a rather small dataset for development purposes. Not only was this dataset quite small, but also heavily imbalanced, with a lot more "O" labels than the other labels. Fixing this by using a dataset of 10x the original size (in terms of sentences), I am able to get meaningful results, even for my custom labels.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using SparkNLP's NerCrfApproach with custom labels #6270

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Using SparkNLP's NerCrfApproach with custom labels #6270

mwunderlich Oct 13, 2021

Replies: 2 comments

mwunderlich Oct 14, 2021 Author

mwunderlich Oct 14, 2021 Author

mwunderlich
Oct 13, 2021

mwunderlich
Oct 14, 2021
Author

mwunderlich
Oct 14, 2021
Author