Can't train_test_split with test_size < 0.32 #4946

BlueRoss715 · 2020-01-28T14:01:57Z

BlueRoss715
Jan 28, 2020

Hey everyone!
I am fairly new to python and much more to machine learning. I am currently working with spacy on Pycharm and I want to train and test a blank model with a certain dataset. The problem is that for some reason that I cannot understand I cant train_test_split with a test_size < 0.32 since it will display zero for all the scores and returns a empty list for the ents_per_type like shown in the link below when TESTING_DATA is used to evaluate the model with scorer.score:

Although if i split the data to a test_size equal to 0.32 ou bigger then that i get normal results. This doesnt make any sense to me since I cant understand what would be the difference from 0.32 to 0.31..

The way I do it is like this:

TOTAL_DATA = []

TOTAL_DATA = plac.call(main)

TRAIN_DATA, TESTING_DATA = train_test_split(TOTAL_DATA, test_size=0.3, random_state=42)

I have been searching for some answer and didnt find anything related to my problem so I started printing the TRAIN_DATA and the TESTING_DATA and they both seem to be equal, but one has more information than the other. Because of that I was sure it was something inside the Scorer class more specifically scorer.score method that would be giving me this results so I started printing random stuff to see where was it not going through and it seems to be in this part of the score method :

if "-" not in [token[-1] for token in gold.orig_annot]:
            print("YES")
            # Find all NER labels in gold and doc
            ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
            # Set up all labels for per type scoring and prepare gold per type
            gold_per_ents = {ent_label: set() for ent_label in ent_labels}
            for ent_label in ent_labels:
                if ent_label not in self.ner_per_ents:
                    self.ner_per_ents[ent_label] = PRFScore()
                gold_per_ents[ent_label].update(
                    [x for x in gold_ents if x[0] == ent_label]
                )

So with my testing set being 0.32 or more it goes through it and prints "YES" but if i set the test_size to be less than 0.32 like for example 0.30 it will not go through which I think will result in empty results.

So if any of you might know whats happening I would greatly appreciate it!

Thanks in advance!

PS:
My evaluation function is :

def evaluate(self, ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot)
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

Your Environment

Operating System: Windows 10
Python Version Used: 3.7.2
spaCy Version Used: 2.2.1
Keras = 2.3.1
Tensorflow: 2.0.0
Environment Information: I am working on Pycharm

Answered by adrianeboyd

Jan 31, 2020

Thanks for the report, this is an interesting case related to subtle behavior around character offsets and tokenization. Spacy doesn't warn/explain enough about what's going on here and at the very least this behavior needs to be more transparent for users. I think it would be even better if there were some additional settings for how to handle misaligned data, too, but that would be a larger change.

I suspect the underlying problem is that you have a lot of cases where the tokenization in your annotation doesn't line up with spacy's tokenization. As an example, an obvious case would be something like this, where my annotation says that ome i is an entity:

"Rome is a city.",
{"entities": …

View full answer

adrianeboyd · 2020-01-31T09:50:46Z

adrianeboyd
Jan 31, 2020

Thanks for the report, this is an interesting case related to subtle behavior around character offsets and tokenization. Spacy doesn't warn/explain enough about what's going on here and at the very least this behavior needs to be more transparent for users. I think it would be even better if there were some additional settings for how to handle misaligned data, too, but that would be a larger change.

I suspect the underlying problem is that you have a lot of cases where the tokenization in your annotation doesn't line up with spacy's tokenization. As an example, an obvious case would be something like this, where my annotation says that ome i is an entity:

"Rome is a city.",
{"entities": [(1, 6, "LOC")]}

After the gold data is loaded, the internal tokenization and the IOB tags the model is learning from look like this:

['Rome', 'is', 'a', 'city', '.']
['-', '-', 'O', 'O', 'O']

The model tries to learn as much as it can (nothing in ['a', 'city', '.'] is an entity), but it can't be sure about what's going on with the other tokens, so it ignores misaligned annotation ('-' means it's unknown, which is different from 'O'). It can't know for sure whether ome i corresponds to the same kind of entity for Rome or is or Rome is, so it just ignores this bit of the annotation.

Then, when you get to the evaluation, you want to evaluate on sentences where you know what's correct for all tokens, so the scorer skips sentences where some of the tags aren't known for sure. Your data seems to have a lot of these cases, so when your test set gets smaller, you can end up with only sentences with misalignments and nothing for the scorer to evaluate on.

ome i is an exaggerated example, in more typical cases it's the tokenization of punctuation that causes problems. Try nlp.make_doc(text) for your training texts to see what the tokenization looks like and see if you can figure what's going on. It could be that there are more serious problems with the offsets (everything is off by 1, for instance), but my first guess would be something related to punctuation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't train_test_split with test_size < 0.32 #4946

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Can't train_test_split with test_size < 0.32 #4946

BlueRoss715 Jan 28, 2020

Your Environment

Replies: 1 comment

adrianeboyd Jan 31, 2020

BlueRoss715
Jan 28, 2020

adrianeboyd
Jan 31, 2020