Can't train_test_split with test_size < 0.32 #4946
-
Hey everyone! Although if i split the data to a test_size equal to 0.32 ou bigger then that i get normal results. This doesnt make any sense to me since I cant understand what would be the difference from 0.32 to 0.31.. The way I do it is like this:
I have been searching for some answer and didnt find anything related to my problem so I started printing the
So with my testing set being 0.32 or more it goes through it and prints "YES" but if i set the test_size to be less than 0.32 like for example 0.30 it will not go through which I think will result in empty results. So if any of you might know whats happening I would greatly appreciate it! Thanks in advance! PS:
Your Environment
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Thanks for the report, this is an interesting case related to subtle behavior around character offsets and tokenization. Spacy doesn't warn/explain enough about what's going on here and at the very least this behavior needs to be more transparent for users. I think it would be even better if there were some additional settings for how to handle misaligned data, too, but that would be a larger change. I suspect the underlying problem is that you have a lot of cases where the tokenization in your annotation doesn't line up with spacy's tokenization. As an example, an obvious case would be something like this, where my annotation says that
After the gold data is loaded, the internal tokenization and the IOB tags the model is learning from look like this:
The model tries to learn as much as it can (nothing in Then, when you get to the evaluation, you want to evaluate on sentences where you know what's correct for all tokens, so the scorer skips sentences where some of the tags aren't known for sure. Your data seems to have a lot of these cases, so when your test set gets smaller, you can end up with only sentences with misalignments and nothing for the scorer to evaluate on.
|
Beta Was this translation helpful? Give feedback.
Thanks for the report, this is an interesting case related to subtle behavior around character offsets and tokenization. Spacy doesn't warn/explain enough about what's going on here and at the very least this behavior needs to be more transparent for users. I think it would be even better if there were some additional settings for how to handle misaligned data, too, but that would be a larger change.
I suspect the underlying problem is that you have a lot of cases where the tokenization in your annotation doesn't line up with spacy's tokenization. As an example, an obvious case would be something like this, where my annotation says that
ome i
is an entity: