One annotation change corrupts entire model #7168

python3Berg · 2021-02-22T13:14:03Z

python3Berg
Feb 22, 2021

Hello,
I have a custom named entity model with 000's of annotations that has been incrementally trained for months. I am using spacy2.2, python3 and my own training script originally based on your documentation.

If I remove a single annotation, and replace it with one or more different labels that overlap with the same token-start and end, my models, on incremental training paths, will show losses increased by 100-200x... 5x larger that a clean model at start.

Since I have little control over changes that come from client text annotations, this seems like an unexpected outcome. I could see where changes and subsequent relearning would require some time, but a single change seems to corrupt everything.

svlandeg · 2021-02-22T14:37:23Z

svlandeg
Feb 22, 2021
Maintainer

Could you share some sample output? Is it just the training loss that jumps so strongly, or also the accuracy on the dev set?

0 replies

python3Berg · 2021-02-22T15:11:55Z

python3Berg
Feb 22, 2021
Author

Truthfully, I need to do some work to answer question. Once my model was mature, I streamlined some of the testing and used loss as a proxy. As my annotation set incrementally grows, the increase in accuracy was so slow as to be uninteresting. Seeing the loss jump so dramatically, and not reduce even after dozens of iterations, made me think it unlikely that other metrics would be unaffected.

0 replies

svlandeg · 2021-02-22T16:55:24Z

svlandeg
Feb 22, 2021
Maintainer

I'll go ahead and move this to the discussion forum in the meantime, as this type of discussion is perfectly suited for that.

We haven't really encountered the behaviour you described before, so it would be good to get a sample output / some more background / some reproducible code snippet to be able to look into this further!

0 replies

python3Berg · 2021-02-22T17:49:20Z

python3Berg
Feb 22, 2021
Author

Last training iteration before annotation change...loss = 12000 (seems large but huge dataset with many custom entities)
First training iteration after single annotation "rename" =

0 Mon Feb 22 11:24:17 2021 Mon Feb 22 11:45:11 2021 {'ner': 3022027.5547618866}
Saved model # TODO: o C:\home\compscre\training\models\2.3\Loan\testModel

Testing on my demo sample set:
Before change, 22 entities returned, 19 correct
After change, 3 entities returned, 1 correct

These models are trained using 5k+ annotations across 1000 documents. I use the full training set for all model runs to avoid forgetting issues. My code is almost identical to that which you posted as a sample in spacy 2

def main(model='devModel',  n_iter=30, n_texts=6000,provision_type='Lease',version='2.3',clean=0,drop=.55):
    output_dir= os.path.join('C:/home/compscre/training/','models',version,provision_type,model)
    # output_dir='C:/home/compscre/training/v23/models/'+provision_type.lower()+'/'+model

    try:
        nlp = spacy.load(output_dir)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    except:
        nlp = spacy.blank('en')  # create blank Language class
        infixes = nlp.Defaults.infixes + (r'''[(\)\()\)\(\-\/]''',)
        infix_regex = spacy.util.compile_infix_regex(infixes)
        nlp.tokenizer.infix_finditer = infix_regex.finditer
    
        suffixes = nlp.Defaults.suffixes + (r'''\.(\s|$)''',)
        suffix_regex = spacy.util.compile_suffix_regex(suffixes)
        nlp.tokenizer.suffix_search = suffix_regex.search

        prefixes = nlp.Defaults.prefixes + (r'''[\+\$\%\-]''',)
        prefix_regex = spacy.util.compile_prefix_regex(prefixes)
        nlp.tokenizer.prefix_search = prefix_regex.search

        print("Created blank 'en' model")
        
    usecat='ner_tokens'
    if 'ner' in nlp.pipe_names and clean:
        nlp.remove_pipe('ner')

    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe('ner')

    print("Loading NER Training Data")
    [lst,TRAIN_DATA] = load_ner_data(provision_type)

    for lab in lst:
        ner.add_label(lab)

    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    disabled = nlp.disable_pipes(*other_pipes)
    with warnings.catch_warnings():
        
        warnings.filterwarnings("once", category=UserWarning, module='spacy')
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            sTime = time.ctime()
            print("running iteration ",itn)
            random.shuffle(TRAIN_DATA)
            losses = {}
            batches = minibatch(TRAIN_DATA, size=compounding(2.0, 32.0, 1.01))
            for batch in batches:
                texts, annotations = zip(*batch)
                try:
                    nlp.update(
                        texts,  # batch of texts
                        annotations,  # batch of annotations
                        sgd=optimizer,
                        drop=drop,  # dropout - make it harder to memorise data
                        losses=losses)
                except:
                    e = sys.exc_info()[0]
                    print("Training failed  for sdoc ",texts[0][:200],e)
            eTime=time.ctime()
            print(itn,sTime,eTime,losses)
            if output_dir is not None:
                output_dir = Path(output_dir)
                if not output_dir.exists():
                    output_dir.mkdir()
                disabled.restore()
                nlp.to_disk(output_dir)
                disabled = nlp.disable_pipes(*other_pipes)  # only train NER. Adjust position to allow for disable/enable each iteration so program doesn't lose other pipes
                print("Saved model # TODO: o", output_dir)

0 replies

honnibal · 2021-02-23T07:06:27Z

honnibal
Feb 23, 2021
Maintainer

If I remove a single annotation, and replace it with one or more different labels that overlap with the same token-start and end, my models, on incremental training paths, will show losses increased by 100-200x... 5x larger that a clean model at start.

Are you adding a label to the model that wasn't seen before, and wasn't present during initialization?

I don't remember exactly which versions were affected, but it was an extended struggle to make "live" learning of labels work well. One issue that was particularly tricky was that it turned out the NER model settled into a pattern where most of the scores were fairly large negative values. When a new label was added, the initial weights for the label were zero, and that meant that it received a score of 0 --- which would be by far the best score! So as soon as the label is added, the model predicts it like crazy, even before it's seen an example of it. This would explain the sudden spike in loss.

You can avoid the issue by ensuring all the labels are added at the beginning of training. I think the issue was fixed by v2.3.

0 replies

python3Berg · 2021-02-23T10:45:16Z

python3Berg
Feb 23, 2021
Author

I am using 2.3. I posted my code and do include all labels. Since this is a long training model, I also ensure that any new labels are added. The model vocabulary has grown without issue, except for the edge condition where a token is redefined or sub-divided. Even then, I recognize the reality of stepping back to move forward, but neither the expected forgetting or learning is taking place. I have not seen any reported negative losses. Imagine a scenario where token-start 0 to token-end 50 is named "fintable" and trained extensively. We then decide we wish to remove "fintable" annotation and replace with "NOI" and token 12-20. NOI may or may not be new provision. This blows things up. I've seen the same issue when consolidating token names. Suppose I have 20 entities labeled "building_count", and 5 labeled "property_structures". If I change the 5 to "building_count" and eliminate that tag, bad things happen. I would love to get some guidance into any parameters or pipeline config that I am using to cause this. I hate the idea of keeping a training audit table and build in logic to exclude anything that used to overlap with something else.

…

On 2/23/2021 2:06 AM, Matthew Honnibal wrote: If I remove a single annotation, and replace it with one or more different labels that overlap with the same token-start and end, my models, on incremental training paths, will show losses increased by 100-200x... 5x larger that a clean model at start. Are you adding a label to the model that wasn't seen before, and wasn't present during initialization? I don't remember exactly which versions were affected, but it was an extended struggle to make "live" learning of labels work well. One issue that was particularly tricky was that it turned out the NER model settled into a pattern where most of the scores were fairly large negative values. When a new label was added, the initial weights for the label were zero, and that meant that it received a score of 0 --- which would be by far the best score! So as soon as the label is added, the model predicts it like crazy, even before it's seen an example of it. This would explain the sudden spike in loss. You can avoid the issue by ensuring all the labels are added at the beginning of training. I think the issue was fixed by v2.3. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX7EWF4CVYEHCTWCCLLTANHYTANCNFSM4YA6VNFA>.

1 reply

honnibal Feb 23, 2021
Maintainer

I am using 2.3. I posted my code and do include all labels. Since this
is a long training model, I also ensure that any new labels are added.

Sorry, I do see that now.

Imagine a scenario where token-start 0 to token-end 50 is named "fintable" and trained extensively. We then decide we wish to remove "fintable" annotation and replace with "NOI" and token 12-20. NOI may or may not be new provision.

I've seen the same issue when consolidating token names. Suppose I have 20 entities labeled "building_count", and 5 labeled "property_structures". If I change the 5 to "building_count" and eliminate that tag

Apologies, but I just don't think I follow what you mean here. Could you provide example code? These operations aren't in your code above, right?

adrianeboyd · 2021-02-23T11:56:04Z

adrianeboyd
Feb 23, 2021

In v2.3 there is still a weird bug if you add additional labels to a trained model, see #6525 (comment). The simplest workaround is to save the model to disk and reload before training.

0 replies

python3Berg · 2021-02-23T12:05:26Z

python3Berg
Feb 23, 2021
Author

Are you saying that after I load the labels to the pipeline, save the model before nlp.begin...? I am not sure this is the same issue. The labels I am using and moving between have been in the model since inception. I am doing a consolidation or sometimes a drill down. Semantically: "Sources and Uses" precedes a simple table and model has been able to discover with decent success. One of the rows in sources in uses might be "Equity" followed by a float. In broader context, this too has good success. Perhaps this is a source of issue since any given span can only belong to one label? The other example is where I am removing a label, and moving a small set to another existing one. Thanks

…

On 2/23/2021 6:56 AM, Adriane Boyd wrote: In v2.3 there is still a weird bug if you add additional labels to a trained model, see #6525 (comment) <#6525 (comment)>. The simplest workaround is to save the model to disk and reload before training. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX33KHAXLVY5AFAECEDTAOJWLANCNFSM4YA6VNFA>.

1 reply

adrianeboyd Feb 23, 2021

It might be a separate issue, but existing model + ner.add_label(new_label) + immediate training leads to weird losses and bad performance. If your script always starts from spacy.blank("en") then I don't think this is related, but if you are starting from an existing model loaded with spacy.load() and add_label is actually adding a new label, then this could be part of what's going on.

python3Berg · 2021-02-23T12:08:46Z

python3Berg
Feb 23, 2021
Author

No, these actions reflect refinement of training annotations. I am using custom pdf viewer to select text and then translate to token offsets. In this case, a larger block of text with label X has been deleted and a smaller block of text contained within the previous offsets has been added with label Y.

…

On 2/23/2021 7:06 AM, Matthew Honnibal wrote: I am using 2.3. I posted my code and do include all labels. Since this is a long training model, I also ensure that any new labels are added. Sorry, I do see that now. Imagine a scenario where token-start 0 to token-end 50 is named "fintable" and trained extensively. We then decide we wish to remove "fintable" annotation and replace with "NOI" and token 12-20. NOI may or may not be new provision. I've seen the same issue when consolidating token names. Suppose I have 20 entities labeled "building_count", and 5 labeled "property_structures". If I change the 5 to "building_count" and eliminate that tag Apologies, but I just don't think I follow what you mean here. Could you provide example code? These operations aren't in your code above, right? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX4JRATRWTX7VPSTXKTTAOK33ANCNFSM4YA6VNFA>.

1 reply

honnibal Feb 23, 2021
Maintainer

Are you sure you're not messing up the offsets somehow when you do this, for instance the offsets of other entities?

python3Berg · 2021-02-23T12:26:41Z

python3Berg
Feb 23, 2021
Author

Everything is independent. Text is unchanging..other annotations are unchanging..I also have a tool that tests annotation text vs text(token_start:token_end] and this shows no issues.

…

On 2/23/2021 7:23 AM, Matthew Honnibal wrote: Are you sure you're not messing up the offsets somehow when you do this, for instance the offsets of other entities? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX5YVIVGB5E424KRWXTTAOM5VANCNFSM4YA6VNFA>.

0 replies

python3Berg · 2021-02-23T12:29:03Z

python3Berg
Feb 23, 2021
Author

I'm not sure it is related, but it is something I will address. I add labels before training...Easy to tell if new one is added...Would I then save model AND null out model and reload, or is saving sufficient?

…

On 2/23/2021 7:25 AM, Adriane Boyd wrote: It might be a separate issue, but existing model + |ner.add_label(new_label)| + immediate training leads to weird losses and bad performance. If your script always start from |spacy.blank("en")| then I don't think this is related, but if you are starting from an existing model loaded with |spacy.load()| and |add_label| is actually adding a new labels, then this could be part of what's going on. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX4FAASHDNJX3UJAWSDTAOND3ANCNFSM4YA6VNFA>.

0 replies

python3Berg · 2021-02-23T13:37:36Z

python3Berg
Feb 23, 2021
Author

I added a little code to save and reload model if new labels were present and then ran a single iteration for training. I guess the results are promising...Instead of losses increasing by 100x, they are only increasing by 5x. I'll need to research further, but this seems promising. Not sure if it explains all behavior, but it certainly helps.

…

On 2/23/2021 7:25 AM, Adriane Boyd wrote: It might be a separate issue, but existing model + |ner.add_label(new_label)| + immediate training leads to weird losses and bad performance. If your script always start from |spacy.blank("en")| then I don't think this is related, but if you are starting from an existing model loaded with |spacy.load()| and |add_label| is actually adding a new labels, then this could be part of what's going on. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX4FAASHDNJX3UJAWSDTAOND3ANCNFSM4YA6VNFA>.

0 replies

python3Berg · 2021-02-23T14:54:41Z

python3Berg
Feb 23, 2021
Author

More work to do, but I think Adriane's suggestion has addressed the issue. While initial losses are much larger than I would expect, they quickly resolve themselves down to a more reasonable level. Thanks very much for your insights into this issue. I am still not sure what combination of factors is triggering this, but the pre-training save seems to have broken the event chain. I will continue to test and revert if I can recreate outside of this solution

…

On 2/23/2021 7:25 AM, Adriane Boyd wrote: It might be a separate issue, but existing model + |ner.add_label(new_label)| + immediate training leads to weird losses and bad performance. If your script always start from |spacy.blank("en")| then I don't think this is related, but if you are starting from an existing model loaded with |spacy.load()| and |add_label| is actually adding a new labels, then this could be part of what's going on. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#7168 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ARP5DX4FAASHDNJX3UJAWSDTAOND3ANCNFSM4YA6VNFA>.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One annotation change corrupts entire model #7168

{{title}}

Replies: 13 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

One annotation change corrupts entire model #7168

python3Berg Feb 22, 2021

Replies: 13 comments · 3 replies

svlandeg Feb 22, 2021 Maintainer

python3Berg Feb 22, 2021 Author

svlandeg Feb 22, 2021 Maintainer

python3Berg Feb 22, 2021 Author

honnibal Feb 23, 2021 Maintainer

python3Berg Feb 23, 2021 Author

honnibal Feb 23, 2021 Maintainer

adrianeboyd Feb 23, 2021

python3Berg Feb 23, 2021 Author

adrianeboyd Feb 23, 2021

python3Berg Feb 23, 2021 Author

honnibal Feb 23, 2021 Maintainer

python3Berg Feb 23, 2021 Author

python3Berg Feb 23, 2021 Author

python3Berg Feb 23, 2021 Author

python3Berg Feb 23, 2021 Author

python3Berg
Feb 22, 2021

Replies: 13 comments 3 replies

svlandeg
Feb 22, 2021
Maintainer

python3Berg
Feb 22, 2021
Author

svlandeg
Feb 22, 2021
Maintainer

python3Berg
Feb 22, 2021
Author

honnibal
Feb 23, 2021
Maintainer

python3Berg
Feb 23, 2021
Author

honnibal Feb 23, 2021
Maintainer

adrianeboyd
Feb 23, 2021

python3Berg
Feb 23, 2021
Author

python3Berg
Feb 23, 2021
Author

honnibal Feb 23, 2021
Maintainer

python3Berg
Feb 23, 2021
Author

python3Berg
Feb 23, 2021
Author

python3Berg
Feb 23, 2021
Author

python3Berg
Feb 23, 2021
Author