Why can't pickable Doc be used in doc extension attributes when multiprocessing? #5571

lingvisa · 2020-06-10T17:22:01Z

lingvisa
Jun 10, 2020

In the ChineseTokenizer class, I want to set char based tokenization doc as an extension attribute of word based tokenization doc, as follows:

  def __call__(self, text):
       # use jieba
       word_nlp_doc = None
       if self.use_jieba:
           if self.cut_all:
               jieba_words = list(
                   [x for x in self.jieba_seg.cut(text, cut_all=True) if x]
               )
           else:
               jieba_words = list(
                   [x for x in self.jieba_seg.cut(text) if x]
               )
           if len(jieba_words) == 0:
               jieba_words.append(' ')
           words = [jieba_words[0]]
           spaces = [False]
           for i in range(1, len(jieba_words)):
               word = jieba_words[i]
               if word.isspace():
                   spaces.append(False)
               else:
                   spaces.append(False)
               words.append(word)
           word_nlp_doc = Doc(self.vocab, words=words, spaces=spaces)

       # split into individual characters
       words = []
       spaces = []
       for char in list(text):
           if char == ' ':
               spaces.append(True)
           else:
               spaces.append(False)
           words.append(char)
       
       char_nlp_doc = Doc(self.vocab, words=words, spaces=spaces)
       word_nlp_doc.user_data['char_doc']=char_nlp_doc
        return word_nlp_doc

    This code generates this piece of error when multiprocessing is used:

 File "_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'spacy.tokens.doc.Doc' object

A space Doc is pickable, right? Why can't it be set as an extension attribute when multiprocessing is used?

Answered by adrianeboyd

Jun 11, 2020

You can pickle a Doc, but we would strongly recommend against it because there are better ways to save the annotation you need in a more secure and much more compact format. For the core token attributes, Doc.to_array() is a good option, and for a large collection of docs, you can use DocBin.

You can't serialize a Doc with msgpack, which is what this is trying to do, since msgpack doesn't support most of the object types for Doc.

What information do you really need to save from the character-based doc overall? Can you just save the words and spaces instead? You could also consider saving the output of doc.to_array() with the features you're interested in, since that would be serializable …

View full answer

adrianeboyd · 2020-06-11T07:42:15Z

adrianeboyd
Jun 11, 2020

You can pickle a Doc, but we would strongly recommend against it because there are better ways to save the annotation you need in a more secure and much more compact format. For the core token attributes, Doc.to_array() is a good option, and for a large collection of docs, you can use DocBin.

You can't serialize a Doc with msgpack, which is what this is trying to do, since msgpack doesn't support most of the object types for Doc.

What information do you really need to save from the character-based doc overall? Can you just save the words and spaces instead? You could also consider saving the output of doc.to_array() with the features you're interested in, since that would be serializable with msgpack as a numpy array.

0 replies

lingvisa · 2020-06-11T18:56:56Z

lingvisa
Jun 11, 2020
Author

Yes， I had a workaround for my purpose. In entity recognition, I had a dictionary based recognizer to be coupled with machine learning and rule based approach. In particular, I want to use character based matching for dictionary, instead of word, to boost recall. When the pipeline runs up to dictionary recognizer, I want to switch to character based doc from word based doc. Initially, I want to send both word and char docs to later components and that's why I want to set char_doc as an attribute of word_doc. Given the serialization issue, I now stop doing that, just convert the char doc to word doc right before the dictionary matching, and I still want the tagging results to be associated with the original word doc. This causes headache when converting char position index to word position index, especially when dictionary entries don't match tokens, but overall it works, since users mostly want to use a dictionary of strings with various nlp elements, not the original Doc. If you have any suggestions on this, please share and thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can't pickable Doc be used in doc extension attributes when multiprocessing? #5571

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Why can't pickable Doc be used in doc extension attributes when multiprocessing? #5571

lingvisa Jun 10, 2020

Replies: 2 comments

adrianeboyd Jun 11, 2020

lingvisa Jun 11, 2020 Author

lingvisa
Jun 10, 2020

adrianeboyd
Jun 11, 2020

lingvisa
Jun 11, 2020
Author