Why can't pickable Doc be used in doc extension attributes when multiprocessing? #5571
-
In the ChineseTokenizer class, I want to set char based tokenization doc as an extension attribute of word based tokenization doc, as follows:
A space Doc is pickable, right? Why can't it be set as an extension attribute when multiprocessing is used? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
You can pickle a You can't serialize a What information do you really need to save from the character-based doc overall? Can you just save the |
Beta Was this translation helpful? Give feedback.
-
Yes, I had a workaround for my purpose. In entity recognition, I had a dictionary based recognizer to be coupled with machine learning and rule based approach. In particular, I want to use character based matching for dictionary, instead of word, to boost recall. When the pipeline runs up to dictionary recognizer, I want to switch to character based doc from word based doc. Initially, I want to send both word and char docs to later components and that's why I want to set char_doc as an attribute of word_doc. Given the serialization issue, I now stop doing that, just convert the char doc to word doc right before the dictionary matching, and I still want the tagging results to be associated with the original word doc. This causes headache when converting char position index to word position index, especially when dictionary entries don't match tokens, but overall it works, since users mostly want to use a dictionary of strings with various nlp elements, not the original Doc. If you have any suggestions on this, please share and thanks. |
Beta Was this translation helpful? Give feedback.
You can pickle a
Doc
, but we would strongly recommend against it because there are better ways to save the annotation you need in a more secure and much more compact format. For the core token attributes,Doc.to_array()
is a good option, and for a large collection of docs, you can useDocBin
.You can't serialize a
Doc
withmsgpack
, which is what this is trying to do, sincemsgpack
doesn't support most of the object types forDoc
.What information do you really need to save from the character-based doc overall? Can you just save the
words
andspaces
instead? You could also consider saving the output ofdoc.to_array()
with the features you're interested in, since that would be serializable …