Split long sentences into arrays of capped token length #11529
-
Hi there! I want to take very long sentences in my corpus, and split them at a max length (say 512 tokens). Are there any utilities to help perform this in spark NLP? Would the best way to be to use tokenizer(), do some list manipulation, and then re-join elements in the capped token arrays to reconstruct the split sentences? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 3 replies
-
Hi @jenghub We have 2 sentence splitter annotators: Both of these annotators have a parameter called splitLength to split the sentence at that length into 2 or more sentences:
Hope this helps :) |
Beta Was this translation helpful? Give feedback.
Hi @jenghub
We have 2 sentence splitter annotators:
SentenceDetector
which is rule-based andSentenceDetectorDL
which is a trainable and DL-based annotator. (SentenceDetectorModel for prediction)Both of these annotators have a parameter called splitLength to split the sentence at that length into 2 or more sentences:
Hope this helps :)