Split long sentences into arrays of capped token length #11529

jenghub · 2022-08-05T18:36:03Z

jenghub
Aug 5, 2022

Hi there! I want to take very long sentences in my corpus, and split them at a max length (say 512 tokens). Are there any utilities to help perform this in spark NLP? Would the best way to be to use tokenizer(), do some list manipulation, and then re-join elements in the capped token arrays to reconstruct the split sentences? Thanks!

Answered by maziyarpanahi

Aug 5, 2022

Hi @jenghub

We have 2 sentence splitter annotators: SentenceDetector which is rule-based and SentenceDetectorDL which is a trainable and DL-based annotator. (SentenceDetectorModel for prediction)

Both of these annotators have a parameter called splitLength to split the sentence at that length into 2 or more sentences:

Hope this helps :)

View full answer

maziyarpanahi · 2022-08-05T18:45:08Z

maziyarpanahi
Aug 5, 2022
Maintainer

Hi @jenghub

We have 2 sentence splitter annotators: SentenceDetector which is rule-based and SentenceDetectorDL which is a trainable and DL-based annotator. (SentenceDetectorModel for prediction)

Both of these annotators have a parameter called splitLength to split the sentence at that length into 2 or more sentences:

Hope this helps :)

3 replies

jenghub Aug 5, 2022
Author

Wow, totally missed that. Thanks so much @maziyarpanahi !

jenghub Aug 5, 2022
Author

@maziyarpanahi and just to clarify. This is token or word length rather than character length?

maziyarpanahi Aug 6, 2022
Maintainer

it splits everything by empty space and then counts it if I am not mistaken. (the actual tokenizer comes after this annotator so it simply takes the white spaces as words and that is considered the number of words)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split long sentences into arrays of capped token length #11529

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Split long sentences into arrays of capped token length #11529

jenghub Aug 5, 2022

Replies: 1 comment · 3 replies

maziyarpanahi Aug 5, 2022 Maintainer

jenghub Aug 5, 2022 Author

jenghub Aug 5, 2022 Author

maziyarpanahi Aug 6, 2022 Maintainer

jenghub
Aug 5, 2022

Replies: 1 comment 3 replies

maziyarpanahi
Aug 5, 2022
Maintainer

jenghub Aug 5, 2022
Author

jenghub Aug 5, 2022
Author

maziyarpanahi Aug 6, 2022
Maintainer