sentence span is wrong if there are sentences containing only space tokens #42

jwijffels · 2022-01-27T11:49:59Z

The sentence span is wrong if there are sentences containing only space tokens

>>> import spacy
>>> import spacy_udpipe
>>> spacy_udpipe.download("nl")
Already downloaded a model for the 'nl' language
>>> nlp = spacy_udpipe.load("nl")
>>>
>>> def line_splitter(x):
...     text = str(x)
...     text = text.split(sep = "\n")
...     text = [sent + "\n" for sent in text]
...     return text
...
>>> text_raw = "We gingen naar Brussel \n\n \nen kochten op 13/12/2021 veel eten. Jullie ook?"
>>> text = line_splitter(text_raw)
>>> text
['We gingen naar Brussel \n', '\n', ' \n', 'en kochten op 13/12/2021 veel eten. Jullie ook?\n']
>>> doc = nlp(text)
>>> for sent_i, sent in enumerate(doc.sents):
...     print(sent.start_char, sent.end_char)
...
0 22
23 70
>>> text_raw[0:(22+1)]
'We gingen naar Brussel '
>>> text_raw[23:(70+1)]
'\n\n \nen kochten op 13/12/2021 veel eten. Jullie o'
>>>

The text was updated successfully, but these errors were encountered:

jwijffels · 2022-01-27T13:14:45Z

the reason being of course that UDPipe does not return tokens as all spaces are in the misc column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sentence span is wrong if there are sentences containing only space tokens #42

sentence span is wrong if there are sentences containing only space tokens #42

jwijffels commented Jan 27, 2022

jwijffels commented Jan 27, 2022

sentence span is wrong if there are sentences containing only space tokens #42

sentence span is wrong if there are sentences containing only space tokens #42

Comments

jwijffels commented Jan 27, 2022

jwijffels commented Jan 27, 2022