Skip to content

Commit

Permalink
fix pretokenized input format
Browse files Browse the repository at this point in the history
Now, a list of tokens is not valid input anymore
  • Loading branch information
Bram Vanroy authored and Bram Vanroy committed Jan 17, 2023
1 parent 8bfe62b commit 5e82adc
Show file tree
Hide file tree
Showing 6 changed files with 38 additions and 52 deletions.
14 changes: 14 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# History

## 3.3.0 (January 17th, 2023)

Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (`["This", "is", "a", "pretokenized", "sentence"]`) is not accepted anymore. Therefore,
the `is_tokenized` option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. `"This is a pretokenized sentence"`, which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.

Specific changes:

- **[conllparser]** Breaking change: `is_tokenized` is not a valid argument to `ConllParser` any more.
- **[utils/conllparser]** Breaking change: when using UDPipe, pretokenized data is not supported any more.
- **[utils]** Breaking change: `SpacyPretokenizedTokenizer.__call__` does not support a list of tokens any more.

## 3.2.0 (April 4th, 2022)

- **[conllformatter]** Fixed an issue where `SpaceAfter=No` was not added correctly to tokens
Expand Down
33 changes: 13 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
# Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe

**The last version to support spaCy v2 can be found** [here](<https://github.com/BramVanroy/spacy_conll/tree/v2.1.0>).
The current version only supports v3.

This module allows you to parse text into CoNLL-U format\_. You can use it as a command line tool, or embed it in your
This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your
own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It
also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in
functionality to parse files or text.
Expand All @@ -26,23 +23,22 @@ By default, this package automatically installs only [spaCy](https://spacy.io/us
*are* trained on UD data.

**NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because
it might be too much overhead for those who don't need UD. If you wish to use their functionality (e.g. better
performance, real UD output), you have to install them manually or use one of the available options as described
below.
it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install
them manually or use one of the available options as described below.

If you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects
that `pandas` is installed. See the Usage section for more.

To install the library, simply use pip.

```bash
```shell
# only includes spacy by default
pip install spacy_conll
```

A number of options are available to make installation of additional dependencies easier:

```bash
```shell
# include spacy-stanza and spacy-udpipe
pip install spacy_conll[parsers]
# include pandas
Expand Down Expand Up @@ -100,11 +96,8 @@ Because this library supports different spaCy wrappers (`spacy`, `stanza`, and `
find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the
possible arguments or try out the [examples](examples/).

**NOTE**: `is_tokenized` does not work for `spacy-udpipe` and `disable_sbd` only works for `spacy`. `spacy-udpipe` has
made a change to allow pretokenized text, but it depends on the input format and cannot be fixed at initialisation of
the parser. See release v0.3.0 of spacy-udpipe or [this PR](https://github.com/TakeLab/spacy-udpipe/pull/19). Using
`is_tokenized` for `spacy-stanza` also affects sentence segmentation, effectively *only* splitting on new
lines. With `spacy`, `is_tokenized` disables sentence splitting completely.
**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence
segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely.

```python
def init_parser(
Expand Down Expand Up @@ -221,8 +214,8 @@ for sent in doc.sents:
Upon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a
string or file into CoNLL format given a number of options.

```bash
> parse-as-conll -h
```shell
parse-as-conll -h
usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE]
[-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v]
[--ignore_pipe_errors] [--no_split_on_newline]
Expand Down Expand Up @@ -295,8 +288,8 @@ optional arguments:
For example, parsing a single line, multi-sentence string:
```bash
> parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers
```shell
parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers
# sent_id = 1
# text = I like cookies.
Expand All @@ -315,8 +308,8 @@ For example, parsing a single line, multi-sentence string:
For example, parsing a large input file and writing output to a given output file, using four processes:
```bash
> parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
```shell
parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
```
Expand Down
2 changes: 1 addition & 1 deletion spacy_conll/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
__version__ = "3.2.0"
__version__ = "3.3.0"

from .formatter import ConllFormatter
from .parser import ConllParser
Expand Down
22 changes: 1 addition & 21 deletions spacy_conll/parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,9 @@ class ConllParser:
Constructor arguments:
:param nlp: instantiated spaCy-like parser
:param is_tokenized: whether or not the expected input format is pre-tokenized. This must correspond with how
'nlp' was initialized! If you initialized the 'nlp' object with 'init_parser', make sure you used 'is_tokenized'
in the same way
"""

nlp: Language
is_tokenized: bool = False
parser: str = field(init=False, default=None)

def __post_init__(self):
Expand All @@ -56,21 +52,7 @@ def __post_init__(self):
self.parser = "spacy"

def __repr__(self) -> str:
return f"{self.__class__.__name__}(is_tokenized={self.is_tokenized}, parser={self.parser})"

def prepare_data(self, lines: List[str]) -> List[str]:
"""Prepares data according to whether or not is_tokenized was given and depending on the parser.
Each parser requires a different type of input when the data is pre_tokenized.
:param lines: a list of lines to process
:return: the lines in the correct format for the parser
"""
if self.is_tokenized:
if self.parser == "spacy":
lines = [l.split() for l in lines]
elif self.parser == "udpipe":
lines = [[l.split()] for l in lines]

return lines
return f"{self.__class__.__name__}(parser={self.parser})"

def parse_file_as_conll(
self, input_file: Union[PathLike, Path, str], input_encoding: str = getpreferredencoding(), **kwargs
Expand Down Expand Up @@ -128,8 +110,6 @@ def parse_text_as_conll(
else:
text = text.splitlines()

text = self.prepare_data(text)

conll_idx = 0
output = ""
for doc_idx, doc in enumerate(self.nlp.pipe(text, n_process=n_process)):
Expand Down
12 changes: 5 additions & 7 deletions spacy_conll/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ def init_parser(
See the stanza documentation for more:
https://stanfordnlp.github.io/stanza/tokenize.html#start-with-pretokenized-text
This option does not affect UDPipe.
This option is not supported in UDPipe.
:param disable_sbd: disables automatic sentence boundary detection in spaCy and stanza. For stanza, make sure that
your input is in the correct format, that is: sentences must be separated by two new lines. If you want to
disable both tokenization and sentence segmentation in stanza, do not enable this option but instead only
Expand All @@ -60,7 +60,7 @@ def init_parser(
See the stanza documentation for more:
https://stanfordnlp.github.io/stanza/tokenize.html#tokenization-without-sentence-segmentation
This option does not affect UDPipe.
This option is not supported in UDPipe.
:param exclude_spacy_components: spaCy components to exclude from the pipeline, which can greatly improve
processing speed. Only works when using spaCy as a parser.
:param parser_opts: will be passed to the core pipeline. For spacy, it will be passed to its
Expand Down Expand Up @@ -133,19 +133,17 @@ def __init__(self, vocab: Vocab):
"""
self.vocab = vocab

def __call__(self, inp: Union[List[str], str]) -> Doc:
def __call__(self, inp: str) -> Doc:
"""Call the tokenizer on input `inp`.
:param inp: either a string to be split on whitespace, or a list of tokens
:param inp: a string to be split on whitespaces
:return: the created Doc object
"""
if isinstance(inp, str):
words = inp.split()
spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
return Doc(self.vocab, words=words, spaces=spaces)
elif isinstance(inp, list):
return Doc(self.vocab, words=inp)
else:
raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")
raise ValueError("Unexpected input format. Expected string to be split on whitespace.")


@Language.factory("disable_sbd")
Expand Down
7 changes: 4 additions & 3 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,9 +50,10 @@ def conllparser(request):
yield ConllParser(get_parser(request.param, include_headers=True))


@pytest.fixture(params=["spacy", "stanza", "udpipe"])
# Not testing with UDPipe, which does not support this
@pytest.fixture(params=["spacy", "stanza"])
def pretokenized_conllparser(request):
yield ConllParser(get_parser(request.param, is_tokenized=True, include_headers=True), is_tokenized=True)
yield ConllParser(get_parser(request.param, is_tokenized=True, include_headers=True))


@pytest.fixture
Expand Down Expand Up @@ -101,7 +102,7 @@ def base_doc(base_parser, text):
def pretokenized_doc(pretokenized_parser):
name = pretokenized_parser[1]
if name == "spacy":
yield pretokenized_parser[0](single_sent().split())
yield pretokenized_parser[0](single_sent())
else:
yield pretokenized_parser[0](single_sent())

Expand Down

0 comments on commit 5e82adc

Please sign in to comment.