fix pretokenized input format

Now, a list of tokens is not valid input anymore
BramVanroy · Jan 17, 2023 · 5e82adc · 5e82adc
1 parent 8bfe62b
commit 5e82adc
Show file tree

Hide file tree

Showing 6 changed files with 38 additions and 52 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,5 +1,19 @@
 # History
 
+## 3.3.0 (January 17th, 2023)
+
+Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing 
+a list of pretokenized tokens (`["This", "is", "a", "pretokenized", "sentence"]`) is not accepted anymore. Therefore,
+the `is_tokenized` option needed to be adapted to reflect this. It is still possible to pass a string where tokens
+are separated by whitespaces, e.g. `"This is a pretokenized sentence"`, which will continue to work for spaCy and
+stanza. Support for pretokenized data has been dropped for UDPipe.
+
+Specific changes:
+
+- **[conllparser]** Breaking change: `is_tokenized` is not a valid argument to `ConllParser` any more.
+- **[utils/conllparser]** Breaking change: when using UDPipe, pretokenized data is not supported any more.
+- **[utils]** Breaking change: `SpacyPretokenizedTokenizer.__call__` does not support a list of tokens any more.
+
 ## 3.2.0 (April 4th, 2022)
 
 - **[conllformatter]** Fixed an issue where `SpaceAfter=No` was not added correctly to tokens

diff --git a/README.md b/README.md
@@ -1,9 +1,6 @@
 # Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe
 
-**The last version to support spaCy v2 can be found** [here](<https://github.com/BramVanroy/spacy_conll/tree/v2.1.0>).
- The current version only supports v3.
-
-This module allows you to parse text into CoNLL-U format\_. You can use it as a command line tool, or embed it in your
+This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your
  own scripts by adding it as a custom pipeline component to a spaCy, `spacy-stanza`, or `spacy-udpipe` pipeline. It 
  also provides an easy-to-use function to quickly initialize a parser as well as a ConllParser class with built-in 
  functionality to parse files or text.
@@ -26,23 +23,22 @@ By default, this package automatically installs only [spaCy](https://spacy.io/us
  *are* trained on UD data.
 
 **NOTE**: `spacy-stanza` and `spacy-udpipe` are not installed automatically as a dependency for this library, because 
- it might be too much overhead for those who don't need UD. If you wish to use their functionality (e.g. better 
- performance, real UD output), you have to install them manually or use one of the available options as described 
- below.
+ it might be too much overhead for those who don't need UD. If you wish to use their functionality, you have to install
+them manually or use one of the available options as described  below.
 
 If you want to retrieve CoNLL info as a `pandas` DataFrame, this library will automatically export it if it detects 
  that `pandas` is installed. See the Usage section for more.
 
 To install the library, simply use pip.
 
-```bash
+```shell
 # only includes spacy by default
 pip install spacy_conll
 ```
 
 A number of options are available to make installation of additional dependencies easier:
 
-```bash
+```shell
 # include spacy-stanza and spacy-udpipe
 pip install spacy_conll[parsers]
 # include pandas
@@ -100,11 +96,8 @@ Because this library supports different spaCy wrappers (`spacy`, `stanza`, and `
  find the function's signature below. Have a look at the [source code](spacy_conll/utils.py) to read more about all the
  possible arguments or try out the [examples](examples/).
 
-**NOTE**: `is_tokenized` does not work for `spacy-udpipe` and `disable_sbd` only works for `spacy`. `spacy-udpipe` has
- made a change to allow pretokenized text, but it depends on the input format and cannot be fixed at initialisation of
- the parser. See release v0.3.0 of spacy-udpipe or [this PR](https://github.com/TakeLab/spacy-udpipe/pull/19). Using
- `is_tokenized` for `spacy-stanza` also affects sentence segmentation, effectively *only* splitting on new
- lines. With `spacy`, `is_tokenized` disables sentence splitting completely.
+**NOTE**: `is_tokenized` does not work for `spacy-udpipe`. Using `is_tokenized` for `spacy-stanza` also affects sentence
+ segmentation, effectively *only* splitting on new lines. With `spacy`, `is_tokenized` disables sentence splitting completely.
 
 ```python
 def init_parser(
@@ -221,8 +214,8 @@ for sent in doc.sents:
 Upon installation, a command-line script is added under tha alias `parse-as-conll`. You can use it to parse a
 string or file into CoNLL format given a number of options.
 
-```bash
-> parse-as-conll -h
+```shell
+parse-as-conll -h
 usage: parse-as-conll [-h] [-f INPUT_FILE] [-a INPUT_ENCODING] [-b INPUT_STR] [-o OUTPUT_FILE]
                   [-c OUTPUT_ENCODING] [-s] [-t] [-d] [-e] [-j N_PROCESS] [-v]
                   [--ignore_pipe_errors] [--no_split_on_newline]
@@ -295,8 +288,8 @@ optional arguments:
 
 For example, parsing a single line, multi-sentence string:
 
-```bash
->  parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers
+```shell
+parse-as-conll en_core_web_sm spacy --input_str "I like cookies. What about you?" --include_headers
 
 # sent_id = 1
 # text = I like cookies.
@@ -315,8 +308,8 @@ For example, parsing a single line, multi-sentence string:
 
 For example, parsing a large input file and writing output to a given output file, using four processes:
 
-```bash
-> parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
+```shell
+parse-as-conll en_core_web_sm spacy --input_file large-input.txt --output_file large-conll-output.txt --include_headers --disable_sbd -j 4
 ```
 
 

diff --git a/spacy_conll/__init__.py b/spacy_conll/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "3.2.0"
+__version__ = "3.3.0"
 
 from .formatter import ConllFormatter
 from .parser import ConllParser

diff --git a/spacy_conll/parser.py b/spacy_conll/parser.py
@@ -29,13 +29,9 @@ class ConllParser:
 
     Constructor arguments:
     :param nlp: instantiated spaCy-like parser
-    :param is_tokenized: whether or not the expected input format is pre-tokenized. This must correspond with how
-    'nlp' was initialized! If you initialized the 'nlp' object with 'init_parser', make sure you used 'is_tokenized'
-    in the same way
     """
 
     nlp: Language
-    is_tokenized: bool = False
     parser: str = field(init=False, default=None)
 
     def __post_init__(self):
@@ -56,21 +52,7 @@ def __post_init__(self):
             self.parser = "spacy"
 
     def __repr__(self) -> str:
-        return f"{self.__class__.__name__}(is_tokenized={self.is_tokenized}, parser={self.parser})"
-
-    def prepare_data(self, lines: List[str]) -> List[str]:
-        """Prepares data according to whether or not is_tokenized was given and depending on the parser.
-        Each parser requires a different type of input when the data is pre_tokenized.
-        :param lines: a list of lines to process
-        :return: the lines in the correct format for the parser
-        """
-        if self.is_tokenized:
-            if self.parser == "spacy":
-                lines = [l.split() for l in lines]
-            elif self.parser == "udpipe":
-                lines = [[l.split()] for l in lines]
-
-        return lines
+        return f"{self.__class__.__name__}(parser={self.parser})"
 
     def parse_file_as_conll(
         self, input_file: Union[PathLike, Path, str], input_encoding: str = getpreferredencoding(), **kwargs
@@ -128,8 +110,6 @@ def parse_text_as_conll(
         else:
             text = text.splitlines()
 
-        text = self.prepare_data(text)
-
         conll_idx = 0
         output = ""
         for doc_idx, doc in enumerate(self.nlp.pipe(text, n_process=n_process)):

diff --git a/spacy_conll/utils.py b/spacy_conll/utils.py
@@ -51,7 +51,7 @@ def init_parser(
            See the stanza documentation for more:
            https://stanfordnlp.github.io/stanza/tokenize.html#start-with-pretokenized-text
 
-           This option does not affect UDPipe.
+           This option is not supported in UDPipe.
     :param disable_sbd: disables automatic sentence boundary detection in spaCy and stanza. For stanza, make sure that
            your input is in the correct format, that is: sentences must be separated by two new lines. If you want to
            disable both tokenization and sentence segmentation in stanza, do not enable this option but instead only
@@ -60,7 +60,7 @@ def init_parser(
            See the stanza documentation for more:
            https://stanfordnlp.github.io/stanza/tokenize.html#tokenization-without-sentence-segmentation
 
-           This option does not affect UDPipe.
+           This option is not supported in UDPipe.
     :param exclude_spacy_components: spaCy components to exclude from the pipeline, which can greatly improve
            processing speed. Only works when using spaCy as a parser.
     :param parser_opts: will be passed to the core pipeline. For spacy, it will be passed to its
@@ -133,19 +133,17 @@ def __init__(self, vocab: Vocab):
         """
         self.vocab = vocab
 
-    def __call__(self, inp: Union[List[str], str]) -> Doc:
+    def __call__(self, inp: str) -> Doc:
         """Call the tokenizer on input `inp`.
-        :param inp: either a string to be split on whitespace, or a list of tokens
+        :param inp: a string to be split on whitespaces
         :return: the created Doc object
         """
         if isinstance(inp, str):
             words = inp.split()
             spaces = [True] * (len(words) - 1) + ([True] if inp[-1].isspace() else [False])
             return Doc(self.vocab, words=words, spaces=spaces)
-        elif isinstance(inp, list):
-            return Doc(self.vocab, words=inp)
         else:
-            raise ValueError("Unexpected input format. Expected string to be split on whitespace, or list of tokens.")
+            raise ValueError("Unexpected input format. Expected string to be split on whitespace.")
 
 
 @Language.factory("disable_sbd")

diff --git a/tests/conftest.py b/tests/conftest.py
@@ -50,9 +50,10 @@ def conllparser(request):
     yield ConllParser(get_parser(request.param, include_headers=True))
 
 
-@pytest.fixture(params=["spacy", "stanza", "udpipe"])
+# Not testing with UDPipe, which does not support this
+@pytest.fixture(params=["spacy", "stanza"])
 def pretokenized_conllparser(request):
-    yield ConllParser(get_parser(request.param, is_tokenized=True, include_headers=True), is_tokenized=True)
+    yield ConllParser(get_parser(request.param, is_tokenized=True, include_headers=True))
 
 
 @pytest.fixture
@@ -101,7 +102,7 @@ def base_doc(base_parser, text):
 def pretokenized_doc(pretokenized_parser):
     name = pretokenized_parser[1]
     if name == "spacy":
-        yield pretokenized_parser[0](single_sent().split())
+        yield pretokenized_parser[0](single_sent())
     else:
         yield pretokenized_parser[0](single_sent())