Releases · BramVanroy/spacy_conll

02 Jul 08:38

BramVanroy

v4.0.0

22a14f7

v4.0.0 Latest

Latest

What's Changed

Two new changes thanks to user @rominf:

Repackaged the library to bring it up to modern standards, notably relying on a pyproject.toml file and
removing support for Python <3.8.
When dep, pos, tag, or lemma fields are empty, the underscore _ will be used

New Contributors

@rominf made their first contribution in #32

Full Changelog: v3.4.0...v4.0.0

Contributors

rominf

Assets 2

07 Apr 13:21

BramVanroy

v3.4.0

fc25688

Update default field names and allow custom ones

What's Changed

improve CoNLL-U fields by @BramVanroy in #25

Full Changelog: v3.3.0...v3.4.0

Contributors

BramVanroy

Assets 2

17 Jan 10:03

BramVanroy

v3.3.0

5e82adc

Changes to input format of pretokenized text

Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]) is not accepted anymore. Therefore,
the is_tokenized option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence", which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.

Specific changes:

[conllparser] Breaking change: is_tokenized is not a valid argument to ConllParser any more.
[utils/conllparser] Breaking change: when using UDPipe, pretokenized data is not supported any more.
[utils] Breaking change: SpacyPretokenizedTokenizer.__call__ does not support a list of tokens any more.

Assets 2

04 Apr 12:16

BramVanroy

v3.2.0

e8ff86d

Entry points and quality of life improvements

[conllformatter] Fixed an issue where SpaceAfter=No was not added correctly to tokens
[conllformatter] Added ConllFormatter as an entry point, which means that you do not have to import
spacy_conll anymore when you want to add the pipe to a parser! spaCy will know where to look for the CoNLL
formatter when you use nlp.add_pipe("conll_formatter") without you having to import the component manually
[conllformatter] Now adds the component constructor on a construction function rather than directly on the class
as recommended by spacy. The formatter has also been re-written as a dataclass
[conllformatter/utils] Moved merge_dicts_strict to utils, outside the formatter class
[conllparser] Make ConllParser directly importable from the root of the library, i.e.,
from spacy_conll import ConllParser
[init_parser] Allow users to exclude pipeline components when using the spaCy parser with the
exclude_spacy_components argument
[init_parser] Fixed an issue where disabling sentence segmentation would not work if your model does
not have a parser
[init_parser] Enable more options when using stanza in terms of pre-segmented text. Now you can also disable
sentence segmentation for stanza (but still do tokenization) with the disable_sbd option
[utils] Added SpacyDisableSentenceSegmentation as an entry-point custom component so that you can use it in your
own code, by calling nlp.add_pipe("disable_sbd", before="parser")

Assets 2

14 Jul 15:16

BramVanroy

v3.0.2

bd83765

Fix no_split_on_newline

[conllparser] Fix: fixed an issue with no_split_on_newline in combination with nlp.pipe

Assets 2

14 Jul 05:50

BramVanroy

v3.0.1

0f916ff

Bugfix for ConllParser: do not require stanza and udpipe

[conllparser] Fix: make sure the parser also runs if stanza and UDPipe are not installed

Assets 2

12 Jul 10:17

BramVanroy

v3.0.0

670d002

Release for spaCy v3

This release makes spacy_conll compatible with spaCy's new v3 release. On top of that some improvements were made to make the project easier to maintain.

[general] Breaking change: spaCy v3 required (closes #8)
[init_parser] Breaking change: in all cases, is_tokenized now disables sentence segmentation
[init_parser] Breaking change: no more default values for parser or model anywhere. Important to note here that
spaCy does not work with short-hand codes such as en any more. You have to provide the full model name, e.g.
en_core_web_sm
[init_parser] Improvement: models are automatically downloaded for Stanza and UDPipe
[cli] Reworked the position of the CLI script in the directory structure as well as the arguments. Run
parse-as-conll -h for more information.
[conllparser] Made the ConllParser class available as a utility to easily create a wrapper for a spaCy-like
parser which can return the parsed CoNLL output of a given file or text
[conllparser,cli] Improvements to usability of n_process. Will try to figure out whether multiprocessing
is available for your platform and if not, tell you so. Such a priori error messages can be disabled, with
ignore_pipe_errors, both on the command line as in ConllParser's parse methods

Assets 2

23 Jun 13:00

BramVanroy

v2.1.0

1c70c46

Preparing for v3 release

Last version to support spaCy v2. New versions will require spaCy v3
Last version to support spacy-stanfordnlp. spacy-stanza is still supported

Assets 2

11 May 17:36

BramVanroy

v2.0.0

e119d90

Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more

Fully reworked version!

Tested support for both spacy-stanza and spacy-udpipe! (Not included as a dependency, install manually)
Added a useful utility function init_parser that can easily initialise a parser together with the custom
pipeline component. (See the README or examples)
Added the disable_pandas flag the the formatter class in case you would want to disable setting the pandas
attribute even when pandas is installed.
Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
Reworked datatypes of output. In version 2.0.0 the data types are as follows:
- ._.conll: raw CoNLL format
  - in Token: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
    values.
  - in sentence Span: a list of its tokens' ._.conll dictionaries (list of dictionaries).
  - in a Doc: a list of its sentences' ._.conll lists (list of list of dictionaries).
- ._.conll_str: string representation of the CoNLL format
  - in Token: tab-separated representation of the contents of the CoNLL fields ending with a newline.
  - in sentence Span: the expected CoNLL format where each row represents a token. When
    ConllFormatter(include_headers=True) is used, two header lines are included as well, as per the
    CoNLL format_.
  - in Doc: all its sentences' ._.conll_str combined and separated by new lines.
- ._.conll_pd: pandas representation of the CoNLL format
  - in Token: a Series representation of this token's CoNLL properties.
  - in sentence Span: a DataFrame representation of this sentence, with the CoNLL names as column
    headers.
  - in Doc: a concatenation of its sentences' DataFrame's, leading to a new a DataFrame whose
    index is reset.
field_names has been removed, assuming that you do not need to change the column names of the CoNLL properties
Removed the Spacy2ConllParser class
Many doc changes, added tests, and a few examples

Assets 2

28 Apr 08:29

BramVanroy

v1.3.0

bec4e1f

Add SpaceAfter=No property

IMPORTANT: This will be the last release that supports the deprecated Spacy2ConllParser class!
Community addition: add SpaceAfter=No to the Misc field when applicable (#6). Thanks @KoichiYasuoka!
Fixed failing tests

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Releases: BramVanroy/spacy_conll

v4.0.0

What's Changed

New Contributors

Contributors

Update default field names and allow custom ones

What's Changed

Contributors

Changes to input format of pretokenized text

Entry points and quality of life improvements

Fix no_split_on_newline

Bugfix for ConllParser: do not require stanza and udpipe

Release for spaCy v3

Preparing for v3 release

Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more

Add SpaceAfter=No property