Releases: BramVanroy/spacy_conll
v4.0.0
What's Changed
Two new changes thanks to user @rominf:
- Repackaged the library to bring it up to modern standards, notably relying on a pyproject.toml file and
removing support for Python <3.8. - When dep, pos, tag, or lemma fields are empty, the underscore
_
will be used
New Contributors
Full Changelog: v3.4.0...v4.0.0
Update default field names and allow custom ones
Changes to input format of pretokenized text
Since spaCy 3.2.0, the data that is passed to a spaCy pipeline has become more strict. This means that passing
a list of pretokenized tokens (["This", "is", "a", "pretokenized", "sentence"]
) is not accepted anymore. Therefore,
the is_tokenized
option needed to be adapted to reflect this. It is still possible to pass a string where tokens
are separated by whitespaces, e.g. "This is a pretokenized sentence"
, which will continue to work for spaCy and
stanza. Support for pretokenized data has been dropped for UDPipe.
Specific changes:
- [conllparser] Breaking change:
is_tokenized
is not a valid argument toConllParser
any more. - [utils/conllparser] Breaking change: when using UDPipe, pretokenized data is not supported any more.
- [utils] Breaking change:
SpacyPretokenizedTokenizer.__call__
does not support a list of tokens any more.
Entry points and quality of life improvements
- [conllformatter] Fixed an issue where
SpaceAfter=No
was not added correctly to tokens - [conllformatter] Added
ConllFormatter
as an entry point, which means that you do not have to import
spacy_conll
anymore when you want to add the pipe to a parser! spaCy will know where to look for the CoNLL
formatter when you usenlp.add_pipe("conll_formatter")
without you having to import the component manually - [conllformatter] Now adds the component constructor on a construction function rather than directly on the class
as recommended by spacy. The formatter has also been re-written as a dataclass - [conllformatter/utils] Moved
merge_dicts_strict
to utils, outside the formatter class - [conllparser] Make ConllParser directly importable from the root of the library, i.e.,
from spacy_conll import ConllParser
- [init_parser] Allow users to exclude pipeline components when using the spaCy parser with the
exclude_spacy_components
argument - [init_parser] Fixed an issue where disabling sentence segmentation would not work if your model does
not have a parser - [init_parser] Enable more options when using stanza in terms of pre-segmented text. Now you can also disable
sentence segmentation for stanza (but still do tokenization) with thedisable_sbd
option - [utils] Added SpacyDisableSentenceSegmentation as an entry-point custom component so that you can use it in your
own code, by callingnlp.add_pipe("disable_sbd", before="parser")
Fix no_split_on_newline
- [conllparser] Fix: fixed an issue with no_split_on_newline in combination with nlp.pipe
Bugfix for ConllParser: do not require stanza and udpipe
- [conllparser] Fix: make sure the parser also runs if stanza and UDPipe are not installed
Release for spaCy v3
This release makes spacy_conll
compatible with spaCy's new v3 release. On top of that some improvements were made to make the project easier to maintain.
- [general] Breaking change: spaCy v3 required (closes #8)
- [init_parser] Breaking change: in all cases,
is_tokenized
now disables sentence segmentation - [init_parser] Breaking change: no more default values for parser or model anywhere. Important to note here that
spaCy does not work with short-hand codes such asen
any more. You have to provide the full model name, e.g.
en_core_web_sm
- [init_parser] Improvement: models are automatically downloaded for Stanza and UDPipe
- [cli] Reworked the position of the CLI script in the directory structure as well as the arguments. Run
parse-as-conll -h
for more information. - [conllparser] Made the ConllParser class available as a utility to easily create a wrapper for a spaCy-like
parser which can return the parsed CoNLL output of a given file or text - [conllparser,cli] Improvements to usability of
n_process
. Will try to figure out whether multiprocessing
is available for your platform and if not, tell you so. Such a priori error messages can be disabled, with
ignore_pipe_errors
, both on the command line as in ConllParser's parse methods
Preparing for v3 release
- Last version to support spaCy v2. New versions will require spaCy v3
- Last version to support
spacy-stanfordnlp
.spacy-stanza
is still supported
Stanza and UDPipe support, easy-to-use utility function, Token-attributes, and more
Fully reworked version!
- Tested support for both
spacy-stanza
andspacy-udpipe
! (Not included as a dependency, install manually) - Added a useful utility function
init_parser
that can easily initialise a parser together with the custom
pipeline component. (See the README or examples) - Added the
disable_pandas
flag the the formatter class in case you would want to disable setting the pandas
attribute even when pandas is installed. - Added custom properties for Tokens as well. So now a Doc, its sentence Spans as well as Tokens have custom attributes
- Reworked datatypes of output. In version 2.0.0 the data types are as follows:
._.conll
: raw CoNLL format- in
Token
: a dictionary containing all the expected CoNLL fields as keys and the parsed properties as
values. - in sentence
Span
: a list of its tokens'._.conll
dictionaries (list of dictionaries). - in a
Doc
: a list of its sentences'._.conll
lists (list of list of dictionaries).
- in
._.conll_str
: string representation of the CoNLL format- in
Token
: tab-separated representation of the contents of the CoNLL fields ending with a newline. - in sentence
Span
: the expected CoNLL format where each row represents a token. When
ConllFormatter(include_headers=True)
is used, two header lines are included as well, as per the
CoNLL format
_. - in
Doc
: all its sentences'._.conll_str
combined and separated by new lines.
- in
._.conll_pd
:pandas
representation of the CoNLL format- in
Token
: aSeries
representation of this token's CoNLL properties. - in sentence
Span
: aDataFrame
representation of this sentence, with the CoNLL names as column
headers. - in
Doc
: a concatenation of its sentences'DataFrame
's, leading to a new aDataFrame
whose
index is reset.
- in
field_names
has been removed, assuming that you do not need to change the column names of the CoNLL properties- Removed the
Spacy2ConllParser
class - Many doc changes, added tests, and a few examples
Add SpaceAfter=No property
- IMPORTANT: This will be the last release that supports the deprecated
Spacy2ConllParser
class! - Community addition: add SpaceAfter=No to the Misc field when applicable (#6). Thanks @KoichiYasuoka!
- Fixed failing tests