-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge master
into develop
#13271
Merged
danieldk
merged 40 commits into
explosion:develop
from
danieldk:maintenance/develop-merge-master-20240125
Jan 26, 2024
Merged
Merge master
into develop
#13271
danieldk
merged 40 commits into
explosion:develop
from
danieldk:maintenance/develop-merge-master-20240125
Jan 26, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sync `docs/llm_main` with `master`
Sync `docs/llm_develop` with `docs/llm_main`
- Replace `np.trapz` with vendored `trapezoid` from scipy - Replace `np.float_` with `np.float64`
…#13086) * Update Tokenizer.explain for special cases with whitespace Update `Tokenizer.explain` to skip special case matches if the exact text has not been matched due to intervening whitespace. Enable fuzzy `Tokenizer.explain` tests with additional whitespace normalization. * Add unit test for special cases with whitespace, xfail fuzzy tests again
Co-authored-by: Ridge Kimani <[email protected]>
Build with `build` if available. Warn and fall back to previous `setup.py`-based builds if `build` build fails.
* Update supported OpenAI models. * Update with new GPT-3.5 and GPT-4 versions. * Add links to OpenAI model docs.
…#13081) * Update the "Missing factory" error message This accounts for model installations that took place during the current Python session. * Add a note about Jupyter notebooks * Move error to `spacy.cli.download` Add extra message for Jupyter sessions * Add additional note for interactive sessions * Remove note about `spacy-transformers` from error message * `isort` * Improve checks for colab (also helps displacy) * Update warning messages * Improve flow for multiple checks --------- Co-authored-by: Adriane Boyd <[email protected]>
* add language extensions for norwegian nynorsk and faroese * update docstring for nn/examples.py * use relative imports * add fo and nn tokenizers to pytest fixtures * add unittests for fo and nn and fix bug in nn * remove module docstring from fo/__init__.py * add comments about example sentences' origin * add license information to faroese data credit * format unittests using black * add __init__ files to test/lang/nn and tests/lang/fo * fix import order and use relative imports in fo/__nit__.py and nn/__init__.py * Make the tests a bit more compact * Add fo and nn to website languages * Add note about jul. * Add "jul." as exception --------- Co-authored-by: Adriane Boyd <[email protected]>
…13149) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: explosion/thinc#754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import
* Add documentation for EL task. * Fix EL factory name. * Add llm_entity_linker_mentio. * Apply suggestions from code review Co-authored-by: Madeesh Kannan <[email protected]> * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Update EL task docs. * Apply suggestions from code review Co-authored-by: Sofie Van Landeghem <[email protected]> * Incorporate feedback. * Format. * Fix link to KB data. --------- Co-authored-by: Sofie Van Landeghem <[email protected]> Co-authored-by: Madeesh Kannan <[email protected]>
Sync `llm_develop` with `llm_main`
* Add section on RawTask. * Fix API docs. * Update website/docs/api/large-language-models.mdx Co-authored-by: Sofie Van Landeghem <[email protected]> --------- Co-authored-by: Sofie Van Landeghem <[email protected]>
* Describe translation task. * Fix references to examples and template. * Format.
* correct char_span output type - can be None * unify type of exclude parameter * black * further fixes to from_dict and to_dict * formatting
…y blog. (explosion#13197) * Update README.md to include links for GPU processing, LLM, and spaCy's blog. * Create ojo4f3.md * corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog. * Delete .github/contributors/ojo4f3.md * changed LLM icon Co-authored-by: Adriane Boyd <[email protected]> * Apply suggestions from code review --------- Co-authored-by: Adriane Boyd <[email protected]>
* Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <[email protected]> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <[email protected]> Co-authored-by: Adriane Boyd <[email protected]>
* Add spacy.TextCatParametricAttention.v1 This layer provides is a simplification of the ensemble classifier that only uses paramteric attention. We have found empirically that with a sufficient amount of training data, using the ensemble classifier with BoW does not provide significant improvement in classifier accuracy. However, plugging in a BoW classifier does reduce GPU training and inference performance substantially, since it uses a GPU-only kernel. * Fix merge fallout
* Updated docs w.r.t. infinite doc length. * Fix typo. * fix typo's * Fix table formatting. * Update formatting. --------- Co-authored-by: Sofie Van Landeghem <[email protected]>
Sync `docs/llm_main` with `docs/llm_develop`
# Conflicts: # website/docs/api/large-language-models.mdx
…ith-llm_main Sync `master` with `docs/llm_main`
Before this change, the workers of pipe call with n_process != 1 were stopped by calling `terminate` on the processes. However, terminating a process can leave queues, pipes, and other concurrent data structures in an invalid state. With this change, we stop using terminate and take the following approach instead: * When the all documents are processed, the parent process puts a sentinel in the queue of each worker. * The parent process then calls `join` on each worker process to let them finish up gracefully. * Worker processes break from the queue processing loop when the sentinel is encountered, so that they exit. We need special handling when one of the workers encounters an error and the error handler is set to raise an exception. In this case, we cannot rely on the sentinel to finish all workers -- the queue is a FIFO queue and there may be other work queued up before the sentinel. We use the following approach to handle error scenarios: * The parent puts the end-of-work sentinel in the queue of each worker. * The parent closes the reading-end of the channel of each worker. * Then: - If the worker was waiting for work, it will encounter the sentinel and break from the processing loop. - If the worker was processing a batch, it will attempt to write results to the channel. This will fail because the channel was closed by the parent and the worker will break from the processing loop.
macOS now uses port 5000 for the AirPlay receiver functionality, so this test will always fail on a macOS desktop (unless AirPlay receiver functionality is disabled like in CI).
Fix typo in method name
svlandeg
approved these changes
Jan 26, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGMT!
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Merge
master
intodevelop
.Types of change
Checklist