Spark NLP 5.2.3: ONNX support for XLM-RoBERTa Token and Sequence Classifications, and Question Answering task, AWS SDK optimizations, New notebooks, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes! #14142

maziyarpanahi · 2024-01-18T22:43:50Z

maziyarpanahi
Jan 18, 2024
Maintainer

📢 Overview

Spark NLP 5.2.3 🚀 comes with an array of exciting features and optimizations. We're thrilled to announce support for ONNX Runtime in XLMRoBertaForTokenClassification, XLMRoBertaForSequenceClassification, and XLMRoBertaForQuestionAnswering annotators. This release also showcases a significant refinement in the use of AWS SDK in Spark NLP, shifting from aws-java-sdk-bundle to aws-java-sdk-s3, resulting in a substantial ~320MB reduction in library size and a 20% increase in startup speed, new notebooks to import external models from Hugging Face, over 400+ new LLM models, and more!

We're pleased to announce that our Models Hub now boasts 36,000+ free and truly open-source models & pipelines 🎉. Our deepest gratitude goes out to our community for their invaluable feedback, feature suggestions, and contributions.

🔥 New Features & Enhancements

NEW: Introducing support for ONNX Runtime in XLMRoBertaForTokenClassification annotator
NEW: Introducing support for ONNX Runtime in XLMRoBertaForSequenceClassification annotator
NEW: Introducing support for ONNX Runtime in XLMRoBertaForQuestionAnswering annotator
Refactored the use of AWS SDK in Spark NLP, transitioning from the aws-java-sdk-bundle to the aws-java-sdk-s3 dependency. This change has resulted in a 318MB reduction in the library's overall size and has enhanced the Spark NLP startup time by 20%. For instance, using sparknlp.start() in Google Colab is now 14 to 20 seconds faster. Special thanks to @c3-avidmych for requesting this feature.
Add new notebooks to import DeBertaForQuestionAnswering, DebertaForSequenceClassification, and DeBertaForTokenClassification models from HuggingFace
Add a new DocumentTokenSplitter notebook
Add a new training NER notebook by using DeBerta Embeddings
Add a new training text classification notebook by using INSTRUCTOR Embeddings
Update RoBertaForTokenClassification notebook
Update RoBertaForSequenceClassification notebook
Update OpenAICompletion notebook with new gpt-3.5-turbo-instruct model

🐛 Bug Fixes

Fix BGEEmbeddings not downloading in Python

ℹ️ Known Issues

ONNX models crash when they are used in Colab's T4 GPU runtime ONNX models crash when they are used in Colab's T4 GPU runtime #14109

📓 New Notebooks

Notebooks
Import ONNX DeBertaForQuestionAnswering models from HuggingFace 🤗
Import ONNX DeBertaForSequenceClassification models from HuggingFace 🤗
Import ONNX DeBertaForTokenClassification models from HuggingFace 🤗
Import ONNX XlmRoBertaForQuestionAnswering models from HuggingFace 🤗
Import ONNX XlmRoBertaForSequenceClassification models from HuggingFace 🤗
Import ONNX XlmRoBertaForTokenClassification models from HuggingFace 🤗
Documents chunking by DocumentTokenSplitter
Training ClassifierDL with INSTRUCTOR Embeddings
NER Model Development with DebertaEmbeddings Based on CoNLL 2003
OpenAICompletion in SparkNLP

📖 Documentation

❤️ Community support

Slack For live discussion with the Spark NLP community and the team
GitHub Bug reports, feature requests, and contributions
Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
Medium Spark NLP articles
YouTube Spark NLP video tutorials

Installation

Python

#PyPI

pip install spark-nlp==5.2.3

Spark Packages

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x: (Scala 2.12):

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.2.3

GPU

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:5.2.3

Apple Silicon (M1 & M2)

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-silicon_2.12:5.2.3

AArch64

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3

pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:5.2.3

Maven

spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

spark-nlp-gpu:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

spark-nlp-silicon:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-silicon_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

spark-nlp-aarch64:

<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-aarch64_2.12</artifactId>
    <version>5.2.3</version>
</dependency>

FAT JARs

CPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-5.2.3.jar
GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-5.2.3.jar
M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-silicon-assembly-5.2.3.jar
AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x/3.4.x/3.5.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-5.2.3.jar

What's Changed

HuggingFace_ONNX_in_Spark_NLP_RoBertaForSequenceClassification updated by @AbdullahMubeenAnwar in HuggingFace_ONNX_in_Spark_NLP_RoBertaForSequenceClassification updated #14122
HuggingFace_ONNX_in_Spark_NLP_RoBertaForTokenClassification updated by @AbdullahMubeenAnwar in HuggingFace_ONNX_in_Spark_NLP_RoBertaForTokenClassification updated #14123
adding notebooks for onnx Deberta Import from Huggingface by @ahmedlone127 in adding notebooks for onnx Deberta Import from Huggingface #14126
Sparknlp 967 add onnx support to xlm roberta classifiers by @ahmedlone127 in Sparknlp 967 add onnx support to xlm roberta classifiers #14130
adding BGEEmbeddings to resource downloader by @ahmedlone127 in adding BGEEmbeddings to resource downloader #14133
adding missing notebooks by @ahmedlone127 in adding missing notebooks #14135
Uploading and fixing example notebooks to spark-nlp by @AbdullahMubeenAnwar in Uploading and fixing example notebooks to spark-nlp #14137
[SPARKNLP-978] Refactoring to use aws-java-sdk-s3 library by @danilojsl in [SPARKNLP-978] Refactoring to use aws-java-sdk-s3 library #14136
Models hub by @maziyarpanahi in Models hub #14141
Release/523 release candidate by @maziyarpanahi in Release/523 release candidate #14140

New Contributors

@AbdullahMubeenAnwar made their first contribution in HuggingFace_ONNX_in_Spark_NLP_RoBertaForSequenceClassification updated #14122

Full Changelog: 5.2.2...5.2.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark NLP 5.2.3: ONNX support for XLM-RoBERTa Token and Sequence Classifications, and Question Answering task, AWS SDK optimizations, New notebooks, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes! #14142

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Spark NLP 5.2.3: ONNX support for XLM-RoBERTa Token and Sequence Classifications, and Question Answering task, AWS SDK optimizations, New notebooks, Over 400 new state-of-the-art Transformer Models in ONNX, and bug fixes! #14142

maziyarpanahi Jan 18, 2024 Maintainer

📢 Overview

🔥 New Features & Enhancements

🐛 Bug Fixes

ℹ️ Known Issues

📓 New Notebooks

📖 Documentation

❤️ Community support

Installation

What's Changed

New Contributors

Replies: 0 comments

maziyarpanahi
Jan 18, 2024
Maintainer