Highlights Include
- GenAI updates
- No code LLM deployments with TorchServe + vLLM & TensorRT-LLM using
ts.llm_launcher
script - OpenAI API support for TorchServe + vLLM
- Integration of TensorRT-LLM engine
- Stateful Inference on AWS Sagemaker (see blog)
- No code LLM deployments with TorchServe + vLLM & TensorRT-LLM using
- Support for
linux-aarch64
- CI & nightly regression added
- Publish docker & KServe images
- PyTorch updates
- Support for PyTorch 2.4
- Deprecation of TorchText
PyTorch Updates
- upgrade to PyTorch 2.4 & deprecation of TorchText by @agunapal in #3289
- Resnet152 batch inference torch.compile example by @andrius-meta in #3259
- squeezenet torch.compile example by @wdvr in #3277
GenAI
- Implement stateful inference session timeout by @namannandan in #3263
- Use Case: Enhancing LLM Serving with Torch Compiled RAG on AWS Graviton by @agunapal in #3276
- Feature add openai api for vllm integration by @mreso in #3287
- Set vllm multiproc method to spawn by @mreso in #3310
- TRT LLM Integration with LORA by @agunapal in #3305
- Bump vllm from 0.5.0 to 0.5.5 in /examples/large_models/vllm by @dependabot in #3321
- Use startup time in async worker thread instead of worker timeout by @mreso in #3315
- Rename vllm dockerfile by @mreso in #3330
Support for linux-aarch64
- Adding Graviton Regression test CI by @udaij12 in #3273
- adding graviton docker image release by @udaij12 in #3313
- Fixing kserve nightly for arm64 by @udaij12 in #3319
- Docker aarch by @udaij12 in #3323
Documentation
- Security doc update by @udaij12 in #3256
- Remove compile note for hpu by @RafLit in #3271
- doc update of the rag usecase blog by @agunapal in #3280
- Add some hints for java devs by @mreso in #3282
- add TorchServe with Intel® Extension for PyTorch* guidance by @jingxu10 in #3285
- Update quickstart llm docker in serve/readme; added ts.llm_launcher example by @mreso in #3300
- typo fixes in HF Transformers example by @EFord36 in #3307
- docs: update WaveGlow links by @emmanuel-ferdman in #3317
- Fix typo: "a asynchronous" -> "an asynchronous" by @tadayosi in #3314
- Fix typo: vesion -> version, succsesfully -> successfully by @tadayosi in #3322
Improvements and Bug Fixing
- Bump torchserve from 0.10.0 to 0.11.0 in /examples/large_models/ipex_llm_int8 by @dependabot in #3257
- add JDK17 compatible groovy dependency for frontend log4j ScriptFilter by @lanxih in #3235
- Leave response and sendError when request is canceled by @slashvar in #3267
- add kserve gpu tests by @rohithkrn in #3283
- Configurable startup time by @Isalia20 in #3262
- Add REPO_URL in Dockerfile to allow docker builds from contributor repos by @mreso in #3291
- Fix docker repo url in github action workflow by @mreso in #3293
- Fix docker ci repo_url by @mreso in #3294
- Fix/docker repo url3 by @mreso in #3297
- Remove debug step in docker ci by @mreso in #3298
- Fix wild card in extra files by @mreso in #3304
- Example to demonstrate building a custom endpoint plugin by @namannandan in #3306
- Benchmark fix by @udaij12 in #3316
- Update TS version to 0.12.0 by @agunapal in #3318
- Clear up neuron cache by @chen3933 in #3326
- Fix Dockerfile fore renamed forks by @mreso in #3327
- Load all models including targz by @m10an in #3329
- fix for snapshot variables missing/null by @udaij12 in #3328
New Contributors
- @andrius-meta made their first contribution in #3259
- @slashvar made their first contribution in #3267
- @RafLit made their first contribution in #3271
- @wdvr made their first contribution in #3277
- @Isalia20 made their first contribution in #3262
- @jingxu10 made their first contribution in #3285
- @EFord36 made their first contribution in #3307
- @emmanuel-ferdman made their first contribution in #3317
- @tadayosi made their first contribution in #3314
- @m10an made their first contribution in #3329
Platform Support
Ubuntu 20.04 MacOS 10.14+, Windows 10 Pro, Windows Server 2019, Windows subsystem for Linux (Windows Server 2019, WSLv1, Ubuntu 18.0.4). TorchServe requires Python >= 3.8 and JDK17.
GPU Support Matrix
TorchServe version | PyTorch version | Python | Stable CUDA | Experimental CUDA |
---|---|---|---|---|
0.12.0 | 2.4.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.11.1 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.11.0 | 2.3.0 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.10.0 | 2.2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.9.0 | 2.1 | >=3.8, <=3.11 | CUDA 11.8, CUDNN 8.7.0.84 | CUDA 12.1, CUDNN 8.9.2.26 |
0.8.0 | 2.0 | >=3.8, <=3.11 | CUDA 11.7, CUDNN 8.5.0.96 | CUDA 11.8, CUDNN 8.7.0.84 |
0.7.0 | 1.13 | >=3.7, <=3.10 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 |
Inferentia2 Support Matrix
TorchServe version | PyTorch version | Python | Neuron SDK |
---|---|---|---|
0.12.0 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.11.1 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.11.0 | 2.1 | >=3.8, <=3.11 | 2.18.2+ |
0.10.0 | 1.13 | >=3.8, <=3.11 | 2.16+ |
0.9.0 | 1.13 | >=3.8, <=3.11 | 2.13.2+ |