Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLM] add deploy server #9581

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions llm/server/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
README.md
requirements-dev.txt
pyproject.toml
Makefile

dockerfiles/
docs/
server/__pycache__
server/http_server
server/engine
server/data
39 changes: 39 additions & 0 deletions llm/server/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@

<h1 align="center"><b><em>大模型服务化部署</em></b></h1>

*该部署工具是基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计。它提供了支持gRPC、HTTP协议的服务接口,以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化(PTQ)等加速优化策略,为用户带来易用且高性能的部署体验。*
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在 /llm/readme.md 上面也加上使用介绍吧。


# 快速开始

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

必须要搭配特定镜像使用吗?不能像vllm一样随意部署吗?

基于预编译镜像部署,本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例,更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md):

```
# 下载模型
wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以提供更过开箱即用的量化模型吗?

# 挂载模型文件
export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8

docker run --gpus all --shm-size 5G --network=host --privileged --cap-add=SYS_PTRACE \
-v ${MODEL_PATH}:/models/ \
-dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

能否去除掉特定docker的依赖。依赖特定docker的话,会使得易用性降低。

bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
```

等待服务启动成功(服务初次启动大概需要40s),可以通过以下命令测试:

```
curl 127.0.0.1:9965/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{"text": "hello, llm"}'
```

Note:
1. 请保证 shm-size >= 5,不然可能会导致服务启动失败

更多关于该部署工具的使用方法,请查看[服务化部署流程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/server/docs/deploy_usage_tutorial.md)

# License

遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/LICENSE) 。
110 changes: 110 additions & 0 deletions llm/server/client/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# 客户端使用方式

## 简介

服务化部署客户端提供命令行接口和Python接口,可以快速调用服务化后端部署的LLM模型服务。

## 安装

源码安装
```
pip install .
```

## 命令行接口

首先通过环境变量设置模型服务模式、模型服务URL、模型ID,然后使用命令行接口调用模型服务。

| 参数 | 说明 | 是否必填 | 默认值 |
| --- | --- | --- | --- |
| FASTDEPLOY_MODEL_URL | 模型服务部署的IP地址和端口,格式为`x.x.x.x:xxx`。 | 是 | |

```
export FASTDEPLOY_MODEL_URL="x.x.x.x:xxx"

# 流式接口
fdclient stream_generate "你好?"

# 非流式接口
fdclient generate "你好,你是谁?"
```

## Python接口

首先通过Python代码设置模型服务URL(hostname+port),然后使用Python接口调用模型服务。

| 参数 | 说明 | 是否必填 | 默认值 |
| --- | --- | --- | --- |
| hostname+port | 模型服务部署的IP地址和端口,格式为`x.x.x.x。 | 是 | |


```
from fastdeploy_client.chatbot import ChatBot
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm,还是需要fd吗?fd的安装会不会有问题,版本要求,兼容性要求呢?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议这个文件夹 不要 弄 fd 了吧,容易混淆


hostname = "x.x.x.x"
port = xxx

# 流式接口,stream_generate api的参数说明见附录
chatbot = ChatBot(hostname=hostname, port=port)
stream_result = chatbot.stream_generate("你好", topp=0.8)
for res in stream_result:
print(res)

# 非流式接口,generate api的参数说明见附录
chatbot = ChatBot(hostname=hostname, port=port)
result = chatbot.generate("你好", topp=0.8)
print(result)
```

### 接口说明
```
ChatBot.stream_generate(message,
max_dec_len=1024,
min_dec_len=2,
topp=0.0,
temperature=1.0,
frequency_score=0.0,
penalty_score=1.0,
presence_score=0.0,
eos_token_ids=254186)

# 此函数返回一个iterator,其中每个元素为一个dict, 例如:{"token": "好的", "is_end": 0}
# 其中token为生成的字符,is_end表明是否为生成的最后一个字符(0表示否,1表示是)
# 注意:当生成结果出错时,返回错误信息;不同模型的eos_token_ids不同
```

```
ChatBot.generate(message,
max_dec_len=1024,
min_dec_len=2,
topp=0.0,
temperature=1.0,
frequency_score=0.0,
penalty_score=1.0,
presence_score=0.0,
eos_token_ids=254186)

# 此函数返回一个,例如:{"results": "好的,我知道了。"},其中results即为生成结果
# 注意:当生成结果出错时,返回错误信息;不同模型的eos_token_ids不同
```

### 参数说明

| 字段名 | 字段类型 | 说明 | 是否必填 | 默认值 | 备注 |
| :---: | :-----: | :---: | :---: | :-----: | :----: |
| req_id | str | 请求ID,用于标识一个请求。建议设置req_id,保证其唯一性 | 否 | 随机id | 如果推理服务中同时有两个相同req_id的请求,会返回req_id重复的错误信息 |
| text | str | 请求的文本 | 是 | 无 | |
| max_dec_len | int | 最大生成token的长度,如果请求的文本token长度加上max_dec_len大于模型的max_seq_len,会返回长度超限的错误信息 | 否 | max_seq_len减去文本token长度 | |
| min_dec_len | int | 最小生成token的长度,最小是1 | 否 | 1 | |
| topp | float | 控制随机性参数,数值越大则随机性越大,范围是0~1 | 否 | 0.7 | |
| temperature | float | 控制随机性参数,数值越小随机性越大,需要大于 0 | 否 | 0.95 | |
| frequency_score | float | 频率分数 | 否 | 0 | |
| penalty_score | float | 惩罚分数 | 否 | 1 | |
| presence_score | float | 存在分数 | 否 | 0 | |
| stream | bool | 是否流式返回 | 否 | False | |
| return_all_tokens | bool | 是否一次性返回所有结果 | 否 | False | 与stream参数差异见表后备注 |
| timeout | int | 请求等待的超时时间,单位是秒 | 否 | 300 | |

* 在正确配置PUSH_MODE_HTTP_PORT字段下,服务支持 GRPC 和 HTTP 两种请求服务
* stream 参数仅对 HTTP 请求生效
* return_all_tokens 参数对 GRPC 和 HTTP 请求均有效
20 changes: 20 additions & 0 deletions llm/server/client/fastdeploy_client/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import logging
import sys

__version__ = "4.4.0"

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
Loading