PaddlePaddle · kevincheng2 · Dec 9, 2024 · ZHUI · Dec 9, 2024 · ZHUI
diff --git a/llm/server/.dockerignore b/llm/server/.dockerignore
@@ -0,0 +1,11 @@
+README.md
+requirements-dev.txt
+pyproject.toml
+Makefile
+
+dockerfiles/
+docs/
+server/__pycache__
+server/http_server
+server/engine
+server/data
diff --git a/llm/server/README.md b/llm/server/README.md
@@ -0,0 +1,39 @@
+
+<h1 align="center"><b><em>大模型服务化部署</em></b></h1>
+
+*该部署工具是基于英伟达Triton框架专为服务器场景的大模型服务化部署而设计。它提供了支持gRPC、HTTP协议的服务接口，以及流式Token输出能力。底层推理引擎支持连续批处理、weight only int8、后训练量化（PTQ）等加速优化策略，为用户带来易用且高性能的部署体验。*
+
+# 快速开始
+
+  基于预编译镜像部署，本节以 Meta-Llama-3-8B-Instruct-A8W8C8 为例，更多模型请参考[LLaMA](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/llama.md)、[Qwen](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/qwen.md)、[Mixtral](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/mixtral.md), 更细致的模型推理、量化教程可以参考[大模型推理教程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/docs/predict/inference.md)：
+
+  ```
+    # 下载模型
+    wget https://paddle-qa.bj.bcebos.com/inference_model/Meta-Llama-3-8B-Instruct-A8W8C8.tar
+    mkdir Llama-3-8B-A8W8C8 && tar -xf Meta-Llama-3-8B-Instruct-A8W8C8.tar -C Llama-3-8B-A8W8C8
+
+    # 挂载模型文件
+    export MODEL_PATH=${PWD}/Llama-3-8B-A8W8C8
+
+    docker run --gpus all --shm-size 5G --network=host --privileged --cap-add=SYS_PTRACE \
+    -v ${MODEL_PATH}:/models/ \
+    -dit registry.baidubce.com/paddlepaddle/fastdeploy:llm-serving-cuda123-cudnn9-v1.2 \
+    bash -c 'export USE_CACHE_KV_INT8=1 && cd /opt/output/Serving && bash start_server.sh; exec bash'
+  ```
+
+  等待服务启动成功（服务初次启动大概需要40s），可以通过以下命令测试：
+
+  ```
+    curl 127.0.0.1:9965/v1/chat/completions \
+    -H 'Content-Type: application/json' \
+    -d '{"text": "hello, llm"}'
+  ```
+
+Note:
+1. 请保证 shm-size >= 5，不然可能会导致服务启动失败
+
+更多关于该部署工具的使用方法，请查看[服务化部署流程](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/llm/server/docs/deploy_usage_tutorial.md)
+
+# License
+
+遵循 [Apache-2.0开源协议](https://github.com/PaddlePaddle/PaddleNLP/blob/develop/LICENSE) 。
diff --git a/llm/server/client/README.md b/llm/server/client/README.md
@@ -0,0 +1,110 @@
+# 客户端使用方式
+
+## 简介
+
+服务化部署客户端提供命令行接口和Python接口，可以快速调用服务化后端部署的LLM模型服务。
+
+## 安装
+
+源码安装
+```
+pip install .
+```
+
+## 命令行接口
+
+首先通过环境变量设置模型服务模式、模型服务URL、模型ID，然后使用命令行接口调用模型服务。
+
+| 参数 | 说明 | 是否必填 | 默认值 |
+| --- | --- | --- | --- |
+| FASTDEPLOY_MODEL_URL | 模型服务部署的IP地址和端口，格式为`x.x.x.x:xxx`。 | 是 | |
+
+```
+export FASTDEPLOY_MODEL_URL="x.x.x.x:xxx"
+
+# 流式接口
+fdclient stream_generate "你好?"
+
+# 非流式接口
+fdclient generate "你好，你是谁?"
+```
+
+## Python接口
+
+首先通过Python代码设置模型服务URL(hostname+port)，然后使用Python接口调用模型服务。
+
+| 参数 | 说明 | 是否必填 | 默认值 |
+| --- | --- | --- | --- |
+| hostname+port | 模型服务部署的IP地址和端口，格式为`x.x.x.x。 | 是 | |
+
+
+```
+from fastdeploy_client.chatbot import ChatBot
+
+hostname = "x.x.x.x"
+port = xxx
+
+# 流式接口，stream_generate api的参数说明见附录
+chatbot = ChatBot(hostname=hostname, port=port)
+stream_result = chatbot.stream_generate("你好", topp=0.8)
+for res in stream_result:
+    print(res)
+
+# 非流式接口，generate api的参数说明见附录
+chatbot = ChatBot(hostname=hostname, port=port)
+result = chatbot.generate("你好", topp=0.8)
+print(result)
+```
+
+### 接口说明
+```
+ChatBot.stream_generate(message,
+                        max_dec_len=1024,
+                        min_dec_len=2,
+                        topp=0.0,
+                        temperature=1.0,
+                        frequency_score=0.0,
+                        penalty_score=1.0,
+                        presence_score=0.0,
+                        eos_token_ids=254186)
+
+# 此函数返回一个iterator，其中每个元素为一个dict, 例如：{"token": "好的", "is_end": 0}
+# 其中token为生成的字符，is_end表明是否为生成的最后一个字符（0表示否，1表示是）
+# 注意：当生成结果出错时，返回错误信息；不同模型的eos_token_ids不同
+```
+
+```
+ChatBot.generate(message,
+                 max_dec_len=1024,
+                 min_dec_len=2,
+                 topp=0.0,
+                 temperature=1.0,
+                 frequency_score=0.0,
+                 penalty_score=1.0,
+                 presence_score=0.0,
+                 eos_token_ids=254186)
+
+# 此函数返回一个，例如：{"results": "好的，我知道了。"}，其中results即为生成结果
+# 注意：当生成结果出错时，返回错误信息；不同模型的eos_token_ids不同
+```
+
+### 参数说明
+
+| 字段名 | 字段类型 | 说明 | 是否必填 | 默认值 | 备注 |
+| :---: | :-----: | :---: | :---: | :-----: | :----: |
+| req_id |  str  | 请求ID，用于标识一个请求。建议设置req_id，保证其唯一性   | 否 | 随机id | 如果推理服务中同时有两个相同req_id的请求，会返回req_id重复的错误信息 |
+| text   | str  | 请求的文本 | 是 | 无 |  |
+| max_dec_len | int  | 最大生成token的长度，如果请求的文本token长度加上max_dec_len大于模型的max_seq_len，会返回长度超限的错误信息 | 否 | max_seq_len减去文本token长度 |  |
+| min_dec_len | int | 最小生成token的长度，最小是1 | 否 | 1 |  |
+| topp | float | 控制随机性参数，数值越大则随机性越大，范围是0~1 | 否 | 0.7 |  |
+| temperature | float | 控制随机性参数，数值越小随机性越大，需要大于 0 | 否 | 0.95 |  |
+| frequency_score | float | 频率分数 | 否 | 0 |  |
+| penalty_score | float | 惩罚分数 | 否 | 1 |  |
+| presence_score | float | 存在分数 | 否 | 0 |  |
+| stream | bool | 是否流式返回 | 否 | False |  |
+| return_all_tokens | bool | 是否一次性返回所有结果 | 否 | False | 与stream参数差异见表后备注 |
+| timeout | int | 请求等待的超时时间，单位是秒 | 否 | 300 |  |
+
+* 在正确配置PUSH_MODE_HTTP_PORT字段下，服务支持 GRPC 和 HTTP 两种请求服务
+  * stream 参数仅对 HTTP 请求生效
+  * return_all_tokens 参数对 GRPC 和 HTTP 请求均有效
diff --git a/llm/server/client/fastdeploy_client/__init__.py b/llm/server/client/fastdeploy_client/__init__.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import logging
+import sys
+
+__version__ = "4.4.0"
+
+logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)