logo 🤗

技术视野

聚焦科技前沿,分享技术解析,洞见未来趋势。在这里,与您一起探索人工智能的无限可能,共赴技术盛宴。

Triton24.02 部署TensorRT-LLM,实现http查询

选择正确的环境

  1. 选择版本。查询nvidia官方文档,可以看到目前最新的容器是24.02。
  • NVIDIA Driver这一行,它推荐的英伟达驱动版本是545以上,对于数据卡,可以适当降低。如果你是游戏卡,驱动版本没有545,也不想升级,那么建议至少不要低太多,比如535其实也可以。
    image.png
  • Triton Inference Server这一行,可以看到它内置了triton server版本是2.43,需要的TensorRT-LLM版本是0.8.0。
    image.png
  1. 拉取镜像。进入Nvidia镜像中心找到tritonserver的镜像,选择和TensorRT-LLM(简称trtllm)有关的容器,然后拷贝镜像地址,最后使用docker pull来拉取该镜像。
    image.png
docker pull nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
  1. 拉取TensorRT-LLM的项目。
  • 可以选择官方项目,但是注意要是v0.8.0
git clone https://github.com/NVIDIA/TensorRT-LLM.git -b v0.8.0
  • 也可以选择我的项目,目前main分支就是0.8.0,后续可能会打成tag,建议实际访问项目地址,查看是否有0.8.0的tag。
git clone https://github.com/Tlntin/Qwen-TensorRT-LLM
  • 下面演示是以我的项目为主,在triton_server上面部署Qwen-1.8B-Chat(毕竟这个模型比较小)
  1. 拉取tensorrtllm_backend。这个是用来编排tensorrt-llm服务的,需要和TensorRT-LLM版本一致,这里同样选择0.8.0
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git -b v0.8.0
  1. 启动tritonserver容器
docker run -d \
    --name triton \
    --net host \
    --shm-size=2g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    --gpus all \
    -v ${PWD}/tensorrtllm_backend:/tensorrtllm_backend \
    -v ${PWD}/Qwen-TensorRT-LLM/examples:/root/examples \
    nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3 sleep 864000
  1. 检查服务
  • 进入容器
docker exec -it triton /bin/bash
  • 检查英伟达驱动
nvidia-smi
  • 检查tritonserver版本,至少和上面提到的一样,是2.43
cat /opt/tritonserver/TRITON_VERSION
  • 检查tensorrtllm_backend版本,该数值必须和官方github仓库的0.8.0版本的tool/version.txt文件内容一致,官方仓库链接
cat /tensorrtllm_backend/tools/version.txt
  1. 直接通过pip安装TensorRT-LLM (如果是自己编译的容器,这步可以省略)
pip install tensorrt_llm==0.8.0 --extra-index-url https://pypi.nvidia.com --extra-index-url https://download.pytorch.org/whl/cu121

编译Engine

  1. 进入容器
docker exec -it triton /bin/bash
  1. 重复之前的操作,安装qwen的依赖,编译Engine,推荐开启inflight-batching+smooth int8,参考命令
  • 进入qwen目录
cd /root/examples/qwen
  • 安装依赖
pip install -r requirements.txt
  • 编译,下面是fp16部署的简单示例,设置batch_size=2,开启paged_kv_cache,方便部署inflight-batching
python3 build.py \
    --paged_kv_cache \
    --remove_input_padding \
    --max_batch_size=2
  • 运行一下做个测试
python3 run.py

临时部署Triton

  1. 进入容器
docker exec -it triton /bin/bash
  1. 构建好目录
cd /tensorrtllm_backend
cp all_models/inflight_batcher_llm/ -r triton_model_repo
  1. 复制上一部分编译好的Engine文件,复制完后需要做修改/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/config.json,将里面的max_output_len的数值换成new_token_len对应的数值,以当前项目为例,需要将6144换成2048,否则报错。
cd /root/examples/qwen2/trt_engines/fp16/1-gpu/
cp -r ./* /tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/
  1. 复制tokenzer文件
cd /root/examples/qwen2
cp -r qwen1.5_7b_chat /tensorrtllm_backend/triton_model_repo/tensorrt_llm/

# 删除tokenizer目录的Huggingface模型文件(可选)
rm /tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat/*.safetensors
  1. 编写Triton中的预处理配置和后处理配置, 参考文档
cd /tensorrtllm_backend
export HF_QWEN_MODEL="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat"
export ENGINE_DIR="/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1"
export MAX_BATCH_SIZE=2
export TOKENIZE_TYPE=auto
# 根据cpu线程数定,一般为batch_size的2倍数或者cpu线程的一半
export INSTANCE_COUNT=4
# 我就一张卡,你可以指定用那些卡,用逗号隔开
export GPU_DEVICE_IDS=0


python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${HF_QWEN_MODEL},tokenizer_type:${TOKENIZE_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:${INSTANCE_COUNT}

python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${HF_QWEN_MODEL},tokenizer_type:${TOKENIZE_TYPE},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:${INSTANCE_COUNT}

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:${INSTANCE_COUNT},accumulate_tokens:True

python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_DIR},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_batching,max_queue_delay_microseconds:600,gpu_device_ids:${GPU_DEVICE_IDS}
  1. 简单修改一下preprocess/postprocess的model.py的initialize函数,示例是llama的,我们要改成qwen的tokenizer配置。
  • 修改前(preprocessing有三行,postprocessing只有一行):
self.tokenizer.pad_token = self.tokenizer.eos_token
self.tokenizer_end_id = self.tokenizer.encode(
    self.tokenizer.eos_token, add_special_tokens=False)[0]
self.tokenizer_pad_id = self.tokenizer.encode(
    self.tokenizer.pad_token, add_special_tokens=False)[0]
  • 修改后
import os


gen_config_path = os.path.join(tokenizer_dir, 'generation_config.json')
with open(gen_config_path, 'r') as f:
    gen_config = json.load(f)
if isinstance (gen_config["eos_token_id"], list):
    pad_id = end_id = gen_config["eos_token_id"][0]
### if model type is base, run this branch
else:
    pad_id = gen_config["bos_token_id"]
    end_id = gen_config["eos_token_id"]
self.tokenizer_pad_id = pad_id
self.tokenizer_end_id = end_id
eos_token = self.tokenizer.decode(end_id)
self.tokenizer.eos_token = self.tokenizer.pad_token = eos_token
  1. 启动服务
cd /tensorrtllm_backend
python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo
  1. 另外开一个终端,测试一下http效果。
  • 请求
curl -X POST localhost:8000/v2/models/ensemble/generate \
-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151645], "pad_id": [151645]}'
  • 输出结果
{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"你好,我是来自阿里云的大规模语言模型,我叫通义千问。"}%

调用服务

python客户端请求
  1. 安装python依赖(可选)
pip install tritonclient transformers gevent geventhttpclient tiktoken grpcio
  1. 运行qwen/triton_client/inflight_batcher_llm_client.py文件即可开启
cd /root/examples/triton_client
python3 inflight_batcher_llm_client.py --tokenizer_dir=/tensorrtllm_backend/triton_model_repo/tensorrt_llm/qwen1.5_7b_chat
  1. 测试结果
====================
Human: 你好
Output: 你好!有什么我可以帮助你的吗?
Human: 你叫什么?
Output: 我是来自阿里云的大规模语言模型,我叫通义千问。

http流式调用
  1. 前提
  • 编译的Engine开启了paged_kv_cache
  • 部署triton时,tensorrt_llm/config.pbtxt里面的gpt_model_type对应的value为inflight_batching
  1. 运行命令
curl -X POST localhost:8000/v2/models/ensemble/generate_stream \
-d '{"text_input": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n你好,你叫什么?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 100, "bad_words": "", "stop_words": "", "end_id": [151645], "pad_id": [151645], "stream": true}'
  1. 输出结果:
data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":0.0,"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"你好"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"我是"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"来自"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"阿里"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"云"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"的大"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"规模"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"语言"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"模型"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":","}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"我"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"叫"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"通"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"义"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"千"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"问"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"。"}

data: {"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":""}

关闭triton服务

pkill tritonserver

永久部署

  1. 在上个容器部署时,我们启动的命令是python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/tensorrtllm_backend/triton_model_repo,经过修改tensorrtllm_backend/scripts/launch_triton_server.py文件,倒数第二行增加一个print("cmd", cmd)在结尾打印出它真实运行的命令如下:
["mpirun", "--allow-run-as-root", "-n", "1", "/opt/tritonserver/bin/tritonserver", "--model-repository=/tensorrtllm_backend/triton_model_repo", "--grpc-port=8001", "--http-port=8000", "--metrics-port=8002", "--disable-auto-complete-config", "--backend-config=python,shm-region-prefix-name=prefix0_", ":"]
  1. 编写一个Dockerfile来启动刚刚的命令,替换原来容器自带的命令。
FROM nvcr.io/nvidia/tritonserver:24.02-trtllm-python-py3
USER root
# Option
# COPY tensorrtllm_backend /tensorrtllm_backend
WORKDIR /tensorrtllm_backend

CMD ["mpirun", "--allow-run-as-root", "-n", "1", "/opt/tritonserver/bin/tritonserver", "--model-repository=/tensorrtllm_backend/triton_model_repo", "--grpc-port=8001", "--http-port=8000", "--metrics-port=8002", "--disable-auto-complete-config", "--backend-config=python,shm-region-prefix-name=prefix0_", ":"]
  1. 编译新镜像,命名为tritonserver:24.02
docker build . -t tritonserver:24.02
  1. 测试一下是否ok
docker run -it \
    --name triton_server \
    --net host \
    --shm-size=2g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -v ${PWD}/tensorrtllm_backend:/tensorrtllm_backend \
    --gpus all \
    tritonserver:24.02
  • 测试一下请求,没问题就退出,然后删除该容器
docker rm -f triton_server
  1. 永久开启该容器,设置后台启动,并且设置自动重启
docker run -d \
    --name triton_server \
    --net host \
    --shm-size=2g \
    --restart always \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -v ${PWD}/tensorrtllm_backend:/tensorrtllm_backend \
    --gpus all \
    tritonserver:24.02
  1. 查看一下这个容器运行情况,发现正常。
docker logs triton_server

版权属于:tlntin
作品采用:本作品采用 知识共享署名-相同方式共享 4.0 国际许可协议 进行许可。
更新于: 2024年04月21日 11:09


39 文章数
5 分类数
40 页面数
已在风雨中度过 1年160天12小时13分
目录
来自 《Triton24.02 部署TensorRT-LLM,实现http查询》
暗黑模式
暗黑模式
返回顶部
暗黑模式
暗黑模式
返回顶部