TensorRT8.6.1.6 Inference cost too much time #3993

kaixiangjin · 2024-07-09T07:45:54Z

Description

I used tensorRT8.6.1.6 to implement yolov8 inference.I found a problem and it was confused. when i set batchsize from 1 to 12, the inference time was also increased like , batchsize:1 ,time: 10ms; batchsize:2,time: 20ms, ... until batchsize:12, time: 120ms. it seems like the model inference images one by one, not as a whole to inference. is it normal? In my view, if batchsize 2 cost 20ms, then batchsize 4 should also cost 20ms. Cuda should parallel processing. I do not know how to solve this problem. Could someone give me one demo to help me implement this idea.

Environment

TensorRT Version: 8.6.1.6

NVIDIA GPU: RTX A4000

NVIDIA Driver Version:

CUDA Version: 11.6

CUDNN Version:

Operating System: windows

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-07-10T01:30:23Z

it seems like the model inference images one by one, not as a whole to inference.

Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.

kaixiangjin · 2024-07-10T02:55:46Z

it seems like the model inference images one by one, not as a whole to inference.

Parallel processing is only performed when there is still surplus GPU resources, otherwise it is considered serial execution.

How do i know if the GPU resources is not enough? Can i compute it?

lix19937 · 2024-07-10T05:17:58Z

GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.

It can be roughly viewed through nvidia-smi to see gpu util.
On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.

kaixiangjin · 2024-07-10T05:54:38Z

GPU resources contains lots of things: register, l1, l2, memory bandwidth, shm, cuda core/tensor core etc. Usually need do experiments.

It can be roughly viewed through nvidia-smi to see gpu util. On the other hand, model has many layers(each layer has some cuda kernel), so among layers can be parallel.

I check my model and GPU. I think my GPU has enough resource. My GPU is RTXA4000 and model is yolov8s. Even if I use 224,224 as input size. This phenomenon still exists.

lix19937 · 2024-07-10T15:17:45Z

What is your benchmark command or code ?

xxHn-pro · 2024-07-11T09:54:31Z

I had the same problem. The inference time for batch size of 32 is about 32X larger than that for batch size of 1. But the same model using TensorFlow-TensorRT behaves as expected. The hardware and environment are the same in a Nvidia TensorFlow Container released in 2401. Here is the benchmark command.
trtexec --onnx=./tmp.onnx --saveEngine=./tmp.trt --shapes='input1':32x256x256x1,input2:32x256x256x1

lix19937 · 2024-07-11T11:19:33Z

@xxHn-pro How do you metric time ?

xxHn-pro · 2024-07-11T12:00:28Z

``In TensorRT, it is in the log output. I take "GPU Compute Time" as the inference time.

[07/11/2024-09:44:35] [I] === Performance summary ===
[07/11/2024-09:44:35] [I] Throughput: 15.4965 qps
[07/11/2024-09:44:35] [I] Latency: min = 63.6447 ms, max = 65.3091 ms, mean = 64.3707 ms, median = 64.2803 ms, percentile(90%) = 65.0261 ms, percentile(95%) = 65.238 ms, percentile(99%) = 65.3091 ms
[07/11/2024-09:44:35] [I] Enqueue Time: min = 0.492401 ms, max = 0.9552 ms, mean = 0.859185 ms, median = 0.863281 ms, percentile(90%) = 0.917953 ms, percentile(95%) = 0.927368 ms, percentile(99%) = 0.9552 ms
[07/11/2024-09:44:35] [I] H2D Latency: min = 1.3772 ms, max = 1.38623 ms, mean = 1.37917 ms, median = 1.37891 ms, percentile(90%) = 1.38007 ms, percentile(95%) = 1.38232 ms, percentile(99%) = 1.38623 ms
[07/11/2024-09:44:35] [I] GPU Compute Time: min = 59.7156 ms, max = 61.3806 ms, mean = 60.4414 ms, median = 60.3503 ms, percentile(90%) = 61.0979 ms, percentile(95%) = 61.3088 ms, percentile(99%) = 61.3806 ms
[07/11/2024-09:44:35] [I] D2H Latency: min = 2.54977 ms, max = 2.55176 ms, mean = 2.55008 ms, median = 2.55005 ms, percentile(90%) = 2.55029 ms, percentile(95%) = 2.55054 ms, percentile(99%) = 2.55176 ms
[07/11/2024-09:44:35] [I] Total Host Walltime: 3.03294 s
[07/11/2024-09:44:35] [I] Total GPU Compute Time: 2.84075 s

In TensorFlow-TensorRT, the code is run in python and the inference time is measured as below.

import tensorflow as tf
from tensorflow.python.saved_model import signature_constants, tag_constants
import time

def LoadRT(saved_model_dir):
   saved_model_loaded = tf.saved_model.load(
       saved_model_dir, tags=[tag_constants.SERVING])
   graph_func = saved_model_loaded.signatures[
       signature_constants.DEFAULT_SERVING_SIGNATURE_DEF_KEY]
   return graph_func, saved_model_loaded

model, _ = LoadRT(ModelName)

start_time = time.time()
pred = model(**InputData)
TimeIt = time.time() - start_time
return pred, TimeIt

xxHn-pro · 2024-07-12T09:50:42Z

I reproduce the problem with an open model from here. Here is the result. The scale of time is about 1.7 with double batch size. Is that normal? I believe that the hardware (A100) is strong enough to handle these batch size in parallel.

BatchSize	4	8	16	32	64
Time(ms)	1.41	2.27	3.84	7.11	13.40
Scale	-	1.6099	1.6916	1.8516	1.8847

Here is the info about the container:

================

== TensorFlow ==

NVIDIA Release 24.01-tf2 (build 78846615)
TensorFlow Version 2.14.0

Container image Copyright (c) 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2023 The TensorFlow Authors. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.60.13.
See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

NOTE: Mellanox network driver detected, but NVIDIA peer memory driver not
detected. Multi-node communication performance may be reduced.

The test was done with

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:4x3x224x224  > log4.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:8x3x224x224  > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:16x3x224x224  > log16.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:32x3x224x224  > log32.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --shapes=data:64x3x224x224  > log64.txt

The log32.txt is provided here. log32.txt

Any advice or suggestion will be appreciate.

xxHn-pro · 2024-07-17T08:36:03Z

@lix19937 Can you tell me something to try? Or common on the results please.

lix19937 · 2024-07-20T03:35:27Z

@xxHn-pro dynamic shape model need set min-opt-max shape

  --minShapes=spec            Build with dynamic shapes using a profile with the min shapes provided
  --optShapes=spec            Build with dynamic shapes using a profile with the opt shapes provided
  --maxShapes=spec            Build with dynamic shapes using a profile with the max shapes provided

xxHn-pro · 2024-07-21T09:42:24Z

I have tried these commands.

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:8x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:16x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:16x3x224x224 > log16.txt

trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:4x3x224x224 --optShapes=data:8x3x224x224 --maxShapes=data:16x3x224x224 > log8.txt
trtexec --onnx=./resnet50-v2-7.onnx --saveEngine=./tmp.trt --minShapes=data:8x3x224x224 --optShapes=data:16x3x224x224 --maxShapes=data:32x3x224x224 > log16.txt

But the results are the same as before.

lix19937 · 2024-07-21T11:31:52Z

Can you upload the resnet50-v2-7.onnx file ?

xxHn-pro · 2024-07-22T02:10:09Z

The onnx file can be obtained from https://github.com/onnx/models/blob/main/validated/vision/classification/resnet/model/resnet50-v2-7.onnx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT8.6.1.6 Inference cost too much time #3993

TensorRT8.6.1.6 Inference cost too much time #3993

kaixiangjin commented Jul 9, 2024

lix19937 commented Jul 10, 2024

kaixiangjin commented Jul 10, 2024

lix19937 commented Jul 10, 2024

kaixiangjin commented Jul 10, 2024

lix19937 commented Jul 10, 2024

xxHn-pro commented Jul 11, 2024 •

edited

Loading

lix19937 commented Jul 11, 2024

xxHn-pro commented Jul 11, 2024

xxHn-pro commented Jul 12, 2024

== TensorFlow ==

xxHn-pro commented Jul 17, 2024

lix19937 commented Jul 20, 2024

xxHn-pro commented Jul 21, 2024

lix19937 commented Jul 21, 2024

xxHn-pro commented Jul 22, 2024

TensorRT8.6.1.6 Inference cost too much time #3993

TensorRT8.6.1.6 Inference cost too much time #3993

Comments

kaixiangjin commented Jul 9, 2024

Description

Environment

Relevant Files

Steps To Reproduce

lix19937 commented Jul 10, 2024

kaixiangjin commented Jul 10, 2024

lix19937 commented Jul 10, 2024

kaixiangjin commented Jul 10, 2024

lix19937 commented Jul 10, 2024

xxHn-pro commented Jul 11, 2024 • edited Loading

lix19937 commented Jul 11, 2024

xxHn-pro commented Jul 11, 2024

xxHn-pro commented Jul 12, 2024

== TensorFlow ==

xxHn-pro commented Jul 17, 2024

lix19937 commented Jul 20, 2024

xxHn-pro commented Jul 21, 2024

lix19937 commented Jul 21, 2024

xxHn-pro commented Jul 22, 2024

xxHn-pro commented Jul 11, 2024 •

edited

Loading