Triton provides Prometheus metrics indicating GPU and request statistics. By default, these metrics are available at http://localhost:8002/metrics. The metrics are only available by accessing the endpoint, and are not pushed or published to any remote server. The metric format is plain text so you can view them directly, for example:
$ curl localhost:8002/metrics
The tritonserver --allow-metrics=false option can be used to disable all metric reporting and --allow-gpu-metrics=false can be used to disable just the GPU Utilization and GPU Memory metrics. The --metrics-port option can be used to select a different port.
The following table describes the available metrics.
Category | Metric | Description | Granularity | Frequency |
---|---|---|---|---|
GPU Utilization | Power Usage | GPU instantaneous power | Per GPU | Per second |
Power Limit | Maximum GPU power limit | Per GPU | Per second | |
Energy Consumption | GPU energy consumption in joules since Triton started | Per GPU | Per second | |
GPU Utilization | GPU utilization rate (0.0 - 1.0) | Per GPU | Per second | |
GPU Memory | GPU Total Memory | Total GPU memory, in bytes | Per GPU | Per second |
GPU Used Memory | Used GPU memory, in bytes | Per GPU | Per second | |
Count | Request Count | Number of inference requests | Per model | Per request |
Execution Count | Number of inference executions (request count / execution count = average dynamic batch size) | Per model | Per request | |
Inference Count | Number of inferences performed (one request counts as "batch size" inferences) | Per model | Per request | |
Latency | Request Time | Cumulative end-to-end inference request handling time | Per model | Per request |
Queue Time | Cumulative time requests spend waiting in the scheduling queue | Per model | Per request | |
Compute Input Time | Cumulative time requests spend processing inference inputs (in the framework backend) | Per model | Per request | |
Compute Time | Cumulative time requests spend executing the inference model (in the framework backend) | Per model | Per request | |
Compute Output Time | Cumulative time requests spend processing inference outputs (in the framework backend) | Per model | Per request |