LLMPerf

A Tool for evaulation the performance of LLM APIs. This repo is forked from https://github.com/ray-project/llmperf and builds upon this awesome project to log benchmarking metrics to Truefoundry.

Installation

git clone https://github.com/truefoundry/llmperf.git
cd llmperf
pip install -e .

Quickstart

export OPENAI_API_KEY=<YOUR_API_KEY>

# For privately used model export your organization's URL as base
# Example: https://*.truefoundry.com/api/llm/openai
export OPENAI_API_BASE="https://llm-gateway.truefoundry.com/openai"

python token_benchmark_ray.py \
--model "truefoundry-public/CodeLlama-Instruct(13B)" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Parameters

The Quickstart script accepts several parameters:

Parameter	Comments
model	Chat based model from llm-gateway. Example: `truefoundry-public / Falcon-Instruct(7B)`, `truefoundry-public / CodeLlama-Instruct(7B)`, etc
mean-input-tokens	This is the average number of input tokens in a dataset. For example, if you have 3 sentences with 5, 7, and 8 tokens, the mean-input-tokens would be 6.67.
stddev-input-tokens	This is the standard deviation of the number of input tokens, which measures the amount of variation or dispersion in the token counts. For instance, if your input tokens across different requests vary greatly, this number will be high.
mean-output-tokens	This is the average number of output tokens generated. For example, if your model generates 3 responses with 10, 12, and 15 tokens respectively, the mean-output-tokens would be 12.33.
stddev-output-tokens	This is the standard deviation of the number of output tokens. It measures the variability in the number of tokens in the output. A high value indicates a wide range of token counts in the output.
max-num-completed-requests	This is the maximum number of requests that must be completed.
timeout	This is the maximum time allowed for a request to be processed. For instance, if the timeout is set to 30 seconds, any request that takes longer than this will be terminated.
num-concurrent-requests	his is the number of requests that can be processed at the same time. For example, if this value is 10, it means the system will handle 10 requests simultaneously.
results-dir	The directory where the results will be saved.
llm-api	Type of LLM Client to be used. Supported Clients: `openai`, `anthropic`, `litellm`.
additional-sampling-params	These are extra parameters used for sampling.
tokenizer_id	[Optional] This specifies the name of the tokenizer used from huggingface, default is `hf-internal-testing/llama-tokenizer`
ml_repo	[Optional] This specifies the name of the Machine Learning repository. You need to have access to this repository.
run_name	[Optional] The run name which is logged in the Machine Learning repository. It helps in identifying and tracking different runs or experiments.

Generate an API Key

Navigate to Settings > API Keys tab
Click on Create New API Key
Give any name to the API Key
On Generate, API Key will be gererated.
Please save the value or download it

For more details visit here.

Logging params and metrics to ML Repo

ML Repositories are like specialized Git repositories for machine learning, managing runs, models, and artifacts within MLFoundry.

We’ll use mlfoundry to log parameters and metrics to ML Repository. Read more about ML Repository here.

Script

export OPENAI_API_KEY=<YOUR_API_KEY>
# For privately used model export your organization's URL as base
# Example: https://*.truefoundry.com/api/llm/openai
export OPENAI_API_BASE="https://llm-gateway.truefoundry.com/openai"

python token_benchmark_ray.py \
--model "<MODEL_NAME>" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--tokenizer_id "<TOKENIZER_ID_FROM_HUGGINGFACE>"
--ml_repo "<ML_REPO_NAME>"
--run_name "<RUN_NAME>"
--additional-sampling-params '{}'

After logging the results seen are:

Basic Usage

We implement a load test for evaluating Large Language Model (LLMs) latency metrics to assess their performance.

Load test

The load test is designed to simulate multiple simultaneous requests to the LLM API.

It measures two key metrics: the time taken between tokens (inter-token latency) and the rate of token generation (generation throughput) for each request and across all concurrent requests.

In addition to these, the test also gauges several other metrics. These include time to first token, end-to-end latency, number of output tokens, and total number of tokens used.

The test is also designed to monitor and record any errors that occur during the process. It logs the error message, error code, frequency of each error code, and total number of errors.

The test involves sending a prompt with each request, which is structured as follows:

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

In this prompt, the lines are randomly selected from a collection of lines taken from Shakespeare's sonnets.

You have the option to count tokens using any tokenizer available from Hugging Face. You can specify the tokenizer by providing its ID in the parameters of the run.By default tokenizer used is hf-internal-testing/llama-tokenizer.

To conduct a basic load test, you can use the token_benchmark_ray script.

Caveats and Disclaimers

The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
The results may vary with time of day.
The results may vary with the load.
The results may not correlate with users’ workloads.

Saving Results

The results of the load test and correctness test are saved in the results directory specified by the --results-dir argument. The results are saved in 2 files, one with the summary metrics of the test, and one with metrics from each individual request that is returned.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
src/llmperf		src/llmperf
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
NOTICE.txt		NOTICE.txt
README.md		README.md
analyze-token-benchmark-results.ipynb		analyze-token-benchmark-results.ipynb
llm_correctness.py		llm_correctness.py
pre-commit.sh		pre-commit.sh
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
token_benchmark_ray.py		token_benchmark_ray.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLMPerf

Installation

Quickstart

Parameters

Generate an API Key

Logging params and metrics to ML Repo

Script

Basic Usage

Load test

Caveats and Disclaimers

Saving Results

About

Releases

Packages

Languages

License

truefoundry/llmperf

Folders and files

Latest commit

History

Repository files navigation

LLMPerf

Installation

Quickstart

Parameters

Generate an API Key

Logging params and metrics to ML Repo

Script

Basic Usage

Load test

Caveats and Disclaimers

Saving Results

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages