Skip to content

LLMPerf is a library for validating and benchmarking LLMs

License

Notifications You must be signed in to change notification settings

truefoundry/llmperf

 
 

Repository files navigation

LLMPerf

A Tool for evaulation the performance of LLM APIs. This repo is forked from https://github.com/ray-project/llmperf and builds upon this awesome project to log benchmarking metrics to Truefoundry.

Installation

git clone https://github.com/truefoundry/llmperf.git
cd llmperf
pip install -e .

Quickstart

export OPENAI_API_KEY=<YOUR_API_KEY>

# For privately used model export your organization's URL as base
# Example: https://*.truefoundry.com/api/llm/openai
export OPENAI_API_BASE="https://llm-gateway.truefoundry.com/openai"

python token_benchmark_ray.py \
--model "truefoundry-public/CodeLlama-Instruct(13B)" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'

Parameters

The Quickstart script accepts several parameters:

Parameter Comments
model Chat based model from llm-gateway.
Example: truefoundry-public / Falcon-Instruct(7B), truefoundry-public / CodeLlama-Instruct(7B), etc
mean-input-tokens This is the average number of input tokens in a dataset. For example, if you have 3 sentences with 5, 7, and 8 tokens, the mean-input-tokens would be 6.67.
stddev-input-tokens This is the standard deviation of the number of input tokens, which measures the amount of variation or dispersion in the token counts. For instance, if your input tokens across different requests vary greatly, this number will be high.
mean-output-tokens This is the average number of output tokens generated. For example, if your model generates 3 responses with 10, 12, and 15 tokens respectively, the mean-output-tokens would be 12.33.
stddev-output-tokens This is the standard deviation of the number of output tokens. It measures the variability in the number of tokens in the output. A high value indicates a wide range of token counts in the output.
max-num-completed-requests This is the maximum number of requests that must be completed.
timeout This is the maximum time allowed for a request to be processed. For instance, if the timeout is set to 30 seconds, any request that takes longer than this will be terminated.
num-concurrent-requests his is the number of requests that can be processed at the same time. For example, if this value is 10, it means the system will handle 10 requests simultaneously.
results-dir The directory where the results will be saved.
llm-api Type of LLM Client to be used. Supported Clients: openai, anthropic, litellm.
additional-sampling-params These are extra parameters used for sampling.
tokenizer_id [Optional] This specifies the name of the tokenizer used from huggingface, default is hf-internal-testing/llama-tokenizer
ml_repo [Optional] This specifies the name of the Machine Learning repository. You need to have access to this repository.
run_name [Optional] The run name which is logged in the Machine Learning repository. It helps in identifying and tracking different runs or experiments.

Generate an API Key

  • Navigate to Settings > API Keys tab
  • Click on Create New API Key
  • Give any name to the API Key
  • On Generate, API Key will be gererated.
  • Please save the value or download it

For more details visit here.

Logging params and metrics to ML Repo

ML Repositories are like specialized Git repositories for machine learning, managing runs, models, and artifacts within MLFoundry.

We’ll use mlfoundry to log parameters and metrics to ML Repository. Read more about ML Repository here.

Script

export OPENAI_API_KEY=<YOUR_API_KEY>
# For privately used model export your organization's URL as base
# Example: https://*.truefoundry.com/api/llm/openai
export OPENAI_API_BASE="https://llm-gateway.truefoundry.com/openai"

python token_benchmark_ray.py \
--model "<MODEL_NAME>" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--tokenizer_id "<TOKENIZER_ID_FROM_HUGGINGFACE>"
--ml_repo "<ML_REPO_NAME>"
--run_name "<RUN_NAME>"
--additional-sampling-params '{}'

After logging the results seen are:

Result

Basic Usage

We implement a load test for evaluating Large Language Model (LLMs) latency metrics to assess their performance.

Load test

The load test is designed to simulate multiple simultaneous requests to the LLM API.

It measures two key metrics: the time taken between tokens (inter-token latency) and the rate of token generation (generation throughput) for each request and across all concurrent requests.

In addition to these, the test also gauges several other metrics. These include time to first token, end-to-end latency, number of output tokens, and total number of tokens used.

The test is also designed to monitor and record any errors that occur during the process. It logs the error message, error code, frequency of each error code, and total number of errors.

The test involves sending a prompt with each request, which is structured as follows:

Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...

In this prompt, the lines are randomly selected from a collection of lines taken from Shakespeare's sonnets.

You have the option to count tokens using any tokenizer available from Hugging Face. You can specify the tokenizer by providing its ID in the parameters of the run.By default tokenizer used is hf-internal-testing/llama-tokenizer.

To conduct a basic load test, you can use the token_benchmark_ray script.

Caveats and Disclaimers

  • The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
  • The results may vary with time of day.
  • The results may vary with the load.
  • The results may not correlate with users’ workloads.

Saving Results

The results of the load test and correctness test are saved in the results directory specified by the --results-dir argument. The results are saved in 2 files, one with the summary metrics of the test, and one with metrics from each individual request that is returned.

About

LLMPerf is a library for validating and benchmarking LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 55.3%
  • Jupyter Notebook 44.1%
  • Other 0.6%