A Tool for evaulation the performance of LLM APIs. This repo is forked from https://github.com/ray-project/llmperf and builds upon this awesome project to log benchmarking metrics to Truefoundry.
git clone https://github.com/truefoundry/llmperf.git
cd llmperf
pip install -e .
export OPENAI_API_KEY=<YOUR_API_KEY>
# For privately used model export your organization's URL as base
# Example: https://*.truefoundry.com/api/llm/openai
export OPENAI_API_BASE="https://llm-gateway.truefoundry.com/openai"
python token_benchmark_ray.py \
--model "truefoundry-public/CodeLlama-Instruct(13B)" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--additional-sampling-params '{}'
The Quickstart script accepts several parameters:
Parameter | Comments |
---|---|
model | Chat based model from llm-gateway. Example: truefoundry-public / Falcon-Instruct(7B) , truefoundry-public / CodeLlama-Instruct(7B) , etc |
mean-input-tokens | This is the average number of input tokens in a dataset. For example, if you have 3 sentences with 5, 7, and 8 tokens, the mean-input-tokens would be 6.67. |
stddev-input-tokens | This is the standard deviation of the number of input tokens, which measures the amount of variation or dispersion in the token counts. For instance, if your input tokens across different requests vary greatly, this number will be high. |
mean-output-tokens | This is the average number of output tokens generated. For example, if your model generates 3 responses with 10, 12, and 15 tokens respectively, the mean-output-tokens would be 12.33. |
stddev-output-tokens | This is the standard deviation of the number of output tokens. It measures the variability in the number of tokens in the output. A high value indicates a wide range of token counts in the output. |
max-num-completed-requests | This is the maximum number of requests that must be completed. |
timeout | This is the maximum time allowed for a request to be processed. For instance, if the timeout is set to 30 seconds, any request that takes longer than this will be terminated. |
num-concurrent-requests | his is the number of requests that can be processed at the same time. For example, if this value is 10, it means the system will handle 10 requests simultaneously. |
results-dir | The directory where the results will be saved. |
llm-api | Type of LLM Client to be used. Supported Clients: openai , anthropic , litellm . |
additional-sampling-params | These are extra parameters used for sampling. |
tokenizer_id | [Optional] This specifies the name of the tokenizer used from huggingface, default is hf-internal-testing/llama-tokenizer |
ml_repo | [Optional] This specifies the name of the Machine Learning repository. You need to have access to this repository. |
run_name | [Optional] The run name which is logged in the Machine Learning repository. It helps in identifying and tracking different runs or experiments. |
- Navigate to Settings > API Keys tab
- Click on Create New API Key
- Give any name to the API Key
- On Generate, API Key will be gererated.
- Please save the value or download it
For more details visit here.
ML Repositories are like specialized Git repositories for machine learning, managing runs, models, and artifacts within MLFoundry.
We’ll use mlfoundry to log parameters and metrics to ML Repository. Read more about ML Repository here.
export OPENAI_API_KEY=<YOUR_API_KEY>
# For privately used model export your organization's URL as base
# Example: https://*.truefoundry.com/api/llm/openai
export OPENAI_API_BASE="https://llm-gateway.truefoundry.com/openai"
python token_benchmark_ray.py \
--model "<MODEL_NAME>" \
--mean-input-tokens 550 \
--stddev-input-tokens 150 \
--mean-output-tokens 150 \
--stddev-output-tokens 10 \
--max-num-completed-requests 2 \
--timeout 600 \
--num-concurrent-requests 1 \
--results-dir "result_outputs" \
--llm-api openai \
--tokenizer_id "<TOKENIZER_ID_FROM_HUGGINGFACE>"
--ml_repo "<ML_REPO_NAME>"
--run_name "<RUN_NAME>"
--additional-sampling-params '{}'
After logging the results seen are:
We implement a load test for evaluating Large Language Model (LLMs) latency metrics to assess their performance.
The load test is designed to simulate multiple simultaneous requests to the LLM API.
It measures two key metrics: the time taken between tokens (inter-token latency) and the rate of token generation (generation throughput) for each request and across all concurrent requests.
In addition to these, the test also gauges several other metrics. These include time to first token, end-to-end latency, number of output tokens, and total number of tokens used.
The test is also designed to monitor and record any errors that occur during the process. It logs the error message, error code, frequency of each error code, and total number of errors.
The test involves sending a prompt with each request, which is structured as follows:
Randomly stream lines from the following text. Don't generate eos tokens:
LINE 1,
LINE 2,
LINE 3,
...
In this prompt, the lines are randomly selected from a collection of lines taken from Shakespeare's sonnets.
You have the option to count tokens using any tokenizer available from Hugging Face. You can specify the tokenizer by providing its ID in the parameters of the run.By default tokenizer used is hf-internal-testing/llama-tokenizer
.
To conduct a basic load test, you can use the token_benchmark_ray
script.
- The endpoints provider backend might vary widely, so this is not a reflection on how the software runs on a particular hardware.
- The results may vary with time of day.
- The results may vary with the load.
- The results may not correlate with users’ workloads.
The results of the load test and correctness test are saved in the results directory specified by the --results-dir
argument. The results are saved in 2 files, one with the summary metrics of the test, and one with metrics from each individual request that is returned.