LLM Load Testing on Azure (Serverless or Managed-Compute)

introduction
In the constantly evolving field of artificial intelligence, the ability to efficiently load test large language models (LLMs) is critical to ensuring optimal performance and scalability. llm-load-test-azure A powerful tool designed to facilitate load testing of LLMs running on various Azure deployment setups.

Why use llm-load-test-azure?

The ability to load test your LLM is essential to ensure that it can handle real-world usage scenarios. Use: llm-load-test-azureDevelopers can identify potential bottlenecks, optimize performance, and ensure models are ready for deployment. The tool’s flexibility, comprehensive feature set, and support for a wide range of Azure AI models make it an invaluable resource for anyone using LLM on Azure.

Some scenarios where this tool can be useful:

You will need to set up your endpoints and decide how many tokens per minute you can process and what your latency expectations are.
You have implemented a large-scale language model (LLM) on your infrastructure and aim to benchmark different types of computation for your application.
We want to test real token throughput and perform stress tests on premium PTUs.

Key Features

llm-load-test-azure It’s packed with features that make it a must-have tool for anyone pursuing LLM on Azure. Here are some of the highlights:

Customizable test data set: Create custom test data sets tailored to your use case and similar settings. This flexibility ensures that your load tests are as relevant and accurate as possible.
Load Testing Options: The tool supports customizable concurrency, duration, and warm-up options, allowing users to simulate different load scenarios and measure the performance of their models under different conditions.
Support for multiple Azure AI models: Whether you use Azure OpenAI, Azure OpenAI Embedding, or Azure Model Catalog serverless (Mars)or managed computing (mafia), llm-load-test-azure We’ve got you covered. The tool’s modular design allows developers to integrate new endpoints with minimal effort.
Detailed Results: Get comprehensive statistics such as throughput, time to first token, time between tokens, and end-to-end latency in JSON format, providing valuable insights into your model performance.

Get started

In use llm-load-test-azure It’s simple. Here’s a quick guide to get you started.

Create a dataset (optional): Create a custom dataset using the generate_dataset.py script. Specify the input and output lengths, number of samples, and output file name.

[ python datasets/generate_dataset.py --tok_input_length 250 --tok_output_length 50 --N 100 --output_file datasets/random_text_dataset.jsonl ]

--tok_input_length: The length of the input. minimum 25.

--tok_output_length: The length of the output.

--N: The number of samples to generate.

--output_file: The name of the output file (default is random_text_dataset.jsonl).

Run the tool: Run the load_test.py script with your desired configuration options. Use a YAML configuration file to customize the tool’s behavior and specify parameters such as output format, storage type, and warmup options.

load_test.py [-h] [-c CONFIG] [-log {warn,warning,info,debug}]

optional arguments:

  -h, --help            show this help message and exit

  -c CONFIG, --config CONFIG

                        config YAML file name

  -log {warn,warning,info,debug}, --log_level {warn,warning,info,debug}

                        Provide logging level. Example --log_level debug, default=warning

result

This tool provides valuable insights into the performance of your LLM Azure endpoints by generating comprehensive statistics in JSON format such as throughput, time to first token, time between tokens, and end2end latency.

Example JSON output:

"results": [ # stats on a request level
...
  ],
  "config": { # the run settings
...
    "load_options": {
      "type": "constant",
      "concurrency": 8,
      "duration": 20
...
  },
  "summary": { # overall stats 
    "output_tokens_throughput": 159.25729928295627,
    "input_tokens_throughput": 1592.5729928295625,
    "full_duration": 20.093270540237427,
    "total_requests": 16,
    "complete_request_per_sec": 0.79,  # number of competed requests / full_duration 
    "total_failures": 0,
    "failure_rate": 0.0


    #time per ouput_token
    "tpot": { 
      "min": 0.010512285232543946,
      "max": 0.018693844079971312,
      "median": 0.01216195583343506,
      "mean": 0.012808671338217597,
      "percentile_80": 0.012455177783966065,
      "percentile_90": 0.01592913103103638,
      "percentile_95": 0.017840550780296324,
      "percentile_99": 0.018523185420036312
    },
     #time to first token
    "ttft": {
      "min": 0.4043765068054199,
      "max": 0.5446293354034424,
      "median": 0.46433258056640625,
      "mean": 0.4660029411315918,
      "percentile_80": 0.51033935546875,
      "percentile_90": 0.5210948467254639,
      "percentile_95": 0.5295632600784301,
      "percentile_99": 0.54161612033844
    },
    #input token latency
    "itl": { 
      "min": 0.008117493672586566,
      "max": 0.01664590356337964,
      "median": 0.009861880810416522,
      "mean": 0.010531313198552402,
      "percentile_80": 0.010261738599844314,
      "percentile_90": 0.013813444118403915,
      "percentile_95": 0.015781731761280615,
      "percentile_99": 0.016473069202959836
    },
    #time to ack
    "tt_ack": { 
      "min": 0.404374361038208,
      "max": 0.544623851776123,
      "median": 0.464330792427063,
      "mean": 0.46600091457366943,
      "percentile_80": 0.5103373527526855,
      "percentile_90": 0.5210925340652466,
      "percentile_95": 0.5295597910881042,
      "percentile_99": 0.5416110396385193
    },
    "response_time": {
      "min": 2.102457046508789,
      "max": 3.7387688159942627,
      "median": 2.3843793869018555,
      "mean": 2.5091602653265,
      "percentile_80": 2.4795608520507812,
      "percentile_90": 2.992232322692871,
      "percentile_95": 3.541854977607727,
      "percentile_99": 3.6993860483169554
    },
    "output_tokens": {
      "min": 200,
      "max": 200,
      "median": 200.0,
      "mean": 200.0,
      "percentile_80": 200.0,
      "percentile_90": 200.0,
      "percentile_95": 200.0,
      "percentile_99": 200.0
    },
    "input_tokens": {
      "min": 2000,
      "max": 2000,
      "median": 2000.0,
      "mean": 2000.0,
      "percentile_80": 2000.0,
      "percentile_90": 2000.0,
      "percentile_95": 2000.0,
      "percentile_99": 2000.0
    },
    
  }
}

conclusion

llm-load-test-azure A powerful and versatile tool that simplifies the process of load testing large-scale language models on Azure. Whether you’re a developer or an AI enthusiast, this repository provides the tools you need to ensure your models perform optimally under a variety of conditions. Check out the repository on GitHub and optimize your LLM today!

Bookmark this Github link: maljazaery/llm-load-test-azure(github.com)

Acknowledgements

Special thanks to you Jack Sonnen For code contribution Vlad Feigin For feedback and reviews Andrew Thomas, Gunjan Shah And my manager Joel Borealis For idea generation and discussion
llm-load-test-azure The tool was originally derived from a load testing tool. [openshift-psap/llm-load-test (github.com)]. Thanks to the creator.

disclaimer

This tool is unofficial and not a Microsoft product. It is still under development, so feedback and bug reports are welcome.

Source link

Our Company

About Links

Useful Links

Newsletter

Laest News

LLM Load Testing on Azure (Serverless or Managed-Compute)

Exciting Drone Development Engineer Jobs at Vega Drones Pvt. Ltd. in Indirapuram, Ghaziabad

Microsoft at Open Source Summit Europe 2024

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News