LLM Load Testing on Azure (Serverless or Managed-Compute) by info.odysseyx@gmail.com September 9, 2024 written by info.odysseyx@gmail.com September 9, 2024 0 comment 13 views 13 introductionIn the constantly evolving field of artificial intelligence, the ability to efficiently load test large language models (LLMs) is critical to ensuring optimal performance and scalability. llm-load-test-azure A powerful tool designed to facilitate load testing of LLMs running on various Azure deployment setups. Why use llm-load-test-azure? The ability to load test your LLM is essential to ensure that it can handle real-world usage scenarios. Use: llm-load-test-azureDevelopers can identify potential bottlenecks, optimize performance, and ensure models are ready for deployment. The tool’s flexibility, comprehensive feature set, and support for a wide range of Azure AI models make it an invaluable resource for anyone using LLM on Azure. Some scenarios where this tool can be useful: You will need to set up your endpoints and decide how many tokens per minute you can process and what your latency expectations are. You have implemented a large-scale language model (LLM) on your infrastructure and aim to benchmark different types of computation for your application. We want to test real token throughput and perform stress tests on premium PTUs. Key Features llm-load-test-azure It’s packed with features that make it a must-have tool for anyone pursuing LLM on Azure. Here are some of the highlights: Customizable test data set: Create custom test data sets tailored to your use case and similar settings. This flexibility ensures that your load tests are as relevant and accurate as possible. Load Testing Options: The tool supports customizable concurrency, duration, and warm-up options, allowing users to simulate different load scenarios and measure the performance of their models under different conditions. Support for multiple Azure AI models: Whether you use Azure OpenAI, Azure OpenAI Embedding, or Azure Model Catalog serverless (Mars)or managed computing (mafia), llm-load-test-azure We’ve got you covered. The tool’s modular design allows developers to integrate new endpoints with minimal effort. Detailed Results: Get comprehensive statistics such as throughput, time to first token, time between tokens, and end-to-end latency in JSON format, providing valuable insights into your model performance. Get started In use llm-load-test-azure It’s simple. Here’s a quick guide to get you started. Create a dataset (optional): Create a custom dataset using the generate_dataset.py script. Specify the input and output lengths, number of samples, and output file name. [ python datasets/generate_dataset.py --tok_input_length 250 --tok_output_length 50 --N 100 --output_file datasets/random_text_dataset.jsonl ] --tok_input_length: The length of the input. minimum 25. --tok_output_length: The length of the output. --N: The number of samples to generate. --output_file: The name of the output file (default is random_text_dataset.jsonl). Run the tool: Run the load_test.py script with your desired configuration options. Use a YAML configuration file to customize the tool’s behavior and specify parameters such as output format, storage type, and warmup options. load_test.py [-h] [-c CONFIG] [-log {warn,warning,info,debug}] optional arguments: -h, --help show this help message and exit -c CONFIG, --config CONFIG config YAML file name -log {warn,warning,info,debug}, --log_level {warn,warning,info,debug} Provide logging level. Example --log_level debug, default=warning result This tool provides valuable insights into the performance of your LLM Azure endpoints by generating comprehensive statistics in JSON format such as throughput, time to first token, time between tokens, and end2end latency. Example JSON output: "results": [ # stats on a request level ... ], "config": { # the run settings ... "load_options": { "type": "constant", "concurrency": 8, "duration": 20 ... }, "summary": { # overall stats "output_tokens_throughput": 159.25729928295627, "input_tokens_throughput": 1592.5729928295625, "full_duration": 20.093270540237427, "total_requests": 16, "complete_request_per_sec": 0.79, # number of competed requests / full_duration "total_failures": 0, "failure_rate": 0.0 #time per ouput_token "tpot": { "min": 0.010512285232543946, "max": 0.018693844079971312, "median": 0.01216195583343506, "mean": 0.012808671338217597, "percentile_80": 0.012455177783966065, "percentile_90": 0.01592913103103638, "percentile_95": 0.017840550780296324, "percentile_99": 0.018523185420036312 }, #time to first token "ttft": { "min": 0.4043765068054199, "max": 0.5446293354034424, "median": 0.46433258056640625, "mean": 0.4660029411315918, "percentile_80": 0.51033935546875, "percentile_90": 0.5210948467254639, "percentile_95": 0.5295632600784301, "percentile_99": 0.54161612033844 }, #input token latency "itl": { "min": 0.008117493672586566, "max": 0.01664590356337964, "median": 0.009861880810416522, "mean": 0.010531313198552402, "percentile_80": 0.010261738599844314, "percentile_90": 0.013813444118403915, "percentile_95": 0.015781731761280615, "percentile_99": 0.016473069202959836 }, #time to ack "tt_ack": { "min": 0.404374361038208, "max": 0.544623851776123, "median": 0.464330792427063, "mean": 0.46600091457366943, "percentile_80": 0.5103373527526855, "percentile_90": 0.5210925340652466, "percentile_95": 0.5295597910881042, "percentile_99": 0.5416110396385193 }, "response_time": { "min": 2.102457046508789, "max": 3.7387688159942627, "median": 2.3843793869018555, "mean": 2.5091602653265, "percentile_80": 2.4795608520507812, "percentile_90": 2.992232322692871, "percentile_95": 3.541854977607727, "percentile_99": 3.6993860483169554 }, "output_tokens": { "min": 200, "max": 200, "median": 200.0, "mean": 200.0, "percentile_80": 200.0, "percentile_90": 200.0, "percentile_95": 200.0, "percentile_99": 200.0 }, "input_tokens": { "min": 2000, "max": 2000, "median": 2000.0, "mean": 2000.0, "percentile_80": 2000.0, "percentile_90": 2000.0, "percentile_95": 2000.0, "percentile_99": 2000.0 }, } } conclusion llm-load-test-azure A powerful and versatile tool that simplifies the process of load testing large-scale language models on Azure. Whether you’re a developer or an AI enthusiast, this repository provides the tools you need to ensure your models perform optimally under a variety of conditions. Check out the repository on GitHub and optimize your LLM today! Bookmark this Github link: maljazaery/llm-load-test-azure(github.com) Acknowledgements Special thanks to you Jack Sonnen For code contribution Vlad Feigin For feedback and reviews Andrew Thomas, Gunjan Shah And my manager Joel Borealis For idea generation and discussionllm-load-test-azure The tool was originally derived from a load testing tool. [openshift-psap/llm-load-test (github.com)]. Thanks to the creator. disclaimer This tool is unofficial and not a Microsoft product. It is still under development, so feedback and bug reports are welcome. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Exciting Drone Development Engineer Jobs at Vega Drones Pvt. Ltd. in Indirapuram, Ghaziabad next post Microsoft at Open Source Summit Europe 2024 You may also like 7 Disturbing Tech Trends of 2024 December 19, 2024 AI on phones fails to impress Apple, Samsung users: Survey December 18, 2024 Standout technology products of 2024 December 16, 2024 Is Intel Equivalent to Tech Industry 2024 NY Giant? December 12, 2024 Google’s Willow chip marks breakthrough in quantum computing December 11, 2024 Job seekers are targeted in mobile phishing campaigns December 10, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.