ONNX and NPU Acceleration for Speech on ARM by info.odysseyx@gmail.com October 31, 2024 written by info.odysseyx@gmail.com October 31, 2024 0 comment 14 views 14 introduction Automatic Speech Recognition (ASR) is a technology that allows machines to process human spoken language and convert it into written text, and is widely used in many fields today. However, low inference speed is one of the major challenges in automatic speech recognition. The goal of this project is to investigate different ways to accelerate inference on Whisper models by leveraging the Open Neural Network Exchange (ONNX) format using different runtimes and specialized hardware accelerators such as NPUs. This project is part of the Industry Exchange Network (IXN) program, an educational framework that provides masters and undergraduate students with opportunities to collaborate with industry companies. The project is supported by UCL CS, Microsoft and Intel and is led by Professor Dean Mohamedally from UCL CS and Lee Jonathan Stott and Chris Noring from Microsoft. Intel provides technical support for Intel PCs equipped with AI accelerator NPUs. This project will run for 3 months from June 2024 to September 2024. Project Overview This project explores the benefits of ONNX and NPU accelerators for accelerating inference of Whisper models and developing local Whisper models leveraging these techniques for ARM-based systems. Goals include investigating end-to-end voice models (Whisper and its variants). We evaluate the performance of different approaches to using the Whisper model. Find different ways to convert PyTorch’s Whisper to ONNX format and apply different optimization techniques to ONNX models. Compare the performance of PyTorch models and ONNX models. Leverage NPUs for inference and compare their performance with other hardware devices. ONNX model compilation for ARM and compatibility testing on ARM-based systems, making models more practical for use in IoT or embedded devices such as phones and chatbots technical details The Whisper model was initially built and evaluated in PyTorch. To improve inference speed, we converted the model to ONNX format using tools such as PyTorch’s torch.onnx.export, the Optimum library, and Microsoft’s Olive tool. These models were then optimized using graph optimization and quantization techniques to reduce model size and speed up inference. To further improve performance, the team explored different inference engines such as ONNX Runtime and OpenVINO to significantly accelerate model execution. The project also investigated the use of NPUs, which, although not delivering the expected performance gains, provides insight into future improvements. video and media The project included a variety of visualizations and performance charts comparing the inference speed of Whisper models with different levels of optimization across hardware accelerators (CPU, GPU, NPU) and inference engines (ONNX Runtime, OpenVINO). These visual tools were key to understanding the pros and cons and improvements in model performance across platforms. Figure 1: Accuracy and speed of various Whisper checkpoints. Figure 2: Accuracy and speed of different calling methods for Whisper-tiny. Figure 3: Average inference time per sample per second across acceleration methods and dataset Figure 4: Inference time using CPU, GPU, and NPU on song clip dataset. results and results Figure 1 As the model size increases from small to large, the inference time gradually increases to values of 0.98s, 1.37s, 5.15s, 17.12s, and 29.09s, and WER decreases to values of 6.78%, 4.39%, 3.49%, and 2.43, respectively. do. % and 2.22% are in the same order. Figure 2 Among the four methods, the average inference time per sample for the Hugging Face API endpoint was the longest at 1.82 seconds, and for the remaining three methods, inference took less than 1 second at 0.91, 0.78, and 0.97 seconds when using the Pipeline and OpenAI Whisper libraries. , Hugging Face Converter Library. In terms of accuracy, except for the Hugging Face Transformers library, which achieved the lowest word error rate of 6.78%, the other three methods all had slightly higher error rates, around 7%. Figure 3 The project demonstrated up to a 5x improvement in inference speed for Whisper models compared to PyTorch models when using the OpenVINO-optimized ONNX format. This project demonstrated the feasibility of deploying these optimized models on ARM-based systems, opening up the possibility for their use in real-time voice applications on IoT devices and edge devices. Figure 4 The plot shows the inference time for Whisper-tiny using different executor configurations. Tested on Song Clip dataset. CPU, GPU and special hardware accelerator NPU was used in this part of the study. Results are sorted in descending order from top to bottom, from the device configuration with the longest inference time to the device configuration with the shortest inference time. Inference times range from 1.371 seconds to 0.182 seconds, with the fastest inference configuration being CPU and the slowest configuration being set to MULTI: GPU, NPU. As a result, CPUs outperform NPUs, and NPUs outperform GPUs. Using MULTI or HETERO mode provides faster speeds and shorter inference times than using GPU or NPU alone in certain configurations. For example, HETERO: CPU, NPU, HETERO: CPU, GPU, and MULTI: GPU, NPU, CPU have lower inference times than NPU and GPU. For the other three cases, the results are similar. CPU or MULTI: GPU, NPU, CPU has the shortest inference time, and MULTI: GPU, NPU has the longest inference time. lesson learned Several important lessons emerged from this project. The ONNX format is a game changer.: Converting your models to ONNX significantly improves cross-platform compatibility and inference speed, making them easier to deploy on a variety of devices. Further investigation of NPU is needed.: NPUs, while promising, did not provide the performance gains expected. However, this indicates that more sophisticated optimizations are needed in future work. Quantization requires careful management: Reducing the precision of model data can help improve performance, but comes with a trade-off in accuracy. Future iterations should maintain this balance more effectively. Collaboration and Teamwork The success of this project was a testament to the power of collaboration. The combination of UCL’s academic expertise and Microsoft’s industry knowledge ensured that the project ran smoothly and achieved its objectives. Team roles were clearly defined, allowing for a smooth and productive workflow. future development This project has laid a solid foundation for further research and development. There are several areas available for future exploration. NPU-specific optimization: The team plans to explore more advanced methods to utilize the full potential of NPU. Deploy to real ARM devices: Once your model has successfully run in a Docker container, the next step is to deploy it to a real ARM-based device, such as a smartphone or IoT device. multilingual test: So far, the project has mainly focused on speaking English. Extending the model to other languages will open new possibilities for global applications. conclusion This project successfully demonstrated how state-of-the-art tools and optimizations can significantly improve the performance of ASR systems. Leverages ONNX, NPU, and ARM-based platforms. The project’s results highlight the importance of model optimization for resource-constrained environments and provide a clear path for deploying advanced ASR models in the field. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Configure File in Azure Static Web Apps next post Microsoft Copilot in Azure – Securing Storage Accounts You may also like Ride-sharing and Robotaxis Decopled Revenue Model Problems February 17, 2025 Web Raiders run the Global Brut Force attack from 2.5M IPS February 12, 2025 Generator Tech, Robot, risk of emerging February 11, 2025 Robotaxis is bringing in the lift dallas’ with ‘2026 with’ February 11, 2025 Why did Qualcom lose his first leadership February 10, 2025 Lenovo’s ThinkPad X 1 Carbon has rewrite my MacBook Pro February 5, 2025 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.