Home NewsX ONNX and NPU Acceleration for Speech on ARM

ONNX and NPU Acceleration for Speech on ARM

by info.odysseyx@gmail.com
0 comment 14 views


introduction

Automatic Speech Recognition (ASR) is a technology that allows machines to process human spoken language and convert it into written text, and is widely used in many fields today. However, low inference speed is one of the major challenges in automatic speech recognition. The goal of this project is to investigate different ways to accelerate inference on Whisper models by leveraging the Open Neural Network Exchange (ONNX) format using different runtimes and specialized hardware accelerators such as NPUs. This project is part of the Industry Exchange Network (IXN) program, an educational framework that provides masters and undergraduate students with opportunities to collaborate with industry companies. The project is supported by UCL CS, Microsoft and Intel and is led by Professor Dean Mohamedally from UCL CS and Lee Jonathan Stott and Chris Noring from Microsoft. Intel provides technical support for Intel PCs equipped with AI accelerator NPUs. This project will run for 3 months from June 2024 to September 2024.

Project Overview

This project explores the benefits of ONNX and NPU accelerators for accelerating inference of Whisper models and developing local Whisper models leveraging these techniques for ARM-based systems. Goals include investigating end-to-end voice models (Whisper and its variants). We evaluate the performance of different approaches to using the Whisper model. Find different ways to convert PyTorch’s Whisper to ONNX format and apply different optimization techniques to ONNX models. Compare the performance of PyTorch models and ONNX models. Leverage NPUs for inference and compare their performance with other hardware devices. ONNX model compilation for ARM and compatibility testing on ARM-based systems, making models more practical for use in IoT or embedded devices such as phones and chatbots

technical details

The Whisper model was initially built and evaluated in PyTorch. To improve inference speed, we converted the model to ONNX format using tools such as PyTorch’s torch.onnx.export, the Optimum library, and Microsoft’s Olive tool. These models were then optimized using graph optimization and quantization techniques to reduce model size and speed up inference.

To further improve performance, the team explored different inference engines such as ONNX Runtime and OpenVINO to significantly accelerate model execution. The project also investigated the use of NPUs, which, although not delivering the expected performance gains, provides insight into future improvements.

video and media

The project included a variety of visualizations and performance charts comparing the inference speed of Whisper models with different levels of optimization across hardware accelerators (CPU, GPU, NPU) and inference engines (ONNX Runtime, OpenVINO). These visual tools were key to understanding the pros and cons and improvements in model performance across platforms.

ucab202_0-1729817895189.png

Figure 1: Accuracy and speed of various Whisper checkpoints.

ucab202_1-1729817909489.png

Figure 2: Accuracy and speed of different calling methods for Whisper-tiny.

ucab202_2-1729817955531.png

Figure 3: Average inference time per sample per second across acceleration methods and

dataset

ucab202_3-1729817968957.png

Figure 4: Inference time using CPU, GPU, and NPU on song clip dataset.

results and results

Figure 1

As the model size increases from small to large, the inference time gradually increases to values ​​of 0.98s, 1.37s, 5.15s, 17.12s, and 29.09s, and WER decreases to values ​​of 6.78%, 4.39%, 3.49%, and 2.43, respectively. do. % and 2.22% are in the same order.

Figure 2

Among the four methods, the average inference time per sample for the Hugging Face API endpoint was the longest at 1.82 seconds, and for the remaining three methods, inference took less than 1 second at 0.91, 0.78, and 0.97 seconds when using the Pipeline and OpenAI Whisper libraries. , Hugging Face Converter Library. In terms of accuracy, except for the Hugging Face Transformers library, which achieved the lowest word error rate of 6.78%, the other three methods all had slightly higher error rates, around 7%.

Figure 3

The project demonstrated up to a 5x improvement in inference speed for Whisper models compared to PyTorch models when using the OpenVINO-optimized ONNX format. This project demonstrated the feasibility of deploying these optimized models on ARM-based systems, opening up the possibility for their use in real-time voice applications on IoT devices and edge devices.

Figure 4

The plot shows the inference time for Whisper-tiny using different executor configurations.

Tested on Song Clip dataset. CPU, GPU and special hardware accelerator NPU

was used in this part of the study. Results are sorted in descending order from top to bottom, from the device configuration with the longest inference time to the device configuration with the shortest inference time. Inference times range from 1.371 seconds to 0.182 seconds, with the fastest inference configuration being CPU and the slowest configuration being set to MULTI: GPU, NPU. As a result, CPUs outperform NPUs, and NPUs outperform GPUs. Using MULTI or HETERO mode provides faster speeds and shorter inference times than using GPU or NPU alone in certain configurations. For example, HETERO: CPU, NPU, HETERO: CPU, GPU, and MULTI: GPU, NPU, CPU have lower inference times than NPU and GPU. For the other three cases, the results are similar. CPU or MULTI: GPU, NPU, CPU has the shortest inference time, and MULTI: GPU, NPU has the longest inference time.

lesson learned

Several important lessons emerged from this project.

  1. The ONNX format is a game changer.: Converting your models to ONNX significantly improves cross-platform compatibility and inference speed, making them easier to deploy on a variety of devices.
  2. Further investigation of NPU is needed.: NPUs, while promising, did not provide the performance gains expected. However, this indicates that more sophisticated optimizations are needed in future work.
  3. Quantization requires careful management: Reducing the precision of model data can help improve performance, but comes with a trade-off in accuracy. Future iterations should maintain this balance more effectively.

Collaboration and Teamwork

The success of this project was a testament to the power of collaboration. The combination of UCL’s academic expertise and Microsoft’s industry knowledge ensured that the project ran smoothly and achieved its objectives. Team roles were clearly defined, allowing for a smooth and productive workflow.

future development

This project has laid a solid foundation for further research and development. There are several areas available for future exploration.

  • NPU-specific optimization: The team plans to explore more advanced methods to utilize the full potential of NPU.
  • Deploy to real ARM devices: Once your model has successfully run in a Docker container, the next step is to deploy it to a real ARM-based device, such as a smartphone or IoT device.
  • multilingual test: So far, the project has mainly focused on speaking English. Extending the model to other languages ​​will open new possibilities for global applications.

conclusion

This project successfully demonstrated how state-of-the-art tools and optimizations can significantly improve the performance of ASR systems. Leverages ONNX, NPU, and ARM-based platforms. The project’s results highlight the importance of model optimization for resource-constrained environments and provide a clear path for deploying advanced ASR models in the field.





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX