Local AI on Windows: Explaining the Audio Editor App Sample by info.odysseyx@gmail.com August 20, 2024 written by info.odysseyx@gmail.com August 20, 2024 0 comment 5 views 5 Building a Windows app that leverages AI models on your device can seem like a daunting task, as it involves a lot of work in defining use cases, selecting and tuning appropriate models, and improving the logic surrounding the models. There’s no quick and easy way to learn the backstory of AI on Windows, but we’ll analyze the sample applications we showcased at Build to show you how to leverage device models to enhance your applications. The sample we will be looking at is an AI-based audio editor built with WinUI3 and WinAppSDK. The application itself is minimal in functionality, but it provides a good framework for demonstrating the AI part of the app. Audio Smart Trimming Audio editor apps use several models to enable “smart trimming” of audio, which has the following flow: Users upload audio files containing recognizable speech. They provide a theme keyword or phrase and a trim period. The audio editor creates a trimmed audio clip containing the most relevant audio segments related to the given theme. The input UI for this flow helps you visualize what exactly is expected from the sample. once Create a new clip Clicking will upload a new audio clip to the app and allow you to play it for verification. Now let’s look at the model used to do this. Activate Smart Trimming with Silero, Whisper and MiniLML6v2 The “smart trimming” operation requires three different models: OnxProcessing input audio data into the output we expect. Let’s see what each model does, what it achieves in our use case, and where to find more information about the models. Order of use: Silero Voice Activity Detection (VAD) We use this model to “smart chunk” the audio into smaller bits that can be processed by the transcription model. This is necessary because Whisper (our transcription model) can only process 30-second chunks of audio at a time. We can’t naively cut the audio into 30-second chunks because that would result in a transcript that doesn’t accurately reflect the structure and grammar of the spoken audio, since sentences would be cut off mid-sentence. As a solution, we use Silero VAD to detect voice activity and remove speech pauses, creating audio chunks that are small enough for Whisper to process, but still have properly separated speech segments. Click here to learn more about the Silero VAD itself. whispering little After the audio is chunked, we take the output and feed it into the Whisper Tiny model. This model converts speech to text and is probably the simplest step in the pipeline. Audio chunks go in, converted chunks come out. We use the Tiny version of Whisper to optimize performance, but it has some drawbacks like being limited to English and possibly less accurate. However, it works well for our use case. Click here to learn more about Whisper Tiny or its variants. Hugging Face. Mini LM The last model we use is a text embedding model called MiniLM. MiniLM maps a written sentence to a multidimensional vector space that encapsulates all the semantic information contained in the sentence. In other words, this model maps everything we understand about language (semantics, grammar, vocabulary, etc.) into a numerical representation of that information. This is useful for all sorts of tasks, but we will use it for semantic retrieval. For this example, we take the transcribed text from the Whisper model, the input theme phrases to search for, and use MiniLM to generate text embeddings for both. Once we have the embeddings, we can compute: Cosine similarityExtracts the most semantically similar sections from audio by inserting y between topic phrases and embedded sections in the audio text. Now you just need to trim the audio based on timestamps and load it into your player! Click here to learn more about MiniLM. Hugging Face. Run the sample and check the code To run the sample yourself or learn more about how it all works, go here. repository Where the samples are. We also have Code Practice Documentation If you want to know how this sample was created. Setting up all the models in the project requires quite a bit of setup (since you can’t check them into source control), but all the necessary steps are defined in the README. Go check it out! To learn more about using the local model on Windows, go to: This document To learn more. Source link Share 0 FacebookTwitterPinterestEmail info.odysseyx@gmail.com previous post Associate External Relations Officer, P2 (temporary) – Lebanon next post PhD in Microbial Ecology and Evolution, University of Lausanne, Switzerland You may also like Copilot for Microsoft Fabric – Starter Series Healthcare Focus September 12, 2024 More ways to sell through the marketplace with professional services September 11, 2024 Two upcoming Copilot and M365 for SMB Community offerings September 11, 2024 Copilot for Microsoft 365 Adoption Trainings September 11, 2024 Omdia’s perspective on Microsoft’s SSE solution September 11, 2024 Extend Viva Connections with pre-built 3rd party Adaptive cards September 11, 2024 Leave a Comment Cancel Reply Save my name, email, and website in this browser for the next time I comment.