Home NewsX Local AI on Windows: Explaining the Audio Editor App Sample

Local AI on Windows: Explaining the Audio Editor App Sample

by info.odysseyx@gmail.com
0 comment 5 views


Building a Windows app that leverages AI models on your device can seem like a daunting task, as it involves a lot of work in defining use cases, selecting and tuning appropriate models, and improving the logic surrounding the models.

There’s no quick and easy way to learn the backstory of AI on Windows, but we’ll analyze the sample applications we showcased at Build to show you how to leverage device models to enhance your applications.

The sample we will be looking at is an AI-based audio editor built with WinUI3 and WinAppSDK. The application itself is minimal in functionality, but it provides a good framework for demonstrating the AI ​​part of the app.

zteutsch_0-1723058502544.png

Audio Smart Trimming

Audio editor apps use several models to enable “smart trimming” of audio, which has the following flow:

  1. Users upload audio files containing recognizable speech.
  2. They provide a theme keyword or phrase and a trim period.
  3. The audio editor creates a trimmed audio clip containing the most relevant audio segments related to the given theme.

The input UI for this flow helps you visualize what exactly is expected from the sample.

zteutsch_1-1723058502551.png

once Create a new clip Clicking will upload a new audio clip to the app and allow you to play it for verification.

zteutsch_2-1723058502556.png

Now let’s look at the model used to do this.

Activate Smart Trimming with Silero, Whisper and MiniLML6v2

The “smart trimming” operation requires three different models: OnxProcessing input audio data into the output we expect. Let’s see what each model does, what it achieves in our use case, and where to find more information about the models.

Order of use:

Silero Voice Activity Detection (VAD)

We use this model to “smart chunk” the audio into smaller bits that can be processed by the transcription model.

This is necessary because Whisper (our transcription model) can only process 30-second chunks of audio at a time. We can’t naively cut the audio into 30-second chunks because that would result in a transcript that doesn’t accurately reflect the structure and grammar of the spoken audio, since sentences would be cut off mid-sentence.

As a solution, we use Silero VAD to detect voice activity and remove speech pauses, creating audio chunks that are small enough for Whisper to process, but still have properly separated speech segments.

Click here to learn more about the Silero VAD itself.

whispering little

After the audio is chunked, we take the output and feed it into the Whisper Tiny model. This model converts speech to text and is probably the simplest step in the pipeline. Audio chunks go in, converted chunks come out.

We use the Tiny version of Whisper to optimize performance, but it has some drawbacks like being limited to English and possibly less accurate. However, it works well for our use case.

Click here to learn more about Whisper Tiny or its variants. Hugging Face.

Mini LM

The last model we use is a text embedding model called MiniLM. MiniLM maps a written sentence to a multidimensional vector space that encapsulates all the semantic information contained in the sentence. In other words, this model maps everything we understand about language (semantics, grammar, vocabulary, etc.) into a numerical representation of that information. This is useful for all sorts of tasks, but we will use it for semantic retrieval.

For this example, we take the transcribed text from the Whisper model, the input theme phrases to search for, and use MiniLM to generate text embeddings for both. Once we have the embeddings, we can compute: Cosine similarityExtracts the most semantically similar sections from audio by inserting y between topic phrases and embedded sections in the audio text.

Now you just need to trim the audio based on timestamps and load it into your player!

Click here to learn more about MiniLM. Hugging Face.

Run the sample and check the code

To run the sample yourself or learn more about how it all works, go here. repository Where the samples are.

We also have Code Practice Documentation If you want to know how this sample was created.

Setting up all the models in the project requires quite a bit of setup (since you can’t check them into source control), but all the necessary steps are defined in the README. Go check it out!

To learn more about using the local model on Windows, go to: This document To learn more.





Source link

You may also like

Leave a Comment

Our Company

Welcome to OdysseyX, your one-stop destination for the latest news and opportunities across various domains.

Newsletter

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

Laest News

@2024 – All Right Reserved. Designed and Developed by OdysseyX