Practical Guide to Azure Custom Neural Voice: Essential Tips for Success

by info.odysseyx@gmail.com · August 26, 2024

Teaser image created by DALL E 3

Custom Neural Voices (CNV) is a feature of Azure Cognitive Services that lets you create personalized synthetic voices for your applications. This text-to-speech feature lets you use human speech samples as training data to develop a voice that sounds very natural for your brand or character.

Recently, while working on a project involving custom voice generation, I encountered some features and hidden issues that are not covered in this document. Official Document. So, I would like to share some tips and tricks in this article. The theoretical aspects are well documented, so the advice in this article is mainly based on my personal experience. I hope you find these insights useful. Let’s get started!

Audio recording

First, you need to prepare a balanced script. It is more important to have a good mix of questions, exclamations, and statements than to ensure that the training set closely matches the target domain. In short, a good dataset should include:

Statement: 70-80%
Questions: 10-20% and equal number of rising and falling tones (yes/no questions use rising tones, while wh questions use falling tones very commonly)
Exclamations: 10-20%
Short words/phrases: 10%

sound editing software

Screenshot 2024-08-26 16.12.49.png

There are several possible solutions, such as Adobe Audition or Audacity. I recommend using Audacity. Not only because Adobe Audition is paid, but also because Audacity’s limited features are ideal for our needs. We just need to select the speech, export it, and cut it. Minimalism is the key to success. Audacity also makes it easy to navigate the track and minimizes the unnecessary toolbox.

The File menu in Audacity provides commands to create, open, and save projects, and import and export audio files. For example, the Export function is not assigned by default, so you can easily create a shortcut to export a selection. This speeds up the process considerably. In my experience using both Adobe Audition and Audacity, I was able to complete the same amount of work in two days using Audacity, compared to four days using Adobe Audition.

price

Here are my project details:

Model Type : Nerve V5.2022.05
Engine version : 2023.01.16.0
Training time : 30.48
Data size : 440 statements
price: $1584.27

Pricing may vary depending on engine version and number of training hours, but you will at least get a sample.

Intake form

Screenshot 2024-08-26 16.22.36.png

You probably know that access is granted only after you complete the Intake Form and that decisions are made based on eligibility and usage criteria. Before providing any project information, please refer to the following: Microsoft’s Responsible AI StandardsThis will allow you to tailor your description and scenario accordingly.

Prepare your audio

Screenshot 2024-08-26 16.28.36.png The process is very simple. Create a notepad with all the utterances and their IDs. Select the utterances one by one, export them, save them with their IDs, and then delete them from the notepad. Define the optimal size in advance and do not zoom in or out while working. You will become familiar with the timeline size and will be able to add the 100-200 milliseconds of silence you need more easily.

Audio recording

sound editing software

price

Intake form

Prepare your audio

Our Company

About Links

Useful Links

Newsletter

Laest News

Practical Guide to Azure Custom Neural Voice: Essential Tips for Success

Audio recording

sound editing software

price

Intake form

Prepare your audio

Work Smarter: Copilot Productivity Tips

Exciting Presales Executive Job Openings at PlanetSpark in Central Delhi, Gurgaon, Chandigarh, and Surrounding Areas

You may also like

Leave a Comment Cancel Reply

Our Company

About Links

Useful Links

Newsletter

Laest News