Acoustic Models

Acoustic Model Development

“A Speech Recognition Engine is Only as Good as its Acoustic Model”

Customized acoustic models are instrumental when it comes to developing speech recognition systems that perform to their highest degree of recognition.  If you are striving for excellence in your speech recognition product and desire the best customer experience from your speech recognition system, then using an acoustic model that is custom fit to your needs is a must.

Unfortunately, most speech systems on the market today provide little ability to modify or custom train their acoustic models.  This can compromise speech recognition results and lead to increased customer dissatisfaction.

MatrixHCI understands the need for better acoustic models and can provide custom tailored and trained acoustic models to meet your needs.  Our speech and acoustic engineers can analyze your speech recognition requirements and design a speech solution that provides your customers with a more positive experience.  MatrixHCI has extensive experience designing, developing, and training acoustic models in even the most challenging speech recognition environments.

Whether you are just starting out in speech or have a system that is not performing to your expectations, MatrixHCI can help you and your customers gain better and more satisfying results through our custom acoustic model and ASR tuning service.

Please contact MatrixHCI about its services, and learn how we can help you meet your needs.  To learn more about acoustic models and what they do, continue reading the acoustic model FAQ:

What is an acoustic model?

Acoustic models are created by taking audio recordings of speech and their text transcriptions.  These components are then processed by software to create statistical representations of the sounds that make up each word.  The acoustic model is then used by the speech recognition decoding engine to actually recognize speech.

Because of this, acoustic models are essential components to a speech recognition system.  Without them a speech recognition system cannot recognize words.  In addition, if an acoustic model is not properly trained or designed for its target speech environment, it will have a significant impact on the speech systems ability to recognize speech.

Using a speech recognition system with a compromised acoustic model that is not custom trained for its specific speech environment will most always result in many speech recognition implementation failures, where customers have either been unable to use the system or become unwilling to use it.  These types of failures can have a significant impact on customer satisfaction and the success of a speech application.


Do I need my own custom acoustic model?

Acoustic models have many dependencies and also are subject to a variety of implementation specific constraints which prohibit generic “plug and play” solutions that fit all speech environments.

If your current speech solution does not meet your expectations, or if you have special speech recognition requirements and want the very best recognition results possible, then it is wise to consider the feasibility of developing a custom acoustic model.

What is the Sampling Rate and how does it affect the Acoustic Model?
Audio that is digitally saved has to be captured in samples.  The rate of sampling audio is called the “sampling rate.”  Audio can be encoded at different sampling rates.  Sampling rate is measured in samples per second with the most common rates being 8 kHz, 16 kHz, 32 kHz, 44.1 kHz, 48 kHz, and 96 kHz.  In addition, the number of bits per sample is variable with the most common being 8-bits, 16-bits, and 32-bits.

In order to attain optimal speech recognition, these engines work best if the acoustic model has been trained (i.e., matched) with speech audio which was recorded at the same sampling rate/bits per sample as the speech being recognized.

Which sampling frequency should I use?

The higher the sampling frequency, the more data (and frequency range) there is for the speech recognition system to use.

However, as mentioned above, the model data and input data should match, so the first limitation is the actual source of your audio.

In telephony-based speech recognition, the limiting factor is the bandwidth at which speech can be transmitted.  For example, a standard landline telephone only has a bandwidth of 64-bit/s at a sampling rate of 8 kHz and 8-bits per sample (8000 samples per second * 8-bits per sample = 64000 bit/s). Therefore, for telephony-based speech recognition, you need acoustic models trained with 8 kHz/8-bit speech audio files.

In the case of Voice Over IP, the codec determines the sampling rate/bits per sample of speech transmission.  If you use a codec with a higher sampling rate/bits per sample for speech transmission (to improve the sound quality), then your acoustic model must be trained with audio data that matches that sampling rate/bits per sample.

For speech recognition on a standard desktop PC, the limiting factor is the sound card.  Most sound cards today can record at sampling rates of between 16 kHz-48 kHz of audio with bit rates of 8 to 16-bits per sample and playback at up to 96 kHz.

Another consideration is processing power.  Audio with high sampling rate/bits per sample requires more storage space and CPU power which can slow the recognition engine down.  A compromise is needed.  Thus for desktop speech recognition, the current standard is acoustic models trained with speech audio data recorded at sampling rates of 16 kHz/16-bits per sample.


Why don’t all speech recognition solutions offer custom acoustic models?

Many companies offering speech recognition solutions are re-sellers or integraters who do not actually control the underlying speech technology. The current marketplace simply doesn’t have the technical capacity or flexibility to address acoustic models on an individual customer basis. Additional reasons why speech recognition companies don’t offer more customized acoustic models are:

The acoustic model has been designed as a closed model that cannot be easily changed.

It may have been licensed from another company or come as part of the “engine.”

The company does not have the technical expertise to fully develop and train a custom acoustic model.

Customizing acoustic models adds time and labor and may not fit the business model of the company and/or there aren’t resources to handle this custom service.

What are the most challenging acoustic model development issues?

There are many challenging issues when it comes to building and training an acoustic model.  A few of these issues are described below:

  1. Language Issues – Different spoken languages can pose a problem when developing acoustic models.  Not only does an acoustic model need to be developed for each language, there are problems when spoken languages are intermixed.  An example of this might be when a speaker intermixes names in Spanish while speaking primarily in English.
    Accent Issues – Accents can severely compromise speech recognition systems especially if the acoustic model has not accounted for expected accents.  This is why an acoustic model trained for US English will produce less successful results when used in a UK English setting.
  2. Speech Speed Issues – Since the speech recognition system is looking at pieces of speech in patterns, the speed at which the words are uttered can have an impact on recognition success.  As an example, an acoustic model trained for normal rate speech will not perform well when trying to recognize the commands blurted quickly by an Air Traffic Controller.
  3. Sampling Rate Issues – If there is a mismatch between the sampling rate of the audio used to train the acoustic model and the audio that the system is trying to recognize, there will be serious recognition failures.
  4. Acoustic Background Issues – Background noise, cross talk (multiple speakers), and poor sound quality are just a few of the acoustic interference issues that are cause for concern when building and implementing a speech recognition system.
  5. Over-training/Under-training – Too little training can result in a high word error rate (WER) and poor recognition.  On the other hand, over-training can over sensitize the system and cause failure too.  As providers of custom solutions, we run tuning and testing cycles until the optimal acoustic model performance is reached.

What are the phases of an acoustic model development process?

MatrixHCI will evaluate your unique speech recognition needs and use this information when designing and training your acoustic model.  Some of the factors to consider when designing and training an acoustic model are:

1. Identification of Speech Recognition Phrases/Grammar

2. Analysis of Audio Source Environment

– Languages
– Accents
– Sampling Rate and Audio Background Quality

3. Audio Corpus Data Acquisition

– Live-in-the-Field Acquisition
– Existing Audio Corpus Data

4. Transcription of Audio Corpus Data
5. Acoustic Data Pruning Prior to Training
6. Gender and Age Group Training
7. Dictionary Development
8. Language Model Design

– Unigram, Bigram, Trigram Design
– Defining the Type of Acoustic Model to Build

9. Acoustic Model Training – Finally Building the Model!
10. Testing Against Test Audio Data
11. Tweaking and Refining the Training Process
12. Garbage In – Garbage Out.  It is a delicate balance deciding what data should be used for acoustic model training.

How Can MatrixHCI Help Develop Your Acoustic Model?

Please contact our sales office to discuss how our distinctive ASR tuning and custom acoustic model services can improve the accuracy of your speech recognition system.

MatrixHCI offers Custom Software Development with an emphasis on Advanced Cutting-Edge Speech Recognition

If you have a need for specialized custom speech recognition solutions, please contact us for a free confidential consultation.