Speech Technology

Offering Custom Speech Technology

Speech Technology Design and Consulting

MatrixHCI is a speech technology consulting company providing ASR engineering, support services, and voice recognition code development to clients needing quality products quickly.  It is also a leading developer in the rapidly rising area of audio-mining technologies that allow for search and retrieval of audio and video information.

MatrixHCI is not a re-seller, integrator, or affiliated with any vendor–we offer leading-edge technical expertise and custom algorithm development for speech recognition.  While we are able to work with a variety of engines and vendors, our core value proposition lies in our proprietary technology for improving speech recognition in challenging environments and extracting greater performance out of existing installations.

MatrixHCI is dedicated to raising the quality of your speech applications.  Below is a partial list of our speech technology and programming services:

Audiomining and Audio Text Search
Development of Complete Custom Voice Search Solutions
Integration and Training of Existing Audiomining Applications
Consulting and Management of Text and Video Search Applications
Audio Conditioning
Sampling Rate (Frequency) Changes
Pre & Post Filtering
Acoustic Models
HMM – Statistical and Vector Output from Analog Signal
Adaptive Data-driven Models
Proprietary ASR Engine
Optimized for Use Cases (Decoding, Embedded, Large Vocabulary, and Continuous, etc.)
ASR Code Development
Algorithm Development
Language and Grammar Models
Domain Specific Grammar Construction
Analysis and Application Tuning
Symantec Annotation
Dialog Management Systems
AI / NLP – Rule for More Complex HCI



Acoustic Model

The compiled statistical representation of speech against which all incoming audio might be compared and identified.  Properly training an acoustic model is one way to improve ASR accuracy.


Automatic Speech Recognition (ASR).  This acronym is commonly synonymous with “speech recognition” in which the machine decodes human speech.  Other specific types of ASR include continuous speech recognition, speaker independent, and speaker dependent recognition.

ASR Engine

Term to describe the core code and functionality of the ASR system.

Call Flow

The design and logic of phone menu systems.  A blue print including all the levels of the phone menu, decision trees for user input, and actions generated from user decisions.

Continuous Speech Recognition

Voice recognition for free-form input, such as dictation.  The opposite of constrained input, such as a finite set of commands.


A speech corpus is the dataset by which the acoustic model is generated.  The corpus is a database of speech audio along with the accompanying transcriptions.  The database provides a reference or baseline for how sounds compare to words and vice versa.

Grammar Engineering

Grammar engineering describes an intelligent system approach beyond a typical “XML” grammar descriptor file.  Using iterative loops that pass ASR output through the NLP inference engine, we can inject reasoning and create intelligent or dynamic grammars.


Human Computer Interaction or Interface (HCI) describes the science and art of creating human-to-machine interfaces.  The computer mouse is a good example, although voice is far more complex.


Digital Signal Processing (DSP) is used to discern human-made signals, which are ordinarily from noise signals that are chaotic.  DSP describes various techniques to clarify, standardize, or isolate the levels or states of digital signals.


A grammar file is typically an XML reference that defines the words, syntax, and/or phonetics of the most probable words or phrases the system expects to encounter.


The core of the speech engine that actually crawls the waveform finds the phoneme boundaries and segments the words into text.

Dialog Manager

The design of a “conversational” voice interface.  Dialog is helpful for the purposes of confirming, adding, or augmenting information.  This component works with the ASR engine and the grammar as a high-level structured logic to help the human succeed in their voice directed goal.


Hidden Markov Models (HMM) are algorithms for deriving statistics and vectors in speech recognition.  As speech is decoded, it can be compared to such models to determine a given phoneme or the likelihood of connecting phonetic units.


Integrated Voice Response (IVR) describes the overall telephone voice system which encapsulates the phone system, ASR engine, grammars, call flow, and other components, such as prompts, database, and call recording software.

Large Vocabulary

Very Large Vocabulary (VLV) describes a near-unbounded voice system, such as “Internet search,” where literally anything might be said.


Multi-modal stands for multiple modalities, and typically refers to human computer interfaces that involve faculties beyond a keyboard and screen, for instance voice, touch, and/or motion.


Natural Language Processing (NLP) is the area of human-computer interaction (HCI) concerned with words, structure, and meaning.  NLP sits above the speech recognizer to provide added rules.  A hierarchy of rules or weighted outcomes constitutes a knowledge base or inference engine.

Noise Cancellation

There are many types of noise that can interfere or cause errors with voice recognition systems.  Advanced filters can be used to attenuate noise.  Common noises include reverb, ambient room sounds, phone line echoes, pops, and hisses from breath.


Audio normalization is a signal processing technique to increase the amplitude (gain) of the audio waveform without introducing further signal noise.  This is one of many filters used to condition audio for speech recognition and decoding.


Out of Vocabulary (OOV) refers to words that are undefined in a grammar, and which might not be properly recognized.  For instance, in a “Yes or No” question, what happens if the caller says “Billing?”


A phoneme is the elemental part of speech consisting of vowels and consonants.  In speech technology, the fascinating combinations and timbres are modeled so machines can perform work or detection.

Sampling Rate

Sampling rate describes the frequency of the encoded audio data.  Many phone applications are 8kHz, while desktop applications are more likely 16kHz or 22kHz.  This can be an issue when an acoustic model from one sampling rate is applied to another audio with a different sampling rate.

Speaker Dependent

A speaker dependent system is one where the acoustic model is trained to a single individual’s voice-print.  Speaker dependent systems take time to train, but become more accurate as a customized profile of the user’s speech patterns is developed.

Speaker Independent

A speaker independent system is capable of taking immediate input from a large population of users.  This type of system requires no formal training by individual users to begin using the system and relies on finely tuned acoustic models and grammar files to be universally accessible.

Speech Interface

Synonym for VUI, and closely related to HCI and multi-modal.

Speech to Text

Synonym for speech recognition, voice recognition, and ASR.


Text-to-Speech (TTS) is the opposite of ASR or Voice-to-Text systems.  In the case of TTS, the computer talks (i.e., provides output) to the user.


Transcription and the verb, transcribe, refer to a text-based representation of a given piece of speech audio.  Transcriptions are the alpha and the omega in the sense that they help build the models to power the software that generate more transcriptions.


Unmanned Aerial Vehicle (UAV) describes an aircraft controlled from a remote location, and in this context, by a multi-modal voice interface.


An utterance is defined as a complete segment of speech.  It is only associated with the spoken act and when a silence is reached, the utterance is finished.  For a speech application, an utterance can be a single word like “yes or no” or a complete sentence.

Voice to Text

Synonym for speech recognition, voice recognition, and ASR.


Voice User Interface (VUI), like HCI above, describes the science and art of designing, building, testing, and deploying speech-based interfaces for applications, such as telephony, navigation, and hands-free systems.


Word Error Rate (WER) is a measure used to determine the accuracy of a speech application.  It is commonly used in the testing, tuning, and analytic phases to benchmark progress or problems with a speech system.

MatrixHCI offers Custom Software Development with an emphasis on Advanced Cutting-Edge Speech Recognition

If you have a need for specialized custom speech recognition solutions, please contact us for a free confidential consultation.