Speech Technology
OFFERING CUSTOM SPEECH TECHNOLOGY
Speech Technology Design, ASR Engineering & Consulting Services
Speech Technology Design and Consulting
MatrixHCI is a speech technology consulting company providing ASR engineering, support services, and voice recognition code development to clients needing quality products quickly. It is also a leading developer in the rapidly rising area of audio-mining technologies that allow for search and retrieval of audio and video information.
MatrixHCI is not a re-seller, integrator, or affiliated with any vendor–we offer leading-edge technical expertise and custom algorithm development for speech recognition. While we are able to work with a variety of engines and vendors, our core value proposition lies in our proprietary technology for improving speech recognition in challenging environments and extracting greater performance out of existing installations.
MatrixHCI is dedicated to raising the quality of your speech applications. Below is a partial list of our speech technology and programming services:
✔ Audiomining and Audio Text Search
✔ Development of Complete Custom Voice Search Solutions
✔ Integration and Training of Existing Audiomining Applications
✔ Consulting and Management of Text and Video Search Applications
✔ Audio Conditioning
✔ Sampling Rate (Frequency) Changes
✔ Pre & Post Filtering
✔ Acoustic Models
✔ HMM – Statistical and Vector Output from Analog Signal
✔ Adaptive Data-driven Models
✔ Proprietary ASR Engine
✔ Optimized for Use Cases (Decoding, Embedded, Large Vocabulary, and Continuous, etc.)
✔ ASR Code Development
✔ Algorithm Development
✔ Language and Grammar Models
✔ Domain Specific Grammar Construction
✔ Analysis and Application Tuning
✔ Symantec Annotation
✔ Dialog Management Systems
✔ AI / NLP – Rule for More Complex HCI
Glossary
ACOUSTIC MODEL
The compiled statistical representation of speech against which all incoming audio might be compared and identified. Properly training an acoustic model is one way to improve ASR accuracy.
ASR
Automatic Speech Recognition (ASR). This acronym is commonly synonymous with “speech recognition” in which the machine decodes human speech. Other specific types of ASR include continuous speech recognition, speaker independent, and speaker dependent recognition.
ASR ENGINE
Term to describe the core code and functionality of the ASR system.
CALL FLOW
The design and logic of phone menu systems. A blue print including all the levels of the phone menu, decision trees for user input, and actions generated from user decisions.
CONTINUOUS SPEECH RECOGNITION
Voice recognition for free-form input, such as dictation. The opposite of constrained input, such as a finite set of commands.
CORPUS
A speech corpus is the dataset by which the acoustic model is generated. The corpus is a database of speech audio along with the accompanying transcriptions. The database provides a reference or baseline for how sounds compare to words and vice versa.
GRAMMAR ENGINEERING
Grammar engineering describes an intelligent system approach beyond a typical “XML” grammar descriptor file. Using iterative loops that pass ASR output through the NLP inference engine, we can inject reasoning and create intelligent or dynamic grammars.
HCI
Human Computer Interaction or Interface (HCI) describes the science and art of creating human-to-machine interfaces. The computer mouse is a good example, although voice is far more complex.
DSP
Digital Signal Processing (DSP) is used to discern human-made signals, which are ordinarily from noise signals that are chaotic. DSP describes various techniques to clarify, standardize, or isolate the levels or states of digital signals.
GRAMMAR
A grammar file is typically an XML reference that defines the words, syntax, and/or phonetics of the most probable words or phrases the system expects to encounter.
DECODER
The core of the speech engine that actually crawls the waveform finds the phoneme boundaries and segments the words into text.
DIALOG MANAGER
The design of a “conversational” voice interface. Dialog is helpful for the purposes of confirming, adding, or augmenting information. This component works with the ASR engine and the grammar as a high-level structured logic to help the human succeed in their voice directed goal.
HMM
Hidden Markov Models (HMM) are algorithms for deriving statistics and vectors in speech recognition. As speech is decoded, it can be compared to such models to determine a given phoneme or the likelihood of connecting phonetic units.
IVR
Integrated Voice Response (IVR) describes the overall telephone voice system which encapsulates the phone system, ASR engine, grammars, call flow, and other components, such as prompts, database, and call recording software.
LARGE VOCABULARY
Very Large Vocabulary (VLV) describes a near-unbounded voice system, such as “Internet search,” where literally anything might be said.
MULTI-MODAL
Multi-modal stands for multiple modalities, and typically refers to human computer interfaces that involve faculties beyond a keyboard and screen, for instance voice, touch, and/or motion.
NLP
Natural Language Processing (NLP) is the area of human-computer interaction (HCI) concerned with words, structure, and meaning. NLP sits above the speech recognizer to provide added rules. A hierarchy of rules or weighted outcomes constitutes a knowledge base or inference engine.
NOISE CANCELLATION
There are many types of noise that can interfere or cause errors with voice recognition systems. Advanced filters can be used to attenuate noise. Common noises include reverb, ambient room sounds, phone line echoes, pops, and hisses from breath.
NORMALIZATION
Audio normalization is a signal processing technique to increase the amplitude (gain) of the audio waveform without introducing further signal noise. This is one of many filters used to condition audio for speech recognition and decoding.
OOV
Out of Vocabulary (OOV) refers to words that are undefined in a grammar, and which might not be properly recognized. For instance, in a “Yes or No” question, what happens if the caller says “Billing?”
PHONEME
A phoneme is the elemental part of speech consisting of vowels and consonants. In speech technology, the fascinating combinations and timbres are modeled so machines can perform work or detection.
SAMPLING RATE
Sampling rate describes the frequency of the encoded audio data. Many phone applications are 8kHz, while desktop applications are more likely 16kHz or 22kHz. This can be an issue when an acoustic model from one sampling rate is applied to another audio with a different sampling rate.
SPEAKING DEPENDENT
A speaker dependent system is one where the acoustic model is trained to a single individual’s voice-print. Speaker dependent systems take time to train, but become more accurate as a customized profile of the user’s speech patterns is developed.
SPEAKER INDEPENDENT
A speaker independent system is capable of taking immediate input from a large population of users. This type of system requires no formal training by individual users to begin using the system and relies on finely tuned acoustic models and grammar files to be universally accessible.
SPEECH INTERFACE
Synonym for VUI, and closely related to HCI and multi-modal.
SPEECH TO TEXT
Synonym for speech recognition, voice recognition, and ASR.
TTS
Text-to-Speech (TTS) is the opposite of ASR or Voice-to-Text systems. In the case of TTS, the computer talks (i.e., provides output) to the user.
TRANSCRIPTION
Transcription and the verb, transcribe, refer to a text-based representation of a given piece of speech audio. Transcriptions are the alpha and the omega in the sense that they help build the models to power the software that generate more transcriptions.
UAV
Unmanned Aerial Vehicle (UAV) describes an aircraft controlled from a remote location, and in this context, by a multi-modal voice interface.
UTTERANCE
An utterance is defined as a complete segment of speech. It is only associated with the spoken act and when a silence is reached, the utterance is finished. For a speech application, an utterance can be a single word like “yes or no” or a complete sentence.
VOICE TO TEXT
Synonym for speech recognition, voice recognition, and ASR.
VUI
Voice User Interface (VUI), like HCI above, describes the science and art of designing, building, testing, and deploying speech-based interfaces for applications, such as telephony, navigation, and hands-free systems.
WER
Word Error Rate (WER) is a measure used to determine the accuracy of a speech application. It is commonly used in the testing, tuning, and analytic phases to benchmark progress or problems with a speech system.
MatrixHCI offers Custom Software Development with an emphasis on Advanced Cutting-Edge Speech Recognition
If you have a need for specialized custom speech recognition solutions, please contact us for a free confidential consultation.