You might have already used speech recognition in many products, in your iphone or in any anroid devices.You still don’t know how it works. I want to give you a technical overview of speech recognition so you can understand how it works.

Speech recognition fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio from a sound into recognized speech.

How Speech Recognition Works.

Speech Recognition works in following steps.

1. Transform the PCM digital audio into a better acoustic representation

2. Figure out which phonemes are spoken.
3. Apply a “grammar” so the speech recognizer knows what phonemes to expect. A grammar could be anything from a context-free grammar to full-blown English.
4. Convert the phonemes into words.

I’ll cover each of these steps individually

1.Transform the PCM digital audio:
First it converts the digital sound coming from our mike in PCM.This sound look like a wave, sampled at about 16,000 times per second.
The input to the speech recognizer began as a stream of 16,000 PCM values per second. By using fast Fourier transforms and the codebook, it is boiled down into essential information, producing 100 feature numbers per second. And it converted into frequency domain.
and as a output we get a pattern.

The speech recognizer has a database of several thousand such graphs (called a codebook) that identify different types of sounds the human voice can make. The sound is “identified” by matching it to its closest entry in the codebook, producing a number that describes the sound. This number is called the “feature number.” And we get closest match.

2. Figure Out Which Phonemes Are Spoken:

In an ideal world, you could match each feature number to a phoneme. If a segment of audio resulted in feature #52, it could always mean that the user made an “h” sound. Feature #53 might be an “f” sound, etc. If this were true, it would be easy to figure out what phonemes the user spoke.By matching patterns we find possible match with probability mark.

Which tells us that how much this is word is possible. It is quite complex algorithm to match.

The speech recognizer can now identify what phonemes were spoken. Figuring out what words were spoken should be an easy task. If the user spoke the phonemes, “h eh l oe”, then you know they spoke “hello”. The recognizer should only have to do a comparison of all the phonemes against a lexicon of pronunciations.
But it can make mistake. Here comes role of grammers.

3. Apply a “grammar” : every speach application have grammers which tells the reconginser that what is expecting user to speak. It matches from grammers and return the result.

4. Convert the phonemes into words.

After understanding phonemes Speach reconginations engine make words from them again it use grammers.


Speech recognition system “adapt” to the user’s voice, vocabulary, and speaking style to improve accuracy. A system that has had time enough to adapt to an individual can have one fourth the error rate of a speaker independent system. Adaptation works because the speech recognition is often informed (directly or indirectly) by the user if it’s recognition was correct, and if not, what the correct recognition is.

The recognizer can adapt to the speaker’s voice and variations of phoneme pronunciations in a number of ways. First, it can gradually adapt the codebook vectors used to calculate the acoustic feature number. Second, it can adapt the probability that a feature number will appear in a phoneme. Both of these are done by weighted averaging.

The language model can also be adapted in a number of ways. The recognizer can learn new words, and slowly increase probabilities of word sequences so that commonly used word sequences are expected. Both these techniques are useful for learning names.


This was a high level overview of how speech recognition works. It’s not nearly enough detail to actually write a speech recognizer, but it exposes the basic concepts. Most speech recognition engines work in a similar manner, although not all of them work this way.

This work is not that easy.But speech recognizer are becoming smarter as everyday new grammers added,new match and so many speach scintist putting so much continous effort.

If you want more detail you should purchase one of the numerous technical books on speech recognition.

The following two tabs change content below.
He is founder and CTO of He senior software architect. He is guru of various Mobile and web technologies including Node.js, Angular.js, Meteor.js. He is founder member of habilelabs Pvt. Ltd.