How Speech Recognition Works

Overview

This article will give you a technical overview of speech recognition so you can understand how it works, and better understand some of the capabilities and limitations of the technology.

Speech recognition fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio from a sound card into recognized speech. The elements of the pipeline are:

  1. Transform the PCM digital audio into a better acoustic representation.
  2. Apply a “grammar” so the speech recognizer knows what phonemes to expect. A grammar could be anything from a context-free grammar to full-blown Language.
  3. Figure out which phonemes are spoken.
  4. Convert the phonemes into words.

Transform the PCM digital audio

The first element of the pipeline converts digital audio coming from the sound card into a format that’s more representative of what a person hears.

The wave format can vary. In other words, it may be 16 KHz 8 bit Mono/Stereo or 8 KHz 16 bit mono, and so forth. It’s a wavy line that periodically repeats while the user is speaking. When in this form, the data isn’t useful to speech recognition because it’s too difficult to identify any patterns that correlate to what was actually said.

To make pattern recognition easier, the PCM digital audio is transformed into the “frequency domain.” Transformations are done using a windowed Fast-Fourier Transform (FFT). The output is similar to what a spectrograph produces. In frequency domain, you can identify the frequency components of a sound. From the frequency components, it’s possible to approximate how the human ear perceives the sound.

The FFT analyzes every 1/100th of a second and converts the audio data into the frequency domain. Each 1/100th of second’s results are a graph of the amplitudes of frequency components, describing the sound heard for that 1/100th of a second. The speech recognizer has a database of several thousand such graphs (called a codebook) that identify different types of sounds the human voice can make. The sound is “identified” by matching it to its closest entry in the codebook, producing a number that describes the sound. This number is called the “feature number.” (Actually, there are several feature numbers generated for every 1/100th of a second, but the process is easier to explain assuming only one.)

The input to the speech recognizer began as a stream of 16,000 PCM values per second. By using Fast-Fourier Transforms and the codebook, it is boiled down into essential information, producing 100 feature numbers per second.

Figure Out Which Phonemes Are Spoken

Start by grouping. To make the recognition process easier to understand, you first should know how the recognizer determines what phonemes were spoken and then understand the grammars.

  • Every time a user speaks a word, it sounds different. Users do not produce exactly the same sound for the same phoneme.
  • The background noise from the microphone and user’s office sometimes causes the recognizer to hear a different vector than it would have if the user were in a quiet room with a high-quality microphone.
  • The sound of a phoneme changes depending on what phonemes surround it. The “t” in “talk” sounds different than the “t” in “attack” and “mist.”

    The sound produced by a phoneme changes from the beginning to the end of the phoneme, and is not constant. The beginning of a “t” will produce different feature numbers than the end of a “t.”

The background noise and variability problems are solved by allowing a feature number to be used by more than just one phoneme, and using statistical models to figure out which phoneme is spoken. This can be done because a phoneme lasts for a relatively long time, 50 to 100 feature numbers, and it’s likely that one or more sounds are predominant during that time. Hence, it’s possible to predict what phoneme was spoken.

Actually, the approximation is a bit more complex than this. I’ll explain by starting at the origin of the process. For the speech recognition to learn how a phoneme sounds, a training tool is passed hundreds of recordings of the phoneme. It analyzes each 1/100th of a second of these hundreds of recordings and produces a feature number. From these, it learns statistics about how likely it is for a particular feature number to appear in a specific phoneme. Hence, for the phoneme “h,” there might be a 55% chance of feature #52 appearing in any 1/100th of a second, 30% chance of feature #189 showing up, and 15% chance of feature #53. Every 1/100th of a second of an “f” sound might have a 10% chance of feature #52, 10% chance of feature #189, and 80% chance of feature #53.

The probability analysis done during training is used during recognition. The six feature numbers that are heard during recognition might be:

52, 52, 189, 53, 52, 52

The recognizer computes the probability of the sound being an “h” and the probability of it being any other phoneme, such as “f.” The probability of “h” is:

80% * 80% * 30% * 15% * 80% * 80% = 1.84%

The probability of the sound being an “f” is:

10% * 10% * 10% * 80% * 10% * 10 % = 0.0008%

You can see that, given the current data, “h” is a more likely candidate. (For those of you that are mathematical sticklers, you’ll notice that the “probabilities” are no longer probabilities because they don’t sum to one. From now on, I’ll call them “scores” because they’re un-normalized probabilities.)

The speech recognizer needs to know when one phoneme ends and the next begins. Speech recognition engines use a mathematical technique called “Hidden Markov Models” (HMMs) that figure this out. This article won’t get into the mathematics of how HMMs work, but here’s an intuitive feel. Assume that the recognizer heard a word with an “h” phoneme followed by an “ee” phoneme. The “ee” phoneme has a 75% chance of producing feature #82 every 1/100 th of a second, 15% of chance feature #98, and a 10% chance of feature #52. Notice that feature #52 also appears in “h.” If you saw a lineup of the data, it might look like this:

52, 52, 189, 53, 52, 52, 82, 52, 82, etc.

So, where does the “h” end and the “ee” begin? From looking at the features, you can see that the 52s are grouped at the beginning, and the 82s grouped at the end. The split occurs someplace in between. Humans can eyeball this. Computers use Hidden Markov Models.

By the way, the speech recognizer figures out when speech starts and stops because it has a “silence” phoneme, and each feature number has a probability of appearing in silence, just like any other phoneme.

Now, the recognizer can recognize what phoneme was spoken if there’s background noise or the user’s voice had some variation. However, there’s another problem. The sound of phonemes changes depending upon what phoneme came before and after. You can hear this with words such as “he” and “how”. You don’t speak a “h” followed by an “ee” or “ow,” but the vowels intrude into the “h,” so the “h” in “he” has a bit of “ee” in it, and the “h” in “how” as a bit of “ow” in it.

Speech recognition engines solve the problem by creating “tri-phones,” which are phonemes in the context of surrounding phonemes. Thus, there’s a tri-phone for “silence-h-ee” and one for “silence-h-ow.” Because there are roughly 50 phonemes in English, you can calculate that there are 50*50*50 = 125,000 tri-phones. That’s just too many for current PCs to deal with so similar sounding tri-phones are grouped together.

And now, for the last issue. The sound of a phoneme is not constant. A “t” sound is silent at first, and then produces a sudden burst high frequency of noise, which then fades to silence. Speech recognizers solve this by splitting each phoneme into several segments and generating a different segment for each segment. The recognizer figures out where each segment begins and ends in the same way it figures out where a phoneme begins and ends.

After all this work, the speech recognizer has all the mechanics necessary to recognize if a particular phoneme was spoken. An important question still needs answering: How does it determine which phoneme to look for?

A speech recognizer works by hypothesizing a number of different “states” at once. Each state contains a phoneme with a history of previous phonemes. The hypothesized state with the highest score is used as the final recognition result.

When the speech recognizer starts listening, it has one hypothesized state. It assumes the user isn’t speaking and that the recognizer is hearing the “silence” phoneme. Every 1/100th of a second, it hypothesizes that the user has started speaking, and adds a new state per phoneme, creating 50 new states, each with a score associated with it. After the first 1/100th of a second, the recognizer has 51 hypothesized states.

In 1/100th of a second, another feature number comes in. The scores of the existing states are recalculated with the new feature. Then, each phoneme has a chance of transitioning to yet another phoneme, so 51 * 50 = 2550 new states are created. The score of each state is the score of the first 1/100th of a second times the score if the 2nd 1/100th of a second. After 2/100ths of a second, the recognizer has 2601 hypothesized states.

This same process is repeated every 1/100th of a second. The score of each new hypothesis is the score of its parent hypothesis times the score derived from the new 1/100th of a second. In the end, the hypothesis with the best score is what’s used as the recognition result.

Of course, a few optimizations are introduced.

If the score of a hypothesis is much lower than the highest score, the hypothesis is dropped. This is called pruning. The optimization is intuitively obvious. If the recognizer is millions of times more confident that it heard “h eh l oe” than “z z z z,” there’s not much point in continuing the hypothesis that the recognizer heard “z z z z.” However, if too much is pruned, errors can be introduced because the recognizer might make a mistake about which phoneme was spoken.

Recognizers also optimize by not hypothesizing a transition to a new phoneme ever 1/100th of a second. To do this though, the recognizer must limit what phonemes can follow other phonemes.

Reducing Computation and Increasing Accuracy

The speech recognizer now can identify what phonemes were spoken. Figuring out what words were spoken should be an easy task. If the user spoke the phonemes “h eh l oe”, you know they spoke “hello.” The recognizer should only have to do a comparison of all the phonemes against a lexicon of pronunciations.

It’s not that simple.

  1. The user might have pronounced “hello” as “h uh l oe,” which might not be in the lexicon.
  2. The recognizer may have made a mistake and recognized “hello” as “h uh l oe.”
  3. Where does one word end and another begin?
  4. Even with all these optimizations, the speech recognition still requires too much CPU usage.

To reduce computation and increase accuracy, the recognizer restricts acceptable inputs from the user. On the whole, this isn’t a bad assumption because:

  • It’s unlikely that the user will speak “zwickangagang” because it’s not a valid word.
  • The user may limit him/herself to a relatively small grammar. There are millions of words, but most people only use a few thousand of them a day, and they may need even fewer words to communicate with a computer.
  • When people speak, they have a specific grammar that they use. After all, users say, “Open the window,” not “Window the open.”
  • Certain word sequences are more common than others. “New York City” is more common than “New York Aardvark.”

More by Author

Get the Free Newsletter!

Subscribe to Developer Insider for top news, trends & analysis

Must Read