You are on page 1of 2

The Basics of Voice/Speech Recognition System/Technology

At its core, voice recognition system technology is the process of converting audio into text for the purpose of conversational AI and
voice applications.

Voice/Speech recognition can be broken down into three stages:

 Automatic speech recognition (ASR): The task of transcribing the audio


 Natural language processing (NLP): Deriving meaning from speech data and the subsequent transcribed text
 Text-to-speech (TTS): Converts text to human-like speech

Where we see this play out most commonly is with virtual assistants. Think Amazon Alexa, Apple’s Siri, and Google Home. We
speak, they interpret what we are trying to ask of them, and they respond to the best of their programmed abilities.

The process begins by digitizing a recorded speech sample with ASR. The speaker’s unique voice template is broken up into discrete
segments made up of several tones visualized in the form of spectrograms.

The spectrograms are further divided into timesteps using the short-time Fourier transform.

Each spectrogram is analyzed and transcribed based on the NLP algorithm that predicts the probability of all words in a language’s
vocabulary. A contextual layer is added to help correct any potential mistakes. Here the algorithm considers both what was said, and
the likeliest next word based on its knowledge of the given language.

Finally, the device will verbalize the best possible response to what it has heard and analyzed using TTS.

It’s not all that unlike how we learn language as children.

From day one of a child’s life, they hear words used all around them. Parents speak to the child knowing they can’t answer yet, but
even though the child doesn’t respond, they are absorbing all kinds of verbal cues, including intonation, inflection, and pronunciation.

This is the input stage. The child’s brain is forming patterns and connections based on how their parents use language. Though
humans are hardwired to listen and understand, we train our entire lives to apply this natural ability to detecting patterns in one or
more languages.
It takes five or six years to be able to have a full conversation, and then we spend the next 15 years in school collecting more data and
increasing our vocabulary. By the time we reach adulthood, we can interpret meaning almost instantly.

Voice recognition technology works in a similar way. The speech recognition software breaks the speech down into bits it can
interpret, converts it into a digital format, and analyzes the pieces of content.

It then makes determinations based on previous data and common speech patterns, making hypotheses about what the user is saying.
After determining what the user most likely said, the smart device can offer back the best possible response.

But whereas humans have refined our process, we are still figuring out the best practices for AI. We have train them in the same way
our parents and teachers trained us, and that involves a lot of manpower, research, and innovation.

Speech Recognition Technology in Action


Shazam is a great example of how speech recognition technology works. The popular app–purchased by Apple in 2018 for $400M—
can identify music, movies, commercials, and TV shows based on a short audio sample using the microphone on your device.

When you hit the Shazam button, you are starting an audio recording of your surroundings. It can differentiate the ambient noise from
the intended source material, identify the song’s pattern, and compare the audio recording to its database. It will then track down the
specific track that was playing and supply the information to its curious end-user.

While this is a nice and simple example among other more recent innovations in speech technology, it’s not always that clean of a
process.

You might also like