You are on page 1of 19

Introduction to Artificial

Intelligence

Topic – Speech Recognition

Group members: -

Akanksha Kumari (BFT/18/1627)

Prince (BFT/18/513)

Sneha Mahto (BFT/18/1157)

Somya (BFT/18/526)
INTRODUCTION

Speech recognition is a technology that able a computer to capture the


words spoken by a human with a help of microphone. These words are
later on recognized by speech recognizer, and in the end, system outputs
the recognized words. An ideal situation in the process of speech
recognition is that, a speech recognition engine recognizes all words
uttered by a human but, practically the performance of a speech
recognition engine depends on number of factors. Vocabularies, multiple
users and users and noisy environment are the major factors that are
counted in as the depending factors for a speech recognition engine.

Speech Recognition known as “automatic speech recognition “(ASR), or


speech to text (STT). It is the process of converting an acoustic signal,
captured by a microphone or any peripherals, to a set of words. To
achieve speech understanding we can use linguistic processing.

The speech recognition process is performed by a software component


known as the speech recognition engine. The primary function of the
speech recognition engine is to process spoken input and translate it into
text that an application understands.

Simply put, it is the process of converting spoken input to text. Speech


recognition is thus sometimes referred to as speech-to-text.

Speech recognition allows you to provide input to an application with your


voice. Just like clicking with your mouse, typing on your keyboard, or
pressing a key on the phone keypad provides input to an application,
speech recognition allows you to provide input by talking.

HISTORY

The first speech recognition systems were focused on numbers, not


words. In 1952, Bell Laboratories designed the “Audrey” system which
could recognize a single voice speaking digits aloud. Ten years later, IBM
introduced “Shoebox” which understood and responded to 16 words in
English.
Across the globe other nations developed hardware that could recognize
sound and speech. And by the end of the ‘60s, the technology could
support words with four vowels and nine consonants.

The ‘80s saw speech recognition vocabulary go from a few hundred words
to several thousand words.

In 1978, the Speak & Spell, using a speech chip, was introduced to help
children spell out words. The speech chip within would prove to be an
important tool for the next phase in speech recognition software. In 1987,
the World of Wonders “Julie” doll came out. In an impressive (if not
downright terrifying) display, Julie was able to respond to a speaker and
had the capacity to distinguish between speaker’s voices.

Three short years after Julie, the world was introduced to Dragon,
debuting its first speech recognition system, the “Dragon Dictate”. Around
the same time, AT&T was playing with over-the-phone speech recognition
software to help field their customer service calls. In 1997, Dragon
released “Naturally Speaking,” which allowed for natural speech to be
processed without the need for pauses. What started out as a painfully
simple and often inaccurate system is now easy for customers to use.

By the year 2001, speech recognition technology had achieved close to


80% accuracy. For most of the decade there weren’t a lot of
advancements until Google arrived with the launch of Google Voice
Search. Because it was an app, this put speech recognition into the hands
of millions of people. It was also significant because the processing power
could be offloaded to its data centres. Not only that, Google was collecting
data from billions of searches which could help it predict what a person is
actually saying. At the time Google’s English Voice Search System
included 230 billion words from user searches.

In 2010, Google made a game-changing development which brought


speech recognition technology to the forefront of innovation: the Google
Voice Search app. It aimed to reduce the hassle of typing on your phone’s
tiny keyboard, and was the first of its kind to utilize cloud data centres. It
was, also personalized to your voice and was able to ‘learn’ your speech
patterns for higher accuracy. This all paved the way for Siri.

One year later in 2011, Apple debuted ‘Siri’. ‘She’ became instantly
famous for her incredible ability to accurately process natural utterances.
And, for her ability to respond using conversational – and often shockingly
sassy – language. You’re sure to have seen a few screen-captures of her
pre-programmed humour floating around the internet. Her success,
boosted by zealous Apple fans, brought speech recognition technology to
the forefront of innovation and technology. With the ability to respond
using natural language and to ‘learn’ using cloud-based processing, Siri
catalysed the birth of other likeminded technologies such as Amazon’s
Alexa and Microsoft’s Cortana.

Terms and Concepts

Following are a few of the basic terms and concepts that are fundamental
to speech recognition.

Utterances

When the user says something, this is known as an utterance. An


utterance is any stream of speech between two periods of silence.
Utterances are sent to the speech engine to be processed.

Silence

Silence, in speech recognition, is almost as important as what is spoken,


because silence delineates the start and end of an utterance. Here's how
it works. The speech recognition engine is "listening" for speech input.
When the engine detects audio input - in other words, a lack of silence --
the beginning of an utterance is signaled. Similarly, when the engine
detects a certain amount of silence following the audio, the end of the
utterance occurs.

Utterances are sent to the speech engine to be processed. If the user


doesn’t say anything, the engine returns what is known as a silence
timeout - an indication that there was no speech detected within the
expected timeframe, and the application takes an appropriate action, such
as reporting the user for input.

Pronunciations

The speech recognition engine uses all sorts of data, statistical models,
and algorithms to convert spoken input into text. One piece of information
that the speech recognition engine uses to process a word is its
pronunciation, which represents what the speech engine thinks a word
should sound like.
Grammars

Grammars define the domain, or context, within which the recognition


engine works. The engine compares the current utterance against the
words and phrases in the active grammars. If the user says something
that is not in the grammar, the speech engine will not be able to decipher
it correctly.

TYPES

A few classes of speech recognition are classified as

Isolated Speech

Isolated speech usually involves a pause between two utterances; it


doesn’t mean that it only accepts a single word but instead it requires one
utterance at a time. And it usually does the processing during the pauses).

Connected Speech

connected speech is similar to isolated speech but allow separate


utterances with minimal pause between them.

Continuous speech

Continuous speech allows the user to speak almost naturally, it is also


called the computer dictation.

Spontaneous Speech

At a basic level, it can be thought of as speech that is natural sounding


and not rehearsed. An Automatic Speech Recognition system with
spontaneous speech ability should be able to handle a variety of natural
speech features such as words being run together, “ums” and “ahs”, and
even slight stutters.

VOICE VERIFICATION/IDENTIFICATION

Some ASR systems have the ability to identify specific users.

Speaker–dependent software works by learning the unique


characteristics of a single person's voice, in a way similar to voice
recognition. New users must first "train" the software by speaking to it, so
the computer can analyse how the person talks. This often means users
have to read a few pages of text to the computer before they can use the
speech recognition software.

Speaker–independent software is designed to recognize anyone's voice,


so no training is involved. This means it is the only real option for
applications such as interactive voice response systems — where
businesses can't ask callers to read pages of text before using the system.
The downside is that speaker–independent software is generally less
accurate than speaker–dependent software.

A third variation of speaker models is now emerging called speaker


adaptive. Speaker adaptation refers to the technology whereby a speech
recognition system is adapted to the acoustic features of a specific user
using an extremely small sample of utterances when the system is used.

Technical Feasibility

There are many of components that can build our system: (hardware,
software, and human components).

Hardware components

Network communication Modem for connecting to internet , Connecting


Wires .

Computer components

computer devices which are use to implements the application (Speech


Recognition System)

Component Minimum Recommended

CPU 1.6GHz 2.53GHz

RAM 2 GB 4 GB
Human Components

Programmers, Analyzers, Designers and etc...

Software Components

Visual studio 2015: for build up our project, creates all the window forms
application and designing an interfaces.

MySQL: for managing the database (creates tables, store the data).

Word processor: for write a project report.

Programming language: The programming language is C SHARP(C#) It’s


easy for learning and its use for create windows forms application, it’s also
a well-known and high-level programing language. Microsoft Speech SDK
is one of the many tools that enable a developer to add speech capability
in to an application.

C# is the open source language and run on Windows, Mac, and Linux.
This language helps you for developing the windows store application,
Android apps, and iOS apps. It can also be useful to build backend and
middle-tire framework and libraries. It supports language interoperability it
means that C# can access code written in any .NET compliant language.

The C# can run on a variety of computer platform so a developer can


easily perform reusability of coding. C# supports operator overloading and
pre-processor directive that will help for speech recognition grammar.
With this language, we can easily handle speech recognition event. We
can find freelance jobs online in this sector.
Components of a Speech Recognition System

• A speech capturing Device: It consists of a microphone, which


converts the sound wave to electrical signals and an Analog to
Digital Converter which samples and digitizes the analog signals to
a discrete data which computer can understand.

• A Digital Signal Module or a Processor: It performs processing


on the raw speech signal like frequency domain conversion,
restoring only the required information etc.

• Pre-processed signal storage: The pre-processed speech is


stored in the memory to carry out further task of speech recognition.

• Reference Speech patterns: The computer or the system consists


of predefined speech patterns or templates already stored in the
memory, to be used as the reference for matching.

• Pattern matching algorithm: The unknown speech signal is


compared with the reference speech pattern to determine the actual
words or the pattern of words.
The major steps in producing speech from text are as follows:

Structure analysis: process the input text to determine where paragraphs,


sentences and other structures start and end. For most languages,
punctuation and formatting data are used in this stage.

Text pre-processing: analyze the input text for special constructs of the
language. In English, special treatment is required for abbreviations,
acronyms, dates, times, numbers, currency amounts, email addresses
and many other forms. Other languages need special processing for these
forms and most languages have other specialized requirements.

The remaining steps convert the spoken text to speech:

Text-to-phoneme conversion: convert each word to phonemes. A


phoneme is a basic unit of sound in a language. US English has around
45 phonemes including the consonant and vowel sounds. For example,
"times" is spoken as four phonemes "t ay m s". Different languages have
different sets of sounds (different phonemes). For example, Japanese has
fewer phonemes including sounds not found in English, such as "ts" in
"tsunami".

Prosody analysis: process the sentence structure, words and phonemes


to determine appropriate prosody for the sentence. Prosody includes
many of the features of speech other than the sounds of the words being
spoken. This includes the pitch (or melody), the timing (or rhythm), the
pausing, the speaking rate, the emphasis on words and many other
features. Correct prosody is important for making speech sound right and
for correctly conveying the meaning of a sentence.

Waveform production: finally, the phonemes and prosody information are


used to produce the audio waveform for each sentence. There are many
ways in which the speech can be produced from the phoneme and
prosody information. Most current systems do it in one of two ways:
concatenation of chunks of recorded human speech, or formant synthesis
using signal processing techniques based on knowledge of how
phonemes sound and how prosody affects those phonemes. The details
of waveform generation are not typically important to application
developers.

PROCESS

The basic principle of voice recognition involves the fact that speech or
words spoken by any human being cause vibrations in air, known as
sound waves. These continuous or analog waves are digitized and
processed and then decoded to appropriate words and then appropriate
sentences.

When we speak, we create vibrations in the air. The analog-to-digital


converter (ADC) translates this analog wave into digital data that the
computer can understand. To do this, it samples, or digitizes, the sound
by taking precise measurements of the wave at frequent intervals. The
system filters the digitized sound to remove unwanted noise, and
sometimes to separate it into different bands of frequency.

Next the signal is divided into small segments. The program then matches
these segments to known phonemes of a language. A phoneme is the
smallest element of a language -- a representation of the sounds we make
and put together to form meaningful expressions.

The program examines phonemes in the context of the other phonemes


around them. It runs the contextual phoneme plot through a complex
statistical model and compares them to a large library of known words,
phrases and sentences. The program then determines what the user was
probably saying and either outputs it as text or issues a computer
command.
PROS

Access – For writers with physical disabilities that prevent them from using
a keyboard and mouse, being able to issue voice commands and dictate
words into a text document is a significant advantage.

Spelling – you will have access to the same editing tools as a standard
word processing solution. Of course, nothing is 100 percent accurate
(yet), but the software will catch the majority of spelling and grammatical
errors.

Speed – the software can capture your speech at a faster rate than you
might normally type. So, it is now possible to get your thoughts onto
electronic paper faster than waiting for your fingers to catch up.

Specialization – Voice Command Technology (VCT) has demonstrated


considerable growth in the medical sector. For example, the physicians
find it very helpful to make file notations directly into the patient’s
Electronic Health Record (EHR). It’s designed to have a built-in
comprehensive medical vocabulary.

CONS
Set-up and Training can be a significant investment of time. Despite promises
that you’ll be up and running in a few minutes after installation, the reality of
recording your voice commands is more complex. Capturing your tone and
inflection accurately sometimes takes time. Even the software takes a pause at
few sentences, as it tries to figure out what you said. Therefore, it all requires
patience and clear enunciation.

Frequent Pauses can at times spoil your mood. Remember that the goal was to
write faster than you could normally type. Changes in voice tone or speech
clarity can cause glitches, as an unrecognized words or acronyms.

Limited Vocabulary – you should also be ready for lots of delays while the
software stumbles on your strange words. The simple reason for this is, new
industry-specific vocabularies are being added all the time these days.
Apple’s Siri

Apple’s Siri was the first voice assistant created by mainstream tech
companies debuting back in 2011.

Since then, it has been integrated on all iPhones, iPads, the AppleWatch,
the HomePod, Mac computers, and Apple TV.

Via your phone, Siri is even being used as the key user interface in Apple’s
CarPlay infotainment system for automobiles as well as the wireless
AirPod earbuds.

With the release of SiriKit, a development tool that lets third-party


companies integrate with Siri, and HomePod, Apple’s own attempt at an
intelligent the voice assistant’s ability become even more robust.

Although Apple had a big head start with Siri, many users expressed
frustration at its seeming inability to properly understand and interpret
voice commands.

But, even today Siri remains notorious for misunderstanding voice


commands, even going so far as to respond to a request for help with
alcohol poisoning by providing a list of nearby liquor stores.

If you ask Siri to send a text message or make a call on your behalf, it can
easily do so. However, when it comes to interacting with third-party apps,
Siri is a little less robust compared to its competitors, working with only six
types of apps: ride-hailing and sharing; messaging and calling; photo
search; payments; fitness; and auto infotainment systems.

Now, an iPhone user can say, “Hey Siri, I’d like a ride to the airport” or
“Hey Siri, order me a car,” and Siri will open whatever ride service app
you have on your phone and book the trip.

Focusing on the system’s ability to handle follow-up questions, language


translation, and revamping Siri’s voice to something more human-esque
is definitely helping to iron out the voice assistant’s user experience.

In addition, Apple rules over its competitors in terms of availability by


country and thus in Siri’s understanding of foreign accents. Siri is available
in more than 30 countries and 20 languages – and, in some cases, several
different dialects.
Amazon Alexa

Housed inside Amazon’s smash-hit Amazon Echo smart speaker as well


as the newly released Echo Show (a voice-controlled tablet) and Echo
Spot (a voice-controlled alarm clock), Alexa is one of the most popular
voice-assistants out there today.

Whereas Apple focuses on perfecting Siri’s ability to do a small handful of


things versus expanding its areas of expertise, Amazon puts no such
restrictions on Alexa.

Instead, wagering that the voice assistant with the most “skills,” (its term
for apps on its Echo assistant devices), “will gain a loyal following, even if
it sometimes makes mistakes and takes more effort to use”.

Although some users have pegged Alexa’s word recognition rate as being
a shade behind other voice platforms, the good news is that Alexa adapts
to your voice over time, offsetting any issues it may have with your
particular accent or dialect.

Speaking of skills, Amazon’s Alexa Skills Kit (ASK) is perhaps what has
propelled Alexa forward as a bonafide platform. ASK allows third-party
developers to create apps and tap into the power of Alexa without ever
needing native support.

With over 30,000 skills and growing, Alexa certainly outperforms Siri,
Google Voice and Cortana combined in terms of third-party integration.
With the incentive to “Add Voice to Your Big Idea and Reach More
Customers” (not to mention the ability to build for free in the cloud “no
coding knowledge required”) it’s no wonder that developers are rushing to
put content on the Skills platform.

Another huge selling point for Alexa is its integration with smart home
devices such as cameras, door locks, entertainment systems, lighting and
thermostats.

If you ask Alexa to re-order your rubbish bags, she’ll just go through
Amazon and order them. In fact, you can order millions of products off of
Amazon without ever lifting a finger; a natural and unique ability that Alexa
has over its competitors.
Microsoft’s Cortana

Based on a 26th-century artificially intelligent character in the Halo video


game series, Cortana debuted in 2014 as part of Windows Phone 8.1, the
next big update at the time for Microsoft’s mobile operating system.

Microsoft has since announced, in late 2017, that its conversational


speech recognition system reached a 5.1% error rate, its lowest so far.
This surpasses the 5.9% error rate reached in October 2016 by a group
of researchers from Microsoft Artificial Intelligence and Research and puts
its accuracy on par with professional human transcribers who have
advantages like the ability to listen to text several times.

In this race, every inch counts; when Microsoft announced their 5.9%
accuracy rate in late 2016, they were ahead of Google. However, fast-
forwarding a year puts Google ahead – but only by 0.2%.

We’ve all watched 2001: A Space Odyssey where the mother of all
sentient computers, HAL 9000, goes on a killing rampage with its
unblinking red eye and smooth-as-butter robotic voice.

To avoid this, Microsoft spoke to a number of high-level personal


assistants, finding that they all kept notebooks handy with key information
of the person they were looking after. It was that simple idea that inspired
Microsoft to create a virtual “Notebook” for Cortana, which stores personal
information and anything that’s approved for Cortana to see and use.

For instance, if you aren’t comfortable with Cortana having access to your
email, your Notebook is where you can add or remove access. Another
stand-out feature? Cortana will always ask you before she stores any
information she finds in her Notebook.

Similarly, to Google Assistant and Google search, Cortana has support


from Microsoft’s Bing search engine; allowing the voice-assistant to chew
through whatever data it needs to answer your burning questions.

And, similarly to Amazon, Microsoft has come out with its own home smart
speaker, Invoke, which executes many of the same functions that their
rival devices do. Microsoft has another huge advantage when it comes to
market reach – with Cortana being available on all Windows computers
and mobiles running on Windows 10.
Google Assistant

One of the most common responses to voicing a question out loud these
days is, “LMGTFY”. In other words, “let me Google that for you”.

It only makes sense then, that Google Assistant prevails when it comes to
answering (and understanding) any and all questions its users may have.

From asking for a phrase to be translated into another language, to


converting the number of sticks of butter in one cup, Google Assistant not
only answers correctly, but also gives some additional context and cites a
source website for the information. Given that it’s backed by Google’s
powerful search technology, perhaps it’s an unsurprising caveat.

Though Amazon’s Alexa was released (through the introduction of Echo)


two years earlier than Google Home, Google has made great strides in
catching up with Alexa in a very short time. Google Home was released
in late 2016, and within a year, had already established itself as the most
meaningful opponent to Alexa.

As of late 2017, Google boasted a 95% word accuracy rate for U.S.
English; the highest out of all the voice-assistants currently out there. This
translates to a 4.9% word error rate – making Google the first of the group
to fall below the 5% threshold.

In what some call an attempt to strike back at Amazon, Google has


launched many eerily similar products to Amazon. For instance, Google
Home is reminiscent of Amazon’s Echo, and Google Home Mini of
Amazon Echo Dot.

More recently, Google also announced some new, key partnerships with
companies including Lenovo, LG and Sony to launch a line of Google
Assistant-powered “smart displays,” which once again seems to ‘echo’ the
likeness of Amazon’s Echo Show.
In-Car Speech Recognition

Voice-activated devices and digital voice-assistants aren’t just about


making things easier.

It’s also about safety – at least it is when it comes to in-car speech


recognition.

Companies like Apple, Google and Nuance are completely reshaping the
driver’s experience in their vehicle; aiming at removing the distraction of
looking down at your mobile phone while you drive allows drivers to keep
their eyes on the road.

Instead of texting while driving, you can now tell your car who to call or
what restaurant to navigate to.

Instead of scrolling through Apple Music to find your favorite playlist, you
can just ask Siri to find and play it for you.

If the fuel in your car is running low, you’re in-car speech system can not
only inform you that you need to refuel, but also point out the nearest fuel
station and ask whether you have a preference for a particular brand.

Or perhaps it can warn you that the petrol station you prefer is too far to
reach with the fuel remaining.

As beneficial as it may seem in an ideal scenario, it in-car speech


technology can be dangerous when implemented before it has high
enough accuracy. Studies have found that voice activated technology in
cars can actually cause higher levels of cognitive distractions. This is
because it is relatively new as a technology; engineers are still working
out the software kinks.

But, at the rate speech recognition technology and artificial intelligence is


improving, perhaps we won’t even be behind the wheel at all a few years
down the line.
Voice-Activated Video Games

Outside of these use-cases in which speech recognition technology is


implemented with the intent to simplify our lives, it’s also making strides in
other areas. Namely, in the gaming industry.

Creating a video game is already extraordinarily difficult.

It takes years to properly flesh out the plot, the gameplay, character
development, customizable gear, lottery systems, worlds, and so on. Not
only that, but the game has to be able to change and adapt based on each
player’s actions.

Now, just imagine adding another level to gaming through speech


recognition technology.

Many of the companies championing this idea do so with the intention of


making gaming more accessible for visually and/or physically impaired
people, as well as allowing players to immerse themselves further into
gameplay through enabling yet another layer of integration.

Voice control could also potentially lower the learning curve for beginners,
seeing as less importance will be placed on figuring out controls; players
can “just” begin talking right away.

In other words: it’ll be extremely challenging for game developers who will
now have to account for hundreds (if not thousands) of hours of voice data
collection, speech technology integration, testing and coding in order to
retain their international audience.

However, despite all the goals tech companies are shooting for and the
challenges they have to overcome along the way, there are already
handfuls of video games out there who have believed the benefits
outweigh the obstacles.

In fact, voice-activated video games have even begun to extend from the
classic console and PC format to voice-activated mobile games and apps.
From Seaman starring a sarcastic man-fish brought to life by Leonard
Nimoy’s voice in the late 1990s to Microsoft’s Mass Effect 3 released in
2012, the rise of speech technology in video games has only just begun.
References: -

https://www.globalme.net/blog/speech-recognition-software-history-
future/

https://www.globalme.net/blog/the-present-future-of-speech-recognition/

https://www.tldp.org/HOWTO/Speech-Recognition-
HOWTO/introduction.html

http://read.pudn.com/downloads167/doc/769783/Introduction_to_Speec
h_Recognition.pdf

You might also like