You are on page 1of 48

“SPEECH RECOGNITION TECHNOLOGY”

A Technical Seminar report submitted to JNTUH in the partial fulfillment of the


requirements for the award of the degree of

BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
By

A.USHA - 19631A0523

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


SRI VENKATESWARA ENGINEERING COLLEGE
(SPONSORED BY THE EXHIBITION SOCIETY, HYD)
(Affiliated to Jawaharlal Nehru Technological University, HYD)
SURYAPET-508213
Jan-2023.

-1-
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SRI VENKATESWARA ENGINEERING COLLEGE
SURYAPET-508213

CERTIFICATE

This is to certify that the Technical Seminar Report entitled “SPEECH


RECOGNITION TECHNOLOGY ” is the bonafide work done by A.USHA (19631A0523), in
partial fulfillment of the requirements for the award of BACHELOR OF TECHNOLOGY in
Computer science and Engineering by JNT.University Hyderabad during the academic year 2023.

COORDINATOR Head of the Department


Mrs E.SRI LAXMI,(M.Tech) Mr. B.RAMJI, M.Tech (Ph.D)
Assistant Professor Sr. Assistant Professor

-2-
DECLARATION

I am A. Usha student of B.Tech with CSE of Sri Venkateswara Engineering College,


Suryapet, being H.T. No 19631A0523 respectively hereby declares that the Technical
Seminar with the title “SPEECH RECOGNITION TECHNOLOGY” is the original
work done by me.

To the best of my knowledge and belief, I hereby declare that this Technical Seminar
Report bears no resemblance to any other Technical Seminar Report submitted at Sri
Venkateswara Engineering College, Suryapet or any other college affiliated to Jawaharlal
Nehru Technological University, Hyderabad for the award of the degree.

Place: Suryapet

Date:03/01/2023

Signature of the candidate

A.USHA
(19631A0523)

-3-
ACKNOWLEDGEMENT

I thank the almighty for giving me the courage and perseverance in completing my Technical
Seminar Report. This Technical seminar Report itself is an acknowledgment of all those people who
gave this Technical Seminar a success.

I take this opportunity to express my deep and sincere gratitude to the Co-ordinator
Mrs.E.SRILAXMI,(M.Tech) Assistant Professor for his valuable advice at every stage of this work. Without
her supervision and valuable guidance I never have come out in this form.

We are also thankful to Mr.B.RAMJI, M.Tech (Ph.D) Head of the department of computer science and
engineering, and, Dr M Raju Principal, Dr. D Kiran Kumar Director S.V.E.S for providing
excellent facilities, motivation and encouragement to complete this Technical Seminar Report work
on time.

Last but not least we would like to express our deep sense of gratitude and earnest thanksgiving to our
parents for their moral support and heartfelt cooperation in doing the Technical Seminar Report. We
would also like to thank all the teaching and non-teaching staff and my friends whose direct or indirect
help has enabled us to complete this work successfully.

A.USHA
19631A0523

-4-
ABSTRACT

As a cross-disciplinary, speech recognition is based on the voice as the research object.


Speech recognition allows the machine to turn the speech signal into text or commands through
the process of identification and understanding and also makes the function of natural voice
communication. Speech recognition involves many fields of physiology, psychology,
linguistics, computer science and signal processing, and is even related to the person’s body
language, and its ultimate goal is to achieve natural language communication between man
and machine. The speech recognition technology is gradually becoming the key technology of
the IT man-machine interface. The paper describes the development of speech recognition
technology and its basic principles, methods, reviewed the classification of speech recognition
systems and voice recognition technology, analysed the problems faced by the speech
recognition.

-5-
Table of Contents
CERTIFICATE…………………………………………………………………………………….2
DECLARATION…………………………………………………………………………………...4

ACKNOWLEDGEMENT.................................................................................................................5

ABSTRACT........................................................................................................................................4

Chapter 1: INTRODUCTION ........................................................................................................7

Chapter 2: SPEECH RECOGNITION..........................................................................................8

Chapter 2.1: Automatic speech recognition...............................................................................8

Chapter 3: CLASSIFICATION TECHNIQUES........................................................................10


Chapter 4:WORKING TECHNOLOGY ....................................................................................12
Chapter 4.1: Acoustic model.. .............................................................................................14
Chapter 4.2: Language model…...........................................................................................15

Chapter 5: SPEECH RECOGNITION ALGORITHMS............................................................20

Chapter 5.1: Hidden Markov model ...................................................................................20

Chapter 5.2: N-grams...........................................................................................................31

Chapter 5.3: Artificial Intelligence......................................................................................33

Chapter 6:PRO’S AND CON’S OF SPEECH RRECOGNITION.............................................36


Chapter 6.1: Pro’s.................................................................................................................36
Chapter 6.2: Con’s…………….............................................................................................37

Chapter 7: FEATURES.................................................................................................................39

Chapter 8: APPLICATIONS........................................................................................................40

Chapter 8.1: Google’s voice assistant.................................................................................41

Chapter 8.2: Amazon’s Alexa….........................................................................................41

Chapter 8.3: Apple’s Siri.....................................................................................................42

Chapter 8.4: Voice shopping...............................................................................................42

Chapter 8.5: Voice assistants in smart televisions.............................................................43

Chapter 8.6: Voice recognition for security.......................................................................44

Chapter 8.7: Forensic voice and criminal identification...................................................45

Chapter 8.8: Voice user interfaces......................................................................................45

Chapter 9: Conclusion ................................................................................................................47


REFERENCES...........................................................................................................................48

-6-
Chapter 1: INTRODUCTION

The computer revolution is now well advanced, but although we see a starting
preparation of computer machines in many forms of work people do, the domain of computers
is still significantly small because of the specialized training needed to use them and the lack
of intelligence in computer systems. In the history of computer science five generations have
passed by, each adding a new innovative technology that brought computers nearer and nearer
to the people. Now it is sixth generation, whose prime objective is to make computers more
intelligent i.e., to make computer systems that can think as humans.

The fifth generation was aimed at using conventional symbolic Artificial Intelligence
techniques to achieve machine intelligence. Thus failed. Statistical modeling and Neural Nets
are really sixth generation. The goal of work in Artificial Intelligence is to build the machines
that perform tasks normally requiring human intelligence. True, but speech recognition seeing
and walking don’t require “intelligence, but human perceptual ability and motor control.
Speech Technology is now one of the major significant scientific research fields under the
broad domain of AI; indeed it is a major codomain of computer science, apart from the
traditional linguistics and other disciplines that study the spoken language

-7-
Chapter 2: SPEECH RECOGNITION

Speech recognition, or speech-to-text, is the ability of a machine or program to identify


words spoken aloud and convert them into readable text. Rudimentary speech recognition
software has a limited vocabulary and may only identify words and phrases when spoken
clearly. More sophisticated software can handle natural speech, different accents and various
languages. Speech recognition uses a broad array of research in computer science, linguistics
and computer engineering. Many modern devices and text-focused programs have speech
recognition functions in them to allow for easier or hands-free use of a device. Speech
recognition and voice recognition are two different technologies and should not be confused.
Speech recognition is used to identify words in spoken language. Voice recognition is a
biometric technology for identifying an individual’s voice.

Fig: Speech Recognition Technology

Chapter 2.1: Automatic Speech Recognition

The days when you had to keep staring at the computer screen and frantically hit the
key or click the mouse for the computer to respond to your commands may soon be a things of
past. Today we can stretch out and relax and tell your computer to do your bidding. This has
been made possible by the ASR (Automatic Speech Recognition) technology. The ASR
technology would be particularly welcome by automated telephone exchange operators,
Doctors and lawyers, besides others whose seek freedom from tiresome conventional
computer operations using keyboard and the mouse. It is suitable for applications in which

-8-
computers are used to provide routine information and services. The ASR’s direct
speech to text dictation offers a significant advantage over traditional transcriptions. With
further refinement of the technology in text will become a thing of past. ASR offers a solution
to this fatigue-causing procedure by converting speech in to text. The ASR technology is
presently capable achieving recognition accuracies of 95% - 98 % but only under ideal
conditions.

The technology is still far from perfect in the uncontrolled real world. The routes of this
technology can be traced to 1968 when the term Information Technology hadn’t even been
coined. American’s had only begun to realize the vast potential of computers. In the Hollywood
blockbuster 2001: a space odyssey. A talking listening computer HAL-9000, had been featured
which to date is a called figure in both science fiction and in the world of computing. Even
today almost every speech recognition technologist dreams of designing an HAL-like computer
with a clear voice and the ability to understand normal speech. Though the ASR technology is
still not as versatile as the imaginer HAL, it can nevertheless be used to make life easier. New
application specific standard products, interactive error-recovery techniques, and better voice
activated user interfaces allow the handicapped, computer-illiterate, and rotary dial phone
owners to talk to the computers. ASR by offering a natural human interface to computers, finds
applications in telephone-call centers, such as for airline flight information system, learning
devices, toys, etc.

Fig : Automatic speech recognition

-9-
Chapter 3: SPEECH RECOGNITION
CLASSIFICATION TECHNIQUES

Speech Recognition (SR) can broadly be classified into two categories:

1. Small Vocabulary/ Large User Base: Good for automated tele-services like voice
activated dialing and IVR, but the usable vocabulary is highly limited in scope to certain
specific commands.
2. Large Vocabulary/ Small User Base: Suited for environments where small group of
people is involved. It however requires more rigorous training for that particular user group
and gives erroneous results for anyone outside that group.

The current methods rely on mathematically analyzing the digitized sound waves and
their spectrum properties. The process involves the conversion of the sound waves spoken into
the microphone (at 16KHz) into a digital signal through quantization and digitization following
the Nyquist-Shannon Sampling theorem, which simply put, requires at least one sample to be
collected for each compression and rarefaction consecutively. This means that the frequency
of sampling should be at least twice the highest frequency component in the signal. The speech
recognition program then follows various algorithms and models to account for variations and
compressing the raw speech signal to simplify processing. The initial compression may be
achieved through many methods including Fourier Transforms, Perceptual Linear Prediction,
Linear Predictive Coding and Mel-Frequency Cepstral Coefficients.

There are commonly four common concepts about which speech is recognized:

1. Template Based: Predefined templates or samples are created and stored. Whenever a
user utters a word, it is correlated with all the templates. The one with the highest correlation
is then selected as the spoken word. It isn’t flexible enough to understand voice patterns.
Discrete Time Warping may be considered as one of these techniques.
2. Knowledge based: These analyze spectrograms of voice to collect data and create some
rules which are indicative of the uttered command. These do not use language knowledge base
or speech variations and are generally used for command based systems.

- 10 -
3. Stochastic: Speech being a highly random phenomenon can be considered to be a
piecewise stationary process over which stochastic models can be applied. As stated earlier,
this is one of the most popular methods used by commercial programs. Hidden Markov Models
are an example of stochastic methods.
4. Connectionist: Artificial Neural Networks are used to store and extract various
coefficients from the speech data over multilayered structures and various neural nets to deduce
the spoken word.

The performance is generally measured in terms of accuracy and speed. The general
scales are that of Single Word Error Rate, which is the misunderstanding of one word in a
spoken sentence, and Command Success Rate, which is the accurate interpretation of the
spoken command. Different methods always give varying results which further depends on
various external factors.

- 11 -
Chapter 4: WORKING TECHNOLOGY
When a person speaks, compressed air from the lungs is forced through the
vocal tract as a sound wave that varies as per the variations in the lung pressure and
the vocal tract. This acoustic wave is interpreted as speech when it falls upon a
person’s ear. In any machine that records or transmits human voice, the sound wave
is converted into an electrical analogue signal using a microphone. When we speak
into a telephone receiver, for instance, its microphone converts the acoustic wave into
an electrical analogue signal that is transmitted through the telephone network. The
electrical signals strength from the microphone varies in amplitude over time and is
referred to as an analogue signal or an analogue waveform. If the signal results from
speech, it is known as a speech waveform. Speech waveforms have the characteristic
of being continuous in both time and amplitude.

Speech recognition systems se computer algorithms to process and interpret


spoken words and convert them into text.A Software program turns the sound a
microphone records into written language that computers and human can understand.

Fig:Working of Speech Recognition Technology

When a person speaks, compressed air from the lungs is forced through the
vocal tract as a sound wave that varies as per the variations in the lung pressure and

- 12 -
the vocal tract. This acoustic wave is interpreted as speech when it falls up on
a person’s ear. Speech waveforms have the characteristic of being continuous in both
time and amplitude. Any speech recognition system involves five major steps:

1. Converting sounds into electrical signals: when we speak into microphone it


converts sound waves into electrical signals. In any machine that records or
transmits human voice, the sound wave is converted into an electrical signal using
a microphone. When we speak into telephone receiver, for instance, its
microphone converts the acoustic wave into an electrical analogue signal that is
transmitted through the telephone network. The electrical signal’s strength from
the microphone varies in amplitude overtime and is referred to as an analogue
signal or an analogue waveform.
2. Background noise removal: the ASR programs removes all noise and retains the
words that you have spoken.
3. Breaking up words into phonemes: The words are broken down into individual
sounds, known as phonemes, which are the smallest sound units discernible. For
each small amount of time, some feature, value is found out in the wave. Likewise,
the wave is divided into small parts, called Phonemes
4. Matching and choosing character combination: this is the most complex phase.
The program has big dictionary of popular words that exist in the language. Each
Phoneme is matched against the sounds and converted into appropriate character
group. This is where problem begins. It checks and compares words that are
similar in sound with what they have heard. All these similar words are collected.
5. Language analysis: here it checks if the language allows a particular syllable to
appear after another.
6. After that, there will be grammar check. It tries to find out whether or not the
combination of words any sense. That is there will be a grammar check package.
7. Finally the numerous words constitution the speech recognition programs come
with their own word processor, some can work with other word processing package
like MS word and word perfect

Speech recognition software must adapt to the highly variable and cotext-
specific nature of human speech.The software algorithms that process and organize
audio into text are trained on different speech pattern,speaking

- 13 -
styles,languages,dialects,accents and pharsings.The software also seperates
spoken audio from background noise that often accompanies the signals.
To meet these requirements,speech recognition system use two types of models:
1. Acoustic models
2. Language models

Chapter 4.1: Acoustic Model

Acoustic modeling of speech typically refers to the process of establishing


statistical representations for the feature vector sequences computed from the speech
waveform. Hidden Markov Model (HMM) is one most common type of acoustuc
models. Other acosutic models include segmental models, super-segmental models
(including hidden dynamic models), neural networks, maximum entropy models, and
(hidden) conditional random fields, etc.

Acoustic modeling also encompasses “pronunciation modeling”, which


describes how a sequence or multi-sequences of fundamental speech units (such as
phones or phonetic feature) are used to represent larger speech units such as words
or phrases which are the object of speech recognition. Acoustic modeling may also
include the use of feeback information from the recognizer to reshape the feature
vectors of speech in achieving noise robustness in speech recognition.

Speech recognition engines usually require two basic components in order to


recognize speech. One component is an acoustic model, created by taking audio
recordings of speech and their transcriptions and then compiling them into statistical
representations of the sounds for words. The other component is called a language
model, which gives the probabilities of sequences of words. Language models are
often used for dictation applications. A special type of langauge models is regular
grammars, which are used typically in desktop command and control or
telephony IVR-type applications.Our group have been working on acoustic modeling
since its inception due to its critical importance in speech technology, speech
recognition in particular. We have world-class expertise and researchers in this area
of research. Recently, we have been focusing on two aspects of acoustic modeling:

- 14 -
1) how to establish the statistical models and their structures; and 2) how to learn the
model parameters automatically from the data. The following are some of our recent
projects in the area of acoustic modeling:

 Discriminative Learning Algorithms and Procedures for Acoustic Models of Speech


 Large-Margin Learning of HMM Parameters
 Discriminative pronunciation modeling
 Joint discriminative learning of SLU and SR model parameters using N-best//lattice
results from speech recognizer
 Discriminative acoustic models for Speech Recognition via the use of continuous
features in CRF and HCRF
 Acoustic feature enhancement by statistical mothods with feedbacks from speech
recognition
 Compressing HMM parameters for adaptive noise-robust speech recognition
 Noise-adaptive and speaker-adaptive training of HMM parameters
 Parametric modeling of acoustic environment with mixing phases between speech and
noise for speech recogntion
 Multilingual and cross-lingual speech recognition
 Cross-Lingual Speech Recognition under Runtime Resource Constraints
 Modeling speech production mechanisms for speech recognition: hidden dynamic
modeling; minimum-effort principle for model learning and decoding
 Acoustic modeling for casual speech for enhanced voicemail
 Active learning for speech recognition
 Unsupervised learning for speech recognition

Chapter 4.2: Language Model

Language modeling (LM) is the use of various statistical and probabilistic


techniques to determine the probability of a given sequence of words occurring in a
sentence. Language models analyze bodies of text data to provide a basis for their
word predictions. They are used in natural language processing (NLP) applications,
particularly ones that generate text as an output. Some of these applications include ,
machine translation and question answering.

- 15 -
How language modeling works:
Language models determine word probability by analyzing text data. They
interpret this data by feeding it through an algorithm that establishes rules for context
in natural language. Then, the model applies these rules in language tasks to
accurately predict or produce new sentences. The model essentially learns the features
and characteristics of basic language and uses those features to understand new
phrases.
There are several different probabilistic approaches to modeling language,
which vary depending on the purpose of the language model. From a technical
perspective, the various types differ by the amount of text data they analyze and the
math they use to analyze it. For example, a language model designed to generate
sentences for an automated Twitter bot may use different math and analyze text data
in a different way than a language model designed for determining the likelihood of
a search query.
Some common statistical language modeling types are:

N-gram. N-grams are a relatively simple approach to language models. They create a
probability distribution for a sequence of n The n can be any number, and defines the
size of the "gram", or sequence of words being assigned a probability. For example,
if n = 5, a gram might look like this: "can you please call me." The model then assigns
probabilities using sequences of n size. Basically, n can be thought of as the amount
of context the model is told to consider. Some types of n-grams are unigrams, bigrams,
trigrams and so on.

Unigram. The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the
query likelihood model, which uses information retrieval to examine a pool of
documents and match the most relevant one to a specific query.

Bidirectional. Unlike n-gram models, which analyze text in one direction


(backwards), bidirectional models analyze text in both directions, backwards and

- 16 -
forwards. These models can predict any word in a sentence or body of text by using
every other word in the text. Examining text bidirectionally increases result accuracy.
This type is often utilized in machine learning and speech generation applications. For
example, Google uses a bidirectional model to process search queries.

Exponential. Also known as maximum entropy models, this type is more complex
than n-grams. Simply put, the model evaluates text using an equation that combines
feature functions and n-grams. Basically, this type specifies features and parameters
of the desired results, and unlike n-grams, leaves analysis parameters more ambiguous
-- it doesn't specify individual gram sizes, for example. The model is based on the
principle of entropy, which states that the probability distribution with the most
entropy is the best choice. In other words, the model with the most chaos, and least
room for assumptions, is the most accurate. Exponential models are designed
maximize cross entropy, which minimizes the amount statistical assumptions that can
be made. This enables users to better trust the results they get from these models.

Continuous space. This type of model represents words as a non-linear combination


of weights in a neural network. The process of assigning a weight to a word is also
known as word embedding. This type becomes especially useful as data sets get
increasingly large, because larger datasets often include more unique words. The
presence of a lot of unique or rarely used words can cause problems for linear model
like an n-gram. This is because the amount of possible word sequences increases, and
the patterns that inform results become weaker. By weighting words in a non-linear,
distributed way, this model can "learn" to approximate words and therefore not be
misled by any unknown values. Its "understanding" of a given word is not as tightly
tethered to the immediate surrounding words as it is in n-gram models.
The models listed above are more general statistical approaches from which more
specific variant language models are derived. For example, as mentioned in the n-gram
description, the query likelihood model is a more specific or specialized model that
uses the n-gram approach. Model types may be used in conjunction with one another.
The models listed also vary significantly in complexity. Broadly speaking, more
complex language models are better at NLP tasks, because language itself is extremely
complex and always evolving. Therefore, an exponential model or continuous space

- 17 -
model might be better than an n-gram for NLP tasks, because they are designed to
account for ambiguity and variation in language.
A good language model should also be able to process long-term dependencies,
handling words that may derive their meaning from other words that occur in far-
away, disparate parts of the text. An LM should be able to understand when a word is
referencing another word from a long distance, as opposed to always relying on
proximal words within a certain fixed history. This requires a more complex model.
Importance of language modeling
Language modeling is crucial in modern NLP applications. It is the reason that
machines can understand qualitative information. Each language model type, in one
way or another, turns qualitative information into quantitative information. This
allows people to communicate with machines as they do with each other to a limited
extent.
It is used directly in a variety of industries including tech, finance, healthcare,
transportation, legal, military and government. Additionally, it's likely most people
reading this have interacted with a language model in some way at some point in the
day, whether it be through Google search, an autocomplete text function or engaging
with a voice assistant.
The roots of language modeling as it exists today can be traced back to 1948. That
year, Claude Shannon published a paper titled "A Mathematical Theory of
Communication." In it, he detailed the use of a stochastic model called the Markov
chain to create a statistical model for the sequences of letters in English text. This
paper had a large impact on the telecommunications industry, laid the groundwork for
information theory and language modeling. The Markov model is still used today, and
n-grams specifically are tied very closely to the concept.
Uses and examples of language modeling
Language models are the backbone of natural language processing (NLP). Below are
some NLP tasks that use language modeling, what they mean, and some applications
of those tasks:

Speech recognition -- involves a machine being able to process speech audio. This is
commonly used by voice assistants like Siri and Alexa.
Machine translation -- involves the translation of one language to another by a

- 18 -
machine. Google Translate and Microsoft Translator are two programs that do
this. SDL Government is another, which is used to translate foreign social media feeds
in real time for the U.S. government.

Parts-of-speech tagging -- involves the markup and categorization of words by


certain grammatical characteristics. This is utilized in the study of linguistics, first and
perhaps most famously in the study of the Brown Corpus, a body of composed of
random English prose that was designed to be studied by computers. This corpus has
been used to train several important language models, including one used by Google
to improve search quality.

Parsing -- involves analysis of any string of data or sentence that conforms to formal
grammar and syntax rules. In language modeling, this may take the form of sentence
diagrams that depict each word's relationship to the others. Spell checking applications
use language modeling and parsing.

Sentiment analysis -- involves determining the sentiment behind a given phrase.


Specifically, it can be used to understand opinions and attitudes expressed in a text.
Businesses can use this to analyze product reviews or general posts about their
product, as well as analyze internal data like employee surveys and customer support
chats. Some services that provide sentiment analysis tools are Repustate and Hubspot's
ServiceHub. Google's NLP tool -- called Bidirectional Encoder Representations from
Transformers (BERT) -- is also used for sentiment analysis.

Optical character recognition -- involves the use of a machine to convert images of


text into machine encoded text. The image may be a scanned document or document
photo, or a photo with text somewhere in it -- on a sign, for example. It is often used
in data entry when processing old paper records that need to be digitized. In can also
be used to analyze and identify handwriting samples.
Information retrieval -- involves searching in a document for information, searching
for documents in general, and searching for metadata that corresponds to a document.
Web browsers are the most common information retrieval applications.

- 19 -
Chapter 5: SPEECH RECOGNITION
ALGORITHMS

A speech recognition algorithm or voice recognition algorithm is used in speech


recognition technology to convert voice to text.
Speech recognition systems have several advantages:
 Efficiency: This technology makes work processes more efficient. Documents are
generated faster, and companies have been able to save on labor costs.
 Customer service playback: Speech recognition is now used to provide basic information
to users on customer service phone lines. Customers can select menus and get answers by
answering questions.
 Helps hearing and visually impaired people: People with visual or hearing disabilities
can now use computers to type and have text read to them out loud.
 Handsfree communication: Smartphone assistants such as Apple’s Siri and Google
Assistant have made it possible to use voice to make calls, send emails, search, and more
without touching the phone.
Various algoritms used in speech recognition are
 Hidden-Morkov Model
 N-Grams Model
 Natural Language processing Model
 Artificial Intelligence

Chapter 5.1: Hidden Markov Model

HMM can be used to model an unknown process that produces a sequence of


observable outputs at discrete intervals, where the outputs are members of some finite alphabet.
These models are called hidden Markov models precisely because the state sequence that
produced the observable output is not known- its hidden. HMM is represented by a set of states,
vectors defining transitions between certain pairs of those states, probabilities that apply to
state to state transitions, sets of probabilities characterizing observed output symbols, and initial
conditions.

- 20 -
An example is shown in the three state diagram 3 where states are denoted by nodes
and transitions by directed arrows (vectors) between nodes. The underlying model is a markov
chain. The circles represent states of the speaker’s vocal system specific configuration of
tongue, lips, etc that produce a given sound. The arrows represent possible transitions from one
state to another. At any given time, the model is said to be in one state. At clock time, the model
might change from its current state to any state to any state for which a transition vector exists.
Transition may occur only from the tail to the head of a vector. A state can have more than one
transition leaving it and more than one leading to it.

Fig:Hidden Markov Model

1.Hidden Markov model (HMM) is the base of a set of successful techniques for acoustic
modeling in speech recognition systems. The main reasons for this success are due to this
model’s analytic ability in the speech phenomenon and its accuracy in practical speech
recognition systems. Another major specification of HMM is its convergent and reliable
parameter training procedure. Spoken utterances are represented as a non-stationary sequence
of feature vectors. Therefore, to evaluate a speech sequence statistically, it is required to
segment the speech sequence into stationary states. An HMM model is a finite state machine.
Each state may be modeled as a single Gaussian or a multi-modal Gaussians mixture. Due to
the continuous nature of speech observations, continuous density pdfs are often used in this
model. The topology of an HMM model for speech is considered to be left-to-right to meet the
observations arrangement criterion. This left-to-right topology authorizes transitions from each
state to itself and to right-hand neighbors. HMM model parameters are usually estimated in

- 21 -
the training phase by maximum likelihood based [1] or discriminative based training
algorithms [2,3] using sufficient training data sets. A continuous left-to-right HMM model
parameters with N states and M mixtures can be stated by λ = {π, A, B}. π = {πi} is the initial
state distribution matrix, and A = {aij} is the state transition probability distribution matrix.
The transition probabilities are defined as follows. aij = P[qt+1 = j|qt = i] is the transition
probability from state i

Fig. 1. The overall block diagram of an automatic speech recognition system.

at time t to state j at time t + 1 satisfying the following constraints:

aij ≥ 0, X aij = 1; 1 ≤ i, j ≤ N. (1)

i=1

B = {bj(ot)} is the set of observation probability density per state, which may be represented by
a multi-modal Gaussian mixture model as
M

bj(ot) = X CjmG(ot,µjm,Σjm) (2)


m=1

where Cjm is the mixture coefficient for the mth mixture in state j. Cjm satisfies the following
constraints:

Cjm ≥ 0, X Cjm = 1; 1 ≤ j ≤ N, 1 ≤ m ≤ M. (3)


m=1

G(.) is a Gaussian distribution with mean vector µjm and covariance matrix Σjm.

- 22 -
Fig. shows the overall block diagram of an automatic speech recognition system in the
recognition phase. The continuous input speech utterance is segmented into frames by the
preprocessing module. In the next step, the feature extraction module extracts a feature vector
on each frame to represent its acoustic information. Hence, a discrete sequence of feature
vectors (observations), O = (o1o2 ... oT ), is obtained. In an utterance classification task with
vocabulary size v, the unknown input speech is compared with all of the HMMs λi according to
some search algorithms, and finally, the input speech is identified as one of the reference
HMMs with the highest score. In most HMM-based systems, Viterbi algorithm [1] is the core
of the recognition procedure. Viterbi algorithm is a full search method that tries all possible
solutions to find the best alignment path of the state sequence between the input utterance and
a given HMM. The full search in HMM can be formulated as

LL = P(O|λ) = maxP[q1q2 ... qT , o1o2 ... oT |λ]


q1q2...qT

= max P[πq1bq1(o1)Y aqt−1qt bqt (ot)] (4) q1q2...qT t=2

where qt is the state at time t. The sequence q1q2 ... qt denotes an alignment of observation
sequence and speech HMM and T is the length of the observation sequence. Obviously, as the
search space increases, the computational cost increases exponentially with O(NT ); therefore,
it is impractical to solve this NP-complete problem. Viterbi algorithm extracts the alignment
path dynamically by a recursive procedure.

LLt(j) = max [LLt−1(i)aij]bj(ot) (5)


1≤i≤N

where LLt(j) is the partial cost function of the alignment path in state j at time t and LLt−1(i) is
the score of the best path among possible paths that start from first state and end in the ith state
at time t − 1. Fig. 2 shows a Viterbi trellis diagram in which the horizontal axis represents the
time axis of the input utterance and the vertical axis represents the possible states of the
reference HMM.
The computational complexity of this method is O(N2T). Although it saves the
computational cost and memory requirements, it can however only be practically used where
the length of the input utterance is short and the number of HMM reference models is small.
In particular, for continuous speech recognition, this is not usually the case. Hence, to

- 23 -
overcome this deficiency, a Viterbi beam search [4] has been presented. The main idea
in beam search is to keep and extend possible paths with higher scores. This approach may
eliminate the optimality of the algorithm.

Fig. Viterbi trellis diagram.

Recently, evolutionary algorithms (EAs) have been extended in speech recognition


problems. However, there is little research in using these algorithms in the recognition phase
of HMM-based recognizers, and most studies have been focused on the training phase. EAs
are based on generating a group of random population of possible solutions and using a
collaborative search in each generation to achieve better solutions than previous ones. In HMM
training, genetic algorithms (GAs) [5–9] and particle swarm optimization (PSO) [10–12] have
been studied in recent years, where each individual solution is represented as an HMM and is
encoded as a string of real numbers. The studies have revealed that PSO can yield better
recognition performance and more capability to find the global optimum in comparison with
GA and the well-known Baum–Welch algorithm. These algorithms have also been applied in
optimizing the nonlinear time alignment of templatebased speech recognition in the
recognition phase [13,14]. In these works, to solve the optimal warping path searching
problem, each potential solution is considered as a possible warping path in the search space.
It was shown that using PSO with a pruning strategy causes a considerable reduction in
recognition time while maintaining the system accuracy. In contrast, using a direct GA without
pruning is not a promising approach. PSO has been used to solve many NP-complete
optimization problems.

- 24 -
In this paper, a novel approach is proposed to apply particle swarm optimization
strategy in the recognition phase of a speech recognition system instead of the traditional
Viterbi algorithm to deal with PSO performance in finding the global optimum segmentation.
Preliminary results of this work were reported in [17]. To explore the performance of the
proposed system performance, experiments were conducted on isolated word recognition and
stop consonants phones. Stop consonants classification is one of the most challenging tasks in
speech recognition. In addition, a new classification method based on a tied segmentation
strategy is introduced. The method can be generalized to the continuous speech recognition
case. The remainder of this paper is organized as follows. The next section provides the details
of the proposed PSO-based recognition procedure. Section 3 presents the experimental results
and in the last section the paper is concluded.

Experimental results

Experimental setup

To evaluate the performance of the idea in an utterance classifier, a set of experiments


was conducted on the eight most frequently occurring words of standard TIMIT speech
database which are presented in Table 2. The frequency of each word is more than 460 in the
training set and the words frequency is more than 160 in the test set. Although TIMIT is a
continuous speech recognition benchmark, the variety of words and speakers in TIMIT makes
it a good benchmark for our task. In addition, the idea was evaluated on a stop consonants
phone classifier for six TIMIT stop phones which are presented in Table 3. We eliminated the
phones with length less than three frames in the test set. The total number of phones in the
resulting test set for six stop phones was 5184 phones. The overall block diagram of both
HMM-based recognition systems is similar to Fig. 1. The Baum–Welch algorithm was applied
to train a continuous density HMM of each word and phone. The numbers of states in each
word and each phone were assumed to be four and three, respectively. There are 9 mixtures
per state in the words model and 16 mixtures per state in the phones model.

In the preprocessing stage of both systems, the audio signal is transformed into 26
MFCC feature vectors. In the word recognizer, feature vectors are extracted in 20 ms windows
of the utterance using overlapped 8 ms sliding frames. In contrast, in our phone classification
test bed, the preprocessor produced the feature vectors every 10 ms for 16 ms length windows.
The first 12 features are based on 25 mel-scaled filter bank coefficients, the 13th element is a

- 25 -
log energy coefficient and the 13 remaining features are their first derivatives. The tests
were simulated using Matlab 7.6 programming language. A summary of implementation
parameters is given in Tables 2 and 3 and Fig. 5 shows the block diagram of our system.

Isolated word recognizer results

Figs describe the overall behavior of the suggested system. Fig. 6 shows the effect of particles
defining on the convergence ratio with respect to Viterbi path likelihood. When the particles
are presented as state sequence vectors, the initial step starts from lower ratios. In addition, the
recognition system is easily trapped into a local optimum.The experiments show that if particles
are defined as segmentation vectors and movement updating is considered as being by the
second method in Section 2.3, the probability of finding the global optimum and approaching
Viterbi path likelihood increases. Therefore, Segment-ProbPSO is determined as the baseline
method in all of the experiments and optimizations have also been performed using this
method.Fig. shows the effect of movement structure. Although both curves start from a
common point, their convergence rates are different. Therefore, if the particles movement
during generations is defined as a probabilistic structure, it is more probable to find the global
optimum.

The results in Table 4 reveal that Viterbi and Segment-ProbPSO algorithms are equal
in error rates on average. Although we have applied the recognition procedure of Viterbi
algorithm for our system as the benchmark, this method provides better results in some cases.
This statement indicates the major drawback of the traditional recognition process that makes
the decision based on the comparison of best paths between unknown uttered word and given
word models.Fig shows an example of a comparison between Viterbi and segment-ProbPSO
recognition processes in 10 iterations. The unknown input utterance belongs to word model 1.
This test sample is recognized by Viterbi algorithm correctly while
by the proposed algorithm it is recognized as the second word model after 10 iterations.
However, it is obvious that, after more generations, the system achieves the correct result.
Therefore, more iteration was required for obtaining sufficient

- 26 -
Fig:Speech recognizer block diagram

100
Segment. ProbPSO
SS. ProbPSO
99

LL.
98
Vit
erb
97 (
i/L
L.
PS96
O
95

94

93

92

Fig: Particles defining influence in convergence percentage to Viterbi path likelihood.

100
Segment. ProbPSO
SS. LCPSO
99
LL.
Vit
erb98
i/L (
L.
PS97
O

96

95

94
5 10 15 20 25 30 35 40
Iteration

Fig.:Movement construction influence in convergence percentage to Viterbi path likelihood.

- 27 -
Table
Comparison error rates.

# Viterbi error rate (%) Segment-


Reference ProbPSO
words error rate
(%)
2 0.86 0.86
4 0.58 0.44
6 0.87 0.66
8 0.73 0.73

In most cases, the difference between the competing models likelihood is large and
desirable accuracy will be reached in primary iterations.
Figs. 9 and 10 report the system optimization results. Fig. 9 shows the influence of the α, β and
γ coefficients on the recognition error rate under fixed conditions for eight reference word
models in 20 iterations. The optimum value of β is 5 for α = 15 and γ = 15. The optimum value
of α is 5 for β = 5, γ = 15 and the optimum value of γ is 10 for α = 5 and β = 5.
Fig. 10 shows the effect of population size on the overall error rate. The computational cost
increases with the increase of population size. Therefore, the population size’s optimum value
is the position where the curve is saturated. This value is about eight particles in the empirical
curve depicted in Fig.

Phone recognition results


Considering the good performance of Segment-ProbPSO on an isolated word
recognition system, some phone classification experiments were conducted using Segment-
ProbPSO with the optimum achieved values for α, β and γ in the previous section. The results
are depicted in Fig. 11 and Table 5 which are compatible with the stops recognition results in
the literature Fig. shows the outstanding performance of this recognizer, which is even better
than the PSO-based isolated word recognizer in achieving the global optimum. Furthermore,
it is obvious that, in the initial iteration, the gbest likelihood values of phone classes got close
enough to Viterbi’s best path value. In addition, the system error rate in the first iteration, as
shown in Table 5, is 32.45%, which has negligible difference in comparison with the 30.29%
baseline system error rate based on Viterbi recognition procedure. Therefore, with a few
additional iterations, the system can easily achieve the desired accuracy.
Table 5 shows the variations of system error rate versus different population sizes and
iterations. If we neglect 1% or 2% difference in error rate, we can claim that, in the phone

- 28 -
classification task, the proposed algorithm computational cost is almost equivalent or even less
than the computational size of Viterbi algorithm.
Tied segmentation method results
The results of the tied segmentation method in Table 6 show that both classifier types
have almost the same performance. However, in this method, the convergence rate to the
desired accuracy rate is more than previous proposed methods.

920 865
reference1LL.Segment.ProbPSO reference2LL.Segment.ProbPSO
900 reference1LL.Viterbi reference2LL.Viterbi
Li Li
ke 880 ke 860
lih - lih -
oo 860 851.7
oo
d d
840
855
820

800 851.6

780 850

0 2 4 6 8 10 12 0 2 4 6 8 10 12
Iteration Iteration

1520 1000
reference3LL.Segment.ProbPSO reference4LL.Segment.ProbPSO
reference3LL.Viterbi reference4LL.Viterbi
1500 980
Li Li
ke 1480 ke 960
lih - lih -
oo 1460 oo 940
d 1449
d
1440 920

1420 900 898

1400 880

0 2 4 6 8 10 12 0 2 4 6 8 10 12
Iteration Iteration

Fig:An example of comparison of likelihood values versus number of iterations for Viterbi
and Segment-ProbPSO methods for four reference word models.

- 29 -
1.05
β Influence
α Influence
1
γ Influence
Err
or
Rat 0.95
e
(%)
0.9

0.85

0.8

0.75

0.7

5 10 15 20 25 30 35 40 45 50

α, β, γ

Fig: The influence of α, β and γ on the recognition error rate for eight reference word models in 20 iterations .

3.5
Initial Population Influence

3
Err
orR
ate( 2.5
%)
2

1.5

0 2 4 6 8 10 12 14 16 18
Initial Population
Fig:The effect of population size on recognition error rate for 20 iterations and α = β = 5 and γ = 10.

100
O

(
99.5

99

98.5
0510152025303540
Iteration

Fig: Convergence percentage of Segment-ProbPSO’s gbest likelihood values to the Viterbi path likelihood.

- 30 -
Table
Phone classifier error rates (%).
#Iterations

1 2 3 4 5 6
1 39.12 36.94 35.63 34.88 34.22 33.12 36.65 34.86 33.72
2 33.04 32.64 32.31 35.38 33.35 32.48 31.85 31.58
3 31.46
#Particles 4 34.26 32.54 31.98 31.69 31.27 31.15
5 33.04 32.37 31.69 31.19 30.92 30.90

6 32.45 31.79 31.53 30.74 30.63 30.55

Table
Phone classifier error rates (%).
#Iterations

1 2 3 4 5 6
1 39.12 37.58 36.34 34.68 33.97 33.02 36.65 34.95 33.31 32.91
2 32.43 31.91 35.38 33.24 32.48 31.73 31.60 31.20
3 34.26 32.20 31.46 31.15 31.02 30.94
4
#Particles
5 33.04 31.83 31.15 30.90 30.64 30.22

6 32.45 31.13 30.83 30.63 30.55 30.13

Chapter 5.2: N-Grams Model

N-gram is a sequence of the N-words in the modeling of NLP. Consider an example of


the statement for modeling. “I love reading history books and watching documentaries”. In
one-gram or unigram, there is a one-word sequence. As for the above statement, in one gram
it can be “I”, “love”, “history”, “books”, “and”, “watching”, “documentaries”. In two-gram or
the bi-gram, there is the two-word sequence i.e. “I love”, “love reading”, or “history books”.
In the three-gram or the tri-gram, there are the three words sequences i.e. “I love reading”,
“history books,” or “and watching documentaries” [3]. The illustration of the N-gram modeling
i.e. for N=1,2,3 is given below in Figure.

For N-1 words, the N-gram modeling predicts most occurred words that can follow the
sequences. The model is the probabilistic language model which is trained on the collection of
the text. This model is useful in applications i.e. speech recognition, and machine translations.
A simple model has some limitations that can be improved by smoothing, interpolations, and

- 31 -
back off. So, the N-gram language model is about finding probability distributions over the
sequences of the word. Consider the sentences i.e. "There was heavy rain" and "There was
heavy flood".

Fig: Uni-gram, Bi-gram, and Tri-gram Model

By using experience, it can be said that the first statement is good. The N-gram language
model tells that the "heavy rain" occurs more frequently than the "heavy flood". So, the first
statement is more likely to occur and it will be then selected by this model. In the one-gram
model, the model usually relies on that which word occurs often without pondering the
previous words. In 2-gram, only the previous word is considered for predicting the current
word. In 3-gram, two previous words are considered.
In the N-gram language model the following probabilities are calculated:
P (“There was heavy rain”) = P (“There”, “was”, “heavy”, “rain”) = P (“There”) P (“was”
|“There”) P (“heavy”| “There was”) P (“rain” |“There was heavy”).
As it is not practical to calculate the conditional probability but by using the “Markov
Assumptions”, this is approximated to the bi-gram model as [4]:
P (“There was heavy rain”) ~ P (“There”) P (“was” |“'There”) P (“heavy” |“was”) P (“rain”
|“heavy”)

- 32 -
Applications of the N-gram Model in NLP

In speech recognition, the input can be noisy. This noise can make a wrong speech to
the text conversion. The N-gram language model corrects the noise by using probability
knowledge. Likewise, this model is used in machine translations for producing more natural
statements in target and specified languages. For spelling error corrections, the dictionary is

useless sometimes. For instance, "in about fifteen minutes" 'minuets' is a valid word according
to the dictionary but it is incorrect in the phrase. The N-gram language model can rectify this
type of error.

The N-gram language model is generally at the word levels. It is also used at the
character levels for doing the stemming i.e. for separating the root words from a suffix. By
looking at the N-gram model, the languages can be classified or differentiated between the
US and UK spellings. Many applications get benefit from the N-gram model including
tagging of part of the speech, natural language generations, word similarities, and sentiments
extraction.

Limitations of N-gram Model in NLP

The N-gram language model has also some limitations. There is a problem with the
out of vocabulary words. These words are during the testing but not in the training. One
solution is to use the fixed vocabulary and then convert out vocabulary words in the training
to pseudowords. When implemented in the sentiment analysis, the bi-gram model
outperformed the uni-gram model but the number of the features is then doubled. So, the
scaling of the N-gram model to the larger data sets or moving to the higher-order needs better
feature selection approaches. The N-gram model captures the long-distance context poorly. It
has been shown after every 6-grams, the gain of performance is limited.

Chapter 5.3: Artificial Intelligence

It is indisputable that speech recognition in AI (artificial intelligence) has come a long


way since Bell Laboratories invented The Audrey, a device capable of recognizing a few
spoken digits, in 1952.A recent study by Capgemini demonstrates how ubiquitous speech
recognition has become. 74% of consumers sampled reported that they use conversational
assistants to research and buy goods and services, create shopping lists, and check order status.

- 33 -
We are all familiar with digital assistants such as Google Assistant, Cortana, Siri, and Alexa.
Google Assistant and Siri are used by over 1 billion people globally, and Siri has over 40
million users in the US alone. But, have you ever wondered how these tools understand what
you say? Well, they use speech to text AI.

It was initially used to analyze and quickly compute data, but it is now used to perform
tasks that previously could only be performed by humans.Artificial intelligence is often
confused with machine learning. Machine learning is a derivative of artificial intelligence and
refers to the process of teaching a machine to recognize and learn from patterns rather than
teaching it rules.

Computers are trained by feeding large volumes of data to an algorithm and then letting
it pick out the patterns and learn. In the nascent days of machine learning, programmers had to
write code for every object they wanted the computer to recognize – e.g., a cat vs. a human.
These days, computers are shown numerous examples of each object. Over time, they learn
without any human input.

Speech recognition, natural language processing, and translation use artificial


intelligence today. Many speech recognition applications are powered by automatic speech
recognition and Natural Language Processing (NLP). Automatic speech recognition refers to
the conversion of audio to text, while NLP is processing the text to determine its
meaning.Humans rarely ever speak in a straightforward manner that computers can
understand. Normal speech contains accents, colloquialisms, different cadences, emotions, and
many other variations. It takes a great deal of natural language analysis to generate accurate
text.

- 34 -
Challenges with Speech to Text AI

Despite the giant leap forward that AI speech to text has made over the last decade,
there remain several challenges that stand in the way of true ubiquity.
The first of these is accuracy. The best applications currently boast a 95% accuracy rate – first
achieved by Google Cloud Speech in 2017. Since then, many competitors have made great
strides and achieved the same rate of accuracy.

While this is good progress, it means that there will always be a 5% error rate. This
may seem like a small figure – and it is, where the issue at hand is a transcript that can be
quickly edited by a human to correct errors. But, it is a big deal where voice is used to give a
command to the computer. Imagine asking your car’s navigator to search the map for a
particular location, and it searches for something different and sends you on your way in the
wrong direction because it didn’t quite catch what you said.

The other challenge is that humans don’t just listen to each other’s voices to understand
what is being said. They also observe non-verbal communication to understand what is being
communicated but isn’t being said. This includes facial expressions, gestures, and body
language. So, while computers can hear and understand the content, we are a long way from
getting to a point where they can pick up on non-verbal cues. The emotional robot that can
hear, feel and interpret like a human is the holy grail of speech recognition.In speech
recognition, the computer takes input in the form of sound vibrations. This is done by making
use of an analog to digital converter that converts the sound waves into a digital format that
the computer can understand. Advanced speech recognition in AI also comprises AI voice
recognition where the computer can distinguish a particular speaker’s voice.

- 35 -
Chapter 6:PRO’S AND CON’S OF
SPEECH RECOGNITION

One of the most important biometric technology called Speech recognition technology
has become an increasingly popular concept in recent years. This technology is widely used
for various advantages it provides. It allows documents to be created because the software
generally produces words as quickly as they uttered. Which is usually much faster than a
person can type. So here this article gives the advantages and disadvantages of speech
recognition to better understand this topic.

Chapter 6.1: Pro’s or advantages of speech recognition

There are several advantages to using speech recognition software, including the following:

Machine-to-human communication: The technology enables electronic devices to


communicate with humans in natural language or conversational speech.

Readily accessible: This software is frequently installed in computers and mobile devices,
making it accessible.

Easy to use: Well-designed software is straightforward to operate and often runs in the
background.

Continuous, automatic improvement: Speech recognition systems that incorporate AI


become more effective and easier to use over time. As systems complete speech recognition
tasks, they generate more data about human speech and get better at what they do.

It is fairly accurate: Although it should always be proofread, speech recognition software can
result in a document more or less free of errors. In addition, newer programs tend to be well
designed and can offer reliable results for some applications.

It allows for hands-free work: When working with a client or completing a task, the use of
speech recognition tools facilitates easy note taking, use of other materials, and professional
eye contact. Each of these activities is limited when someone has to type information into a
computer behind a screen.

- 36 -
Security: With this technology a powerful interface between man and computer is created as
the voice reorganization understands only the prerecorded voices and hence there are no ways
of tampering data or breaking the codes if created.

Productivity: It decreases work as all operations are done through voice recognition and hence
paper work decreases to its maximum and the user can feel relaxed irrespective of the work.

Advantage for handicapped and blind: This technology is great boon for blind and
handicapped as they can utilize the voice recognition technology for their works.

Usability of other languages increases: As the speech recognition technology needs only
voice and irrespective of the language in which it is delivered it is recorded, due to this
perspective this is helpful to be used in any language.

Personal voice macros can be created: Every day tasks like sending mails receiving mails
drafting documents can be done easily and also many tasks speed can be increased.

Chapter 6.2: Con’s or disadvantages of speech recognition

While convenient, speech recognition technology still has a few issues to work through.
Limitations include:
Accuracy is always imperfect: “More or less accurate” is not perfectly accurate. This is a
very important factor to consider when choosing tools, especially for medical legal needs. In
these cases, accuracy is permanent. A client’s professional reputation hangs in part on how well
they present themselves in writing.

Some voices don’t come across well: Speech recognition software may not be able to
transliterate the words of those who speak quickly, run words together or have an accent.
It also drops in accuracy when more than one speaker is present and being recorded.

Inconsistent performance: The systems may be unable to capture words accurately because
of variations in pronunciation, lack of support for some languages and inability to sort through
background noise. Ambient noise can be especially challenging. Acoustic training can help
filter it out, but these programs aren't perfect. Sometimes it's impossible to isolate the human
voice.

- 37 -
Speed: Some speech recognition programs take time to deploy and master. The speech
processing may feel relatively slow.
Source file issues: Speech recognition success depends on the recording equipment used, not
just the software.

- 38 -
Chapter 7:FEATURES OF SPEECH
RECOGNITION
Good speech recognition programs let users customize them to their needs. The
features that enable this include:

Language weighting: This feature tells the algorithm to give special attention to certain words,
such as those spoken frequently or that are unique to the conversation or subject. For example,
the software can be trained to listen for specific product references.

Acoustic training: The software tunes out ambient noise that pollutes spoken audio.
Software programs with acoustic training can distinguish speaking style, pace and volume amid
the din of many people speaking in an office.

Speaker labeling: This capability enables a program to label individual participants and
identify their specific contributions to a conversation.

Profanity filtering: Here, the software filters out undesirable words and language.

- 39 -
Chapter 8:APPLICATIONS
Speech recognition technologies such as Alexa, Cortana, Google Assistant and Siri are
changing the way people interact with their devices, homes, cars, and jobs. The technology
allows us to talk to a computer or device that interprets what we’re saying in order to respond
to our question or command.With a long history of development and innovation, it was the
introduction of these artificial intelligence voice-controlled assistants, or digital assistants, into
the voice recognition market that changed the landscape of this technology in the 21st
century.2With digital assistants quickly becoming ubiquitous in various aspects of life,
understanding their capabilities and applications is paramount to individuals, businesses, and
organisations.

Digital assistants are designed to help people perform or complete basic tasks and
respond to queries. With the ability to access information from vast databases and various
digital sources, these robots help to solve problems in real time, enhancing the user
experience and human productivity.These voice recognition programs weren’t sophisticated
enough to understand everyone’s voice, so many users were disappointed. However, voice
recognition has made an enormous amount of progress since then.

It’s present in our smartphones and on our computers, and it is used in a wide variety
of industries. The applications of voice recognition seem almost endless! Here are some of the
top trends and applications when it comes to voice recognition technology.

Popular digital assistants and few popular speech recognition applications include:

 Google’s Google Assistant


 Amazon’s Alexa
 Apple’s Siri
 Voice Shopping
 Voice assistant in smart tv’s
 For security authentication
 Forensic and criminal identification
 For user interface etc.

- 40 -
Chapter 8.1: GOOGLE Voice Assistant

The privacy policy of Google Assistant states that it does not store the audio data
without the user's permission, but may store the conversation transcripts to personalise its
experience. Personalisation can be turned off in settings. If a user wants Google Assistant to
store audio data, they can go to Voice & Audio Activity (VAA) and turn on this feature. Audio
files are sent to the cloud and used by Google to improve the performance of Google Assistant,
but only if the VAA feature is turned on

Fig: google’s voice assistant

Chapter 8.2: AMAZON’s Alexa

The privacy policy of Amazon's virtual assistant, Alexa, states that it only listens to
conversations when its wake word (like Alexa, Amazon, Echo) is used. It starts recording the
conversation after the call of a wake word, and stops recording after 8 seconds of silence. It
sends the recorded conversation to the cloud. It is possible to delete the recording from the
cloud by visiting ‘Alexa Privacy’ in ‘Alexa’.

Fig: Amazon’s Alex


- 41 -
Chapter 8.3:APPLE’s Siri

Apple states that it does not record audio to improve Siri. Instead, it uses transcripts.
Transcript data is only sent if it is deemed important for analysis. Users can opt out anytime if
they don't want Siri to send the transcripts in the cloud.

Fig: Apples’s siri

Chapter 8.4:Voice Shopping

Voice shopping is exactly what it sounds like: using your voice to make purchases.Just
as users have become accustomed to using these devices for carrying out simple search
requests or for operating devices around the home, they are also using them to make purchases
ranging from pizza to paper and from movie tickets to big-ticket items.
Here’s how a typical voice shopping event might take place:

“Hey Alexa, buy shampoo.”


“Okay, according to your order history, I found Maybelline shampoo. Would you like to buy
it?”
“Yes.”

Voice shopping has already become one of the biggest trends in retail. A TechCrunch
report showed that just 13 percent of U.S. homes have a smart speaker system, but that more
than a third of those who had these systems used it to make purchases on a regular basis. These
purchases generated an estimated $2 billion in revenue in 2017

- 42 -
Fig: Voice Shopping

A 2017 survey of more than 1,600 shoppers showed that about one-quarter of those
surveyed owned at least one voice-activated digital assistant, with another 20 percent expected
to purchase one in the coming year. While just under 20 percent of those surveyed had made
purchases through voice shopping, the survey also showed that more than 40 percent of
millennials had used their voice-activated assistants for purchases in the preceding year."Voice
commerce represents the next major disruption in the retail industry," said John Franklin,
Associate Partner at OC&C. "Just as e-commerce and mobile commerce changed the retail
landscape, shopping through smart speaker promises to do the same.”

Many major retailers have taken notice of the voice shopping trend by forming
partnerships with smart speaker providers.
For users who don't specify a particular brand, Amazon will recommend an “Amazon's
Choice” brand, or even an Amazon-branded product where applicable. This capability gives
Amazon and their partners a leg up

Chapter 8.5:Voice Assistance in Smart TV’s

A voice assistant is a natural fit for a smart TV. It is especially helpful if you don’t
want to get up and get the remote control. Also, why waste time when you can use your voice
to navigate. All you have to do is press and hold the microphone button. Calmly speak into the
microphone. Also, there is no need for active listening. Also, you don’t need to yell commands
from across the room. Depending on the version of voice recognition in your TV, you can
through channels. You can browse, search content, open and close apps. Voice recognition

- 43 -
tech also enables you to change the sound mode, look for information, with a smart assistant.

Fig:Voice Assistants in Smart Televisions

Chapter 8.6: Voice recognition for security

Most people have a number of online accounts. These accounts need protection. Many of these
accounts, such as online banking applications, pose significant security hazards. Internet
banking has become widespread in industry 4.0. There is a high need for proper ID systems to
be put in place. These ID systems would ensure that only the account owner has access to
important info. Voice identification is a more latest kind of user identification. Speech
authentication factor aids in identifying your distinctive voice

Fig: Voice Recognition for Security

It is very much similar to AI assistants that respond to your voice. This method can be utilized
as a unique ‘password’ to unlock secured accounts with your voice. Everyone has
Biometric authentication, unlike passwords or token-based authentication, uses unique

- 44 -
biological characteristics to verify an individual’s identity. It’s harder to spoof and generally
more convenient for users since they don’t have to remember passwords or carry a physical
token that can easily be lost or stolen.

Chapter 8.7:Forensic voice recognition and criminal


Identification

One of the most startling trends in speech recognition is the use of this technology to assist in
the identification of criminals. If a crime suspect’s voice is recorded, the audio can now be
utilized as key evidence. AGNITIO and Morpho (Safran) are currently working together to
deliver Voice ID technology for forensics.

Fig: Forensic Voice Recognition and Criminal Identification

This product enables the use of speech biometrics technology all around the world.
This would help to identify persons and do background checks. It will aid in added checks with
fingerprints and other methods. This technology can match recorded or live voices in
seconds.It also has a 99 percent accuracy rate. Furthermore, speech recognition does not
distinguish between different accents or languages. It detects the sound of a person’s voice
rather than the words or language they speak.

Chapter 8.8:Voice User Interfaces

The touchscreen has revolutionized the way we engage with our gadgets. Voice
recognition also has the potential to do the same. This will be a great leap in technology. The
top reason to use voice recognition is that it is less difficult and quicker than typing. Smart
speakers and assistants, IoT are pioneering speech UIs. But, as technology advances, there will
- 45 -
be a few changes. These devices will also deal with the nuances of our speech more
successfully. It would also allow them to achieve more.

Fig:User interface voice assistant

Voice UI will have an impact on the regular mobile app business as well. Voice
recognition and speech UIs will especially help the elderly or people with vision impairment.
Instead of browsing with swipes and clicks, we can simply talk. Improved ASR will also
benefit home automation devices. It will be beneficial to robotics to interactive toys. Speech
UI will also make way for human interactions with domestic technology.

- 46 -
Chapter 9: CONCLUSION

Voice AI technology is gaining traction in a variety of industries. Rather than


mechanical solutions, the latest trends in the sector emphasize giving a more distinct and
tailored experience to users. Furthermore, corporate and small-business entities are continuing
to migrate to voice-based AI systems. They aim to expedite workflows and improve customer
deliverables. Many digital businesses and apps use speech recognition as a standard function.
As the use of voice recognition grows, there are numerous new themes that have emerged.
These voice recognition technologies streamline many e-commerce operations. They can also
play a critical role in other company activities.

This blog aimed at giving you the most recent and significant speech recognition
developments. You can incorporate these in your products, processes, and platforms of
business.Voice recognition promises rosy future and offer wide variety of services. The next
generation of voice recognition technology consists of something called Neural networks using
artificial intelligence technology. They are formed by interconnected nodes which do parallel
processing of the input for fast evaluation. Like human beings they learn new pattern of speech
automatically.

- 47 -
REFERENCES

1. https://www.techtarget.com/searchcustomerexperience/definition/speech
recognition#amp_tf=From%20%251%24s&aoh=16668519701886&referrer=http
s%3A%2F%2Fwww.google.com&ampshare=https%3A%2F%2Fwww.techtarget.
com%2Fsearchcustomerexperience%2Fdefinition%2Fspeech-recognition
2. https://www.knowledgenile.com/blogs/what-are-new-technology-trends-in-speech-recognition/

- 48 -

You might also like