Professional Documents
Culture Documents
Voice Speech Recognization Documentation
Voice Speech Recognization Documentation
BACHELOR OF TECHNOLOGY
In
COMPUTER SCIENCE AND ENGINEERING
By
A.USHA - 19631A0523
-1-
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SRI VENKATESWARA ENGINEERING COLLEGE
SURYAPET-508213
CERTIFICATE
-2-
DECLARATION
To the best of my knowledge and belief, I hereby declare that this Technical Seminar
Report bears no resemblance to any other Technical Seminar Report submitted at Sri
Venkateswara Engineering College, Suryapet or any other college affiliated to Jawaharlal
Nehru Technological University, Hyderabad for the award of the degree.
Place: Suryapet
Date:03/01/2023
A.USHA
(19631A0523)
-3-
ACKNOWLEDGEMENT
I thank the almighty for giving me the courage and perseverance in completing my Technical
Seminar Report. This Technical seminar Report itself is an acknowledgment of all those people who
gave this Technical Seminar a success.
I take this opportunity to express my deep and sincere gratitude to the Co-ordinator
Mrs.E.SRILAXMI,(M.Tech) Assistant Professor for his valuable advice at every stage of this work. Without
her supervision and valuable guidance I never have come out in this form.
We are also thankful to Mr.B.RAMJI, M.Tech (Ph.D) Head of the department of computer science and
engineering, and, Dr M Raju Principal, Dr. D Kiran Kumar Director S.V.E.S for providing
excellent facilities, motivation and encouragement to complete this Technical Seminar Report work
on time.
Last but not least we would like to express our deep sense of gratitude and earnest thanksgiving to our
parents for their moral support and heartfelt cooperation in doing the Technical Seminar Report. We
would also like to thank all the teaching and non-teaching staff and my friends whose direct or indirect
help has enabled us to complete this work successfully.
A.USHA
19631A0523
-4-
ABSTRACT
-5-
Table of Contents
CERTIFICATE…………………………………………………………………………………….2
DECLARATION…………………………………………………………………………………...4
ACKNOWLEDGEMENT.................................................................................................................5
ABSTRACT........................................................................................................................................4
Chapter 7: FEATURES.................................................................................................................39
Chapter 8: APPLICATIONS........................................................................................................40
-6-
Chapter 1: INTRODUCTION
The computer revolution is now well advanced, but although we see a starting
preparation of computer machines in many forms of work people do, the domain of computers
is still significantly small because of the specialized training needed to use them and the lack
of intelligence in computer systems. In the history of computer science five generations have
passed by, each adding a new innovative technology that brought computers nearer and nearer
to the people. Now it is sixth generation, whose prime objective is to make computers more
intelligent i.e., to make computer systems that can think as humans.
The fifth generation was aimed at using conventional symbolic Artificial Intelligence
techniques to achieve machine intelligence. Thus failed. Statistical modeling and Neural Nets
are really sixth generation. The goal of work in Artificial Intelligence is to build the machines
that perform tasks normally requiring human intelligence. True, but speech recognition seeing
and walking don’t require “intelligence, but human perceptual ability and motor control.
Speech Technology is now one of the major significant scientific research fields under the
broad domain of AI; indeed it is a major codomain of computer science, apart from the
traditional linguistics and other disciplines that study the spoken language
-7-
Chapter 2: SPEECH RECOGNITION
The days when you had to keep staring at the computer screen and frantically hit the
key or click the mouse for the computer to respond to your commands may soon be a things of
past. Today we can stretch out and relax and tell your computer to do your bidding. This has
been made possible by the ASR (Automatic Speech Recognition) technology. The ASR
technology would be particularly welcome by automated telephone exchange operators,
Doctors and lawyers, besides others whose seek freedom from tiresome conventional
computer operations using keyboard and the mouse. It is suitable for applications in which
-8-
computers are used to provide routine information and services. The ASR’s direct
speech to text dictation offers a significant advantage over traditional transcriptions. With
further refinement of the technology in text will become a thing of past. ASR offers a solution
to this fatigue-causing procedure by converting speech in to text. The ASR technology is
presently capable achieving recognition accuracies of 95% - 98 % but only under ideal
conditions.
The technology is still far from perfect in the uncontrolled real world. The routes of this
technology can be traced to 1968 when the term Information Technology hadn’t even been
coined. American’s had only begun to realize the vast potential of computers. In the Hollywood
blockbuster 2001: a space odyssey. A talking listening computer HAL-9000, had been featured
which to date is a called figure in both science fiction and in the world of computing. Even
today almost every speech recognition technologist dreams of designing an HAL-like computer
with a clear voice and the ability to understand normal speech. Though the ASR technology is
still not as versatile as the imaginer HAL, it can nevertheless be used to make life easier. New
application specific standard products, interactive error-recovery techniques, and better voice
activated user interfaces allow the handicapped, computer-illiterate, and rotary dial phone
owners to talk to the computers. ASR by offering a natural human interface to computers, finds
applications in telephone-call centers, such as for airline flight information system, learning
devices, toys, etc.
-9-
Chapter 3: SPEECH RECOGNITION
CLASSIFICATION TECHNIQUES
1. Small Vocabulary/ Large User Base: Good for automated tele-services like voice
activated dialing and IVR, but the usable vocabulary is highly limited in scope to certain
specific commands.
2. Large Vocabulary/ Small User Base: Suited for environments where small group of
people is involved. It however requires more rigorous training for that particular user group
and gives erroneous results for anyone outside that group.
The current methods rely on mathematically analyzing the digitized sound waves and
their spectrum properties. The process involves the conversion of the sound waves spoken into
the microphone (at 16KHz) into a digital signal through quantization and digitization following
the Nyquist-Shannon Sampling theorem, which simply put, requires at least one sample to be
collected for each compression and rarefaction consecutively. This means that the frequency
of sampling should be at least twice the highest frequency component in the signal. The speech
recognition program then follows various algorithms and models to account for variations and
compressing the raw speech signal to simplify processing. The initial compression may be
achieved through many methods including Fourier Transforms, Perceptual Linear Prediction,
Linear Predictive Coding and Mel-Frequency Cepstral Coefficients.
There are commonly four common concepts about which speech is recognized:
1. Template Based: Predefined templates or samples are created and stored. Whenever a
user utters a word, it is correlated with all the templates. The one with the highest correlation
is then selected as the spoken word. It isn’t flexible enough to understand voice patterns.
Discrete Time Warping may be considered as one of these techniques.
2. Knowledge based: These analyze spectrograms of voice to collect data and create some
rules which are indicative of the uttered command. These do not use language knowledge base
or speech variations and are generally used for command based systems.
- 10 -
3. Stochastic: Speech being a highly random phenomenon can be considered to be a
piecewise stationary process over which stochastic models can be applied. As stated earlier,
this is one of the most popular methods used by commercial programs. Hidden Markov Models
are an example of stochastic methods.
4. Connectionist: Artificial Neural Networks are used to store and extract various
coefficients from the speech data over multilayered structures and various neural nets to deduce
the spoken word.
The performance is generally measured in terms of accuracy and speed. The general
scales are that of Single Word Error Rate, which is the misunderstanding of one word in a
spoken sentence, and Command Success Rate, which is the accurate interpretation of the
spoken command. Different methods always give varying results which further depends on
various external factors.
- 11 -
Chapter 4: WORKING TECHNOLOGY
When a person speaks, compressed air from the lungs is forced through the
vocal tract as a sound wave that varies as per the variations in the lung pressure and
the vocal tract. This acoustic wave is interpreted as speech when it falls upon a
person’s ear. In any machine that records or transmits human voice, the sound wave
is converted into an electrical analogue signal using a microphone. When we speak
into a telephone receiver, for instance, its microphone converts the acoustic wave into
an electrical analogue signal that is transmitted through the telephone network. The
electrical signals strength from the microphone varies in amplitude over time and is
referred to as an analogue signal or an analogue waveform. If the signal results from
speech, it is known as a speech waveform. Speech waveforms have the characteristic
of being continuous in both time and amplitude.
When a person speaks, compressed air from the lungs is forced through the
vocal tract as a sound wave that varies as per the variations in the lung pressure and
- 12 -
the vocal tract. This acoustic wave is interpreted as speech when it falls up on
a person’s ear. Speech waveforms have the characteristic of being continuous in both
time and amplitude. Any speech recognition system involves five major steps:
Speech recognition software must adapt to the highly variable and cotext-
specific nature of human speech.The software algorithms that process and organize
audio into text are trained on different speech pattern,speaking
- 13 -
styles,languages,dialects,accents and pharsings.The software also seperates
spoken audio from background noise that often accompanies the signals.
To meet these requirements,speech recognition system use two types of models:
1. Acoustic models
2. Language models
- 14 -
1) how to establish the statistical models and their structures; and 2) how to learn the
model parameters automatically from the data. The following are some of our recent
projects in the area of acoustic modeling:
- 15 -
How language modeling works:
Language models determine word probability by analyzing text data. They
interpret this data by feeding it through an algorithm that establishes rules for context
in natural language. Then, the model applies these rules in language tasks to
accurately predict or produce new sentences. The model essentially learns the features
and characteristics of basic language and uses those features to understand new
phrases.
There are several different probabilistic approaches to modeling language,
which vary depending on the purpose of the language model. From a technical
perspective, the various types differ by the amount of text data they analyze and the
math they use to analyze it. For example, a language model designed to generate
sentences for an automated Twitter bot may use different math and analyze text data
in a different way than a language model designed for determining the likelihood of
a search query.
Some common statistical language modeling types are:
N-gram. N-grams are a relatively simple approach to language models. They create a
probability distribution for a sequence of n The n can be any number, and defines the
size of the "gram", or sequence of words being assigned a probability. For example,
if n = 5, a gram might look like this: "can you please call me." The model then assigns
probabilities using sequences of n size. Basically, n can be thought of as the amount
of context the model is told to consider. Some types of n-grams are unigrams, bigrams,
trigrams and so on.
Unigram. The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the
query likelihood model, which uses information retrieval to examine a pool of
documents and match the most relevant one to a specific query.
- 16 -
forwards. These models can predict any word in a sentence or body of text by using
every other word in the text. Examining text bidirectionally increases result accuracy.
This type is often utilized in machine learning and speech generation applications. For
example, Google uses a bidirectional model to process search queries.
Exponential. Also known as maximum entropy models, this type is more complex
than n-grams. Simply put, the model evaluates text using an equation that combines
feature functions and n-grams. Basically, this type specifies features and parameters
of the desired results, and unlike n-grams, leaves analysis parameters more ambiguous
-- it doesn't specify individual gram sizes, for example. The model is based on the
principle of entropy, which states that the probability distribution with the most
entropy is the best choice. In other words, the model with the most chaos, and least
room for assumptions, is the most accurate. Exponential models are designed
maximize cross entropy, which minimizes the amount statistical assumptions that can
be made. This enables users to better trust the results they get from these models.
- 17 -
model might be better than an n-gram for NLP tasks, because they are designed to
account for ambiguity and variation in language.
A good language model should also be able to process long-term dependencies,
handling words that may derive their meaning from other words that occur in far-
away, disparate parts of the text. An LM should be able to understand when a word is
referencing another word from a long distance, as opposed to always relying on
proximal words within a certain fixed history. This requires a more complex model.
Importance of language modeling
Language modeling is crucial in modern NLP applications. It is the reason that
machines can understand qualitative information. Each language model type, in one
way or another, turns qualitative information into quantitative information. This
allows people to communicate with machines as they do with each other to a limited
extent.
It is used directly in a variety of industries including tech, finance, healthcare,
transportation, legal, military and government. Additionally, it's likely most people
reading this have interacted with a language model in some way at some point in the
day, whether it be through Google search, an autocomplete text function or engaging
with a voice assistant.
The roots of language modeling as it exists today can be traced back to 1948. That
year, Claude Shannon published a paper titled "A Mathematical Theory of
Communication." In it, he detailed the use of a stochastic model called the Markov
chain to create a statistical model for the sequences of letters in English text. This
paper had a large impact on the telecommunications industry, laid the groundwork for
information theory and language modeling. The Markov model is still used today, and
n-grams specifically are tied very closely to the concept.
Uses and examples of language modeling
Language models are the backbone of natural language processing (NLP). Below are
some NLP tasks that use language modeling, what they mean, and some applications
of those tasks:
Speech recognition -- involves a machine being able to process speech audio. This is
commonly used by voice assistants like Siri and Alexa.
Machine translation -- involves the translation of one language to another by a
- 18 -
machine. Google Translate and Microsoft Translator are two programs that do
this. SDL Government is another, which is used to translate foreign social media feeds
in real time for the U.S. government.
Parsing -- involves analysis of any string of data or sentence that conforms to formal
grammar and syntax rules. In language modeling, this may take the form of sentence
diagrams that depict each word's relationship to the others. Spell checking applications
use language modeling and parsing.
- 19 -
Chapter 5: SPEECH RECOGNITION
ALGORITHMS
- 20 -
An example is shown in the three state diagram 3 where states are denoted by nodes
and transitions by directed arrows (vectors) between nodes. The underlying model is a markov
chain. The circles represent states of the speaker’s vocal system specific configuration of
tongue, lips, etc that produce a given sound. The arrows represent possible transitions from one
state to another. At any given time, the model is said to be in one state. At clock time, the model
might change from its current state to any state to any state for which a transition vector exists.
Transition may occur only from the tail to the head of a vector. A state can have more than one
transition leaving it and more than one leading to it.
1.Hidden Markov model (HMM) is the base of a set of successful techniques for acoustic
modeling in speech recognition systems. The main reasons for this success are due to this
model’s analytic ability in the speech phenomenon and its accuracy in practical speech
recognition systems. Another major specification of HMM is its convergent and reliable
parameter training procedure. Spoken utterances are represented as a non-stationary sequence
of feature vectors. Therefore, to evaluate a speech sequence statistically, it is required to
segment the speech sequence into stationary states. An HMM model is a finite state machine.
Each state may be modeled as a single Gaussian or a multi-modal Gaussians mixture. Due to
the continuous nature of speech observations, continuous density pdfs are often used in this
model. The topology of an HMM model for speech is considered to be left-to-right to meet the
observations arrangement criterion. This left-to-right topology authorizes transitions from each
state to itself and to right-hand neighbors. HMM model parameters are usually estimated in
- 21 -
the training phase by maximum likelihood based [1] or discriminative based training
algorithms [2,3] using sufficient training data sets. A continuous left-to-right HMM model
parameters with N states and M mixtures can be stated by λ = {π, A, B}. π = {πi} is the initial
state distribution matrix, and A = {aij} is the state transition probability distribution matrix.
The transition probabilities are defined as follows. aij = P[qt+1 = j|qt = i] is the transition
probability from state i
i=1
B = {bj(ot)} is the set of observation probability density per state, which may be represented by
a multi-modal Gaussian mixture model as
M
where Cjm is the mixture coefficient for the mth mixture in state j. Cjm satisfies the following
constraints:
G(.) is a Gaussian distribution with mean vector µjm and covariance matrix Σjm.
- 22 -
Fig. shows the overall block diagram of an automatic speech recognition system in the
recognition phase. The continuous input speech utterance is segmented into frames by the
preprocessing module. In the next step, the feature extraction module extracts a feature vector
on each frame to represent its acoustic information. Hence, a discrete sequence of feature
vectors (observations), O = (o1o2 ... oT ), is obtained. In an utterance classification task with
vocabulary size v, the unknown input speech is compared with all of the HMMs λi according to
some search algorithms, and finally, the input speech is identified as one of the reference
HMMs with the highest score. In most HMM-based systems, Viterbi algorithm [1] is the core
of the recognition procedure. Viterbi algorithm is a full search method that tries all possible
solutions to find the best alignment path of the state sequence between the input utterance and
a given HMM. The full search in HMM can be formulated as
where qt is the state at time t. The sequence q1q2 ... qt denotes an alignment of observation
sequence and speech HMM and T is the length of the observation sequence. Obviously, as the
search space increases, the computational cost increases exponentially with O(NT ); therefore,
it is impractical to solve this NP-complete problem. Viterbi algorithm extracts the alignment
path dynamically by a recursive procedure.
where LLt(j) is the partial cost function of the alignment path in state j at time t and LLt−1(i) is
the score of the best path among possible paths that start from first state and end in the ith state
at time t − 1. Fig. 2 shows a Viterbi trellis diagram in which the horizontal axis represents the
time axis of the input utterance and the vertical axis represents the possible states of the
reference HMM.
The computational complexity of this method is O(N2T). Although it saves the
computational cost and memory requirements, it can however only be practically used where
the length of the input utterance is short and the number of HMM reference models is small.
In particular, for continuous speech recognition, this is not usually the case. Hence, to
- 23 -
overcome this deficiency, a Viterbi beam search [4] has been presented. The main idea
in beam search is to keep and extend possible paths with higher scores. This approach may
eliminate the optimality of the algorithm.
- 24 -
In this paper, a novel approach is proposed to apply particle swarm optimization
strategy in the recognition phase of a speech recognition system instead of the traditional
Viterbi algorithm to deal with PSO performance in finding the global optimum segmentation.
Preliminary results of this work were reported in [17]. To explore the performance of the
proposed system performance, experiments were conducted on isolated word recognition and
stop consonants phones. Stop consonants classification is one of the most challenging tasks in
speech recognition. In addition, a new classification method based on a tied segmentation
strategy is introduced. The method can be generalized to the continuous speech recognition
case. The remainder of this paper is organized as follows. The next section provides the details
of the proposed PSO-based recognition procedure. Section 3 presents the experimental results
and in the last section the paper is concluded.
Experimental results
Experimental setup
In the preprocessing stage of both systems, the audio signal is transformed into 26
MFCC feature vectors. In the word recognizer, feature vectors are extracted in 20 ms windows
of the utterance using overlapped 8 ms sliding frames. In contrast, in our phone classification
test bed, the preprocessor produced the feature vectors every 10 ms for 16 ms length windows.
The first 12 features are based on 25 mel-scaled filter bank coefficients, the 13th element is a
- 25 -
log energy coefficient and the 13 remaining features are their first derivatives. The tests
were simulated using Matlab 7.6 programming language. A summary of implementation
parameters is given in Tables 2 and 3 and Fig. 5 shows the block diagram of our system.
Figs describe the overall behavior of the suggested system. Fig. 6 shows the effect of particles
defining on the convergence ratio with respect to Viterbi path likelihood. When the particles
are presented as state sequence vectors, the initial step starts from lower ratios. In addition, the
recognition system is easily trapped into a local optimum.The experiments show that if particles
are defined as segmentation vectors and movement updating is considered as being by the
second method in Section 2.3, the probability of finding the global optimum and approaching
Viterbi path likelihood increases. Therefore, Segment-ProbPSO is determined as the baseline
method in all of the experiments and optimizations have also been performed using this
method.Fig. shows the effect of movement structure. Although both curves start from a
common point, their convergence rates are different. Therefore, if the particles movement
during generations is defined as a probabilistic structure, it is more probable to find the global
optimum.
The results in Table 4 reveal that Viterbi and Segment-ProbPSO algorithms are equal
in error rates on average. Although we have applied the recognition procedure of Viterbi
algorithm for our system as the benchmark, this method provides better results in some cases.
This statement indicates the major drawback of the traditional recognition process that makes
the decision based on the comparison of best paths between unknown uttered word and given
word models.Fig shows an example of a comparison between Viterbi and segment-ProbPSO
recognition processes in 10 iterations. The unknown input utterance belongs to word model 1.
This test sample is recognized by Viterbi algorithm correctly while
by the proposed algorithm it is recognized as the second word model after 10 iterations.
However, it is obvious that, after more generations, the system achieves the correct result.
Therefore, more iteration was required for obtaining sufficient
- 26 -
Fig:Speech recognizer block diagram
100
Segment. ProbPSO
SS. ProbPSO
99
LL.
98
Vit
erb
97 (
i/L
L.
PS96
O
95
94
93
92
100
Segment. ProbPSO
SS. LCPSO
99
LL.
Vit
erb98
i/L (
L.
PS97
O
96
95
94
5 10 15 20 25 30 35 40
Iteration
- 27 -
Table
Comparison error rates.
In most cases, the difference between the competing models likelihood is large and
desirable accuracy will be reached in primary iterations.
Figs. 9 and 10 report the system optimization results. Fig. 9 shows the influence of the α, β and
γ coefficients on the recognition error rate under fixed conditions for eight reference word
models in 20 iterations. The optimum value of β is 5 for α = 15 and γ = 15. The optimum value
of α is 5 for β = 5, γ = 15 and the optimum value of γ is 10 for α = 5 and β = 5.
Fig. 10 shows the effect of population size on the overall error rate. The computational cost
increases with the increase of population size. Therefore, the population size’s optimum value
is the position where the curve is saturated. This value is about eight particles in the empirical
curve depicted in Fig.
- 28 -
classification task, the proposed algorithm computational cost is almost equivalent or even less
than the computational size of Viterbi algorithm.
Tied segmentation method results
The results of the tied segmentation method in Table 6 show that both classifier types
have almost the same performance. However, in this method, the convergence rate to the
desired accuracy rate is more than previous proposed methods.
920 865
reference1LL.Segment.ProbPSO reference2LL.Segment.ProbPSO
900 reference1LL.Viterbi reference2LL.Viterbi
Li Li
ke 880 ke 860
lih - lih -
oo 860 851.7
oo
d d
840
855
820
800 851.6
780 850
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Iteration Iteration
1520 1000
reference3LL.Segment.ProbPSO reference4LL.Segment.ProbPSO
reference3LL.Viterbi reference4LL.Viterbi
1500 980
Li Li
ke 1480 ke 960
lih - lih -
oo 1460 oo 940
d 1449
d
1440 920
1400 880
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Iteration Iteration
Fig:An example of comparison of likelihood values versus number of iterations for Viterbi
and Segment-ProbPSO methods for four reference word models.
- 29 -
1.05
β Influence
α Influence
1
γ Influence
Err
or
Rat 0.95
e
(%)
0.9
0.85
0.8
0.75
0.7
5 10 15 20 25 30 35 40 45 50
α, β, γ
Fig: The influence of α, β and γ on the recognition error rate for eight reference word models in 20 iterations .
3.5
Initial Population Influence
3
Err
orR
ate( 2.5
%)
2
1.5
0 2 4 6 8 10 12 14 16 18
Initial Population
Fig:The effect of population size on recognition error rate for 20 iterations and α = β = 5 and γ = 10.
100
O
(
99.5
99
98.5
0510152025303540
Iteration
Fig: Convergence percentage of Segment-ProbPSO’s gbest likelihood values to the Viterbi path likelihood.
- 30 -
Table
Phone classifier error rates (%).
#Iterations
1 2 3 4 5 6
1 39.12 36.94 35.63 34.88 34.22 33.12 36.65 34.86 33.72
2 33.04 32.64 32.31 35.38 33.35 32.48 31.85 31.58
3 31.46
#Particles 4 34.26 32.54 31.98 31.69 31.27 31.15
5 33.04 32.37 31.69 31.19 30.92 30.90
Table
Phone classifier error rates (%).
#Iterations
1 2 3 4 5 6
1 39.12 37.58 36.34 34.68 33.97 33.02 36.65 34.95 33.31 32.91
2 32.43 31.91 35.38 33.24 32.48 31.73 31.60 31.20
3 34.26 32.20 31.46 31.15 31.02 30.94
4
#Particles
5 33.04 31.83 31.15 30.90 30.64 30.22
For N-1 words, the N-gram modeling predicts most occurred words that can follow the
sequences. The model is the probabilistic language model which is trained on the collection of
the text. This model is useful in applications i.e. speech recognition, and machine translations.
A simple model has some limitations that can be improved by smoothing, interpolations, and
- 31 -
back off. So, the N-gram language model is about finding probability distributions over the
sequences of the word. Consider the sentences i.e. "There was heavy rain" and "There was
heavy flood".
By using experience, it can be said that the first statement is good. The N-gram language
model tells that the "heavy rain" occurs more frequently than the "heavy flood". So, the first
statement is more likely to occur and it will be then selected by this model. In the one-gram
model, the model usually relies on that which word occurs often without pondering the
previous words. In 2-gram, only the previous word is considered for predicting the current
word. In 3-gram, two previous words are considered.
In the N-gram language model the following probabilities are calculated:
P (“There was heavy rain”) = P (“There”, “was”, “heavy”, “rain”) = P (“There”) P (“was”
|“There”) P (“heavy”| “There was”) P (“rain” |“There was heavy”).
As it is not practical to calculate the conditional probability but by using the “Markov
Assumptions”, this is approximated to the bi-gram model as [4]:
P (“There was heavy rain”) ~ P (“There”) P (“was” |“'There”) P (“heavy” |“was”) P (“rain”
|“heavy”)
- 32 -
Applications of the N-gram Model in NLP
In speech recognition, the input can be noisy. This noise can make a wrong speech to
the text conversion. The N-gram language model corrects the noise by using probability
knowledge. Likewise, this model is used in machine translations for producing more natural
statements in target and specified languages. For spelling error corrections, the dictionary is
useless sometimes. For instance, "in about fifteen minutes" 'minuets' is a valid word according
to the dictionary but it is incorrect in the phrase. The N-gram language model can rectify this
type of error.
The N-gram language model is generally at the word levels. It is also used at the
character levels for doing the stemming i.e. for separating the root words from a suffix. By
looking at the N-gram model, the languages can be classified or differentiated between the
US and UK spellings. Many applications get benefit from the N-gram model including
tagging of part of the speech, natural language generations, word similarities, and sentiments
extraction.
The N-gram language model has also some limitations. There is a problem with the
out of vocabulary words. These words are during the testing but not in the training. One
solution is to use the fixed vocabulary and then convert out vocabulary words in the training
to pseudowords. When implemented in the sentiment analysis, the bi-gram model
outperformed the uni-gram model but the number of the features is then doubled. So, the
scaling of the N-gram model to the larger data sets or moving to the higher-order needs better
feature selection approaches. The N-gram model captures the long-distance context poorly. It
has been shown after every 6-grams, the gain of performance is limited.
- 33 -
We are all familiar with digital assistants such as Google Assistant, Cortana, Siri, and Alexa.
Google Assistant and Siri are used by over 1 billion people globally, and Siri has over 40
million users in the US alone. But, have you ever wondered how these tools understand what
you say? Well, they use speech to text AI.
It was initially used to analyze and quickly compute data, but it is now used to perform
tasks that previously could only be performed by humans.Artificial intelligence is often
confused with machine learning. Machine learning is a derivative of artificial intelligence and
refers to the process of teaching a machine to recognize and learn from patterns rather than
teaching it rules.
Computers are trained by feeding large volumes of data to an algorithm and then letting
it pick out the patterns and learn. In the nascent days of machine learning, programmers had to
write code for every object they wanted the computer to recognize – e.g., a cat vs. a human.
These days, computers are shown numerous examples of each object. Over time, they learn
without any human input.
- 34 -
Challenges with Speech to Text AI
Despite the giant leap forward that AI speech to text has made over the last decade,
there remain several challenges that stand in the way of true ubiquity.
The first of these is accuracy. The best applications currently boast a 95% accuracy rate – first
achieved by Google Cloud Speech in 2017. Since then, many competitors have made great
strides and achieved the same rate of accuracy.
While this is good progress, it means that there will always be a 5% error rate. This
may seem like a small figure – and it is, where the issue at hand is a transcript that can be
quickly edited by a human to correct errors. But, it is a big deal where voice is used to give a
command to the computer. Imagine asking your car’s navigator to search the map for a
particular location, and it searches for something different and sends you on your way in the
wrong direction because it didn’t quite catch what you said.
The other challenge is that humans don’t just listen to each other’s voices to understand
what is being said. They also observe non-verbal communication to understand what is being
communicated but isn’t being said. This includes facial expressions, gestures, and body
language. So, while computers can hear and understand the content, we are a long way from
getting to a point where they can pick up on non-verbal cues. The emotional robot that can
hear, feel and interpret like a human is the holy grail of speech recognition.In speech
recognition, the computer takes input in the form of sound vibrations. This is done by making
use of an analog to digital converter that converts the sound waves into a digital format that
the computer can understand. Advanced speech recognition in AI also comprises AI voice
recognition where the computer can distinguish a particular speaker’s voice.
- 35 -
Chapter 6:PRO’S AND CON’S OF
SPEECH RECOGNITION
One of the most important biometric technology called Speech recognition technology
has become an increasingly popular concept in recent years. This technology is widely used
for various advantages it provides. It allows documents to be created because the software
generally produces words as quickly as they uttered. Which is usually much faster than a
person can type. So here this article gives the advantages and disadvantages of speech
recognition to better understand this topic.
There are several advantages to using speech recognition software, including the following:
Readily accessible: This software is frequently installed in computers and mobile devices,
making it accessible.
Easy to use: Well-designed software is straightforward to operate and often runs in the
background.
It is fairly accurate: Although it should always be proofread, speech recognition software can
result in a document more or less free of errors. In addition, newer programs tend to be well
designed and can offer reliable results for some applications.
It allows for hands-free work: When working with a client or completing a task, the use of
speech recognition tools facilitates easy note taking, use of other materials, and professional
eye contact. Each of these activities is limited when someone has to type information into a
computer behind a screen.
- 36 -
Security: With this technology a powerful interface between man and computer is created as
the voice reorganization understands only the prerecorded voices and hence there are no ways
of tampering data or breaking the codes if created.
Productivity: It decreases work as all operations are done through voice recognition and hence
paper work decreases to its maximum and the user can feel relaxed irrespective of the work.
Advantage for handicapped and blind: This technology is great boon for blind and
handicapped as they can utilize the voice recognition technology for their works.
Usability of other languages increases: As the speech recognition technology needs only
voice and irrespective of the language in which it is delivered it is recorded, due to this
perspective this is helpful to be used in any language.
Personal voice macros can be created: Every day tasks like sending mails receiving mails
drafting documents can be done easily and also many tasks speed can be increased.
While convenient, speech recognition technology still has a few issues to work through.
Limitations include:
Accuracy is always imperfect: “More or less accurate” is not perfectly accurate. This is a
very important factor to consider when choosing tools, especially for medical legal needs. In
these cases, accuracy is permanent. A client’s professional reputation hangs in part on how well
they present themselves in writing.
Some voices don’t come across well: Speech recognition software may not be able to
transliterate the words of those who speak quickly, run words together or have an accent.
It also drops in accuracy when more than one speaker is present and being recorded.
Inconsistent performance: The systems may be unable to capture words accurately because
of variations in pronunciation, lack of support for some languages and inability to sort through
background noise. Ambient noise can be especially challenging. Acoustic training can help
filter it out, but these programs aren't perfect. Sometimes it's impossible to isolate the human
voice.
- 37 -
Speed: Some speech recognition programs take time to deploy and master. The speech
processing may feel relatively slow.
Source file issues: Speech recognition success depends on the recording equipment used, not
just the software.
- 38 -
Chapter 7:FEATURES OF SPEECH
RECOGNITION
Good speech recognition programs let users customize them to their needs. The
features that enable this include:
Language weighting: This feature tells the algorithm to give special attention to certain words,
such as those spoken frequently or that are unique to the conversation or subject. For example,
the software can be trained to listen for specific product references.
Acoustic training: The software tunes out ambient noise that pollutes spoken audio.
Software programs with acoustic training can distinguish speaking style, pace and volume amid
the din of many people speaking in an office.
Speaker labeling: This capability enables a program to label individual participants and
identify their specific contributions to a conversation.
Profanity filtering: Here, the software filters out undesirable words and language.
- 39 -
Chapter 8:APPLICATIONS
Speech recognition technologies such as Alexa, Cortana, Google Assistant and Siri are
changing the way people interact with their devices, homes, cars, and jobs. The technology
allows us to talk to a computer or device that interprets what we’re saying in order to respond
to our question or command.With a long history of development and innovation, it was the
introduction of these artificial intelligence voice-controlled assistants, or digital assistants, into
the voice recognition market that changed the landscape of this technology in the 21st
century.2With digital assistants quickly becoming ubiquitous in various aspects of life,
understanding their capabilities and applications is paramount to individuals, businesses, and
organisations.
Digital assistants are designed to help people perform or complete basic tasks and
respond to queries. With the ability to access information from vast databases and various
digital sources, these robots help to solve problems in real time, enhancing the user
experience and human productivity.These voice recognition programs weren’t sophisticated
enough to understand everyone’s voice, so many users were disappointed. However, voice
recognition has made an enormous amount of progress since then.
It’s present in our smartphones and on our computers, and it is used in a wide variety
of industries. The applications of voice recognition seem almost endless! Here are some of the
top trends and applications when it comes to voice recognition technology.
Popular digital assistants and few popular speech recognition applications include:
- 40 -
Chapter 8.1: GOOGLE Voice Assistant
The privacy policy of Google Assistant states that it does not store the audio data
without the user's permission, but may store the conversation transcripts to personalise its
experience. Personalisation can be turned off in settings. If a user wants Google Assistant to
store audio data, they can go to Voice & Audio Activity (VAA) and turn on this feature. Audio
files are sent to the cloud and used by Google to improve the performance of Google Assistant,
but only if the VAA feature is turned on
The privacy policy of Amazon's virtual assistant, Alexa, states that it only listens to
conversations when its wake word (like Alexa, Amazon, Echo) is used. It starts recording the
conversation after the call of a wake word, and stops recording after 8 seconds of silence. It
sends the recorded conversation to the cloud. It is possible to delete the recording from the
cloud by visiting ‘Alexa Privacy’ in ‘Alexa’.
Apple states that it does not record audio to improve Siri. Instead, it uses transcripts.
Transcript data is only sent if it is deemed important for analysis. Users can opt out anytime if
they don't want Siri to send the transcripts in the cloud.
Voice shopping is exactly what it sounds like: using your voice to make purchases.Just
as users have become accustomed to using these devices for carrying out simple search
requests or for operating devices around the home, they are also using them to make purchases
ranging from pizza to paper and from movie tickets to big-ticket items.
Here’s how a typical voice shopping event might take place:
Voice shopping has already become one of the biggest trends in retail. A TechCrunch
report showed that just 13 percent of U.S. homes have a smart speaker system, but that more
than a third of those who had these systems used it to make purchases on a regular basis. These
purchases generated an estimated $2 billion in revenue in 2017
- 42 -
Fig: Voice Shopping
A 2017 survey of more than 1,600 shoppers showed that about one-quarter of those
surveyed owned at least one voice-activated digital assistant, with another 20 percent expected
to purchase one in the coming year. While just under 20 percent of those surveyed had made
purchases through voice shopping, the survey also showed that more than 40 percent of
millennials had used their voice-activated assistants for purchases in the preceding year."Voice
commerce represents the next major disruption in the retail industry," said John Franklin,
Associate Partner at OC&C. "Just as e-commerce and mobile commerce changed the retail
landscape, shopping through smart speaker promises to do the same.”
Many major retailers have taken notice of the voice shopping trend by forming
partnerships with smart speaker providers.
For users who don't specify a particular brand, Amazon will recommend an “Amazon's
Choice” brand, or even an Amazon-branded product where applicable. This capability gives
Amazon and their partners a leg up
A voice assistant is a natural fit for a smart TV. It is especially helpful if you don’t
want to get up and get the remote control. Also, why waste time when you can use your voice
to navigate. All you have to do is press and hold the microphone button. Calmly speak into the
microphone. Also, there is no need for active listening. Also, you don’t need to yell commands
from across the room. Depending on the version of voice recognition in your TV, you can
through channels. You can browse, search content, open and close apps. Voice recognition
- 43 -
tech also enables you to change the sound mode, look for information, with a smart assistant.
Most people have a number of online accounts. These accounts need protection. Many of these
accounts, such as online banking applications, pose significant security hazards. Internet
banking has become widespread in industry 4.0. There is a high need for proper ID systems to
be put in place. These ID systems would ensure that only the account owner has access to
important info. Voice identification is a more latest kind of user identification. Speech
authentication factor aids in identifying your distinctive voice
It is very much similar to AI assistants that respond to your voice. This method can be utilized
as a unique ‘password’ to unlock secured accounts with your voice. Everyone has
Biometric authentication, unlike passwords or token-based authentication, uses unique
- 44 -
biological characteristics to verify an individual’s identity. It’s harder to spoof and generally
more convenient for users since they don’t have to remember passwords or carry a physical
token that can easily be lost or stolen.
One of the most startling trends in speech recognition is the use of this technology to assist in
the identification of criminals. If a crime suspect’s voice is recorded, the audio can now be
utilized as key evidence. AGNITIO and Morpho (Safran) are currently working together to
deliver Voice ID technology for forensics.
This product enables the use of speech biometrics technology all around the world.
This would help to identify persons and do background checks. It will aid in added checks with
fingerprints and other methods. This technology can match recorded or live voices in
seconds.It also has a 99 percent accuracy rate. Furthermore, speech recognition does not
distinguish between different accents or languages. It detects the sound of a person’s voice
rather than the words or language they speak.
The touchscreen has revolutionized the way we engage with our gadgets. Voice
recognition also has the potential to do the same. This will be a great leap in technology. The
top reason to use voice recognition is that it is less difficult and quicker than typing. Smart
speakers and assistants, IoT are pioneering speech UIs. But, as technology advances, there will
- 45 -
be a few changes. These devices will also deal with the nuances of our speech more
successfully. It would also allow them to achieve more.
Voice UI will have an impact on the regular mobile app business as well. Voice
recognition and speech UIs will especially help the elderly or people with vision impairment.
Instead of browsing with swipes and clicks, we can simply talk. Improved ASR will also
benefit home automation devices. It will be beneficial to robotics to interactive toys. Speech
UI will also make way for human interactions with domestic technology.
- 46 -
Chapter 9: CONCLUSION
This blog aimed at giving you the most recent and significant speech recognition
developments. You can incorporate these in your products, processes, and platforms of
business.Voice recognition promises rosy future and offer wide variety of services. The next
generation of voice recognition technology consists of something called Neural networks using
artificial intelligence technology. They are formed by interconnected nodes which do parallel
processing of the input for fast evaluation. Like human beings they learn new pattern of speech
automatically.
- 47 -
REFERENCES
1. https://www.techtarget.com/searchcustomerexperience/definition/speech
recognition#amp_tf=From%20%251%24s&aoh=16668519701886&referrer=http
s%3A%2F%2Fwww.google.com&share=https%3A%2F%2Fwww.techtarget.
com%2Fsearchcustomerexperience%2Fdefinition%2Fspeech-recognition
2. https://www.knowledgenile.com/blogs/what-are-new-technology-trends-in-speech-recognition/
- 48 -