Chapter 8 - Applications of NLP-Part II

Chapter – 8
Applications of NLP – Part II
Department of Computer Science

School of Computing
Dire Dawa Institute of Technology
Dire Dawa University
Te ssf u Ge t e ye (Ph D)
2020/2021-Se me st er- II
Speech Recognition Speech Recognition Processes
Types of Automatic Speech Recognition
Optical Character Recognition
Difficulties with ASR
Speech Recognition Approaches
Speech Recognition Performance Evaluation
NLP in Speech Recognition
Automatic Speech Recognition
 Human-machine and human-human interactions are being significantly improved using

different Human Language Technologies (HLTs) include:
 Text-based HLTs: Text Summarization, Machine Translation, and others.
 Speech-based HLTs: Automatic Speech Recognition (ASR), Speech Synthesis,

Speech Translation and others.
 Speech-based HLTs are more convenient in communication efﬁciency, restriction, and
accuracy.
 ASR systems are used for translating speech sequences into the corresponding textual
representation.
Speech Text
Department of Computer Science, SC, DDIT, DDU Applications of NLP 2/59

 Building blocks of ASR systems:

 ASR is defined as:
 Acoustic and Lexical models, which compute the
 Language model, which computes the .
 - sequence of word
 - acoustic observation sequence


 The major building blocks of ASR system are:
 Acoustic Model
 Language Model
 Lexical Model
 Decoder
 Acoustic Model:
 It is used in ASR to represent the relationship between an audio signal and the
phonemes or other linguistic units that make up speech.
 The model is learned from a set of audio recordings and their corresponding
transcripts.
 It is typically deals with the raw audio waveforms of human speech, predicting
what phoneme each waveform corresponds to, typically at the character or
subword level.
 It defines the probability that a basic sound unit, or phoneme has been uttered.
 It represents the relationship between the speech signal and the linguistic or
acoustic units in the language.
 It can be developed via different approaches.

 Language model:
 It defines the probability of the occurrence of a word or a word sequence.
 It provides context to distinguish between words and phrases that sound

phonetically similar.
 It can be developed via statistical approach (n-gram) or neural network

approach.
 Lexical Model:
 It is called Vocabulary or Lexicon Model.
 It contains information of how words are formed from phoneme sequences.
 It contains a list of words with their equivalent possible pronunciations in the

language.
 Example: አበበ = አ ብ ኧ ብ ኧ or አበበ = አ ብኧ ብኧ

v
 Decoder:
 It combines acoustic, language, and lexical models given the feature vector
sequence and the hypothesized word sequence, and outputs the word sequence
with the highest score as the recognition result.


Types of ASR based on speech
 Isolated speech
 Isolated word recognition system which recognizes single utterances i.e. single
word.
 It is suitable for situations where the user is required to give only one-word
response or commands, but it is very unnatural for multiple word inputs.
 Connected words
 A connected words system is similar to isolated words, but it allows separate

utterances to be “run-together‟ with a minimal pause between them. Utterance
is the vocalization of a word or words that represent a single meaning to the
computer.
 Continuous speech
 Continuous speech recognition system allows users to speak almost naturally,

while the computer determines its content.
 Basically, it is computer dictation. In this closest words run together without

pause or any other division between words. Continuous speech recognition
system is difficult to develop.

Types of ASR based on speech
 Spontaneous speech
 Spontaneous speech recognition system recognizes the natural speech.
 Spontaneous speech is natural that comes suddenly through mouth.
 An ASR system with spontaneous speech is able to handle a variety of natural

speech features such as words being run together. Spontaneous speech may
include mispronunciation, false-starts and non-words.
 Highly Conversational Speech

Types of ASR based on Size of Vocabulary
 The size of vocabulary of ASR system can affect:
 The complexity, processing and the rate of recognition of ASR system.
 ASR systems are classified based on the vocabulary as:
 Small Vocabulary - 1 to 1000 words
 Medium Vocabulary - 1001 to 10000 words
 Large Vocabulary- 10001 to 100,000 words
 Very-large vocabulary - More than 100,001 words
 Unlimited vocabulary : Contain all potential words of the Language.

Types of ASR based on Speaker model
 Speaker Dependent Models
 Speaker dependent systems are developed for a particular type of speaker.
 They are generally more accurate for the particular speaker, but could be less
accurate for other type of speakers.
 These systems are usually cheaper, easier to develop and more accurate.
 But these systems are not flexible as speaker independent systems.
Speaker Independent Models
 Speaker Independent system can recognize a variety of speakers without any

prior training.
 A speaker independent system is developed to operate for any particular type

of speaker.
 Its drawback is that it limits the number of words in a vocabulary.
 Implementation of Speaker Independent system is the most difficult.
 It is expensive and its accuracy is lower than speaker dependent systems.

 Spoken language is not equal to written Language

 Noise
 Body Language
 Channel Variability
 Speaker Variability
 Speaking Style
 Speaker Sex
 Dialects

ASR Approaches

ASR Approaches: Acoustic Phonetic Approach
 It is also called rule-based approach.

ASR Approaches: Acoustic Phonetic Approach
 Use knowledge of phonetics and linguistics to guide search process

 Usually some rules are defined expressing everything (anything) that might help to
decode:
 Phonetics, phonology, phonotactics
 Syntax
 Pragmatics
 Typical approach is based on “blackboard” architecture:
 At each decision point, lay out the possibilities
 Apply rules to determine which sequences are permitted.
 Poor performance due to:

 Difficulty to express rules
 Difficulty to make rules interact
 Difficulty to know how to improve the system

ASR Approaches: Pattern Recognition Approach

ASR Approaches: Pattern Recognition Approach

 Feature measurement: Filter Bank, LPC, DFT, ...
 Pattern training: Creation of a reference pattern derived from an averaging technique.
 Pattern classification: Compare speech patterns with a local distance measure and a
global time alignment procedure (DTW).
 Decision logic: similarity scores are used to decide which is the best reference pattern.
 The pattern recognition approach has two steps-namely, training of speech
patterns, and recognition of patterns by way of pattern Classifier.
 The pattern recognition approach can be: Template based or Stochastic or Statistical
based approach.
 This approach contains many techniques such as :
 Dynamic Time Warping (DTW)
 Vector Quantization (VQ)
 Support Vector Machine (SVM)
 Polynomial Classifier
 Hidden Markov Model (HMM)

ASR Approaches: Pattern Recognition Approach: Template Matching
 An anthology of prototypical speech patterns is amassed as reference patterns

characterizing the dictionary of candidate words.
 After that, recognition is performed by harmonizing an unidentified spoken word with
each of the reference templates and chooses the type of the best identical pattern.
 The templates for all the words are configuring.
 Test pattern, T, and reference patterns, {R1, …, Rv}, are represented by sequences of
feature measurements.
 Pattern similarity is determined by aligning test pattern, T, with reference pattern, Rv,
with distortion D(T, Rv)
 Decision rule chooses reference pattern, R*, with smallest alignment distortion D(T,
R*).
 Dynamic time warping (DTW) is used to compute the best possible alignment warp,
between T and Rv, and associated distortion D(T, Rv).

ASR Approaches: Pattern Recognition Approach: Statistics-based Approach
 Can be seen as extension of template-based approach, using more powerful

mathematical and statistical tools.
 Sometimes seen as “anti-linguistic” approach.
 Fred Jelinek (IBM, 1988): “Every time I fire a linguist my system improves”
 Collect a large corpus of transcribed speech recordings.
 Train the computer to learn the correspondences (“machine learning”)
 At run time, apply statistical processes to search through the space of all possible
solutions, and pick the statistically most likely one.
 This approach includes:
 HMM
 SVM
 DTW and Bayesian Classification approaches

 The most popular stochastic approach now a day is hidden Markov modelling (HMM).
 An HMM is characterized by a finite state Markov model and a set of output
distributions.
 The transition parameters in the Markov models are temporal variability’s, while the
parameters in the output distribution model are spectral variability’s.

HMM based Speech Recognition Architecture

 ASR system developed using the statistical approach as a hybrid way of GMM-HMM.
 This approach is also called conventional statistical approach.
 It is the widely used approach for ASR for more than four decades.
 In GMM-HMM approach:
 GMM – is used for modeling the spectral features of the speech signal.
- is used for estimating the emission probabilities or observation
likelihoods of the HMM states via the expectation-maximization
algorithm.
 HMM – is used for modeling the temporal features of the speech signal with
respect to the linguistic units in the development of the ASR system for
a particular language.
- is used for computing the probabilities of observation sequences using a
forward algorithm, for finding out the optimal sequences of HMM states
using Viterbi algorithm.

 Limitation of this approach:

 GMM is unable to model the temporal characteristics of speech,
 GMM does not model the high-dimensional speech features
 GMM is statistically inefﬁcient for modeling data that lie on or near a nonlinear
manifold in the data space.
 HMM has relatively poor discrimination power.

 HMM Ignore any long term dependencies. These make HMM inaccurate but
simple to implement.

ASR Approaches: Pattern Recognition Approach: Neural Network Approach
 The ANN approach attempts to mechanize the recognition procedure according to the
way a person applies its intelligence in visualizing, analyzing and finally making a
decision on the measured acoustic features.
 The ANN approach is a hybrid of the acoustic phonetic approach, pattern

recognition approach and a feature extractor approach.
 The various methods or techniques in Artificial Neural Network are:

 Time Delay Neural Network (TDNNs)
 Multi-layer Perceptron (MLP)
 Radial basis Functions (RBF)
 Recurrent Neural Network (RNN)
 Self-Organizing Map (SOM)

ASR Approaches: Artificial Intelligence: Deep Learning Approach
 The deep neural networks based acoustic modeling technique reduce the limitations of
GMM-HMM ASR system.
 Deep neural network such as feed forward and recurrent neural network are applied :
 For acoustic modeling in ASR as a feature extractor for GMM-HMM system
 For replacing GMM to develop hybrid neural networks-HMM systems.
 For developing ASR in End-to-End approach
 Feed Forward DL Networks:
 In these networks, the information always travels in one direction (from the
input layer to the output layer via the hidden layers)and never goes backward.
 Those networks include:
 DNN
 CNN
 TDNN networks.

 Recurrent DL Networks
 RNNs are called cyclic networks with self-connections from the previous time
steps used as inputs to the current time steps.
 These networks capture a dynamic history of information about the input
feature sequences and are less influenced by temporal distortion.
 Unlike the feed-forward DL networks, RNNs can take a long sequence of input
features and generated a long sequence of output values .
 Consequently, these networks are better to model long-term dependencies

among frames of input features .
 The common RNNs are:

 conventional RNN
 LSTM
 GRU

 Data Sharing DL Networks
 Data sharing DL networks are vital for minimizing the overfitting problems of
unilingual feed-forward and RNNs in low-resource ASRs.
 These networks include multitask, multilingual, and weight-transfer learning
techniques.
 Multitask learning is used to improve the overall performance of a learning task
by jointly learning multiple associated tasks.
 This helps to transfer knowledge between or among tasks if the tasks
are associated with each other and share an internal representation
by joint learning.
 Example: Train ASR system for Amharic and Chaha languages.

 Multilingual learning is a special type of MTL in which the training of multiple

languages jointly without specifying the primary and ancillary languages.
 All languages have the same impact on the training of multilingual DL
models.
 It is not mandatory that the languages are related to each other , and
thus, this technique allows for training the DL models using several
training corpora from multiple languages.
 It is trained in two ways, shared phone sets and shared hidden layers
among languages.
 Example: Train ASR of multiple languages jointly

 Weight-transfer technique: considers two major language classes: source and

target.
 Source languages are widely high-resource, which have sufficient training corpus
for training the DL models, while target language is usually a low-resource, with
limited training corpus that is insufficient to train DL models.
 This technique allows transferring the weights from DL models trained via the
source languages to train the target language.
 The hidden layers of the source DL models are trained using either unilingual or
multilingual training corpora, and then the output layers are discarded, and
replaced with a new target language output layer. Then, the weights of nodes in
the added output layer and biases are randomly initialized.
 Finally, either all the hidden layers are made fixed and we train only the added
output layer or we retrain all the hidden layers and the added output layer using
a small training dataset of the target language.
 This technique is important for developing ASR systems for languages that have
very limited training datasets, have no known phone sets , and have no well-
defined orthographic systems .

ASR Performance Evaluation
 The performance of speech recognition is specified in terms of accuracy and speed.

 Accuracy is measured in terms of performance accuracy which is known as Word Error
Rate (WER).
 Speed is measured with the Real Time Factor (RTF).
 Word Error Rate (WER)
 It is a common metric of the speech recognition performance. As recognized

word sequence have a different length from the reference word sequence, there
is difficulty in measuring performance.
 Where S - is number of substitutions
D - is number of deletions
I - is number of insertions
N - is number of words in the reference.

ASR Performance Evaluation
 Sometimes word recognition rate (WRR) is used instead of WER while describing
performance of speech recognition.
 Speed
 It is measured by real time factor.
 If it takes time T to process an input of duration D, then real time factor is

defined by:
 RTF ≤1 implies real time processing

 NLP concepts which are very fundamental for ASR include:
 Word pronunciation – considering homonymy (Homophones) - [Acoustic and

lexical models]
 Language syntax – [language model]

Speech Recognition OCR Definition
Optical Character Recognition OCR Phases
OCR approaches
OCR Performance Evaluation
NLP in OCR
 Handwriting recognition is classified into two types:
 Off-line Handwritten Recognition
 On-line Handwritten recognition
 Off-line handwriting recognition:
 It involves automatic conversion of text in an image into letter codes which are
usable within computer and text-processing applications.
 It is more difficult, as different people have different handwriting styles.
 On-line character recognition:
 It deals with a data stream which comes from a transducer while the user is
writing.
 The typical hardware to collect data is a digitizing tablet which is electromagnetic

or pressure sensitive.
 When the user writes on the tablet, the successive movements of the pen are
transformed to a series of electronic signal which is memorized and analyzed by
the computer.

OCR approaches
NLP in OCR
 Optical Character Recognition (OCR) is a field of research in pattern recognition,

artificial intelligence and machine vision, signal processing.
 OCR is usually referred to as an off-line character recognition process to mean that the
system scans and recognizes static images of the characters.
 It refers to the mechanical or electronic translation of images of handwritten character

or printed text into machine code without any variation.
 It is used to convert handwritten, typed, scanned text, or text inside images to
machine-readable text.

OCR approaches
NLP in OCR
OCR Phases
 OCR has the following major phases:
 Digitization
 Preprocessing
 Segmentation
 Feature extraction
 Classification and Recognition
 Post processing

OCR approaches
NLP in OCR
OCR Phases
 General Architecture of OCR

OCR approaches
NLP in OCR
OCR Phases
 Digitization
 It is the process of converting a paper-based handwritten, typed, scanned text,

or text inside images documents into electronic format using scanner or camera
to produce an image files.
 Preprocessing
 The input image can be preprocess before segmentation.
 The preprocessing tasks are binarization and size normalization.
 Binarization is converting the gray-scale or color images into binary mages for
reducing the storage space and for increasing processing speed.
 Size normalization is making the characterize normalized for reducing the size
varieties.

OCR approaches
NLP in OCR
OCR Phases
 Segmentation:
 The position of the character in the image is found out and the size of the image
is normalized to that of the template size.
 Segmentation can be external and internal.
 External segmentation is the isolation of various writing units, such as

paragraphs, sentences or words.
 Internal segmentation an image of sequence of characters is decomposed into

sub images of individual character.
 Feature extraction:
 Features of individual character are extracted.
 The performance of each character recognition system that depends on the

features that are extracted. The extracted features from input character should
allow classification of a character in a unique way.
 For example: diagonal features, intersection and open end points features,
transition features, zoning features, directional features, parabola curve fitting–
based features, and power curve fitting–based features in order to find the
feature set for a given character.

OCR approaches
NLP in OCR
OCR Phases
 Recognition and Classification:
 Comparing the unknown images of symbols (feature extraction) with predefined

stored samples in order to identify their type, its determines the region of
feature space in which an unknown pattern falls.
 Post-processing:
 It is the final stage in OCR system, and the most important stage.
 It is working for checking the result text from previous stage, and correct it to
make sure it is free from errors.

OCR approaches
NLP in OCR
OCR - Recognition and Classification Techniques

Techniques
 The commonly used OCR approaches are:
 Optimum statistical classifiers: Includes
 Support Vector Machines (SVM)
 Principal Component Analysis (PCA),
 Kernel Principal Component Analysis (KPCA) and others
 SVM are a group of supervised learning methods that can be applied to

classification. In a task of Classification usually data is divided into training and
testing sets. The aim of SVM is to produce a model, which predicts the target
values of the test data. Different types of kernel functions of SVM are: Linear
kernel, Polynomial kernel, Gaussian Radial Basis Function (RBF) and Sigmoid.
 Neural Networks/Deep learning
 OCR, which makes recognition process more application-aware using into a

neural network.
 Common deep learning approach which is effective for OCR is CNN models.

OCR approaches
NLP in OCR
 Recognition rate
 The proportion of correctly classified characters.
 Rejection rate
 The proportion of characters which the system was unable to recognize.
 Rejected characters can be flagged by the OCR system, and are therefore easily
retraceable for manual correction.
 Error rate
 The proportion of characters erroneously classified.
 Misclassified characters go by undetected by the system, and manual inspection

of the recognized text is necessary to detect and correct these errors.

TOC: Course Syllabus
Previous: Approaches to NLP
Current: Applications of NLP-Part-II

Next:
End of NLP Course

Chapter 8 - Applications of NLP-Part II

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 8 - Applications of NLP-Part II

Uploaded by

Copyright:

Available Formats

Chapter – 8

Applications of NLP – Part II

Department of Computer Science

Automatic Speech Recognition

 Human-machine and human-human interactions are being significantly improved using

 Text-based HLTs: Text Summarization, Machine Translation, and others.

 Speech-based HLTs: Automatic Speech Recognition (ASR), Speech Synthesis,

Department of Computer Science, SC, DDIT, DDU Applications of NLP 2/59

Automatic Speech Recognition

 Building blocks of ASR systems:

Department of Computer Science, SC, DDIT, DDU Applications of NLP 3/59

Automatic Speech Recognition

 ASR is defined as:

 Acoustic and Lexical models, which compute the

 Language model, which computes the .

 - acoustic observation sequence

Department of Computer Science, SC, DDIT, DDU Applications of NLP 4/59

Automatic Speech Recognition

 It can be developed via different approaches.

Department of Computer Science, SC, DDIT, DDU Applications of NLP 5/59

Automatic Speech Recognition

 It defines the probability of the occurrence of a word or a word sequence.

 It provides context to distinguish between words and phrases that sound

 It can be developed via statistical approach (n-gram) or neural network

 It is called Vocabulary or Lexicon Model.

 It contains information of how words are formed from phoneme sequences.

 It contains a list of words with their equivalent possible pronunciations in the

 Example: አበበ = አ ብ ኧ ብ ኧ or አበበ = አ ብኧ ብኧ

Department of Computer Science, SC, DDIT, DDU Applications of NLP 6/59

Automatic Speech Recognition

Department of Computer Science, SC, DDIT, DDU Applications of NLP 7/59

Types of Automatic Speech Recognition

Department of Computer Science, SC, DDIT, DDU Applications of NLP 8/59

Types of ASR based on speech

 A connected words system is similar to isolated words, but it allows separate

 Continuous speech recognition system allows users to speak almost naturally,

 Basically, it is computer dictation. In this closest words run together without

Department of Computer Science, SC, DDIT, DDU Applications of NLP 9/59

Types of ASR based on speech

 Spontaneous speech recognition system recognizes the natural speech.

 Spontaneous speech is natural that comes suddenly through mouth.

 An ASR system with spontaneous speech is able to handle a variety of natural

 Highly Conversational Speech

Department of Computer Science, SC, DDIT, DDU Applications of NLP 10/59

Types of ASR based on Size of Vocabulary

 The size of vocabulary of ASR system can affect:

 The complexity, processing and the rate of recognition of ASR system.

 ASR systems are classified based on the vocabulary as:

 Small Vocabulary - 1 to 1000 words

 Medium Vocabulary - 1001 to 10000 words

 Large Vocabulary- 10001 to 100,000 words

 Very-large vocabulary - More than 100,001 words

 Unlimited vocabulary : Contain all potential words of the Language.

Department of Computer Science, SC, DDIT, DDU Applications of NLP 11/59

Types of ASR based on Speaker model

 Speaker Dependent Models

 Speaker dependent systems are developed for a particular type of speaker.

 But these systems are not flexible as speaker independent systems.

Speaker Independent Models

 Speaker Independent system can recognize a variety of speakers without any

 A speaker independent system is developed to operate for any particular type

 Its drawback is that it limits the number of words in a vocabulary.

 Implementation of Speaker Independent system is the most difficult.