Lecture 08

Natural Language Processing (COSC 6405)
Lecture 08: Related Fields
Department of Computer Science,

Addis Ababa University
Yaregal Assabie
2018/19—Sem I
Modes of Language Representation
Written and Spoken Languages
Speech Recognition
Conversion to Machine-Editable Text
Optical Character Recognition
• There are two modes used to represent languages.

♦ Written Language
Machine-printed Handwritten Handwritten Machine-Editable
♦ Spoken Language
Department of Computer Science, Addis Ababa University Lecture 08: Related Fields 2/42
Speech Recognition
• Most of the NLP applications and tasks discussed so far assume that the language is
represented as machine editable text.
• Optical Character Recognition (OCR) Systems convert non-editable texts into their
equivalent machine-editable text.
• Speech Recognition (SR) Systems convert spoken language into its equivalent machine
editable text.
• Both OCR and SR systems merge interdisciplinary technologies from Signal Processing,
Pattern Recognition, Natural Language, and Linguistics into a unified framework.
Modes of Language Representation Parameters of SR Systems
Speech Recognition General Architecture
Optical Character Recognition Components of SR Systems
Parameters of SR Systems
• The following parameters are considered in the development of Speech Recognition

systems.
♦ Speaking Mode
Isolated
Continuous
♦ Speaking Style
Read Speech
Spontaneous Speech
♦ Enrollment
Speaker-Dependent
Speaker-Independent
♦ Vocabulary
Small (<20)
Large (>20,000)
General Architecture
Acoustic Signal
Feature Extraction
Acoustic Model +
Language Model Decoding
Lexical Model
Text
Components of SR Systems: Acoustic Signal
• Acoustic signals represent the waveforms of a given speech.

♦ Can be captured using a sound recorder and stored as a wave file.
Aበበ በሶ በላ
Components of SR Systems: Feature Extraction
• Feature Extraction is the process of transforming the input acoustic signal data into the
set of features.
• Mel-Frequency Cepstral Coefficients (MFCCs) are commonly used as features in speech
recognition systems.
♦ A total of 39 features are extracted.
Wavefile
Feature Extraction
Feature
Vectors
Components of SR Systems: Language Modeling
• The notion of Language Modeling in Speech Recognition is similar to that of Statistical

Machine Translation [Lecture 08].
• N-gram Language Model is commonly used for Speech Recognition as well.
• We want to compute:
♦ the probability of a sequence
P(w1,w2,w3,w4,w5…wn) = P(W)
♦ the probability of a word given some previous words

P(w5|w1,w2,w3,w4)
• Further reading:
Language Modeling in Statistical Machine Translation [Lecture 08].
Components of SR Systems: Acoustic Modeling
• The goal of the probabilistic noisy channel architecture for speech recognition can be
summarized as follows:
What is the most likely sentence out of all sentences in the language L given some acoustic input O?”
• We can treat the acoustic input O as a sequence of individual “symbols” or
“observations” (for example by slicing up the input every 10 milliseconds, and
representing each slice by floating-point values of the energy or frequencies of that
slice).
O = o1,o2,o3, . . . ,ot
• Similarly, we treat a sentence as if it were composed of a string of words:
W = w1,w2,w3, . . . ,wn
• The probabilistic implementation of our intuition above, then, can be expressed as

follows:
^
W=argmax P(W/O)
W
• The Noisy Channel Equation for Speech Recognition:

^
W=argmax P(O/W) * P(W)
W
Observation likelihood Language Model
• The acoustic model needs to have the following:

♦ Observation likelihoods
♦ Pronunciation lexicon (lexical model)
The HMM structure for each word, built by hand
♦ Transition probabilities
Selecting Appropriate Units

• What is the best base unit for a continuous speech recognizer?
♦ Possible units: Phrase, word, syllable, phoneme, allophone, subphone
• Requirements
♦ Accurate
Can be recognized with high accuracy.
♦ Trainable
Can be well trained with the given size of the training data.
♦ Generalizable
Words not in the training data should be modeled with high precision.
Comparison of Different Units

• Phrase
♦ Pros: Captures coarticulation for a whole phrase.
♦ Cons: Very large number; common phrases might be trainable.
• Word
♦ Pros: Intra-word, but not inter-word coarticulation is captured.
♦ Cons: Very large number; large vocabulary training unrealistic.
• Syllable
♦ Pros: Close tying with prosody (stress, rhythm).
♦ Cons: Coarticulation at endpoints not captured; Large number.
• Phone
♦ Pros: Low number (around 50).
♦ Cons: Very sensitive to coarticulation.
• Context-dependent phone (triphone, diphone, monophone)
♦ Pros: Captures coarticulation from adjacent phones.
♦ Cons: High number of triphones (125,000).
Components of SR Systems: Decoding
• Given the language model and acoustic model (along with lexical model), a decoder
searches for the best sequence of words from speech.
• The Viterbi algorithm is widely used as a decoder in Speech Recognition systems
• Currently, the HMM Toolkit (HTK) is the most widely used open source toolkit to
implement HMM-based Speech Recognition systems.
Types of OCR Systems
Speech Recognition
Processes of OCR Systems
Approaches to Recognition
• Optical Character Recognition (OCR) is a process that involves reading text from paper
in the form of image and converting the image into a standard encoding scheme
representing the text, e.g. ASCII or Unicode.
• The idea of OCR came into existence when G. Tauscheck obtained a patent on ‘Reading
Machine’ in Germany in 1929.
♦ However, the modern history of OCR started with the advent of computers.
• In the early years of Latin OCR, some standards of fonts were developed to help easy
recognition.
♦ Among the standard OCR fonts are OCR-A and OCR-B, which are widely used
in passports, bank checks, serial tracking labels, credit card imprints, cash
registers, license plates and postal mails.
OCR-A
OCR-B
Speech Recognition
• The input text can be machine-printed or hand-written.

• Historically, the term OCR has been used for all types of input texts.
♦ However, in recent times, the term intelligent character recognition (ICR) is
often used for handwriting recognition.
♦ If recognition is performed at word level, the handwriting recognition is called
intelligent word recognition (IWR).
• When the text to be recognized is already available in some media such as paper, it
means that the recognition is done after the writing process has been completed.
♦ The text is digitized using a scanner and stored in image format.
♦ In such cases, the recognition process is called offline recognition.
• The recent explosion in the use of handheld digital devices has brought the need to
automatically recognize characters whilst the user is writing.
♦ The text is captured and stored, e.g. in UNIPEN format.
♦ This method of recognition is called online recognition.
Speech Recognition
Recognition
Type of input text
Method Technology Complexity
Machine printed Offline OCR Easy
Offline handwritten Offline ICR Difficult
Online handwritten Online ICR Easy
Speech Recognition
Text Image
Preprocessing
Segmentation
Feature Extraction
Optional Component
Language
Classification Model
Post-Processing
Editable Text
Speech Recognition
Processes of OCR Systems: Preprocessing
• Preprocessing stage aims to produce data that are easy for recognition systems to
produce accurate results.
• It includes image enhancement, noise removal, skewness and slant correction, and size
normalization and thinning.
• Image Enhancement
♦ Used to improve the quality of degraded documents which is typically
observed in ancient documents.
♦ The enhancement can be done by filling some part of missing data or by
adjusting the intensity of images.
• Noise Removal
♦ Noise is commonly present in ancient documents, low quality papers, or poor
printing and writing conditions.
♦ Noisy documents are improved by using smoothing operations which replace
each pixel with some function of the pixel’s neighborhood.
♦ Morphological operations such as dilation and erosion can be used for noise
removal.
Speech Recognition
♦ The use of Gaussian function is one of the most commonly used methods for
noise removal due to its isotropic smoothing.
♦ A 2-dimensional (2D) Gaussian function is defined as:
1 ⎛ x2 + y2 ⎞
g ( x, y ) = exp⎜⎜ − ⎟
2πσ 2
⎝ 2σ 2 ⎟⎠
0.004 0.015 0.026 0.015 0.004
0.015 0.059 0.095 0.059 0.015
0.026 0.095 0.150 0.095 0.026
0.015 0.059 0.095 0.059 0.015
0.004 0.015 0.026 0.015 0.004

Graphical Representation of Discrete Approximation to 2D
2D Gaussian Function Gaussian Function with σ =1.0
Speech Recognition
• Skew and Slant Correction

♦ Skewness is a measure of how well documents are in their expected position
during the scanning process.
♦ It is a global feature of a document and can be detected by projection profiles,
correlation methods, etc.
♦ The skewness of documents is corrected by analyzing the direction and
alignment of the text in images.
♦ Slant refers to the local direction of texts and it is a characteristic feature of
handwriting.
♦ The purpose is to align the paper document with the coordinate system of the
scanner.
♦ In the case of slant correction, the characters in the text would be brought to
a normal position.
• Size Normalization and Thinning
♦ Used to reduce or standardize the feature space representing characters.
Speech Recognition
Processes of OCR Systems: Segmentation
• Segmentation refers to all procedures in which observed patterns in the image are
segregated into units of sub-patterns such as graphical objects, tables, text lines,
words, and characters.
• Handwriting systems usually have difficulties to segment unconstrained text into
individual characters.
♦ With this regard, recognition systems are seen to follow either of the two
paradigms: segmentation-based and segmentation-free.
♦ Segmentation-based approaches assume that the would-be characters are
extracted for further processing such as feature description or recognition.
This assumption can be feasible for machine-printed documents but it
is not easy for handwritten texts.
♦ Thus, most handwriting recognition systems are designed based on a
segmentation-free paradigm.
Here, words are considered to be inputs for the system and for this
reason, the segmentation-free technique is also known as holistic
approach.
Speech Recognition
Handwritten Amharic Document Image
Speech Recognition
Character Segmentation in Handwritten Amharic Document Image [From EthioReader]
Speech Recognition
Text Line Detection in Skewed Handwritten Amharic Document Images [From EthioReader]
Speech Recognition
Processes of OCR Systems: Feature Extraction
• Feature extraction involves the measurement or computation of the most relevant

information out of a given raw data.
• Features can be extracted in two ways:
♦ from the structures of patterns in the raw data; or
♦ by applying some transformations on the raw data and then extracting the
features from the patterns.
• A good feature design requires that features should be invariant to various distortions
of the patterns.
• The design of discriminating features is an important factor to any pattern recognition
algorithm being successful in classification.
• The choice of discriminating features mostly depends on the nature of character
structures and writing styles.
• Most commonly used structural features used for character recognition include strokes,
loops, corners, contours, curves, intersection points, end points of lines, etc.
• The aggregate shape of words can also be used as a feature in the case of holistic
approach.
• In addition, online handwriting recognition uses the directional features generated as a
result of pen-tip movements.
Speech Recognition
• On the other hand, features can be computed by using image transformations.

• The number of features, known as dimensionality, has its own implications on the
complexity of classification.
• As the dimensionality of features linearly increases, the required number of training
samples increases exponentially.
♦ This phenomenon is known as the curse of dimensionality.
♦ Thus, dimensionality reduction is an important component in feature
extraction.
• Extracted features are not equally useful for classification.
♦ A limited yet salient feature set both improves the recognition results and
reduces the complexity of classification.
♦ The process of choosing limited yet good features that lead to efficient
classification is called feature selection.
Speech Recognition
Low Level Feature Extraction in Offline Recognition
Image of the Ethiopic character “ ም” scanned from a noisy document
Speech Recognition
Low Level Feature Extraction in Offline Recognition: Gradient Fields
Direction field image of the Ethiopic character “ ም” scanned from a noisy document
• Gradient field is a low level feature describing the change in gray level with direction.
♦ Calculated by taking the difference in value of neighboring pixels, producing a
vector for each pixel.
Can be computed by convolving the image with a Gaussian and
derivatives of Gaussian operators.
♦ The gradient of pixels is expressed in the range of [0..360] degrees, where
pixels with directions of zero are represented by the red color.
Speech Recognition
Low Level Feature Extraction in Offline Recognition: Direction Fields
Direction field image of the Ethiopic character “ ም” scanned from a noisy document
• Direction field represents the ideal local direction of pixels characterized by the fact
that the gray value remains constant in one direction (along the direction of lines), and
only changes in the orthogonal direction.
♦ Can be computed by convolving the image with a Gaussian and derivatives of
Gaussian operators, and then by pixel-wise complex squaring.
♦ The direction of pixels is represented in double angle and expressed in the
range of [0..180] degrees, where pixels with directions of zero are represented
by the red color.
Speech Recognition
Low Level Feature Extraction in Online Recognition
Handwritten Ethiopic Character “ጬ” Captured Online
Speech Recognition
Low Level Feature Extraction in Online Recognition: Gradient Fields
Time Parameterized Gradient Field for Online Handwritten “ጬ”
Speech Recognition
Low Level Feature Extraction in Online Recognition: Direction Fields
Time Parameterized Direction Field (Double Angle Representation) for Online Handwritten “ጬ”
Speech Recognition
Low Level Feature Extraction in Online Recognition: Direction Fields
Time Parameterized Direction Field (Normal Angle Representation) for Online Handwritten “ጬ”
Speech Recognition
High Level Feature Extraction in Online Recognition: Structural Features
Structural Feature Extraction from Direction Field (Normal Angle Representation)
Speech Recognition
High Level Feature Extraction in Offline Recognition: Structural Features
Structural Feature Extraction from Direction Fields
Speech Recognition
Processes of OCR Systems: Classification
• The primary goal of any recognition system is to classify unknown data into a set of
known categories.
♦ The basic idea is to take the extracted features and determine what label
(class) it should have with minimal error.
♦ The classes in text recognition systems can be characters in a script or words in
a lexicon.
• Classification is the final stage in recognition systems in which a decision is made on
the recognition of a given input.
• The result of decision made by the system can be:
♦ Correct Classification
A given input is recognized by the system as a correct class.
♦ Misclassification
A given input is recognized by the system as a wrong class.
♦ Rejection
Occurs when the system cannot match the input data with one of the
known classes.
Speech Recognition
• The field of pattern recognition, of which character recognition is a sub-field, has seen
much progress since its beginnings.
• A large number of different approaches have been proposed to solve pattern
recognition problems.
• However, most of them are grouped into one of the following four important
recognition techniques:
♦ Template matching
♦ Structural and syntactic
♦ Statistical
♦ Artificial neural network
• Despite their strengths to solve a particular problem, not a single approach is found to
be optimal for all pattern recognition problems.
• Each of these recognition techniques have their own advantages and limitations, and
hybrid systems draw upon the synergy effect of two or more techniques.
• Hybrid methods aim at combining the advantages of different paradigms within a single
system.
Speech Recognition
Approaches to Recognition: Template Matching
• Template matching is one of the simplest and earliest approaches of character

recognition techniques where the character to be recognized is matched against a
database of stored templates of characters.
♦ Template matching assumes very small intra-class variability.
♦ Templates of characters are usually represented by features such as image
pixels, samples, curves, or directional properties of pixels.
• Recognition is made by measuring the correlation of unknown input and stored
templates.
• Template matching is effective to recognize standardized machine printed characters.
♦ Its applicability for handwriting recognition or general purpose machine-printed
character recognition is limited since it needs a stored template of all variants.
• A few improvements have been made on the original rigid template matching technique.
♦ As it becomes difficult to include various types of sample templates, a
representation known as deformable templates has been used.
♦ Deformable templates provide a simple and compact representation of various
sample characters or words.
♦ Thus, a generic deformation model can be used to model large set of classes
using a few examples from each class.
Speech Recognition
Approaches to Recognition: Structural and Syntactic Approach
• Syntactic and structural techniques utilize structural features and syntactic rules to
recognize patterns (characters).
♦ They are used for recognition of complex patterns which are represented in
terms of the interrelationships between simple sub-patterns called primitives.
♦ Large number of complex patterns can be described by a small number of
primitives and their spatial relationships.
♦ This provides a description of how a given character is constructed from the
given set of primitives.
• Recognition is made by parsing the sub-patterns according to a predefined rule and
grammar, and the recognition accuracy depends on the successful extraction of
primitives and their relationships.
• The choice of primitives is application dependent and relies on the general
understanding of the language, the script as well as the technical and mathematical
model building.
• The relationships of primitive structural features are represented by means of symbolic
data structures such as strings, trees, and graphs.
Speech Recognition
Approaches to Recognition: Statistical Approach
• Statistical approach is based on statistical characterizations of patterns, assuming that

the patterns are generated by a probabilistic system.
♦ Each pattern is represented in terms of d features and is viewed as a point in
d-dimensional space.
♦ For effective representation of the patterns, the features of each pattern should
form disjoint regions in the d-dimensional feature space.
• Statistical approach is the most intensively studied technique which represents each
pattern in terms of features or measurements
♦ Many character and word recognition systems make use of this technique.
• In statistical approach, the recognition system involves two operations:
♦ Training (learning)
The appropriate features representing the input patterns are extracted
and the classifier is trained to partition the feature space.
♦ Classification (testing)
Features are extracted from the unknown input and the trained
classifier assigns the input to one of the pattern classes under
consideration based on the measured features.
Speech Recognition
Approaches to Recognition: Neural Network Approach
• Artificial neural network (ANNs) are recently introduced pattern techniques inspired by
neuronal operations in biological systems.
♦ Although established in the 1940s, ANNs have been considerably applied in
the field of pattern recognition only since the 1980s.
• ANNs are a large number of highly interconnected processing elements called neurons,
which are organized into three layers:
♦ Input Layer
Takes data of the unknown pattern
♦ Hidden Layer
Contains many of the neurons in various interconnected structures
hidden from the outside view.
♦ Output Layer
Provides an interface for generating the recognition result.
• ANNs are known to be more effective on handwritten character recognition.
• Samples, pixels, or features can be used as inputs for the neural network system.
• Like statistical classification methods, neural networks require training of samples from
which they learn about how new samples are classified.
TOC: Course Syllabus
Previous: Applications of NLP
Current: Related Fields

Next:

Lecture 08

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 08

Uploaded by

Copyright:

Available Formats

Natural Language Processing (COSC 6405)

Lecture 08: Related Fields

Department of Computer Science,

Written and Spoken Languages

• There are two modes used to represent languages.

Machine-printed Handwritten Handwritten Machine-Editable

Conversion to Machine-Editable Text

• The following parameters are considered in the development of Speech Recognition

Components of SR Systems: Acoustic Signal

• Acoustic signals represent the waveforms of a given speech.

Components of SR Systems: Feature Extraction

Components of SR Systems: Language Modeling

• The notion of Language Modeling in Speech Recognition is similar to that of Statistical

♦ the probability of a word given some previous words

Components of SR Systems: Acoustic Modeling

• The probabilistic implementation of our intuition above, then, can be expressed as

• The Noisy Channel Equation for Speech Recognition:

Components of SR Systems: Acoustic Modeling

• The acoustic model needs to have the following:

Components of SR Systems: Acoustic Modeling

Selecting Appropriate Units

Components of SR Systems: Acoustic Modeling

Comparison of Different Units

Components of SR Systems: Decoding

Types of OCR Systems

Types of OCR Systems

• The input text can be machine-printed or hand-written.

Types of OCR Systems

Processes of OCR Systems: Preprocessing

Processes of OCR Systems: Preprocessing

0.004 0.015 0.026 0.015 0.004

0.015 0.059 0.095 0.059 0.015

0.026 0.095 0.150 0.095 0.026

0.015 0.059 0.095 0.059 0.015

0.004 0.015 0.026 0.015 0.004

Processes of OCR Systems: Preprocessing

• Skew and Slant Correction

Processes of OCR Systems: Segmentation

Processes of OCR Systems: Segmentation

Handwritten Amharic Document Image

Processes of OCR Systems: Segmentation

Character Segmentation in Handwritten Amharic Document Image [From EthioReader]

Processes of OCR Systems: Segmentation

Processes of OCR Systems: Feature Extraction

• Feature extraction involves the measurement or computation of the most relevant

Processes of OCR Systems: Feature Extraction

• On the other hand, features can be computed by using image transformations.

Processes of OCR Systems: Feature Extraction

Low Level Feature Extraction in Offline Recognition

Image of the Ethiopic character “ ም” scanned from a noisy document

Processes of OCR Systems: Feature Extraction

Low Level Feature Extraction in Offline Recognition: Gradient Fields

Processes of OCR Systems: Feature Extraction

Low Level Feature Extraction in Offline Recognition: Direction Fields

Processes of OCR Systems: Feature Extraction

Low Level Feature Extraction in Online Recognition

Handwritten Ethiopic Character “ጬ” Captured Online

Processes of OCR Systems: Feature Extraction

Low Level Feature Extraction in Online Recognition: Gradient Fields

Time Parameterized Gradient Field for Online Handwritten “ጬ”

Processes of OCR Systems: Feature Extraction

Low Level Feature Extraction in Online Recognition: Direction Fields