This action might not be possible to undo. Are you sure you want to continue?
i
Abstract
This thesis focuses on automatic speech recognition of continuous speech of the English
language by means of Image Processing of speech spectrograms and Fuzzy Logic. In this work
we first present a theoretical overview of the theories of Fuzzy Logic and Mathematical
Morphology and a short overview of speech phonetics, we continue by presenting an algorithm
for pitch estimation and conclude by a novel approach for segmentation of speech spectrograms.
The theory of Fuzzy Logic plays an important part in systems that are based on expert
knowledge. Fuzzy Logic is similar to mathematical morphology and these two different theories
are used to tackle the same problem of speech recognition via image spectrogram. A Fuzzy Logic
variable is used in the debugging of the segmentation algorithm since by definition we cannot
decide if the recognition is performed well or not according to a well formulated metric. In fact, if
we did have such a metric, we would have used it to perform the segmentation in the first place.
An algorithm that attempts to estimate the pitch from a narrowband spectrogram is
developed. The algorithm applies mathematical morphology methods to extract the pitch
harmonics, thin them to a single pixel width and then calculate the distance between these lines.
We explain the decisions that have led to the different algorithmic steps and the obstacles that
prevent the algorithm from being used reliably. The pitch algorithm fails to give correct results due
to the difficulty in determining the exact line distances and due to the fact that in some cases we
actually miss some pitch harmonics. We propose a computationallyinvolved approach that
should produce more reliable estimates.
An algorithm to segment speech spectrograms is developed. The algorithm uses
morphological image processing techniques to perform the segmentation. Our objective is to
create a prototype of a building block that can later be used within an automatic speech
recognition application. Automatic speech recognition is still considered an open research topic
and current techniques do not exploit the vast knowledge humans possess on speech. Although it
is almost impossible for us to explain to others what our auditory system accepts as inputs and
what makes us distinguish between different words, we can easily explain the input to our visual
system and for example how we distinguish between different written words. Motivated by recent
improvements in computer hardware and software in conjuncture with advances in the fields of
Mathematical Morphology (Morphological Image Processing), Optical Character Recognition
(OCR) and Fuzzy Logic, we establish a scheme to perform a dissection of a speech spectrogram
treating it as if it were some kind of text written by some alien language. By segmenting the
speech spectrogram we allow an expert in spectrogram reading to write down rules based on a
ii
collection of acquired knowledge and experience. We anticipate extracting in this way information
that is otherwise either hard to extract or simply missed out in the conventional recognition
techniques. An important transform used in the detection process is the Watershed Transform.
The Watershed Transform is a morphologicallybased technique and allows segmenting objects
in an image even when the objects are partially occluding one another.
We conclude the thesis by presenting ideas to improve the results of the segmentation
algorithm to produce better results in lower energy and fast (rising or falling) formant movements.
We also give some ideas for future research that would take the results produced by the
segmentation and incorporate them within a Fuzzy Logicbased expert system.
iii
Acknowledgments
First and foremost I would like to thank my thesis supervisor, Prof. Douglas
O‟Shaughnessy that has inspired and motivated this work and through his immense knowledge
and most dedicated mentoring contributed to the creation of this thesis.
I would also like to thank the technical and administrative staff of INRSEMT for their
positive approach and willingness to help and assist at all times. I am very grateful to Prof.
Ioannis Psaromiligkos from McGill University that has provided me with a solid background and
methodology. Not forgetting Mr. Liron Yatziv from Siemens Research at Princeton, NJ that has
contributed from his knowledge in biomedical image processing and especially in image
segmentation techniques.
Last but not least I would like to thank the many people I have met in the beautiful city of
Montreal and the many different lifeexperiences I have encountered, for helping me set a crisp
value in the Fuzzy Sets of “Quality of Life” and “Friendship”.
iv
v
Contents
Chapter 1
Introduction ...................................................... 1
1.1 Automatic Speech Recognition (ASR) .................................................................................. 2
1.2 Cepstral coefficients and Mel Frequency Cepstral Coefficients ........................................... 4
1.3 Different Approaches to Automatic Speech Recognition ...................................................... 5
1.3.1 Hidden Markov Models (HMMs) .................................................................................... 5
1.3.2 Neural Networks ............................................................................................................ 5
1.3.3 Hybrid Systems .............................................................................................................. 6
1.4 Drawbacks of existing automatic speech recognition systems ............................................. 7
1.5 Image Processing Basic Concepts ....................................................................................... 8
1.6 Thesis Outline ....................................................................................................................... 9
Chapter 2
Fundamentals of Speech Theory and Time
Frequency Analysis ....................................... 10
2.1 Phoneme ............................................................................................................................. 10
2.2 Pitch .................................................................................................................................... 10
2.3 International Phonetic Alphabet (IPA) ................................................................................. 11
2.4 Voiced and Unvoiced Speech ............................................................................................. 11
2.5 Different Phoneme Classes ................................................................................................ 12
2.5.1 Sonorant ...................................................................................................................... 12
2.5.2 Vowels ......................................................................................................................... 12
2.5.3 Approximants ............................................................................................................... 13
2.5.4 Nasals .......................................................................................................................... 13
2.5.5 Taps/Flaps ................................................................................................................... 13
2.5.6 Trills.............................................................................................................................. 14
2.5.7 Obstruents ................................................................................................................... 14
vi
2.6 Coarticulation ...................................................................................................................... 15
2.7 The DARPA TIMIT Speech Database................................................................................. 15
2.8 The Uncertainty Principle .................................................................................................... 16
2.9 Time Frequency Representation ......................................................................................... 17
2.10 Summary ........................................................................................................................... 18
Chapter 3
Mathematical Morphology ............................ 19
3.1 History of Mathematical Morphology ................................................................................... 19
3.2 Useful Properties ................................................................................................................. 20
3.2.1 Connectivity ...................................................................................................................... 21
3.2.2 Erosion ............................................................................................................................. 23
3.2.3 Ultimate Erosion ............................................................................................................... 23
3.2.4 Dilation ............................................................................................................................. 24
3.2.5 Morphological Gradient .................................................................................................... 24
3.2.6 Opening ............................................................................................................................ 25
3.2.7 Closing ............................................................................................................................. 26
3.2.8 Hit and Miss ...................................................................................................................... 26
3.2.9 Thinning ............................................................................................................................ 27
3.2.10 Thickening ...................................................................................................................... 27
3.2.11 Pruning ........................................................................................................................... 28
3.3 Distance Transform ............................................................................................................. 28
3.3.1 Continuous Case ......................................................................................................... 28
3.3.2 Discrete Case .............................................................................................................. 29
3.4 Skeleton .............................................................................................................................. 29
3.5 Watershed Transform ......................................................................................................... 33
3.6 The relationship between Fuzzy Logic and Mathematical Morphology .............................. 35
3.7 Summary ............................................................................................................................. 36
vii
Chapter 4
Fuzzy Logic .................................................... 37
4.1 Introduction .......................................................................................................................... 37
4.2 Fuzzy Logic vs. Boolean Logic ............................................................................................ 38
4.3 Alpha cuts ............................................................................................................................ 42
4.4 DeFuzzification .................................................................................................................... 42
4.5 Fuzzy Union ........................................................................................................................ 43
4.6 General Aggregation Operations ........................................................................................ 46
4.7 DempsterShafer (DS) Theory ............................................................................................ 47
4.7.1 Basic Probability Assignment (BPA) ............................................................................ 47
4.7.2 Combining Evidence .................................................................................................... 48
4.8 Fuzzy Logic Toolbox ........................................................................................................... 49
4.9 Vagueness and Ambiguity .................................................................................................. 49
4.10 Summary ........................................................................................................................... 50
Chapter 5
Pitch detection algorithm .............................. 51
5.1 Motivation ............................................................................................................................ 51
5.2 Theoretical Overview .......................................................................................................... 52
5.3 Autocorrelation Method ....................................................................................................... 52
5.4 Method of Aggregation ........................................................................................................ 54
5.5 Suggested Algorithm ........................................................................................................... 56
5.6 Algorithm Description .......................................................................................................... 57
5.7 Results ................................................................................................................................ 58
5.8 Summary ............................................................................................................................. 61
viii
Chapter 6
Automatic Spectrogram Segmentation
Algorithm ........................................................ 62
6.1 Introduction .......................................................................................................................... 62
6.2 Overview ............................................................................................................................. 62
6.3 Algorithm Description .......................................................................................................... 64
6.4 Adaptive Histogram Equalization ........................................................................................ 66
6.5 Gamma Correction .............................................................................................................. 67
6.6 Window Selection ................................................................................................................ 67
6.7 Connectivity Operations ...................................................................................................... 68
6.8 Local vs. Global Threshold .................................................................................................. 68
6.9 Working with the TIMIT Database ....................................................................................... 69
6.10 Calculating the Local Threshold ........................................................................................ 69
6.11 Function Description ......................................................................................................... 70
6.12 Results .............................................................................................................................. 72
6.13 Suggestions for Improving the Algorithm .......................................................................... 78
6.14 Summary ........................................................................................................................... 79
Chapter 7
Conclusion ..................................................... 80
7.1 Review of the Work and Logical Deductions ...................................................................... 80
7.2 Ideas for Future Work ......................................................................................................... 81
Appendix
Justifications for Choosing a Triangular
Membership Function (TMF) ......................... 83
ix
References ...................................................... 84
x
xi
List of Figures
Figure 3.1: Circles benchmark image……………………………………………… ... 21
Figure 3.2: Circles before and after applying ultimate erosion. ....................................... 23
Figure 3.3: Circles image corrupted white Gaussian noise after applying Beucher
gradient. .................................................................................................................... 24
Figure 3.4: Circles image corrupted by Salt and Pepper noise and after opening. ........... 25
Figure 3.5: Original image; applying a skeleton; pruning the skeleton. ........................... 30
Figure 3.6: Example of cleaning and extracting information from an image using
morphological operators. .......................................................................................... 32
Figure 4.1: Examples of common parametric membership functions. ............................. 41
Figure 4.2: Sugeno fuzzy complement for different lambda parameters. ......................... 45
Figure 4.3: Yager fuzzy complement for different parametric values. ............................. 45
Figure 5.1: Narrowband Speech Spectrogram. ................................................................. 58
Figure 5.2: Narrowband Speech Spectrogram after line detection. .................................. 59
Figure 5.3: Results of the Pitch Estimation Algorithm. .................................................... 60
Figure 6.1: Algorithm Diagram Flow ............................................................................... 65
Figure 6.2: Algorithm Results for different cases. ............................................................ 73
Figure 6.3: Example of a grade 1 score. ........................................................................... 75
Figure 6.4: Example of a grade 2 score. ........................................................................... 75
Figure 6.5: Example of a grade 3 score. ........................................................................... 76
Figure 6.6: Example of a grade 4 score. ........................................................................... 76
Figure 6.7: Example of a grade 5 score. ........................................................................... 77
xii
xiii
List of Tables
Table 2.1: Average formant frequency values for selected phonemes ............................. 12
Table 6.1: Results of a visual inspection. ......................................................................... 74
1
Chapter 1
Introduction
Automatic Speech Recognition has been an active topic of research for the past four decades.
The main objective of the automatic speech recognition task is to convert a speech segment into
an interpretable text message without the need of human intervention. Many different algorithms
and schemes based on different mathematical paradigms have been proposed in an attempt to
improve recognition rates. Since the problem of speech recognition is complex, under certain
circumstances, recognition rates are far from optimal. In addition other constraints such as
computational complexity and realtime constraints come into play in the design and
implementation of a working product. Computer hardware and software have significantly
improved in terms of speed, memory, cost and availability, which have enabled the use of more
sophisticated and computationally demanding algorithms to be implemented even on lowpower
lowcost handheld electronic devices. However, we prefer algorithms with low computational and
memory requirements since they can be implemented easily and at lower cost. Due to
improvements both in algorithms and in hardware, automatic speech recognition has become
more affordable and available. Automatic speech recognition is still an open topic of research,
where improvement and changes are constantly made in a hope for better recognition rates.
In this work we propose a different approach to automatic speech recognition on based
on mathematical morphology and fuzzy logic theories. This new approach is unconventional in
the sense that it involves three major fields of research, namely speech theory with emphasis on
automatic speech recognition, image processing with emphasis on image segmentation and
evidence theory with emphasis on fuzzy logic, decision making and combining evidence.
Before delving into the worlds of phonology and morphological image processing, we present
an overview of automatic speech recognition and give insight to some commonly used techniques
that attempt to solve this formidable task.
2
1.1 Automatic Speech Recognition (ASR)
Automatic speech recognition (ASR) is the process of converting human speech into written text.
Many advancements have been made in the field that have led to systems with high recognition
rates; however, there are still many open problems in particular due to four major parameters:
1. Vocabulary size – The minimal possible vocabulary size is 2 (for example Yes/No).
Another common size is the 10 digits. Telephone conversations or news reports require a
vocabulary of about 60,000 words, which makes the recognition task more difficult.
Professional journals and text use an even more esoteric vocabulary, and this often
requires the use of specialized dictionaries.
2. Fluency – Isolated words are easier to recognize than continuous speech. Read speech
is usually clearer than conversational speech. In addition, isolated words give the system
more time to process results and have lower interspeaker variability.
3. Channel and noise – a lab microphone and lab environment have lower noise
interference and lower signal distortion than speech sampled through a cell phone
microphone in a moving car with the window open. A low signal to noise ratio can cause
severe interference and degrade the performance of a system significantly. Signal
distortion and high compression rate can cause some words to sound the same.
4. Accents and other speakerspecific parameters – Children have a different frequency
range than adults. Foreign accents will degrade the performance of a system as well as
noncommon accents that the system was not designed to handle. Usually an ASR
system can be finetuned to a specific speaker in order to reduce the error rate.
ASR is used in numerous applications. We present several examples for common uses of ASR:
1. Call centers route calls and give out information according to user requests. Usually the
call centers operate on a limited vocabulary related to their field of operation. The
purpose of the ASR system is to aid the customer service representative perform his task
more efficiently and in less time.
2. Dictating allows almost completely handsfree conversion of speech into text. An
example of a commonly used software application for dictation is Nuance‟s Dragon
Naturally Speaking software package. With high recognition rates for native speakers, the
user is rarely required to intervene and correct the dictation results.
3. Medical transcription is growing in importance; Regulations and practical needs require
that a patient file would be converted into digital text. Performing an Optical Character
Recognition (OCR) task to convert the handwritten information into text is especially
challenging when it comes to doctor‟s handwriting. Currently medical transcription is done
3
either by professionals in the field or by ASR systems that have a vocabulary enriched
with medical terms [1].
4. Mobile phones and other communication devices use ASR for speed dialing. The phone
device contains ASR software which can identify names and dial the corresponding
number. Usually the system trains to a particular speaker, has a small vocabulary and
operates under harsh computational, memory and power consumption constraints.
5. Robotics use ASR for guidance and instructions. A robot can be guided to a certain task,
for example the Dialog OS [2] is a software package that can be used to enable a Lego
Mindstorms RCX unit to understand speech [3]. The user can then guide the Lego robot
to perform certain preprogrammed tasks.
6. Security applications such as automatic tapping to telephone lines using specific
keywords. For instance, Nortel is developing an ASR system together with Qinghua
University in an attempt to monitor every civilian in China [4].
7. Automatic translation by converting speech to text and then translating the produced text.
8. Pronunciation evaluation in language learning applications. A speaker of a second
language has to make a special effort to correctly pronounce different words and
sentences. The ASR system indicates to the user if the pronunciation was clear.
9. Home devices use ASR as a friendlier human interface. Fujita et al. [5] propose a speech
remote control for digital TV that uses 15 buttons instead of 70 needed to operate a multi
channel TV. Difficult commands are made simple since, instead of complex programming
of buttons, the user simply speaks out the desired command.
4
1.2 Cepstral coefficients and Mel Frequency Cepstral
Coefficients
Cepstral coefficients play an important part in speech theory and in automatic speech recognition
in particular due to their ability to compactly represent relevant information that is contained in a
short time sample of a continuous speech signal [6]. The definition for real cepstral coefficients is
given by the following equation:
(1.1) ( ) ( ) ( ) ( ) x DFT IDFT x cepstrum log =
We also note that
(1.2) ( ) ( ) ( ) y cepstrum x cepstrum y x cepstrum + = 
Equation 1.2 can be easily derived from 1.1 and is useful in case we model the speech signal as
a result of an excitation convolved with an impulse response of the vocal tract filter. DFT is the
Discrete Fourier Transform often implemented by the Fast Fourier Transform algorithm. The Mel
Frequency Cepstral Coefficients (MFCCs) [7] are obtained by converting the result of the log
absolute value frequency spectrum to a Mel perceptuallybased spectrum and taking an inverse
discrete cosine transform of the result. Using cepstral terminology we regard the Mel mapping to
be a rectangular low quefrency lifter followed by a discrete cosine transform. The result is a
smoothed cepstrum which can be further sampled to a specific number of coefficients. Quefrency
is a cepstrum value ('cepstrum frequency value') while a lifter is a weighted cepstrum or in other
words a filter for the cepstrum coefficients.
(1.3) M i k i X MFCC
k
k i
, , 2 , 1 '
20 2
1
cos
20
1
=

.

\

÷ =
¯
=
t
M is the number of cepstrum coefficients and { }
20
1 = k k
X represents the logenergy output of the k
th
Mel filter. The triangular lifters are linearly spaced up to 1000 Hz and logarithmically spaced
afterwards up to 4000 Hz. The hidden assumption is that more important speech information is
encapsulated in the low frequency band of 01000 Hz while the higher 10004000 Hz band
contains less information per Hz. The triangular lifters can be regarded as a possibility function
which serves as an upper bound to a symmetrical distribution where only the mean and variance
are known. The possibility function entails all the possible distributions that might occur and is the
coarsest upper bound we can obtain knowing only the mean and variance of a stochastic
process; a justification for a triangular function is given in the appendix. The human ear filters
sound linearly for lower frequencies and logarithmically for higher frequencies. Partitioning the
frequency range into two different spacing schemes that also resemble the Bark scale yields an
efficient representation of the spectrum.
5
1.3 Different Approaches to Automatic Speech Recognition
Automatic Speech Recognition has been an active research field since the invention of the
vocoder by Homer Dudley in the late 1930s [8]. Different approaches have been developed to
cope with the challenges presented in 1.1 while meeting the constraints of reasonable recognition
rates and affordable computational requirements. In the following we present an overview of three
common approaches to automatic speech recognition:
1.3.1 Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs) is currently the most common approach to automatic speech
recognition. A Hidden Markov Model is a Markov chain in which the actual state of the chain is
hidden from the observer. A Markov chain is a chain in which each state depends only on the
previous state and does not depend in any way on any state other than the previous state. The
different states of the HMM represent different distributions. The speech signal is modeled as a
piecewise stationary stochastic process and in many applications time intervals are held constant
at 10 ms. A feature vector is computed for each time interval. Typically, a feature vector has 13
elements which are the cepstral coefficients of the sampled speech signal in the current time
interval. The features are then used to determine the state which represents the distribution
associated with the specified time interval. Finally the Viterbi algorithm is used to perform
Maximum APosteriori (MAP) analysis of the data and produce the sequence with the highest
likelihood of occurrence. There has been a significant amount of work in the field of HMMbased
automatic speech recognition systems and many theoretical and application specific algorithms
exist [9].
1.3.2 Neural Networks
Neural Networks (NN) based systems were popular in the late 80s, however due to the relative
success of HMM they have been somewhat neglected. In [10] Rabiner et al. demonstrate the
importance of spectral parameterization of a speech signal that serves as input to a NN system.
Since linguistic isomorphism does not imply acoustic isomorphism, we can expect different
spectral representations to similar words/phonemes. Two methods of parameterization that are
commonly used are a bank of filters and an allpole linear prediction model.
A bank of filters is a set of overlapping filters that are spaced in frequency according to either a
uniform or nonuniform law. The nonuniform law is usually exponential where a common
technique is to use a critical band scale that combines a linear and exponential filter placement
similar to the Bark or Mel scales. Both the Bark and Mel scales are justified based on perceptual
6
studies of speech. The Linear prediction analysis technique models speech as an allpole filter
and looks at the distance from the coefficients of an actual known utterance as an optimization
criteria measure. The Hilbert norm of the difference of the cepstral coefficients is used as the
distance measure and afterwards is optimized. Often the input to the NN is first categorized to
different clusters. The categorization improves the performance of the NN in particular in
schemes of pattern recognition. NN can also be used for statistical estimation of phonetic
probability that can later be used in a HMM to solve a continuous speech statistical ASR system
[11]. In general, the NN require training for a specific set of data to which the desired results are
already known. It is important not to over train the system to a specific data set. Different types of
neurons together with different types of connectivity exist. Several stages of neurons can be built
keeping in mind the tasks of the system on one hand and the computational and theoretical
complexity on the other. The main caveat of NN which are also used in Artificial Intelligence is
that “real intelligence is a prerequisite for Artificial Intelligence” (Prof. David Malah), in other
words, we are attempting to build a complex system that mimics that way the human brain
operates but we do not have a complete understanding on the behavior of the system due to its
complexity.
1.3.3 Hybrid Systems
Hybrid systems as their name implies combine different strategies with the objective of improving
recognition rates. Common hybrid systems are Neural Network Hidden Markov Model as
described in [12]. Makhoul et al. suggest an Nbest paradigm that uses multiple hypotheses
instead of a single one. A segmental neural net is constructed to model the different phonemic
connections. Such modeling is not possible to perform with a HMM since the basic assumption of
the HMM restricts the dependency of the current state only on the previous one. By using multiple
connections a consistent improvement of performance is obtained.
7
1.4 Drawbacks of existing automatic speech recognition
systems
The main drawback of the previous three methods is their blind treatment of the problem. Since
the HMM relies on probability models to reach conclusions, there is no room for human reasoning
or humanbased rules. Not having any option to incorporate human knowledge into the system is
particularly odd, due to the fact that humans are well trained since early childhood to recognize
speech and they perform the task better than any existing ASR system. While in general a person
cannot give a consistent reasoning to the parameters that allow distinguishing between different
words or phonemes, human experts can read speech spectrograms with a high level of accuracy.
Spectrogram reading requires a combination of different sources of knowledge such as
articulatory movement, phonotactics, linguistics and acoustic phonetics [13]. Prof. Victor Zue of
MIT has spent over 2500 hours learning spectrogram reading and has reached impressive
recognition rates. Zue and Cole [14] have given encouraging results for automatic speech
recognition based on speech spectrograms. Different experiments demonstrate recognition rates
in the range of 85%. Such encouraging recognition rates motivate the development of an
automatic tool to perform the reading task.
In an effort to mimic the human experts‟ behavior we choose a large time interval on the
order of 1 second in order to capture several phonemes that may be related through co
articulation. Speech signals can be modeled as nonstationary signals. Movements of the vocal
tract can be well represented using a wideband spectrogram. The wideband spectrogram is
generated using a relatively short time window that gives good time resolution but less accurate
frequency resolution.
Previous attempts to extract information from speech spectrograms have been made. We
note here the work of [15] that used morphological skeletons to extract information. While in
general it is possible to extract information through a skeletonbased approach, we believe it is
necessary to identify and segment the speech spectrogram into Binary Large Objects (BLOBs).
The uncertainty principle, as demonstrated by the Heisenberg–Gabor inequality, stipulates that
the extent to which a particular frequency can be localized is inversely proportional to the length
of the time interval chosen. Attempting to track down frequency changes with time using a single
pixel skeleton path is futile when the time interval is too short to allow single pixel localization.
In [13], an expert system based on spectrogram reading knowledge was devised with an
objective to segment speech into different phonemes. It deals with voiced/unvoiced fricatives,
voiced/unvoiced stops, nasals and liquids. A rulebased expert system reports recognition rates
8
of about 90% for the aforementioned phoneme classes. These results motivate us to focus on the
classes that are more difficult to recognize by a rulebased segmentation, namely the vowels,
glides and nasals.
1.5 Image Processing Basic Concepts
A digital image can be either acquired by sampling the continuous space or through synthetic
computer generated methods. The sampling can be regarded as some form of averaging of
energies that is represented by a matrix grid of pixels (picture elements). Many different types
and formats of images exist; for example, a zcamera can provide in addition to a color image
also the distance from the camera, infrared night vision produces greenlike images, satellite
images obtained from espionage satellites use a ruler (vertical line) that delivers a stream of
pixels per second; these columns are combined to form an image, 3D cameras produce two
images with disparity in the order of the distance between the eyes that combined would show a
3 dimensional image, other 3D cameras extract information from the image by way of inference
[16]. In digital cameras for instance, a 3 dimensional CCD (Chargecoupled Device) is used for
each pixel‟s Red Green and Blue (RGB) values; therefore three matrices are needed for a single
image representation.
Our work focuses mainly on two types of image representations: binary images and
grayscale images. Both types of images are generated from processing an input speech signal.
Some spectrogram readers prefer to use a color spectrogram in which different colors indicate
decreasing/increasing energy levels and maximum/minimum energy points. Since these values
can be calculated we do not focus on the visual aspects of a color image. In binary images, a
pixel takes the logical value „1‟ if it is a foreground pixel (black) and „0‟ if it is a background pixel
(white). In grayscale images each pixel usually takes a value between 0 and 255 which can be
stored in 8 bits. In grayscale images the maximum value (255) corresponds to white and the
minimum value (0) corresponds to black. Apart from the toggle of black and white for the extreme
values, grayscale images can be regarded as an extension of binary images just as Fuzzy Logic
extends the traditional Boolean logic; chapter 4 will discuss Fuzzy Logic in more detail. An image
is stored in memory as a matrix and different signal processing operators such as convolution,
filtering, tracking etc. can be performed on the image as long as their two dimensional version is
applied.
In order to convert an image from gray scale to binary we need to perform a threshold
operation on the image. Each pixel that contains a value greater than the threshold is quantized
to a logical „1‟ while each pixel that contains a value smaller than the threshold is quantized to a
logical „0‟. The reasoning behind the selection of a logical „1‟ as black and a logical „0‟ as white is
9
that since the paper we print on is white, background pixels are white. Applying a threshold
means that the pixels above the threshold produce a truth value (logical „1‟) and therefore are
foreground (black) pixels. The decision to select zero as black in grayscale images arrives from
the fact that grayscale images can be represented by an arbitrary number of bits depending on
the display type and the characteristic of the problem at hand. Commonly but not necessarily they
are represented by 8 bits (byte). Therefore, the simplest procedure is to take black as „0‟ since it
is has the same effect in all display types (cathode rays are shut). Many algorithms on threshold
selection exist in the literature; we will see in chapter 6 that both a global and a local threshold
are required in addition to some preprocessing of the image in order to convert it to a binary
image.
1.6 Thesis Outline
Chapter 2 provides a technical overview of phonological terms and Time Frequency
representation that are important to the understanding of the proposed speech recognition
system.
Chapter 3 serves as a tutorial to morphological image processing. An overview of mathematical
morphology, its history, axioms and advantages are described in detail. Various morphological
operators that are used in the thesis are explained and accompanied with practical examples. As
an anecdote the relationship and similarity between Fuzzy Logic and Mathematical Morphology is
presented.
Chapter 4 introduces Fuzzy Logic and some concepts from Fuzzy set theory and Evidence
theory. The Matlab
TM
Fuzzy Logic toolbox is presented and justifications for using Fuzzy Logic
are given.
Chapter 5 shows an attempt to extract pitch from a narrowband image spectrogram and explains
why even though the recognition and extraction of information is performed in a good manner, the
desired result of obtaining the correct pitch is not reached. Reasons for the failure of this
approach to pitch estimation are discussed.
Chapter 6 gives a description of the proposed algorithm for spectrogram segmentation.
Experimental results are discussed.
Chapter 7 summarizes the thesis with conclusions and ideas for future research.
10
Chapter 2
Fundamentals of Speech Theory and Time
Frequency Analysis
This chapter presents some important concepts in speech theory and timefrequency analysis.
We have seen in Chapter 1 the challenges in designing an Automatic Speech Recognition (ASR)
system. ASR systems can be implemented to detect phonemes, words or even complete
sentences. Since we are dealing with continuous speech, we do not have any indication for word
boundaries. There are different ways we can model speech, namely, articulatory, acoustic,
phonetic and perceptual. We choose to focus our recognition task on a basic speech unit, the
phoneme, and design our system to be able to recognize phonemes. To perform good phoneme
recognition, we need to understand some basic concepts of articulatory phonetics. We conclude
the chapter with a description of the uncertainty principle and time frequency analysis that allows
us to better understand the intricate features of the speech spectrogram.
2.1 Phoneme
Our work concentrates on recognizing different kinds of phonemes in the continuous speech of a
single speaker. A phoneme can be defined as a theoretical representation of sound [17]. A
phoneme is the conception of sound that is sufficient to distinguish between two different words. It
is the smallest meaningful contrastive unit in the phonology of a language [6]. For example, bear
and tear differ only by their first letter. In this case /b/ and /t/ are considered different phonemes
and bear and tear can be distinguished due to the different phonetic transcription. Since
phonemes are conceptual units it is hard to quantify their start and end point in time on a given
speech signal. Most languages have about 2040 phonemes [6]. The TIMIT database as
described in section 2.7 provides the start and end point of each phoneme in the database as
segmented by various speech experts. The additional information provided in TIMIT can be used
for debugging, improving and displaying performance results of the recognition algorithm.
2.2 Pitch
Pitch is the perceived fundamental frequency (f
0
) of speech. Pitch cannot be defined rigidly in
mathematical terms since it is a perceived property that represents the frequency in which the
11
vocal tract cords vibrate. Pitch plays an important part in synthetic speech production, speech
compression, speech coding and other speech related techniques and algorithms. Pitch has not
played an important part in automatic speech recognition due to the difficulty of using the pitch
information in existing speech recognition systems. However the promising feature of
incorporating pitch into a rulebased automatic speech recognition system has motivated our
investigation of the pitch. Chapter 5 thoroughly deals with the concept of pitch as we attempt to
recognize pitch by using image processing techniques on a narrowband spectrogram.
2.3 International Phonetic Alphabet (IPA)
The International Phonetic Alphabet was first developed by linguistics in 1886 in an attempt to
create a different set of phonetic symbols for each language. Eventually it was decided to merge
all languages to a unique set of phonetic symbols. The IPA attempts to associate each sound
with a single phonetic symbol while two symbols are used in case the phonetic unit is produced
by a combination of two sounds. Our work concentrates on the English language. There are a few
possible representations for phonemes where a very common one is the International Phonetic
Alphabet for English (IPA). Unicode is a computer coding standard that supports most known
font systems including the IPA. We use the IPA to easily identify different phonemes. In order to
match between the IPA and the phoneme representation in TIMIT which is given in standard
ASCII format that does not include the IPA, we perform a onetoone mapping between the two
methods of representation.
2.4 Voiced and Unvoiced Speech
A voiced sound is one in which the vocal cords vibrate. A speech recognition system must be
able to distinguish between voiced and unvoiced speech in order for it to detect moments of
silence, end of word, different phoneme classes, whispers, coughs, giggles and other information.
Since pitch can also give an indication of whether the speech segment is voiced or not we again
see the importance of pitch in the overall recognition process. In general, voicing is a binary
parameter even though in some particular cases there can be a degree of voicing. Degrees of
voicing are usually measured by duration (voice onset time) and by intensity. In English, a half
voiced/partially voiced sound is caused by voicing over part of the sound (short duration). Low
intensity voicing occurs when the voicing is weak and is also considered as a partially voiced
sound.
12
2.5 Different Phoneme Classes
We present in the following some basic concepts in phonetics that are used throughout this work.
We also show a possible implementation of fuzzy membership functions for the vowels based on
statistical data.
2.5.1 Sonorant
A sonorant is a speech sound that is produced without turbulent airflow in the vocal tract. If a
sound can be produced continuously with the same pitch it is considered a sonorant. The
sonorant includes the following classes: Vowels, Approximants, Nasals, Taps and Trills. Sonorant
are voiced in most world languages.
2.5.2 Vowels
Vowels and diphthongs are the phonemes with the greatest intensity. Most of the energy of the
vowels is concentrated in the first formant f
1
which is usually below 1 kHz. The vowels can be
modeled as quasiperiodic with the periodicity being the fundamental frequency, f
0
.
Vowels can be distinguished primarily by their first three formants. Following the work of [18] we
obtain the following table for the average location of the first three formants of different English
phonemes:
Phoneme f1 f2 f3
/i/ 270 2290 3010
/I/ 390 1990 2550
/E/ 530 1840 2480
/@/ 660 1720 2410
/a/ 730 1090 2440
/c/ 570 840 2410
/U/ 440 1020 2240
/u/ 300 870 2240
/A/ 640 1190 2390
/R/ 490 1350 1690
Table 2.1 Average formant frequency values for selected phonemes
We would like to examine a possible implementation to a Fuzzy Logic (FL) system.
Realizing how a FL system would look like would assist in the development of the segmentation
algorithm. We would like to know what kind of information we need to extract from the image
spectrogram. We use the table as reference for an expected formant location. An automatic script
13
creates a FL system that assigns a grade to each phoneme based on its similarity to a given
vowel. The input to this FL system is the average frequency location of each phoneme. It is
compared with the average frequency location of formant of the English language. These average
values can farther be adjusted and trained to specific speakers; however, in this case we use
constant reference data. A triangular membership function (TMF) is created around each
phoneme. TMF is selected since it gives coverage of symmetric distributions with a minimal
number of parameters. The MF is constructed so that it attains a zero value at each adjacent
average formant values and attains a one at the target average formant value. In this way, each
midfrequency point can be associated with 2 membership functions. For three formants, at the
most, we obtain 2
3
=8 possibilities; however, after matching the formants and giving priority to two
agreeing formants over one, and to three agreeing formants over two, we can farther reduce this
number. We see that in order for this type of FL system to work we would need to provide it with
some kind of estimate of the formant frequency. As explained in section 1.2 and the appendix, the
TMF are well used in cases of limited or very low prior knowledge.
2.5.3 Approximants
Approximants are sounds that can be regarded as in between vowels and typical consonants.
Articulators narrow the vocal tract but leave enough space for air to flow without much audible
turbulence. Approximants can be slightly fricated in case the out flowing air becomes turbulent.
2.5.4 Nasals
Nasals are sonorant voiced and are caused by using the tongue to block the air allowing it only to
escape freely through the nose. The tongue articulation (and not the nose itself) differentiates
among the different nasals. English nasal sounds consist of [m], [n], [ŋ]. Nasal waveforms are
called murmurs. Murmurs are similar to vowel waveforms but have significantly weaker energy
due to a spectral null. The spectral null is inversly proportional to the length of the oral cavity
behind the constriction. Since humans have a very poor perceptual resolution of spectral nulls,
other cues such as formant transitions in adjacent sounds and spectral jumps distinguish nasals
from other phonemes. These characteristics together with the spectral null can be observed in a
spectrogram image of a nasal phoneme.
2.5.5 Taps/Flaps
An articulator is thrown towards another articulator using a single contraction of the muscles to
produce a consonantal sound that is a tap. Taps and flaps are considered by most linguistics to
be the same even though some distinctions can be made to distinguish between the two.
14
Unlike a stop (plosive) consonant, a flap does not consist of a buildup of air pressure behind the
place of articulation, therefore there is no release burst upon producing the sound.
An example of an alveolar tap is the consonant /ɾ/ or /tt/ as in the English word latter.
2.5.6 Trills
A trill is a consonantal sound produced by vibrations between the articulator and the place of
articulation. Trills significantly differ from flaps. In a trill the articulator is held in place unlike
flaps/taps in which an active articulator is struck against a passive one. Trills, unlike flaps, vary in
the number of periods they occur. Normally trills vibrate on 23 periods, however some trills last 5
periods or more and in some cases a trill can last for a single period. Single period trills differ from
flaps by articulation. The trills consist of three phonemes: [ʙ] [r] [ʀ].
2.5.7 Obstruents
Obstruents include stops (also known as plosives), fricatives and affricates.
Stops
Stops are transient phonemes and thus are acoustically complex. Most stops start with a silent
period due to closure of the articulators. Some stops have a voiced bar of energy in the first few
harmonics (some voiced stops). The voiced bar is caused by radiation of periodic glottal pulses
through the walls of the vocal tract. The throat and the cheek heavily attenuate all other
frequencies. Stops can be very brief in between vowels. Alveolar stops, in particular, may
become flaps in which the tongue tip retains contact with the palate for about 1040 ms.
Fricatives
Fricatives are caused by forcing air through a narrow channel made by two articulators that are
moved close to each other. Producing /f/ using the lower lip against the upper teeth is an example
of placing two articulators closely together.
Sibilants (stridents) are a subset of fricatives. They are created by curling the tongue lengthwise
to direct the air caused by the closely placed articulators towards the edge of the teeth. Examples
of sibilants are English [s], [z], [ʃ], and [ʒ].
Affricates
An affricate is a sound that begins with a stop (plosive) and ends with a fricative. The two English
affricates are /dg/ or in IPA /ʤ/ as in Jump and /ch/ or in IPA /ʧ/ as in Charly. We can regard the
affricates either as a combination or as a single phonemic unit. We regard the affricates as a
single phonemic unit.
15
2.6 Coarticulation
Speech is produced by articulator gestures that essentially overlap in time. The shape of the
vocal tract is highly dependent on the previous and successive phonemes. In general there are
rightleft (RL) and leftright (LR) articulations where in RL the articulator may move toward the
suceeding phoneme in case the articulator‟s new position does not interfer with the current
phoneme. The RL are also called anticipatory coarticulation since the brain prepares the
articulators during the current phoneme to pronounce the procedeeing phoneme. Unlike the RL,
the LR coarticulations do not involve any lookahead planning but are driven by the realistic
physical constraints that are imposed on the articulators when moving from one phoneme, for
example a consonant to another phoneme, for example a vowel. In order to reduce the effort
required to pronounce the different phonemes, instead of fast movement of the articulators they
are allowed to return to their natural position over several phones. Thus the LR articulation helps
reduce the effort in pronouncing different phonemes consecutively.
From the Automatic Speech Recognition perspective, coarticulation imposes a big hurdle.
Instead of having a small possible alphabet of phonemes to recognize we have plenty of possible
combinations all flavored with interspeaker variabilities in pronunciation, energy levels, speed and
system noise. Understanding the effects of coarticulation is of particular importance when
designing an expertbased Fuzzy Logic system that relies on input from speech spectrograms.
Since there are dependencies between different phonemes we cannot simply use a regression
table for values of the locations of the first three formants of the vowels with appropriate
confidence intervals for example in order to perform vowel recognition. We need to take into
account different variations due to coarticulation that would probably cause different constraints to
contradict and result in no correct answer. We need to allow flexibility in the design of the system
to allow it to output different possible outcomes with various grades of belief that would later be
reconciled through a Maximum APosteriori algorithm such as the Viterbi algorithm.
2.7 The DARPA TIMIT Speech Database
The DARPA (Defense Advanced Research Projects Agency) TIMIT speech database consists of
utterances of 630 speakers of eight major dialects of American English. The database is
designed to assist in the development and testing of Automatic Speech Recognition systems.
Both an orthographic transcription and a timealigned phoneme transcription are included for
each binary speech file. Different files in the database have different purposes which are
16
distinguished according to the file/directory names. Files for example that start with sx to their
name are MIT (Massachusetts Institute of Technology) phonetically compact sentence while files
that start with si are TI (Texas Instruments) random contextual variant sentence. Both genders
are represented and phonemes can be tracked according to the location in the sentence given by
the sample number in which they are present. The database is sampled in low noise conditions at
a sampling rate of 16 kHz. The original database is in bigendian format and conversion to little
endian is necessary when reading speech files from the cdrom to an Intelbased machine.
2.8 The Uncertainty Principle
The uncertainty principle also known as Heisenberg‟s uncertainty principle is directly derived from
the Fourier transform equations. Analyzing a signal over long time duration would produce more
specific frequency results at the cost reducing the localization in time. If high frequency resolution
is desired we need a longer time segment. However, if high frequency resolution is not required
we can take a shorter time segment that would give better time resolution. This tradeoff gives rise
to two common spectrograms, namely the narrowband and the wideband spectrogram.
The narrowband spectrogram, as used in chapter 4, gives good frequency resolution, which is
essential in determining the pitch period that takes values in the range of 100 Hz and with a
required resolution of a few Hz. The wideband spectrogram is used in determining the location
and direction in frequency of the different formants which range up to 4 kHz with a required
resolution in the range of 100 Hz.
The Gaussian function is the only function that gives a maximal timefrequency
resolution. Since a Gaussian is the eigenfunction of the Fourier transform, the transform of a
Gaussian using a General Fourier transform is still a Gaussian. This special property gives the
theoretical justification to the Gabor functions (and Gabor wavelet transform) that are used in
chapter 5 to better identify horizontal lines in the spectrogram.
For this example we give the Fourier transform definition of:
(2.1) ( ) ( ) dt e t f w F
jwt ÷
·
· ÷
í
=
t 2
1
And the inverse Fourier transform as:
(2.2) ( ) ( ) dw e w F t f
jwt
í
·
· ÷
=
t 2
1
.
Note that instead of a single constant used for the inverse transform we now have two identical
constants. As long as the product of the constants is
t 2
1
we can select the constants at will. In
most computational software the constant is used only on the inverse transform to save floating
17
point multiplications, however in demonstrating the Gaussian property as an eigenfunction of the
Fourier transform, we make the aforementioned selection. We note that the Gaussian is not the
only function that serves as an eigenfunction of the Fourier transform, another trivial example is
the pulse train signal.
Using integral tables (or calculating using the residue theorem) we obtain that for the above
definition of the Fourier transform, the relationship is:
(2.3) ( ) ( )
2
2
2
2
2 2 o o
o
w
Fourier
t
e w F e t f
÷ ÷
· = ¬ =
We input a centered normalized Gaussian of unity variance, N(0,1), to the Fourier transform. The
output is identical to the input and the Fourier transform did not change the input (except for a
countable set of points of zero measure).
2.9 Time Frequency Representation
TimeFrequency Representation (TFR) differs from a spectrogram representation by taking the
square value of the signal energy instead of its logarithm. Nadine Martin examined different
algorithms for TFR segmentation [19, 20]. In [19] two algorithms for TFR segmentation are
suggested. The first is based on morphological filters and the watershed transform and the
second is based on tracking using a Kalman filter. Another interesting segmentation scheme
based on statistical features of a spectrogram is presented in [20]. The TFR has better resolution
and the variance of the Capon estimator used to segment the image is lower according to [19].
The segmentation is blind toward the analyzed signal; it does not assume that the signal is of any
particular type. A speech signal will be analyzed in exactly the same way as a seismic signal. In
addition the algorithm does not require tuning. Assuming a deterministic signal corrupted by
additive Gaussian noise a probability model is developed to allow for local segmentation of
objects. We note that by ignoring our knowledge of the signal‟s source we lose much prior
information that is known about speech signals and can be used in the segmenting process.
Wideband speech spectrograms are indeed noisy images. Vertical lines that striate the
spectrogram show that it may be inappropriate to model the noise as a lognormal distribution as
would be the case if we apply the algorithm developed in [19] for TFR to the spectrogram. The
vertical striating lines are caused by the opening and closure of the vocal cords. These lines
appear in a spaced distance that can be used as a rough approximation to the fundamental
frequency f
0
, also known as the pitch. Another caveat for using a blind method as proposed in
[19, 20] is the difficulty in adjusting it to recognize specific types of information present in the
speech spectrogram. Existing speech recognition systems have been developed for many years
18
and reach impressive results. Developing a new recognition system is a challenging task. Using
Dempster‟s Rule it is possible to combine two sources of evidence to a joint basic assignment.
We will see in chapter 4 a detailed description of Dempster‟s rule of combining evidence.
2.10 Summary
In this chapter we examined different concepts of speech. We also reviewed the uncertainty
principle and timefrequency representation methods. In chapter 6 we will see how different
phonemes take the form of a smeared energy shape in a speech spectrogram. An expert with
intricate knowledge of the speech process can read the spectrogram and make sense of it. We
will also see in chapter 6 how the morphological image processing tools presented in the
following chapter allow us to extract information from the speech spectrogram even though it is
corrupt due to the uncertainty principle and vertical lines caused by the pitch. Finally, appendix A
gives an idea on how to utilize the information given in table 2.1 in a Fuzzy Logic based expert
system.
19
Chapter 3
Mathematical Morphology
Mathematical morphology is a theoretical model that justifies particularly useful operators in
image processing applications. Based on lattice and set theory and axioms, mathematical
morphology provides solutions to handling of specific geometrical objects in different topological
spaces. Basic operators are used to construct more complex operators while all operators rely on
structure elements that serve as geometrical modular units to the different morphological
operators. We first begin with a historical overview of Mathematical morphology. We continue
with basic axioms and definitions that exemplify the importance of the different morphological
operators. We review the morphological operators starting with the most basic operators of
dilation and erosion, and finally we conclude with the algorithmic scheme of the Watershed
Transform that is used for image segmentation. We end this chapter by showing the close
relationship between Fuzzy Logic and Mathematical Morphology and by concluding on the
importance and relevancy of mathematical morphology to automated spectrogram reading.
3.1 History of Mathematical Morphology
Mathematical morphology was developed in 1964 by Jean Serra while doing his PhD work with
George Matheron [21, 22]. Serra was researching the iron deposits of Lorraine. A method to
distinguish between different shapes was needed and the Hit And Miss transform was developed
to identify specific shapes in the image. Opening and the essence of a structure element were
investigated. The idea was to use the particular previously known shape of the minerals in
question in order to identify and classify these minerals in images.
In the 70‟s mathematical morphology was further developed where recursive algorithms were
implemented such as ultimate erosion, binary thinning, Skeleton by Influence Zones, Watershed
Transform and more. Two notable improvements occurred in the 80‟s. The first was the
establishment of the foundations of mathematical morphology with respect to the mathematical
field of complete lattices and graph theory. The second was the numerous books and industrial
products and applications based on mathematical morphology. The 90‟s made mathematical
morphology an important tool in segmentation through the refinement of existing algorithms. Non
20
linear mathematical morphology based filtering and different connectivity methods (topology)
were developed.
Today, mathematical morphology is an important tool in industrial applications that require fast
processing and detection of objects in acquired images. For example, in Biomedical imaging
mathematical morphology is used in detecting different objects in an image. In the oil industry, in
order to address the issue of sand and oil ore, mathematical morphology is used to estimate both
intensity and range.
3.2 Useful Properties
The following sections are based on [21, 23, 24]:
Extensitivity: Applying an extensive operator increases the number of foreground pixels
Antiextensitivity: Applying an antiextensive operator decreases the number of foreground
pixels
Idempotence: Reapplying an idempotence operator does not change the number of foreground
pixels
Nonlinearity: Morphological operators are nonlinear (except cardinal cases). In general they do
not follow the linearity mapping condition:
( ) ( ) ( ) ( ) . , 4 . 2
2 1 2 1
9 e + = + a y a x a y a x a v v v
The linearity mapping condition is a necessary and sufficient condition for the linearity of an
operator. One consequence of the nonlinearity is that in many cases an inverse does not exist
since there is loss of information. Reconstruction of an image that has undergone a
morphological operator is in most cases impossible. An example of a nonlinear operator that
does have an inverse is the Medial Axis Transform that will be described later in the chapter.
Duality Principle
In general, morphological operators exist in pairs. For example, dilation and erosion, opening and
closing, thinning and thickening, etc. Other duality properties are:
Self dual
( ) X X v _ ) 1 . 2 (
( ) ( ) X X v _ 2 . 2
( ) ( ) ( ) ( ) X X v v v = 3 . 2
( ) ( ) ( )  
C C
X X v v = 5 . 2
21
Invariant under duality
In the following sections and examples we will use the circles image which is a standard
benchmark binary image from the Matlab
TM
database as the original image and the unit radii disc
as the structure element. Each element with a logical value of „1‟ represents a foreground pixel
while each element with a logical value of „0‟ represents a background pixel. The structure
element (SE) and the benchmark circle image are as follows:
( )



.

\

=
0 1 0
1 1 1
0 1 0
7 . 2 SE
Figure 3.1: Circles benchmark image.
3.2.1 Connectivity
Since we are using a digital grid instead of a continuous space, we need to perform quantization
to a discrete area/volume. We will concentrate on the twodimensional case although connectivity
is a welldefined concept in higher dimensions. The quantization takes the form of pixels (picture
elements) which are a lowpass smoothed result of the actual continuous image. In order for us to
define a distance between two pixels or if one pixel is a neighbor of another one, we need to
specify a connectivity. A straightforward way to describing connectivity is by specifying in a matrix
form a logical „1‟ to all pixels that are neighbors of the current pixel. The current pixel is the center
of the matrix and is also denoted by a logical „1‟.For example we show a diagonal (and in this
case symmetrical) connectivity:
( ) ( ) ( ) X X
C
v v = 6 . 2
( )
. 1 0 1
0 1 0
1 0 1
8 . 2



.

\

22
An example of an asymmetrical connectivity is:
( )
. 0 1 0
1 1 0
0 0 1
9 . 2



.

\

The two most common connectivity schemes in 2D are 4connected and “8connected”. The “8
connected” case is also known as fullyconnected.
An example of “4connected”:
( )
. 0 1 0
1 1 1
0 1 0
10 . 2



.

\

An example of “8connected”:
( )
. 1 1 1
1 1 1
1 1 1
11 . 2



.

\

23
3.2.2 Erosion
Erosion is a basic morphological operation. A kernel typically using a logical „1‟ as an indicator to
the Structure Element (SE) is used. The objective is to find objects in the image that exactly
correspond with the SE. We regard each time the center kernel point as the reference point and
output a logical „1‟ in case the SE is fully contained within the image with regard to that reference
point. The operation is equivalent to a logical AND in the binary case. The output image will
contain all the points in which the SE is fully contained in the original image. The image is
therefore eroded with respect to the SE. Hence erosion is an antiextensive operator.
3.2.3 Ultimate Erosion
Consider eroding an image over and over again until an idempotent state is reached. The
Ultimate Erosion is the union of all differences of erosion and reconstruction using opening at all
stages. The Ultimate Erosion is important in robust marker selection which is usually a part of the
watershed transform.
In the following example, ultimate erosion was calculated. We obtain the centers of all 13 circles
and some additional spurious points. It is possible, using an additional simple restriction that limits
the circle distance to be in the range of the radii (the most common distance between each pair of
points) to obtain the exact location of the circles‟ centers. Using that information we can obtain a
good estimate on the number of circles in the original image. A negative of both the circles image
and the ultimate erosion are presented:
Figure 3.2: Circles before (left) and after applying ultimate erosion.
24
3.2.4 Dilation
Dilation is the dual operator of erosion. However, in general, dilation is not the inverse operator of
erosion unless the opening is idempotent with regards to the image and the SE. In the same
manner erosion is, in general, not the inverse operator of dilation unless the closing is idempotent
with regards to the image and the SE.
Dilation follows the same scanning of the image by a kernel in which the SE is indicated by a
logical „1‟. However, a logical OR is used in the binary case and a resulting „1‟ is written to the
output image in case there is at least one pixel in the image that corresponds with the SE.
Therefore dilation is an extensive operator.
3.2.5 Morphological Gradient
Also known as the Beucher Gradient, the morphological gradient is defined as:
(2.12) g(f) = (f ⊕ B)  (f ⊖ B).
Usually the same structure element is used for both dilation and erosion. The morphological
gradient is used in determining the boundaries of an object which can be of particular importance
in segmentation algorithms such as the Watershed Transform.
An approximation of a gradient would in most cases require computing two directional gradients
in the horizontal and vertical direction. Computing the directional gradients can be performed
using a sobel kernel to convolve the image and then combining the results of the vertical and
horizontal “derivatives” to obtain the direction of the gradient at each point. Using mathematical
morphology to compute the gradient allows a nonlinear granular geometric approach. When
detecting objects with specific (known) geometrical boundaries the advantage of a Beucher
gradient is evident by closely tracking the boundary through the use of the structure element.
An example of using a Beucher gradient to the circles image corrupted by white Gaussian noise:
Figure 3.3: Circles image corrupted white Gaussian noise after applying Beucher gradient.
25
After a simple threshold we are left with the image borders. To perform reconstruction of the
original image, in this case, we can use a flooding algorithm that identifies the interior of the
objects and fillsin the gaps.
3.2.6 Opening
(2.13) Open(Im,SE) = Im ◦ SE
Opening is performed by dilating the result of an erosion of an image with the same structure
element (SE). Closing is the dual operator of opening. Opening is normally used to open an
image and clean it from salt noise, and to separate between different object types (lines in
specific directions, specific objects). Simply eroding the image will reduce pixels from „clean‟
image objects; it is therefore necessary to perform dilation to reconstruct these objects. The salt
noise will disappear after erosion and will not grow due to the subsequent dilation. Normally after
opening, features smaller than the structure element are removed, while features larger than the
SE remain about the same.
The morphological opening presented is idempotent. It is also antiextensive and increasing. In
algebra every operation which is increasing, antiextensive and idempotent is called opening:
The following example demonstrates how opening can be used to reduce Salt and Pepper noise:
Figure 3.4: Circles image corrupted by Salt and Pepper noise (left) and after opening.
( ) . 14 . 2
B B B
c o ¸ =
26
3.2.7 Closing
(2.15) Close(Im,SE) = Im ･ SE
Some Image Processing algorithms create holes in the image. That and pepper noise can be
treated using the closing operator. Eroding the dilated image creates a smoother image that also
has better (less chaotic) skeleton properties. Again, simply dilating the image will expand objects
that are „clean‟. Performing the additional dilation with the same SE will reduce the distortion
caused to the clean objects while filling in the holes. The structure element should be designed
according to the holes needed to be filled in.
The morphological closing presented is idempotent. It is also extensive and increasing. In
algebra, every operation which is increasing, extensive and idempotent is called closing:
3.2.8 Hit and Miss
(2.17) HitAndMiss(Im,SE) = Im ⊛ SE
The Hit and Miss Transform can be used to find specific objects in an image. Constraining both
background and foreground pixels allows one to identify a specific shape. Foreground pixels are
usually denoted with a logical „1‟, background pixels with a logical „0‟ and pixels that can be
considered either as foreground or background as a don‟t care. The transform is equivalent to
finding the exact foreground object (erosion with a foreground requirement) while matching the
background objects (erosion of the background requirement on the image background).
Examples:
A kernel for finding a corner:
(2.18)
A kernel for finding pixels that are connected to at least one more pixel in a “4connectivity” lattice
(using OR for the results of applying the kernel with 4 rotations of 90”):
( ) . 16 . 2
B B B
o c m =
. 0 0
1 1 0
1



.

\

o
o o
27
(2.19)
A kernel for locating an endpoint of a “4connectivity” Skeleton (using OR for the results of
applying the kernel with 4 rotations of 90”):
(2.20)
3.2.9 Thinning
Thinning is the operation of removing foreground pixels from the image. Usually thinning results
in some form of skeleton of the image. However the process of thinning is different than that of a
generating skeleton. We would usually employ thinning in an iterative sequence where the
thinning mask differs at each sequence to produce a more refined result. We use an extended
type structure element as the one described for the Hit and Miss Transform that contains zeros,
ones and don‟t care elements.
The logical relation between thinning and Hit and Miss Transform is:
(2.21) ( ) ( ), Im, Im Im SE HitAndMiss Thinning ÷ =
where
(2.22) ( ) . ) (Im, Im Im, Im SE HitAndMiss SE HitAndMiss ÷ · = ÷
3.2.10 Thickening
Thickening is the dual of thinning and is used to grow foreground pixels in the image. It also uses
an extendedtype structure element and is related to the Hit and Miss Transform through the
following equation:
(2.23) ( ) ( ) . Im, Im Im SE HitAndMiss Thickening =
Using DeMorgan‟s law we see that thickening an image with a structure element SE is equivalent
to thinning of the inverse image with the same structure element.
.
1
1



.

\

o o o
o o
o o
.
0 1 0
0 1 0



.

\

o o o
28
3.2.11 Pruning
Pruning reduces branches that are shorter than a specific length. A branch is a group of pixels
connected according to the connectivity method that corresponds to the geometry of the problem
and in which each pixel except the end pixel is connected to two other pixels while the end pixel
is connected only to one pixel. Branches are usually common in applications involving skeletons
and in particular when the image is noisy. Closing was presented as a way to reduce the
complexity of a skeleton and therefore reduce the number of branches a skeleton has. Pruning is
typically performed after thinning or skeleton operations. It is important to choose a correct
branch length that on one hand would eliminate the obscure branches but on the other hand
would preserve the original lines in question. Pruning can be regarded as a particular case of
thinning, however due to its importance and unique description it is regarded as a separate
operator. Pruning is usually performed on a finite number of iterations usually in the length of the
longest branch we would like to remove. Pruning run through an infinite number of runs will
converge when the image will contain only closed loops.
3.3 Distance Transform
The distance transform produces the distance of each foreground pixel from the nearest
boundary according to the selected connectivity. Ridges of the distance transform can be
considered as local maxima and represent the skeleton of the object in question.
It is possible to erode an image by a structure element that is a disk of a certain radius r simply by
removing the pixels that attain a value less than r. Consecutive erosions in this case would be
equivalent to alphacuts of the distance transform in intervals of size r.
3.3.1 Continuous Case
Assuming the image f is an element of the space C(D) of real twicecontinuouslydifferentiable
functions on a connected domain D with only an isolated number of critical points (of zero
measure), we have:
(2.24) ( ) ( ) ( ) . inf , ds s f B A T
f
í
V =
¸ ¸
¸
The infimum is over all paths (smooth curves of measure 1) inside the domain D with
(2.25) ( ) A = 0 ¸ and ( ) . 1 B = ¸
29
3.3.2 Discrete Case
In the discrete case the result would depend on the connectivity chosen. Obviously, we have
more paths to check when the connectivity is higher. The distance between two points is the path
that lies totally within the object and leads to a minimum distance result. The process shown in
equation 2.28 that describes the watershed transform computation assists in understanding the
distance function.
3.4 Skeleton
The skeleton is the locus of all centers of bitangent circles that fit entirely within the foreground
pixels. The Medial Axis Transform (MAT) is the graylevel image that represents the radii of these
bitangent circles. The skeleton provides a compact representation to a shape. Unlike the
skeleton, the MAT can also be used to exactly reconstruct the original shape.
Notable drawbacks of the skeleton/MAT are:
1. Highly sensitive to noise – Even small irregularities in the shape will cause large
distortions in the skeleton since each irregularity has to be included within a bitangent
circle. It is necessary to pre/postprocess the image (see the closing and pruning
operators for example) in order to obtain a less complex skeleton.
2. High computational complexity – typically computed using either a distance transform or
a constrained thinning.
3. The skeleton is not a onetoone transform (the MAT is).
There have been several attempts to use the MAT and the skeleton in lossy or lossless image
compression algorithms; however due to the aforementioned drawbacks it was not implemented
in any standard image compression algorithm.
Since connectivity rules change in discrete topology, a skeleton in the nonEuclidean case
(digital) does not preserve continuity if computed according to the mathematical definition and
therefore is not a homotopic transformation (i.e., not a continuous transformation). To solve that
problem we use the following transforms: HitAndMiss, Thinning, Thickening and the Distance
Transform.
In the following example we take a clean binary image of the printed acronym INRS and run the
skeleton algorithm on it. We then approximate the length of the prunes and perform an iterative
pruning algorithm to produce a cleaner skeleton.
30
Figure 3.5: Original image (top); applying a skeleton (middle); pruning the skeleton (bottom).
Since in general the width of printed characters is uniform, we can use the skeleton to reconstruct
the original written text. In this case the difference between the MAT and the skeleton is small.
31
Compressing text is usually done by using the LZW algorithm. However, compressing
handwritten text can be done using this method. We conclude the example by assuming a Salt
and Pepper (S&P) noise added to the original text image. The S&P noise is often caused in the
process of acquiring data through scanning devices. We see that the skeleton is highly sensitive
to noise and therefore it is essential to perform cleaning before calculating the skeleton. Another
conclusion is that in case of a more severe noise it could be better to take a different strategy of
segmentation such as the Watershed Transform that will yield a cleaner image that can later be
processed using the skeleton.
(a)
(b)
32
(c)
(d)
Figure 3.6: Example of cleaning and extracting information from an image using morphological
operators: (a) Original image corrupted by Salt and Pepper noise. (b) A very noisy result after
applying the skeleton operator. (c) Result of cleaning (a) by employing the closing and opening
operators. (d) Result of applying the skeleton operator to (c). A much clearer skeleton is obtained.
Cleaning the S&P noise is done by a closing with a disk of a unit radius (identical to the 4
connectivity kernel) and opening with an 11 pixellength radii. The result can be pruned to obtain
a more compact skeleton.
33
Skeleton by Influence Zones (SKIZ)
Skeleton by Influence Zones also known as a Generalized Voronoi Diagram consists of all points
which are equidistant, in the geodesic distance sense, to at least two nearly connected
components. The geodesic distance between a and b within A is the minimum (in the continuous
case, infimum) path length between two points {a,b} c A among all paths in A. In the case of a
digital grid it may be that no such point exists due to quantization effects.
3.5 Watershed Transform
Watershed – A region or area bounded peripherally by a divide and draining ultimately to a
particular watercourse or body of water. [25]
The Watershed Transform is a morphologicalbased algorithm to segment images. Segmentation
is not a mathematically defined term and therefore it is hard to compare one segmentation
algorithm to another. The watershed transform is common in numerous segmentation
applications, notably in medical image processing; for example in MRI (Magnetic Resonance
Images) it is used to identify different tissues and aid in detecting tumors. The strength of the
watershed transform can be attributed to the capability of performing segmentation to objects
even in the case of partial occlusions [26, 27]. Many variants of the watershed transform exist in
the literature. In the following, we present a theoretical description of the watershed developed for
the continuous case. The notations have been kept as in [28] to facilitate the understanding of the
algorithm.
Assuming again the image f is an element of the space C(D) of real twice continuously
differentiable functions on a connected domain D with only an isolated number of critical points
(of zero measure) we have the following definition:
Let ( ) D C f e have minima { }
I k k
m
e
for some index set I. The catchment basin ( )
i
m CB of a
minimum
i
m is defined as the set of points D x e which are topographically closer to
i
m than to
any other regional minimum
j
m in D:
(2.26) ( ) { } ( ) ( ) ( ) ( ) { }
j f j i f i i
m x T m f m x T m f i I j D x m CB , , : \ + < + e ¬ e = .
The watershed of f is the set of points which do not belong to any catchment basin:
(2.27) ( ) ( )
c
i
I i
m CB D f Wshed 
.

\

· =
e
.
34
Let W be some label, W e I. The watershed transform of f is a mapping { } W I D ÷ : ì ,
such that ( ) i p = ì if ( )
i
m CB p e and ( ) W p = ì if ( ) f Wshed pe .
As a result from the watershed we obtain a labeled image. Each catchment basin is labeled with
a different number and a special label W is assigned to all points of the watershed of f.
Computing the watershed transform on a digital grid can be a tedious task. In particular, large
parts of the image can be flat plateaus. The distance between pixels in a plateau is zero and an
additional ordering is therefore required. One possible way to overcome this problem is by a
simulated immersion. A flooding algorithm is used to help and associate different pixels with
different catchment basins. By performing threshold and computation at each level we are able to
associate at a relatively low computational cost each pixel with its catchment basin. The
thresholds are performed on the graylevels of the image, a similar operation to performing
different alphacuts on a fuzzy membership function.
In the following example, we first find the basins which in this case are the two lowest points
denoted by A and B. Then at each stage we threshold the image and according to a fully
connected scheme we associate each plateau point with its nearest basin. In case a plateau point
has the same distance to both basin associated points it is considered a (temporary) watershed
and is denoted by W. The process stops when the whole image is „flooded‟, that is when we
threshold with the maximal value in the image. The watershed pixels are the output of the
watershed transform. We see that a temporary watershed value can become associated with a
basin when a closer basinassociated point is created in the iteration.
(2.28)



.

\

÷



.

\

÷



.

\

÷



.

\

÷



.

\

= = = =
B W A
B B W
B B B
B W A
B B
B B
B W A
B W
B A
h h h h 3 2 1 0
3
3
3
2 2 3
1
1 1 3
2 2 3
0 1 0
1 1 3
2 2 3
35
3.6 The relationship between Fuzzy Logic and Mathematical
Morphology
A close relationship exists between Fuzzy Logic and Mathematical Morphology since both
theories are extensions of set theory and rely on similar mathematical axioms [24, 29].
1. Both theories rely on a basic union operation and its dual intersection. These operators
are in general nonlinear and therefore have some advantages in terms of noise and
outlier measurements.
In Fuzzy Logic we have as the logical conjunction (logical AND) a Tnorm and as a logical
implication Snorm also known as Tconorm (logical OR). A common choice for basic norms is
the dual pair min and max functions.
In Mathematical morphology the dual pair Dilation and Erosion play the part of Union and
Intersection. The flexibility in selection is through a structuring element (SE) that is in general the
same for both operations.
2. Composing the basic operations yields more complex operators
In Fuzzy Logic a conclusion is usually deduced by taking the Tconorm of all Tnorms.
In Mathematical Morphology taking the dilation of the erosion is called an opening operator.
3. Graylevel extensions to Mathematical Morphology
Many Morphological operators have been extended to deal with graylevel images. A common
technique uses Fuzzy Logic as its basis and assumes that each graylevel represents a
membership value. Alphacuts are performed on the graylevel image reducing it to binary
threshold images that correspond with the original Mathematical Morphology framework. Another
technique uses basic fuzzy logic operators to construct the dilation and erosion building blocks.
4. Both theories are based on human intuition and expertise in order to construct the system
In Fuzzy Logic a set of rules as well as basic norms and aggregation methods are specifically
chosen based on the task in question. In Mathematical Morphology, the structuring element can
be easily modified to capture the characteristics of the objects in question. The structure element
can be regarded as a measure of belief the system designer has in a certain geometric
specification for the objects to be investigated. Generalizing to the graylevel case, we obtain
degrees of uncertainty for values in the range of [0, 1]. The flexibility in constructing the basic
operators gives both theories their power.
36
3.7 Summary
This chapter gives a description of some of the more common morphological image processing
operators. The relationships between the different operators are presented and the justifications
for using each operator are given. In particular the development of more complex compound
morphological operators from basic operators is demonstrated. Specific examples demonstrate
the usefulness of mathematical morphology in solving every day image processing tasks.
Mathematical Morphology, like Fuzzy Logic, is a nonlinear theory based on set theory. A note on
the association between these two theories is presented to show that different approaches to the
same problem may actually rely on the same methodology but with a different formulation.
The history of Mathematical Morphology helps to understand the ways morphological operators
are used and gives motivation to use mathematical morphology in different contexts. Indeed in
this work we choose to use mathematical morphology due to its high performance both in terms
of runtime complexity and in results. Using morphological operators extended to grayscale
images allows us to extract relevant information from our input speech spectrogram images.
Since the speech spectrograms are considered very noisy images with specific geometrical
objects that convey specific information, we tend to favor using mathematical morphology when
constructing a recognition algorithm.
37
Chapter 4
Fuzzy Logic
"So far as the laws of mathematics refer to reality, they are not certain. And so far as
they are certain, they do not refer to reality."  Albert Einstein
We start the chapter with a short historical overview and by explaining the importance of Fuzzy
Logic and Evidence Theory to an expert systembased speech recognition system. We continue
with theoretical background that explains fundamental concepts in Fuzzy Logic such as fuzzy
variables, alpha cuts, fuzzification and defuzzification, fuzzy union, fuzzy intersection and fuzzy
complement. We show how different sources of information can be combined using the
DempsterSchafer rule of combining evidence and afterwards demonstrate the capabilities of the
Matlab Fuzzy Logic toolbox. Finally, we conclude the chapter with an overview of different terms
that describe situations that are not well defined, namely, vagueness and ambiguity.
4.1 Introduction
Fuzzy Logic can be traced back to the preaching of Buddha in about 500 BC that claimed that
everything contains something from its opposite. In contrast to EastAsian philosophy, about 200
years later, the Greek philosopher Aristotle developed the well known binary logic. The Western
European philosophy adopted binary logic to daily life, to distinguish between good and bad. In
1964 Dr. Lofti Zadeh, a professor from Berkeley University, developed a theory that would make it
easier to translate problems easily understood by humans to a computer that relies on binary
logic to make decisions [30]. Fuzzy Logic may be viewed by some as an extension to binary logic,
perhaps as a multivalue logic. However, Fuzzy Logic is derived from Fuzzy Sets and has much
stronger practical advantages than merely extending the possible output levels of a logical
statement. Consider for example Eublides‟ pile of sand paradox. How many sand grains does it
take to obtain a pile of sand? This problem is defined in vague terms; therefore even a multiple
level solution to it is inadequate. We need to have more understanding of the linguistic terms of
the paradox and use some fuzzy technique to solve the paradox. We will see in the following that
a membership function devised by an expert can give a possible solution to the paradox.
38
Today Fuzzy Logic applications are employed in numerous commercial devices; some
examples are: rice cookers, laundry machines, air conditioners, cameras, automatic braking
systems and many more. An example of a hardware implementation of a controller by Toshiba is
the EJ1000FN [31]. It can control up to 8 elevators and helps minimize waiting times. A neural
network based forecasting model helps predict variations that occur over large time intervals
while a fuzzy logic set of rules handles uncertainty and ambiguity that occurs over short time
intervals. The processor can be programmed to operate under different requirements and the
result is a shorter wait time under the constraints and limitations imposed by the building
manager.
4.2 Fuzzy Logic vs. Boolean Logic
Boolean logic as well as crisp set theory and crisp probability theory are inadequate to deal with
the imprecision, uncertainty and complexity of the real world. The limitations of crisp (Boolean)
theories are that no room is left to disambiguate or for unknown cases. If we have prior
information about a set of events, we would like to have a model that allows us to use that
knowledge. On the other hand if we do not have information about other events, modeling them
as equally likely using the uniform distribution is not always the right choice. Fuzzy logic was
developed to deal with these types of situations and to allow us to easily write and later
implement logical rules that are the result of an expert‟s knowledge.
For example assume we would like to know if a person is tall. We might decide on a
specific fixed threshold of 1.70 meters and regard all candidates with a height greater or equal to
1.70 meters to be tall. The threshold can be determined heuristically or through median, average
or other statistical characteristics of a certain population. Evidently the problem with this method
is that two persons that differ by 1 cm will be categorized into two different groups. In order to
overcome the crisp boundaries of the set “tall” and its complementary set “not tall”, we construct a
fuzzy membership function that assigns values ranging from zero to one that represent the
tallness of a person. A person with height over 1.90 meters will be considered tall in most cases
(the population of basketball players is a good counterexample) while a person shorter than 1.50
meters will not be considered tall. We will assign values to the extreme cases, „1‟ to represent
that the person is clearly in the set of tall people and „0‟ to represent that the person is clearly not
a member of the set of tall people. All other heights will be assigned a midrange value. In fact the
crisp (Boolean) variable “tall” has been converted to a fuzzy variable that can take values on the
interval [0, 1].
39
A Fuzzy variable can be declared for the following reasons:
 predicates (expensive, old, rare, dangerous etc.)
 quantifiers (many, few, almost all, usually, and the like)
 truth values (quite true, more or less true, mostly false etc.)
The previous example required converting a single crisp variable into a fuzzy variable.
Converting a crisp problem to a fuzzy one is called fuzzification. We now examine a person‟s
weight to determine if the person is slim. Instead of a single threshold we now have two
thresholds since a person may either weigh more or less than the weight considered to be
categorized as “slim”. More interestingly, we have a dependency between a person‟s height and
weight in a sense that we would expect a slim and tall person to weigh more than a slim but
shorter one. We can construct the membership functions of the “weight” variable as a function of
the fuzzy variable “tall” and create a Fuzzy set of type 2. We can also quantize the height and
regard each subsection of it to have a different membership function. For example, we can
quantize the height into the following fuzzy variables: {Below Average, Average, Tall, Very Tall}.
Another important concept in Fuzzy Logic is that of linguistic hedges. For example we can
apply the linguistic hedge „very‟ to the fuzzy set „tall‟ to create the fuzzy set „very tall‟. Other
examples of hedges are: „slightly‟, „extremely‟, „rather‟, „more or less‟ etc. We can afterwards
interpret the hedge in some predefined way. We can choose that the hedge „very‟ would
correspond to squaring the membership function. Since membership values are between zero
and unity, squaring the values will increase their membership in the set in a nonlinear manner.
Depending on the problem we have the flexibility of choosing different functions to different
hedges as long as we remain consistent in our definitions.
Another example of the importance of Fuzzy Logic and Fuzzy set theory can be seen through
the “student dilemma”. The student dilemma‟s first implication is “The more I study, the more I
know”. The second implication is “The more I know, the more I forget”. Thus using crisp deduction
the student stipulates that “The more I study, the more I forget”, which may lead to the conclusion
that learning decreases knowledge. However, using Fuzzy Logic operators to link the different
assumptions, we would conclude that the part of the knowledge forgotten is negligible in
comparison to the knowledge gained. Instead of a crisp threshold that, once triggered, would
produce a certain result, we would obtain a value for the knowledge forgotten that should be low
in comparison to the value for the knowledge learnt.
There are many methods by which a membership function can be constructed. An expert
may assign values based on previous experience with the problem. Statistical methods may help
in giving a lower bound to a membership function. A neural network can be trained to tune
membership functions in a way that would produce a desired result. In general, we will use expert
40
knowledge when it is available and when it is feasible to construct each membership function in
the problem at hand. We can always further tune a predetermined membership function using a
neural network. Statistical methods can be justified by evidence theory methods that require that
the membership function should be an envelope function of the plausibility function, which is an
upper bound on all probability density functions that could be associated with the specific
problem. In the same manner a belief function serves as the lower bound for all possible
probability density functions that can be assigned to the problem. We denote by t the plausibility
function and by u the membership function and obtain the following relationship [32]:
In the following we present prototypes of different membership functions. The values of the vector
P stipulate the transition band(s) just as in DSP filter design.
(a)
(b)
( ) ( ) ( ) u u U u
i
A X
u t s e ¬ , 1 . 4
41
(c)
(d)
Figure 4.1: Examples of common parametric membership functions: (a) Sigmoid membership
function. (b) Pi membership function. (c) Z membership function. (d) Triangular membership
function.
42
4.3 Alpha cuts
Alpha cuts are simply threshold levels that convert a fuzzy set into a crisp set. The process of
converting a fuzzy set to a crisp one is called defuzzification. We have already performed alpha
cuts in our previous example when we quantized the “height” fuzzy variable to different
subsections. An alpha cut is considered to be a strong alpha cut if the inequality of the threshold
is strict. For example x>.25 is a strong alpha cut while x>.68 is a regular alpha cut. We present a
formal definition:
Regular Alpha cut:
Strong alpha cut:
Alpha cuts are mostly used to extract information from a membership function and are rarely used
to defuzzify a fuzzy set. Usually several alpha cuts are taken whereby in decreasing the
parameter alpha more elements are added to the generated crisp set. Hence a nested subset is
created with different alpha parameters. The following method introduces a more common way to
defuzzify a set.
4.4 DeFuzzification
In order to obtain a result that can be used to make a decision it is necessary to defuzzify the
fuzzy set. There are many ways to perform defuzzification. One very common way is called the
Center of Gravity (COG).
The COG is defined as:
(4.4)
( )
( )
í
í
=
u
A
u
A
du u
du u u
u
u
u
' .
The concept of COG is simple: if we have both the horizontal and vertical locations of the
COG and if we regard the area under the membership function to have a mass that is evenly
distributed, then by placing a pin at the location of the COG we can (in terms of classical physics
that is  disregarding any inherent uncertainties in measurements) balance the entire mass. The
COG is therefore an equilibrium point in terms of mass. Equation 4.4 calculates the horizontal
( ) ( ) { } o u
o
> e = x X x A
A
2 . 4
( ) ( ) { } o u
o
> e =
+
x X x A
A
3 . 4
43
axis value of the center of gravity; in order to find the exact center of gravity we would need to
perform another calculation for the vertical axis value.
Replacing the membership function (MF) with a normalized one times a constant K, we
obtain the mean according to the normalized membership function. We have made an
assumption that the normalized MF can be regarded as a probability density function; however,
we keep in mind that the MF is not directly generated from a probability model since it contains
vagueness and is a result of nonadditive measures. Since we obtained a mean value it is the
solution of the following optimization problem:
(4.5) ( )( )
í
÷ = du u u u u
A
u
2
'
' min arg ' u .
Taking the derivative of the above expression w.r.t. u‟ we obtain the value that is the best
in a meansquare sense and is given by the COG. In practical implementations we would
aggregate different membership functions according to different weights and rules, and obtain as
a result another membership function. Using the COG would allow us to use a single output value
instead of a collection of values with different membership grades.
4.5 Fuzzy Union
In order to combine two fuzzy values in a constructive manner we need to define a union
operator. The union operator satisfies four axioms in order for it to be logically consistent and
serve as a building block to fuzzy logic. Two additional axioms are only satisfied by the default
union and intersection, that is, by the logical OR and AND operators. Hence, u5 and u6 show the
uniqueness of these default operators. We note the similarity between the union and the
morphological dilation operator.
Axioms for the union operator:
Axiom u1: ( ) ( ) ( ) ( ) 1 1 , 1 0 , 1 1 , 0 , 0 0 , 0 = = = = u u u u [conforms to crisp boundaries].
Axiom u2: ( ) ( ) a b u b a u , , = [commutative].
Axiom u3: If ' a a s and ' b b s then ( ) ( ) ' , ' , b a u b a u s [monotonic].
Axiom u4: ( ) ( ) ( ) ( ) c b u a u c b a u u , , , , = [associative].
Additional axioms satisfied only by the logical operators OR and AND:
Axiom u5: u is a continuous function.
Axiom u6: ( ) a a a u = , [u is idempotent].
44
The dual operator of the fuzzy union is the fuzzy intersection. The intersection complies with all
axioms except u1. Changing axiom u1 to accept a logical value „1‟ only in case both inputs have a
logical value of „1‟ and to accept a logical value „0‟ in all other cases would give the required
additional axiom for the intersection operator. We obtain six axioms, i1i6 for the fuzzy
intersection.
Axioms for the fuzzy complement operator:
Axioms c1, c2 are axiomatic skeletons for fuzzy complements.
Axiom c1: ( ) 1 0 = c and ( ) 0 1 = c .
Axiom c2: monotonic nonincreasing that is:  : 1 , 0 , e ¬ b a if , b a < then ( ) ( ) b c a c > .
Axiom c3: c is continuous.
Axiom c4: c is involutive (therefore necessary continuous).
Examples of different types of complement functions that satisfy all 4 axioms presented above:
Sugeno class of fuzzy complement:
(4.6) ( )
a
a
a c
ì
ì
+
÷
=
1
1
where ( ) · ÷ e , 1 ì
45
Figure 4.2: Sugeno fuzzy complement for different lambda parameters. As ì increases, the
complement curve turns from convex (1<ì<0) to concave (ì>0) as can be seen by taking the
second derivative equation (4.6).
Yager class of fuzzy complements:
(4.7) ( ) ( )
w
w
w
a a c
/ 1
1÷ = where ( ) · e , 0 w
Figure 4.3: Yager fuzzy complement for different parametric values. For w=1 we obtain the
traditional complementary function. For 0<w<1 we obtain a concave function and for w>1 we
obtain a convex function.
The complementary function has at most one equilibrium solution.
Yager class of fuzzy unions:
(4.8) ( ) ( ) 
.

\

+ =
w
w w
b a b a u
1
, 1 min , where ( ) · e , 0 w
The Yager class of fuzzy intersection (Yager Intersection) can be obtained by inserting the Yager
union and Yager complement into DeMorgan‟s law:
(4.9) ( ) ( ) ( ) ( ) . , , b c a c u b a i =
46
In the latter equation the union, intersection and complement correspond to the same parameter
value w.
4.6 General Aggregation Operations
Aggregation operations on fuzzy sets are operations in which several fuzzy sets are combined to
produce a single fuzzy set. Aggregation operations must satisfy two axioms to be logically
consistent. The two additional axioms are common to most aggregation operations used.
Axiom h1: h(0,0,…,0)=0 , h(1,1,…,1)=1 (boundary conditions)
Axiom h2: For any pair ( )
n i
i a N e and ( )
n i
i b N e where   1 , 0 e
i
a and   1 , 0 e
i
b , if
i i
b a >
for all
n
i N e then
i i i i
b a t s b a > ¬ . . , we have ( ) ( )
i i
b h a h > where
n
i N e and   1 , 0 , e
i i
b a
that is, h is monotonic decreasing in all its arguments.
Additional (optional) axioms:
Axiom h3: h is a continuous function.
Axiom h4: h is a symmetric function in all its arguments, that is: ( ) ( )
n i p n i
i a h i a h N e = N e
) (
for any permutation p on
n
N .
Axiom h4 assumes that „all fuzzy sets are created equal‟. If the case is that one set is more
important than the other, this axiom would not be satisfied when devising an aggregate operation.
Fuzzy unions and intersections qualify as aggregation to fuzzy sets since they satisfy the
associative axioms (u4/i4).
One class of averaging operations that covers the entire interval between the min and max
operations consists of generalized means. These are defined by the formula:
(4.10) ( )
o o o o
o
1
2 1
2 1
, , ,


.

\
 + + +
=
n
a a a
a a a h
n
n
where ( ) 0 = 9 e o o is a parameter by
which different means are distinguished:
o÷· min
o=1 Harmonic mean
o÷0 Geometric mean
o=1 Algebraic mean
o÷· max
47
Weighted Generalized Means:
(4.11) ( )
o
o
o
1
1
2 1 2 1
, , , , , 
.

\

=
¯
=
n
i
i i n n
a w w w w a a a h
where
1
1
=
¯
=
n
i
i
w and 0 >
i
w for N i 1 = .
The weighted general mean enables giving more importance to some input and less importance
to others. Extensions of this aggregation method would include dynamic change for the weights
to fit a nonstationary or quasistationary environment.
4.7 DempsterShafer (DS) Theory
4.7.1 Basic Probability Assignment (BPA)
Basic Probability Assignment (BPA) describes a mass value assigned by an information source
only to events for which it has direct evidence. The information source assigns values on subsets
of the frame of discernment.
BPA follows three axioms:
A1: Empty set
The mass assigned to the empty set is zero.
(4.12)
A2: Frame of Discernment
The total mass assigned to the frame of discernment is one.
(4.13)
A3: Normalized values
All mass assignements are normalized to the unit interval.
(4.14)
( )
( )
1 =
¯
e X P A
A m
{ } 0 = o m
  1 , 0 ) ( e A m
48
A particular case of a BPA is the wellknown probability distribution, where the sum of all known
evidence for a particular event is unity.
4.7.2 Combining Evidence
Two basic probability assumptions m
1
on the set of events B and m
2
on the set of events C are
combined give a joint basic probability assumption m
1,2
on the set A. The set A contains events
that are included in sets B and C. Normalizing factor K is needed to compensate for mass that is
assigned to events that are not common to both B and C.
(4.15)
The known formula for joint probability distribution is a particular case of the Dempster
Shafer (DS) rule for combining evidence. Combining an existing speech recognition system with a
spectrogrambased system can improve the performance of the existing system as long as the
spectrogram system gives reasonable results.
DS theory is a generalization of the Bayesian theory of subjective probability. Degrees of
belief need not have the mathematical properties of probability. Rule of combining evidence is a
generalization of Bayesian combining and when two events are independent or when the basic
probability assignment sums to unity the rule of combining evidence and the rule of joint
probability are identical. There is no need to specify prior probabilities in the evidence theory
scheme. In Bayesian inference, a symmetric probability value of .5 would be assigned to a
variable when there is no knowledge about its value, in the case of evidence theory a plausibility
value of unity would be assigned. The plausibility value of unity is more general and allows all
possible types of probability distributions since it serves as an upper bound for the underlying
(unknown) probability density function if it exists. Instead of simply assuming a uniform
distribution due to lack of knowledge or inherent vagueness we allow the system to take upon all
types of probability distributions.
( )
( ) ( )
{ }
( ) ( )
{ }
¯
¯
= ·
= ·
· =
÷
·
=
o C B
A C B
C m B m K
K
C m B m
A m
2 1
2 1
2 , 1
1
49
4.8 Fuzzy Logic Toolbox
The Fuzzy Logic Toolbox in Matlab has a Graphical User Interface (GUI) that allows easy
manipulation of membership functions and a general construction of a Fuzzy Logic rulebased
system. The toolbox also contains different dedicated functions for fast computation of Fuzzy
Logic operations. An example of a feature vector extracted from the image spectrogram can be
an estimated location of the frequency of the first three formants. Since we have vagueness due
to the uncertainty principle, a reasonable membership function would accept different frequency
values around some expected formant value that can be either learnt by a Neural Network or
calculated through statistical regression for the particular speaker. Ideally, we would train the
membership functions to each speaker to improve performance. The output of the system would
be a crisp value that corresponds to a particular phoneme with the dimension being the number
of phonemes to be recognized. However, a different construction can at first distinguish between
different phoneme classes to reduce the dimensionality. Nasals for example can be easily
identified by their suppression of the second formant.
4.9 Vagueness and Ambiguity
We present two different concepts that deal with cases that are not well defined in terms of
classical Boolean logic [29]:
Vagueness – fuzziness, haziness, cloudiness, unclearness, indistinctiveness and sharplessness.
Vagueness is best described by fuzzy sets.
Ambiguity – nonspecificity, onetomany relation, variety, generality, diversity and divergence
Ambiguity is best described by fuzzy measures.
Three types of ambiguity that lead to three different measures are:
1. Nonspecificity in evidence – Nested subsets cause nonspecificity in evidence. The
larger the subset, the greater the nonspecificity is.
2. Dissonance in evidence – Two disjoint sets in X that give information about prospective
locations of an element of X contradict each other by their evidence.
3. Confusion in evidence – Subsets of X that do not overlap, or that partially overlap, cause
confusion in evidence.
Uncertainty can be used to measure information. In this case, the measure of information
does not include semantic and pragmatic aspects of information but can serve an important part
in system modeling, analysis and design. In our work we witness all three types of ambiguity. We
50
experience nonspecificity in evidence when we examine the results of the uncertainty principle in
the image spectrogram. The smearing of information over a wide frequency band does not allow
us to associate a specific frequency with a particular formant. Instead we are left with a wide band
and the particular frequency can be anywhere in the band. After identifying a specific region in the
image spectrogram that is associated with a single frequency, the amount of uncertainty can be
easily quantified.
Dissonance of evidence, the second type of ambiguity, is the prime reason for the slow
development of rulebased speech recognition systems. The contradiction in rules has led to null
results when crisp rule sets were used. Using fuzzy logic rules, however, allows overcoming this
problem since fuzzy logic will produce a valid result even when two bodies of evidence contradict
each other. The amount of uncertainty in this case is directly related to the proximity of formants
of different phonemes under interspeaker variation and coarticulation effects.
The third ambiguity is introduced in our case when the segmentation algorithm does not
recognize all formants in a correct way. We have in this situation missing evidence that can still
be handled by a fuzzy logic based set of rules. Specifying the amount of uncertainty in this case
is directly related to the performance of the segmentation algorithm under different conditions.
4.10 Summary
The basics of Fuzzy Logic operators were introduced. We saw the similarity between the fuzzy
union/intersection and dilation/erosion in mathematical morphology. Fuzzy Logic serves as an
important and useful technique to model daily human tasks. Understanding Fuzzy Logic is
important to our work since when designing the automatic spectrogram reading we need to know
what type of outputs we need to generate that would be accepted by a Fuzzy Logic system. We
expect a human expert to design a set of rules that would use the information generated by the
segmentation algorithm to perform the speech recognition. Knowing that Fuzzy Logic can deal
with ambiguity and with contradicting evidence allows us to develop a scheme that does not
necessarily try to identify a specific phoneme at a first run, but more to extract information that
would be deciphered at a later stage.
51
Chapter 5
Pitch detection algorithm
5.1 Motivation
The wideband speech spectrogram is striated by vertical lines. These lines indicate the peaks
and valleys of the signal that are caused by the opening and closing of the glottis. The opening
and closing of the glottis generates the fundamental frequency also known as the pitch.
Estimating the pitch period can aid in removing these lines from the wideband spectrogram.
Removing the vertical lines from the wideband spectrogram, reduces the noise in the image and
makes it easier to segment the different formants and consequently to perform an automatic
recognition task. In addition, pitch can also be used as a feature that aids an automatic speech
recognition system. Pitch can help distinguish between voiced and unvoiced speech, indicate
semantic and emotional speaker status that can be analyzed using a higher level recognition
technique. For example, pitch can indicate an “end of sentence”. That information can be
incorporated within a language model to improve recognition rates. Other features that can assist
a speech recognition system are the emotional state of the speaker, transition between
phonemes and different types of phoneme. Pitch that indicates that the emotional state of the
speaker has changed can help adjust parameters and assist in overcoming the interspeaker
variability problem. Pitch can also indicate a transition between different phonemes and phoneme
types. We are therefore motivated to detect the pitch and examine a new approach to pitch
recognition.
To obtain a better view of the pitch we generate a narrowband spectrogram. A longer time
window gives better frequency resolution.
52
5.2 Theoretical Overview
Several pitch detection algorithms exist in the literature. These algorithms can be classified into
four groups namely: Time domain, Frequency domain, TimeFrequency hybrid and Event Driven.
Pitch can be described as the rate of vibrations of the vocal cords. The glottis generates a “saw”
wave that propagates through the larynx, which shapes the wave to produce the required spoken
utterance. The fundamental frequency is also caused in the process of the speech creation.
Several difficulties in detecting pitch are:
1. Noise – The speech signal can be corrupted by noise (recording device, background
noise, compression over the channel etc.)
2. The interaction between the vocal tract and the glottal excitation can have an impact on
the clarity of the fundamental frequency.
3. Defining the start and end points of the pitch during voiced segments
4. Differentiating between lowleveled voiced speech and unvoiced speech.
We present the autocorrelation method [33] followed by the cepstrum method [34]. We focus on
these methods since the autocorrelation method is considered to be the most common method
used and since the cepstral method is related to our work and can be easily obtained with minor
additional calculations from the speech spectrogram.
5.3 Autocorrelation Method
To perform good analysis, high pitch speakers require a short window of 520 ms while low pitch
speakers require a long window of 2050 ms. Most methods use a sharp cutoff filter at 900 Hz to
reduce the effect of the second and higher formants. Such a low pass filter manages to preserve
enough harmonics of the pitch to allow detection. The autocorrelation method is robust, works in
the time domain and usually assumes stationary analysis (system does not change during the
computation of the autocorrelation function). Difficulties are high computation costs, accurately
detecting the correct peak in the resulting autocorrelation function and the need to window the
signal. Autocorrelation requires multiplyaccumulate (MAC) operations and the complexity of the
operation is ( )
2
N O . The complexity can be farther reduced to the order of ( ) ( ) N N O log . The O
function serves as an upper bound to the actual complexity function and N is the number of
samples in question.
53
Windowing: the type of window chosen affects the result and since the window tries to taper the
function near zero, there is a change in the autocorrelation function that can affect detection.
For a periodic signal s(n)=s(n+P) we have periodicity in the autocorrelation A(k)=A(k+P).
Since the speech signal is nonstationary, it is more reasonable to define a shorttime
autocorrelation function with the underlying assumption of a quasistationary signal in each time
segment. The signal is preprocessed with a low pass filter with a cutoff frequency of 900 Hz and
stop band frequency of 1,700 Hz (a sampling rate of 10 KHz is assumed for the speech signal in
this scheme). Then the signal is nonlinearly clipped and as a result the spectral density function
is flattened. Since we are dealing with a discretetime signal, we do not have discontinuity points
due to the clipping effect as would be expected in a continuoustime signal; thus the clipping
reduces the energy of the higher frequencies in a nonlinear fashion.
Three different types of clipping are presented:
1. Clip: keep signal above a threshold.
2. Clip and reduce signal by a threshold value.
3. Sign function of the clipped signal.
These three methods give rise to 10 (3*3+1) different ways to correlate the nonlinearly
processed signal with its shifted nonlinearly processed counterpart.
Rule of thumb: set the threshold value to 68% of the minimum of Q where Q is a two element
vector containing the maximum of the absolute values of the signal in the first and in the last
thirds of the analysis frame. Note that this operation is equivalent to a fuzzy union of the absolute
values of the signal at each interval and then the fuzzy intersection between both intervals. The
goal is to reduce the effects of the first formant so as to allow reliable pitch detection. Spectral
smoothing is achieved by a combination of nonlinear functions. An autocorrelation value below
.25 is assumed to be unvoiced speech. It is also assumed that there is a voiced/unvoiced
detector that passes only the voiced speech. Changing the analysis frame size is important in
particular when the application is to handle different speakers. The frame size must contain at
least two pitch periods but not be too large for it to be possible to detect the pitch at a given time
interval. The range of pitch variation within an utterance is generally one octave or less from the
average pitch for the utterance. Thus an instantaneous adaptive algorithm (for the window size) is
not required.
54
In this work we develop a new scheme to detect pitch that is an extension of the Cepstrum
method (CEP) [34].
A short description of the CEP method:
1. The signal is partitioned into intervals on the order of 4 pitch periods.
2. A Hamming window is used to reduce aliasing effects.
3. The logarithm of the absolute value of a 512 point Fast Fourier Transform (FFT) of the
windowed signal is then computed.
4. A 512 point Inverse FFT is computed and the resulting peaks are detected.
By taking a larger time window we can improve our frequency resolution of the pitch at the cost of
time localization. The result is a more accurate pitch estimate that applies to an interval of a few
pitch periods. It is possible to generate the narrowband spectrogram from the partial computation
of the wideband spectrogram. Combining two wideband (shorter time duration) sections into a
single narrowband section is done by using the appropriate twiddle factors as a last stage of the
corresponding FFT. The windowing can be left to a later stage since by cyclic convolution we
have:
(5.1)       { } k X k X k W
d d d
2 1
,  .
  k W
d
is either a duplicate of a wideband length hamming window or a full length narrowband
hamming window in the frequency domain where „d‟ stands for the DFT transform. Therefore,
applications that are using wideband spectrograms can perform the pitch detection algorithm with
a relatively low additional computational cost. We want to measure the distance between the lines
in the spectrogram. The distance would give an indication of a half cycle of the sinusoid (peak to
peak). Since the image is noisy and since we sometimes miss lines where they are supposed to
be detected or have lines in places there shouldn‟t be any, we need to find a way to increase the
accuracy of our measurement. We need to aggregate the different measurements in some way
that will reduce errors. First we need to thin the lines to a single pixel width so the measurement
would have meaning. The thinning morphological operator was described in section 3.2.9.
5.4 Method of Aggregation
Using an arithmetic mean to calculate the average distance between the lines will not give
accurate results since it would only take into account the first and last lines and completely ignore
all the lines in the middle. This will increase the chances of error and inaccuracy since the
calculation depends only on the first and last line position.
55
Assume 26 parallel lines located at frequencies given by the 26 letters of the English alphabet
{a,b,c …}
We obtain a telescopic sequence:
(5.2) {ab + bc + cd + … + xy + yz}/26 = {az}/26
We see that only the location of the first and last lines are taken into account together with their
property as boundaries of the set of lines. Since the distance between the intermediate lines is
not important to the final result we avoid the measurement error associated with each
intermediate distance. However, this method assumes that we correctly identified each line. In
case a line is not identified we will have an error in the denominator. For a typical narrowband
spectrogram that shows 20 lines we will have a 5% error if we miss one line and a 10% error if we
miss two lines. This error also inflates the original measurement error.
A median aggregation method is selected. This method has better properties in terms of
sensitivity to outliers. In case we have an even number of lines (odd number of distances) we are
guaranteed to obtain measurements that are within the set. A limiter that requires a minimum
number of lines at a certain time instance and a maximum possible line distance (pitch height)
helps reduce nonvoiced speech effects.
We continue by examining the properties of the median operator. In the following, we see the
median can be regarded as an optimization of minimum absolute distances (MAD):
Consider the following optimization problem:
(5.3) ( )
¯
=
÷ =
N
i
i
x
x x x f
1
min arg .
Taking the derivative to find a global optimum we obtain:
(5.4) ( ) ( )
¯
=
÷ =
N
i
i
x x x f
1
sgn ' .
If N is odd we can obtain a unique minimum for the function f(x) by taking x to be the midpoint.
This would ensure that the derivative of the function is exactly zero. In case N is even, we can
select any value between the points q=
2
1 ÷ N
and s=
2
1 + N
. A simple to implement solution
would be to take the midvalue (one bit register shift in a hardware implementation). This would
also be an optimal solution in the mean square error in case we assume that the average
between these two points is also the average of the whole ensemble (good assumption for a
large sample drawn from a symmetric distribution). This scheme is used in the 1D Matlab
implementation of the median. In the 2D median implementation, the point q is selected. 2D
median is used in images in order to reduce the effect of Salt and Pepper noise [35]. Unlike a
56
convolution operation which smears (low passes) the energy of the speckles the median filter if
constructed properly would remove these speckles while avoiding placing gray level values that
did not exist in the original image.
5.5 Suggested Algorithm
Our objective is to extract thin lines that represent the fundamental frequency‟s harmonics and
use the distance between the lines to estimate the fundamental frequency.
We start by performing a local threshold to the image to better distinguish between the
lines we wish to detect and the background. Since we are dealing with a narrowband speech
spectrogram which is considered a noisy image, we need to concentrate our detection efforts on
objects that would produce a reasonable result. Hence, we restrict all small objects that might
have been generated by noise and exclude these objects from the image. We perform this task
using connectivity properties and disregard all segments that have less than 100 connected
pixels in “8connectivity”. We proceed by finding segments that are within the objects and are in
fact centers of the thicker lines. We find the centers by thinning the image to infinity which in
practical terms means thinning the image to a single pixel width so the segments would be
evident and well localized. We use these centers to identify again the objects and select from
those only objects that have 1000 connected pixels with a “4connectivity” requirement. The latter
restriction is tighter than the 100 pixels “8connectivity” and yields only long line sections. We
again thin the objects to a single pixel width.
We conclude the image processing part of the algorithm by performing pruning to remove
spurious segments longer than 10 pixels. Having extracted the lines from the image we continue
by measuring the distance between the lines. We have seen in section 5.4 that a median will be a
more reliable choice than an average. Since the pitch is caused by movement of the vocal cords
we can safely assume that no significant change occurs at the sampling rate of 16 kHz. The pitch
is usually in the order of 100 Hz for an adult male speaker, therefore sampling the speech at a
frequency of 16 kHz gives a very good resolution since it is much more than the Nyquist rate for
the 10
th
harmonic (which is expected to be around 1,000 Hz. In order to reduce the amount of
data we need to process we down sample the image spectrogram and calculate the distance
every 10 sample points. We ignore distances that are too large that indicate a pitch higher than
300 Hz to reduce the possibility of an erroneous measurement, we also ignore points that have
less than 4 corresponding lines since a reliable measurement can not be provided in such case.
Points with less than 4 corresponding lines might indicate bad detection of lines or more
commonly indicate a nonvoiced speech segment that does not have a well defined pitch period.
57
5.6 Algorithm Description
We present a stepbystep description of the pitch estimation algorithm:
1. Perform a local threshold.
2. Disregard objects that have less than 100 connected pixels using “8connectivity”.
3. Thin the image indefinitely until all objects are of a single pixel width to find centers of lines.
4. Using a mask to identify the objects that contain the centers from the thinning process as
performed in the former stage, disregard objects that have less than 1000 connected pixels using
“4connectivity”.
5. Thin the identified objects to a single pixel width.
6. Prune the result to cut branches of less than 10 pixels.
7. Disregard line distance that correspond to a fundamental frequency that is higher than 300 Hz
and disregard time instances that have less than 4 lines associated with them.
8. Compute a median distance of a down sample of the time axis. Down sample by a ratio of 1:10
to an effective sampling frequency of 1.6 kHz.
58
5.7 Results
Figure 5.1: Narrowband Speech Spectrogram.
Scale: Horizontal Axis 0 – 1 sec; Vertical Axis 0 – 4,000 Hz.
59
Figure 5.2: Narrowband Speech Spectrogram after line detection.
Scale: Horizontal Axis 0 – 1 sec; Vertical Axis 0 – 4,000 Hz.
We see that in general the lines are well detected. The algorithm manages to thin the lines to a
single pixel width. We also have line detection in areas that the pitch is not well defined in, for
example in fricatives or stops. However, by simple examination of the results we can disregard
these areas or indicate to a higher level that they are caused by unvoiced speech. At first the
lines seem evenly spaced, however close examination of figure 5.2 would show that the lines are
not perfectly spaced and in fact there are some noise and erroneous line segments. The median
in our algorithm reduces the effects of outliers and lets us focus on a more stable and
conservative subset of the measurements.
60
We present some calculations that will correlate between the pitch values and the spectrogram
image. Before the calculation we have the following facts:
1. The speech signal is sampled at 16 kHz. We process a speech segment of 26,000
sample points, therefore 26/16 parts of the second.
2. We have a 50% overlap between consecutive windows. Our window is 1,600 samples
long which correspond to .1 sec.
3. FFT has 4,096 taps. We need a very long FFT in order to capture at full at least one time
window.
4. We normalize the image spectrogram to be in the length of 1,000 pixels. We interpolate
the data farther to spread half the information on the 1,000 pixels.
5. We calculate line distances every 10
th
sample.
In order to compute the narrowband spectrogram we use the specwb function, as described
in section 6.11, with a specific setting to calculate a narrowband spectrogram. The signal is down
sampled to 3,200 Hz. To translate the distance between the lines to a valid frequency we need
first to multiply by two to have a full cycle in pixels and then to multiply by 1.5625 which is the
frequency spacing of each pixel (1,600/1,024). Due to all the information above we obtain a pitch
sample result that corresponds to a duration of 16.25 ms. We obtain the following results for the
pitch values:
Figure 5.3: Results of the Pitch Estimation Algorithm.
61
We see that the values obtained for the pitch are not stable. In addition we have obtained very
high or very low pitch values that are usually caused in unvoiced speech segments. The main
problem of this technique is that we have large errors due to the difficulty in determining the exact
frequency of the sinusoid just by maximum values that are obtained in the thinning process of the
quantized image. We would expect the pitch to be stable at values close to a 100 Hz with slight
variations in the order of about 10% due to syntactic differences.
5.8 Summary
We have seen that attempting to detect the pitch, or any other noisy sinusoidal function, through
thinning of a band of maximum points obtained by a threshold (clipping) yields inaccurate results
when there is insufficient frequency resolution. While a rough approximation of the pitch is
possible and even though on a long time interval the errors cancel out and produce a reasonable
result, the errors in this type of measurement are too large to tolerate. On the other hand,
estimating the threshold regions from a known pitch is possible; however, due to the threshold the
regions are wide and in order to remove them further processing is required. Using morphological
image processing techniques to extract the pitch period from a narrowband speech spectrogram
is not a good idea since there is no visual advantage or expert knowledge that helps to achieve a
more accurate result than existing algorithms that use more information and achieve relatively
accurate and stable results. There is insufficient resolution to determine the pitch frequency
accurately; by increasing the resolution we lose time accuracy, which is important if we desire to
make any use of the pitch information. It is possible to employ cepstralbased algorithms, for
example, [34] where the cepstral coefficients can be computed without much additional
computational effort directly from the spectrogram. Perhaps the main drawback of our method is
ignoring most of the available information that is lost in the quantization process. Relying on
extreme levels to compute the frequency of a sinusoid can be reasonable only if we wish to
obtain a rough approximation. In this case, where the frequency is in the range of 100 Hz and the
average error is over 15%, this method is unacceptable.
In order to perform correct recognition, it would be necessary to imitate the exact
procedure of an expert. After identifying and thinning the lines we would need to count them in an
„intelligent‟ way that would warn in case a line is missing. We will then sample a point from the
first and last lines and perform the average calculation. If the „intelligent‟ system would work
properly we should expect to obtain a correct result for the pitch estimate.
62
Chapter 6
Automatic Spectrogram Segmentation
Algorithm
6.1 Introduction
In this chapter we present an algorithm to enable efficient segmentation of phonemes in a speech
spectrogram. We review previous work and give motivation for automatic reading of speech
spectrograms. We then continue with particular and detailed descriptions of key algorithmic
ingredients followed by an explanation of the segmentation algorithm. After presenting and
analyzing the results we suggest ways to fit the algorithm so it could handle different procedures
and summarize the chapter with conclusions.
6.2 Overview
Inspired by the work of Prof. Zue from MIT, we focus our efforts on the task of automatically
reading a wideband speech spectrogram as if it were some kind of text written in a language
known to expert readers. Optical Character Recognition (OCR) is the task of identifying
handwritten words acquired by some imaging device and translating that into text. There are a
few differences between OCR and spectrogram reading. First the time axis gives good
synchronization in spectrograms. Unlike written text that is not necessarily aligned over a certain
axis, we have good ordering of phonemes. The length of each phoneme depends on the pace
and pronunciation of the particular word or sentence. In the frequency axis we also have ordering
for the different formants. We have an idea on the location of the different formants and this prior
knowledge can be used to construct rules that would aid in recognizing the particular phoneme.
We would like to perform automatic reading of a spectrogram image. In order to do that
we must first identify important spectrogram features. Voiced speech can be characterized by its
relative high energy levels in certain frequency regions which make up the formants associated
with the particular phoneme. Unvoiced speech can be better described by a smearing of energy
63
throughout all frequency bands in some random pattern. Concentrating on the voiced speech we
attempt to identify and extract the different formants from the spectrogram. The extraction can be
done using image processing techniques for segmentation.
Several algorithms have been developed to perform image segmentation. Spectrograms are a
special kind of images that can be subcategorized under TimeFrequency Resolution images as
explained in section 2.9. Common techniques to perform image segmentation are:
1. Statistical techniques: able to identify and separate different areas in the image according
to their different statistical properties. In [36] statistical techniques are used to classify
and segment an image in an attempt to automatically detect cancer cell nuclei.
2. Differential calculus techniques: track down the borders of each object in the image using
locally computed divergence and Laplacian [37].
3. Tracking methods: algorithms such as the Kalman filter are used to track frequency
changes of each formant and segment the image according to the acquired information.
In [38] realtime segmentation of range images is performed using a Kalman filter.
The aforementioned techniques seem to be inappropriate in our case. The spectrogram
image is very noisy due to several factors. Noise caused as a result of the pitch makes it very
hard to determine object boundaries; general speech noise also exists and due to the uncertainty
principle we have smearing of all frequencies over a wide band which smears the formants and
sometimes causes nearby frequencies to merge into a single large segment. It is not clear how a
statistical model should be designed since it would have to include the vagueness due to the
uncertainty principle as well as a model of speech and the disturbing pitch lines. Differential
calculus entails a heavy computational burden and would need cleanup of the image to obtain
correct results. It would have problems in segmenting two formants that due to the uncertainty
principle are smeared into one large object. The Kalman filter on the other hand might get lost
tracking a line since, due to the uncertainty principle, an actual frequency is smeared over a band
of frequencies. If we use [38] and regard the spectrogram image to be a range image we still
need to perform a cleanup that separates between the different BLOB‟s. Since portioning the
image into different BLOB‟s a major task we do not obtain any advantages by using this method.
We wish to work with an algorithm that is robust to scaling in the time axis since we do not want
faster or slower speech to have a detrimental effect on our results. We also wish to have a robust
algorithm in terms of energy levels or at least one that does not require rapid, and often obscure,
changes to threshold parameters. In addition we would like to have an algorithm that has low
computational requirements and that can be later modified to run in realtime perhaps on a down
sampled lowresolution version of the speech spectrogram. These constraints lead us to select
mathematical morphology and particularly the watershed transform as the tool to segment the
64
image spectrogram. The watershed transform is efficiently implemented in Matlab and is intended
to perform image segmentation in harsh noise conditions.
6.3 Algorithm Description
1. Median filter is used once on a 3 by 20 rectangle and 4 times using a 20 pixel horizontal line.
2. Run a 2D Gaussian window (Gabor filter).
3. Smooth using a 2D Wiener filter. The local mean and variance are estimated in a 16 by 16
square around each filtered pixel.
4. Apply local threshold on (3).
5. Apply global threshold on (3).
6. Combine the results of (4) and (5) using a logical OR.
7. Dilate with a disk as a structuring element in order to disconnect thin lines and eliminate small
areas in the image.
8. Use morphological connectivity to disregard small sections that contain less than 40 pixels or
that have a maximum width that is less than 20 pixels.
9. Perform an 8connectivity watershed algorithm.
65
Figure 6.1: Algorithm Diagram Flow
Local Threshold
Image Spectrogram
Binary Mask
Median 3 by 20, Set cnt = 1
Median 1 by 20
Is cnt = 4?
Local 2D Wiener Filter
Global Threshold
Dilation (disk as structuring element)
Discard Small Connected
Sections
Watershed Transform
Logical OR
No
Yes
cnt = cnt + 1
66
The algorithm uses both local and global processing techniques. Both smoothing and threshold
are done at the local and global level. Smoothing at the local level uses a locally estimated mean
and variance for a 2D Wiener filter while global smoothing procedures using a Gaussian window,
median filtering and image dilation are applied with global parameters. The watershed transform
is applied to the entire image since the interaction between different image objects plays an
important part in the segmentation process.
6.4 Adaptive Histogram Equalization
In order to obtain a spectrogram that is clear to read, an adaptive histogram is applied to the
image. The adaptive histogram works first on tiles of the image and then combines the tiles by
using bilinear interpolation in order to reduce border effects. The advantage of using adaptive
histogram equalization is that details in the image can be emphasized. A global equalization
would diminish most of the details in the image. The adaptive equalization is done through the
adapthisteq Matlab instruction. In general any invertible histogram function on a certain domain
can be equalized to any invertible function on the same domain. For example, if the image
histogram is h, equalizing it to h
eq
would require:
(6.1) h s h
eq
· = , where ( ) h inv h s
eq
· = .
Equation 6.1 may seem trivial at first glance; however, since we must preserve a constant
number of pixels in each tiled rectangular, we cannot simply move pixels around and an
additional step is required. An algorithm can be devised to map each pixel from one histogram to
the other. Since the tiled histogram equalization operates on rectangular regions it generates
more homogeneous energy values for the different formants. This method of equalization differs
from the preemphasis filter, since it is performed on a rectangular tile and not on particular
vertical lines/time instances. The result is an image spectrogram that clearly shows the first four
speech formants, f1 to f4. A bilinear transform is then used to reduce the bordering vertical lines
in the rectangular tiles. After applying the bilinear interpolation, the borders are smoothed down
and practically eliminated.
67
6.5 Gamma Correction
A Gamma characteristic is a powerlaw relationship that approximates the relationship between
the encoded luminance in a TV system and the actual desired brightness.
The cathode ray is a nonlinear device. The relationship between the luminance and the voltage
is given by the formula:
(6.2)
¸
S
V I = .
For a computer CathodeRay Tube (CRT), gamma is about 2.2.
To compensate for the distortion effect a gamma correction is performed to the voltage:
(6.3)
¸
1
S C
V V = .
Gamma correction is performed on the whole image and is specific to the hardware used.
Changing the screen brightness is equivalent to performing gamma correction. Since the human
visual system views luminance using a logarithmic scale, in a similar way to the logarithmic scale
used in the human hearing system, the gamma correction serves to adjust this scale. In this work
we use a gamma correction value of 0.8.
6.6 Window Selection
A common tradeoff in window selection is the main lobe width versus side lobe rolloff rate. A
Hann window is used often due to its good rolloff properties: 60 dB/decade. The Hamming
window has a lower rolloff of 20 dB/decade but a lower main lobe width since its maximum side
lobe level is 43 dB as apposed to 32 dB for the Hann window [39]. Choosing a narrow main lobe
reduces the uncertainty in frequency and allows us to better distinguish between formants that
have a small frequency difference. The lower rolloff introduces dependencies on previous and
future speech samples resulting in a noisier image. However, the watershed transform can
produce better results since it can better capture low energy regions in particular on rising and
falling formants as seen by comparing fig. 6.2 (c) and fig. 6.2 (d). Therefore we choose to work
with a hamming window for the specified short time interval of 6.5 ms.
68
6.7 Connectivity Operations
Mathematical morphology is a latticebased theory in which connectivity plays an important role.
We need to separate between larger and smaller objects in the image to get rid of spurious small
spots that are the residue of the threshold and dilation operations. A straightforward approach to
determine which objects are big is to predetermine a minimum object size in terms of the number
of connected pixels and discard all objects that contain less than that minimum value. We note
that we choose a rigid hard threshold value that was determined according to analysis of a few
image spectrograms, however choosing an adaptive threshold value is also possible and might
improve the results at the cost of run time and algorithmic complications. We therefore choose a
fully connected grid (8connectivity) and discard all objects with less than 40 pixels. We also
demand that the minimum width of each object would be 20 pixels due to the minimal signal
bandwidth that should be present as a result of the windowing operation and the uncertainty
principle. As a result we manage to get rid of small objects and are left only with larger objects
that can later be farther segmented and perhaps separated from one another using the
watershed transform.
6.8 Local vs. Global Threshold
We need to quantize the grayscale image spectrogram in order to obtain a binary image.
Quantization is performed by selecting a threshold level. Since lower formants tend to have
higher energy concentration than higher formants and since the spectrogram image contains
much detailed information regarding different formants, simply using a global threshold will not
yield good results. Since the speech can be modeled as quasistationary its statistical properties
change several times during a 1 second interval leading to changes in the spectral density for
different phonemes and even within the same phoneme. The global threshold would not be able
to capture small image details and will not do a good job in separating between different
phonemes that are close to one another. The human vision system when examining a particular
object performs a local threshold operation. The global threshold serves to reduce white noise
with low energy that can appear as very small speckles in the image spectrogram.
A local threshold is used to isolate each BLOB from its surrounding. A global threshold is
used to clean the image from noise. Combining the local threshold image with the global
threshold image using a logical OR will yield the desired result.
69
6.9 Working with the TIMIT Database
We want to test our algorithm on different phonemes. In order to extract samples from the
database we use the function GetPhoneme and input the desired phoneme to be extracted. The
function creates a subfolder within the Database folder in the same name as the requested
phoneme and with the extracted speech files that contain the phoneme. The speech files are
extracted in a way that the desired phoneme begins 0.5 seconds into the extracted 1 second
speech segment (centered). We continue by manually creating a text file that contains the path of
the speech files. We have 20 samples for each phoneme and we name the text file as the
phoneme to be tested. We then run the function manager that takes the file, calculates the
wideband spectrogram using our specwb function, continues with the segmentation algorithm
using the recon function and finally saves both the identified spectrogram and the mask. The
function also marks borders that indicated the start and end of the phoneme according to the
information contained in the TIMIT database. Since the phoneme is centered, we only need to
compute the end point. The 1 second speech segment was saved in a name that encapsulates
the end point of the phoneme so by simple text manipulation we have the end point of each
phoneme. We create a line that is 5 pixels wide and that can be easily identified by the user. The
function is run using a break point in debugging mode (similar to an event driven „catch‟ and „try‟
instructions in Matlab/Java). The user can examine and grade the results of the segmentation
algorithm accordingly.
6.10 Calculating the Local Threshold
We need to obtain a local threshold for each pixel in the image. Using trial and error we select a
parameter for the Gaussian standard deviation to be 100 and create a vertical Gaussian filter of
length 200 pixels. The image that has passed several median stages as described in fig 6.1 is
then filtered with the Gaussian to obtain the local threshold parameters. The effect of the filter is
an averaging over a long line of pixels. The idea behind the process is that values close to the
pixel will have a greater effect on the threshold value than values farther away. The threshold is
adjusted locally to capture the BLOB boundary since an abrupt local change will have an
immediate effect on the threshold value. Since the threshold is set according to the vertical
surroundings it is not affected by intensity changes in the image caused by possibly lower
energies and narrower BLOB‟s in the upper frequency bands. Unlike the block processing of the
histogram equalization, as described in section 6.4, in this case we wish to have a very
discriminative and distinctive difference between pixels; we therefore desire to have a more local
environment that might contain a smaller number of pixels but will still allow us to discriminate
between the borders of the BLOB‟s.
70
6.11 Function Description
We describe the functions that are used in this work, their input, output and objectives:
specwb
Objective: Function creates, displays, saves and returns speech according to a given speech
segment.
Input:
1. Speech signal.
2. Sampling rate.
3. Parameters such as narrowband/wideband etc.
Output: Image Spectrogram
recon
Objective: Function segments the wideband speech spectrogram and determines the borders of
the different resulting BLOBs.
Input: Wideband speech spectrogram.
Output:
1. Segmented speech spectrogram.
2. Segmented mask image.
3. Borders of BLOBs.
Manager
Objective: Create, display and save segmented speech spectrograms and segmentation masks
of specific phonemes.
Input: Text file from the Input file directory containing paths to different speech segments of a
specific phoneme.
Output: Segmented speech spectrograms and segmentation masks saved in the directory
Results under the specific phoneme subdirectory. The phoneme can be identified by a left and
right border vertical line.
GetPhoneme
Objective: Extract from TIMIT speech segments that contain a specific phoneme. The phoneme is
centered and the speech segment is 1 second long.
Input: Phoneme name as appears in TIMIT.
Output: In the Database directory, a subdirectory containing all files extracted. Files that are
either at the end or the beginning of a sentence are marked on a different name. The name
indicates the sentence the phoneme was taken from and the location of the phoneme in the
complete sentence.
71
linez
Objective: Function detects and marks lines and outputs a vector of the median distance at
selected locations.
Input: Narrowband spectrogram.
Output: A vector containing the median distance between the lines. The spectrogram with the
lines and a mask containing only the lines are displayed.
readTIMIT
Objective: Function reads a .wav file from the TIMIT database and converts the file from big
endian byte ordering to littleendian byte ordering.
Input:
1. Name of .wav file.
2. Full path in the TIMIT directory.
Output:
A .mat speech file in little endian format.
fuzzybrain
Objective: Creating triangular membership functions to the different vowels that can be used in
the fuzzy logic toolbox graphical user interface (GUI). This function also automatically constructs
the rules of the system and creates input and output variables.
Inputs: Through parameters: the estimated frequency location of vowel phonemes for the first
three formants.
Output: Saves a .fis file that can be later opened using the fuzzy logic GUI tool (the Matlab
instruction fuzzy).
72
6.12 Results
The algorithm was tested on different speech samples from the TIMIT database. Results were
robust; the segmentation performed well on different speakers and different sentences. The
TIMIT database contains female and male speakers from 7 different dialect regions in the United
States. The speakers repeated sentences especially designed at SRI, MIT and TI to exemplify
different speech characteristics such as accent, coarticulation and different combinations of
phonemes. Orthographic transcription and timealigned phonetic transcription are included for
every sentence.
Our first example uses the meaningful sentence “However, the litter remained,
augmented by several dozen lunchroom suppers”; bold face fonts indicate 1 second of speech
that in this example is displayed in fig. 1(a). We obtain good segmentation for the first and second
formants for all voiced phonemes. For the third and fourth formant, the segmentation misses part
of the phoneme /r/ but the general direction is preserved. In this example, all four formants are
well aligned and ready to be recognized by an appropriate system.
Our second example presents a more challenging scheme. We examine a different
section of the same sentence: “However, the litter remained, augmented by several dozen
lunchroom suppers.” As seen in fig. 1(b), the algorithm has difficulty in segmenting the second
and third formant of /r/. Since these formants are very close together, it is hard to distinguish
between them and to segment them as different objects. In addition, high energy levels for f
3
make it more difficult to separate it from f
2
. Another difficulty arises in the identification of the
nasal /m/. The low spectral density makes it hard to segment the phoneme correctly. The low
spectral density is caused by a spectral zero that reduces the second formant. One other problem
is small segments that do not represent a formant but still appear in the image (false positives).
This problem can be solved by changing the constant in step #8 of the algorithm. However,
changing the constant to accept only stronger energies would result in losing some real formants.
In general, the algorithm manages to perform well when the formant energies are strong.
As a last example, we choose: “Don‟t ask me to carry an oily rag like that.” As seen in
fig. 1(c), we obtain several cases in which formants are segmented into more than one BLOB.
Even though oversegmentation was tackled in the watershed algorithm, we still have remainders
in the form of small binary objects that can cause problems in the recognition stage. On the other
hand as was also noticed in the previous examples, BLOBs associated with f
1
sometimes relate
to more than one phoneme. This phenomenon occurs in some cases for the higher formants as
well.
73
In order to check the algorithm behavior in a more systematic fashion, we test the results
on multiple runs. The criteria for which we judge the performance is the fuzzy variable „Grade‟
that takes the values {„Perfect‟, „Good‟, „Average‟, „Below Average‟, „Poor‟} for the segmentation
results. We assign numbers to each descriptor where „Perfect‟ takes the highest value of 5, „Poor‟
takes the lowest value of 1 and it is believed that „Average‟, which takes the value of 3, contains
enough information for automatic recognition. We select 10 phonemes and run 20 different tests
for each phoneme, in total 200 different speech segments. The results including the mean and
variance of the visual measurements are presented in Table 6.1.
A
“several
dozen”
B
“the litter
remained”
C
Hamming
Window
“an oily
rag”
D
Hann
Window
“an oily
rag”
Figure 6.2: Algorithm Results for different cases.
Scale: Horizontal Axis 0 – 1 sec; Vertical Axis 0 – 4,000 Hz.
74
After examining the algorithm we see that in general the algorithm obtains good
segmentation results for the formant energy levels throughout different phonemes. The algorithm
obtains better segmentation results when the phoneme duration is longer. Since more information
is available and since our segmentation algorithm is searching for large objects, we tend to miss
small concentrations of energy. In general, the vowels are wellrecognized. The nasal /m/ and the
glide /l/ have lower segmentation results due to the difficulty of tracking diagonal lines in the
spectrogram. It is possible to extend the algorithm to detect diagonal lines either by adding a
tracking procedure such as a Kalman filter or by a diagonal line emphasizing median filter. The
semivowel /w/ is better segmented on short duration phonemes since there is a higher energy
concentration that enables better segmenting of f
3
and f
4
.
Test # Phoneme
aa ae eh ux ow oy r l m w
1 5 5 5 5 5 5 5 3 5 1
2 5 5 4 4 4 4 3 4 5 2
3 2 3 3 2 2 3 3 1 4 5
4 4 5 2 5 4 5 5 5 3 3
5 5 4 5 5 5 3 5 5 2 4
6 2 5 5 5 3 5 5 2 5 5
7 5 4 4 5 5 4 3 5 1 5
8 4 5 5 5 5 3 3 1 2 2
9 5 4 2 2 4 4 4 1 1 3
10 3 3 5 5 5 5 5 1 4 3
11 5 4 1 5 5 4 4 2 1 3
12 5 5 3 4 3 4 5 1 1 4
13 5 4 5 5 4 3 5 5 3 5
14 5 4 4 3 2 3 3 3 1 2
15 2 2 5 3 3 3 4 5 2 2
16 5 4 5 4 5 2 5 5 2 5
17 4 3 5 5 5 3 5 2 1 2
18 2 5 5 5 3 5 4 5 2 2
19 4 5 5 5 5 5 4 5 3 5
20 5 5 5 5 4 4 4 3 5 5
Mean 4.1 4.2 4.15 4.35 4.05 3.85 4.2 3.2 2.65 3.4
Variance 1.46 0.8 1.61 1.08 1.10 0.87 0.69 2.91 2.34 1.94
Table 6.1: Results of a visual inspection. The grades describe the accuracy of the segmentation
algorithm for each phoneme
We demonstrate the subjective selection of grades according to our fuzzy variable by
displaying a few speech spectrograms that correspond to different grades. We select the glide „l‟
that as seen in table 6.1 takes all 5 possible values. Once again the bold part of the sentence
indicates the speech segment that is actually displayed in the speech spectrogram where the
glide „l‟ is centered at .5 seconds from the start. The numbers in brackets indicate the row from
the table as well as the line location of the filenames in the text file inputl.txt in the Input directory.
We have the same sentence pronounced by different speakers.
75
1. Don't ask me to carry an oily rag like that. (9)
Figure 6.3: Example of a grade 1 score.
Justification: We miss the second formant almost completely due to the sharp rise and relatively
low energy. Since the main characteristic of the glide is entailed in the gliding second formant we
give a grade „1‟ to this result.
2. Don't ask me to carry an oily rag like that. (6)
Figure 6.4: Example of a grade 2 score.
Justification: We have a clear first and fourth formant; the second formant is well represented but
the third formant is missing. Recognition would be difficult (even though not impossible); therefore
a grade of „2‟ is given.
76
3. She had your dark suit in greasy wash water all year. (1)
Figure 6.5: Example of a grade 3 score.
Justification: We have all four formants. We can conclude by the mask the direction and location
of the second formant. The third formant can also be well estimated. Recognition should be
possible; in this case therefore a grade of „3‟ is given.
4. She had your dark suit in greasy wash water all year. (2)
Figure 6.6: Example of a grade 4 score.
Justification: We have all four formants. The second formant is well described. The location of the
third and fourth formants can be easily understood. Therefore a grade „4‟ is given.
77
5. Don't ask me to carry an oily rag like that. (5)
Figure 6.7: Example of a grade 5 score.
Justification: All four formants are well characterized. The segmentation algorithm well captures
the formants and automatic speech recognition is possible.
We see that due to different accents and energy distributions, we have significantly
different results for the segmentation. Since our algorithm is trained to follow horizontal lines and
shapes, we have a problem with the glides in particular and with rising and falling frequencies. A
linear model for a rise and fall is known as the delta coefficients (first derivative) in common ASR.
A secondorder model also uses the deltadelta coefficients, which are an approximation of the
second derivative and results in fitting the data to a parabolic function. Albeit with some
adjustments it is possible to fit the algorithm to capture nonhorizontal movements, we see that
even at a noncommercial premature stage of the algorithm, we obtain results that are believed to
be sufficient for an automatic speech recognition system in most cases. We obtain very good
recognition results when the phoneme time duration is short. We attribute that to the relatively
high concentration of energy and to the mild glide in the second formant that are more suitable to
an algorithm that aims at segmenting horizontal lines.
78
6.13 Suggestions for Improving the Algorithm
Several changes and modifications can be made in an attempt to improve the algorithm. We
present a few ideas for improvement together with their advantages and disadvantages.
1. Implementing a tracking algorithm that would help to identify formants that are rising or
falling with frequency.
Advantages: Would help in detecting nasals, liquids and glides where low energy
formants tend to rise or fall. A Kalman filter can be used to track the energy peaks
and add rising or falling energy dense regions with an identified BLOB.
Disadvantages: The tracking algorithm would require an additional stage of BLOB
detection to avoid the effect of the uncertainty principle. Adding regions that were not
detected before as BLOBs would introduce also false positive results since lower
spectral density regions are not ignored.
2. Adaptive change of the local threshold value to include formants in low energy
phonemes.
Advantages: Since some phonemes contain less energy than others, lowering the
threshold would help in segmenting formants that otherwise would have been
partially segmented or completely ignored.
Disadvantages: Changing the threshold would require additional processing and will
introduce areas in the image that do not belong to any particular formant (false
positive detections)
3. Using additional information from the spectrogram to help in the recognition process.
Changes in energy level, peaks of energy and other parameters play an important part in
the recognition process. Simply taking a binary image that ignores these parameters
reduces the amount of information available to our recognition system.
Advantages: Recognition rates can be increased using additional information.
Disadvantages: More processing and storing of information is required. Rules need to
be constructed to deal with the additional information in a nonambiguous manner.
79
6.14 Summary
A robust algorithm for speech spectrogram segmentation was presented. By using morphological
image processing techniques, we are able to obtain reliable segmentation of formants in most
cases. The algorithm performs well for all voiced phonemes and has better segmentation results
than previous techniques; however, difficulties occur when formant frequencies are close together
or when there is a lowenergy formant that is rapidly going up or down in frequency. Some
suggestions such as changing the threshold level were made to improve or tune the algorithm.
These results can be used as input to an automatic speech recognition system or in other general
uses of speech spectrograms. It is in the authors‟ belief that a spectrogrambased speech
recognition system can complement an existing recognition system by incorporating human
expert knowledge into the recognition task.
80
Chapter 7
Conclusion
This chapter summarizes the contributions presented in this thesis. We first give conclusions and
an overview of the results obtained from the different algorithms that were developed. We
conclude the chapter with ideas for future work and development.
7.1 Review of the Work and Logical Deductions
In the previous chapters we laid the foundations of three major theories: Speech Recognition,
Morphological Image Processing and Fuzzy Logic. We have seen that it is possible to combine
these methods in order to design a scheme that can perform automatic speech recognition. The
close relationship between Fuzzy Logic and Mathematical Morphology helped in understanding
how to link between these two theories. Justifications for using Mathematical Morphology to
perform the image segmentation were presented. The main purpose of this thesis was to
segment an image spectrogram and for that reason a segmentation algorithm was devised.
The segmentation algorithm performs well in most cases. We have seen how by
choosing a Hamming window instead of a Hann Window we can obtain better segmentation
results since we have better separation between adjacent frequencies and since the time
dependencies between pixels in the image spectrogram can be compromised to some extent. We
concluded that experts can extract information from wideband speech spectrograms and saw the
difference between narrowband and wideband spectrogram images, the information they contain
and the different shapes that require different morphological operators to extract information from
the images. In section 5.4 we saw the mathematical properties of the median operation. We also
used the median to smooth the narrowband and the wideband images as an initial stage before
applying stronger segmentation or extraction techniques such as the watershed transform or the
morphological thinning operator. The watershed transform is effective in segmenting noisy
images and in particular in cases in which different target objects partially occlude one another.
We obtain a labeled mask image and in most cases each BLOB directly corresponds to a formant
of a particular phoneme. In some cases we obtain several small BLOBs that belong to the same
81
formant; however this should not pose a particularly difficult problem since most of the information
that we need for the recognition task is still maintained.
We used a fuzzy variable to quantify the results of the algorithm. This method of
debugging insured that our algorithm will be optimized to yield results that are as close as
possible to the information extracted by an expert that is performing a visual inspection of a
speech spectrogram in an attempt to extract information from it. The vowels are segmented in
most cases in a satisfying manner. In most cases all of the first four formants are well detected
and recognized. Sometimes we miss a formant due to low energy levels. Another common caveat
is a formant that breaks up into smaller pieces in the segmentation process due to lower energy
levels in its center area. The glides present a more challenging scheme since they require
tracking formants that either increase or decrease with frequency. Also their energy levels are in
general lower than those of the vowels. We are able to obtain satisfactory results in most cases
for the glides. These results are lower than those obtained for the vowels.
7.2 Ideas for Future Work
We have managed to perform segmentation that works well in most cases. However, performing
equalization that would use as input the time and energy properties of each phoneme and would
be adjusted to a specific speaker or accent group can help obtain even better results. A simple
equalization can use a Gamma correction as explained in section 6.5 to change the luminance
and therefore the darkness of the different energy sections in the speech spectrogram. Another
improvement to the segmentation algorithm could be finetuning the algorithm and adjusting it to
tackle different phoneme types. By constraining the number of BLOBs we expect to segment over
a prespecified time period we can significantly improve the results since we will reduce the
number of small BLOBs and merge BLOBs that are actually constituents of the same formant.
In order to perform automatic speech recognition using the results of our algorithm we will
need to construct an expert system. The expert system would rely on one or more experts in
spectrogram reading and will have the form of IF THEN rules. The rules will also have an
aggregation method that would explain how to perform combinations, intersection or negation. In
addition we will need to extract a feature vector from our segmented image. Since we are not
restricted to the information we have in the mask (segmented binary image) we can use the mask
as reference and extract more accurate information related to a specific BLOB from the original
spectrogram. The feature vector can include parameters such as the length of the BLOB, its
frequency band location, its slope measure by a first or second order approximation, in a similar
way delta coefficients are measured and its energy strength. The feature vector would be
constructed according to the rules laid forth in the expert system. The membership functions for
82
each element of the feature vector can be either manually designed or trained by a neural
network. Finally a thorough regression would be performed to analyze the system performance in
different cases.
We expect that the information contained in speech spectrograms as interpreted by a
human expert welltrained and familiar with acoustic, phonetics, linguistic and speech production
models can yield better recognition rates than the current methods that do not incorporate human
knowledge in their algorithms. Information extracted by this method can also be combined with
existing systems to improve their results. A clear advantage of the proposed system is its intuitive
rulebased design and the possibility of incorporating knowledge of more than one expert. A
possible solution for creating the set of rules is a wikibased system that will allow experts from
different places around the world to contribute from their experience.
83
Appendix
Justifications for Choosing a Triangular
Membership Function (TMF)
The MF serves as an upper bound to the possibility function. Since there is never a tight bound of
the possibility and necessity function on the probability function, the tightest interval should be
obtained in order to keep as much information as possible (maximally specific possibility
function). We can regard the values in table [2.1] as measurements that contain some error
according to some unknown probability distribution. We assume/approximate these
measurements are the mode of the distribution of outcomes and since we do not have any
additional information about the standard deviation we can only make conclusions about the
range of measurements associated with a certain confidence interval. For example, if we have
two consecutive frequency values corresponding to two different phonemes, we can assume that
most measurements associated with the lower frequency fall below the measurements of the
higher one.
A Truncated Triangular Possibility Distribution (TTPD) can be used to cover efficiently a
Normal, Laplacian, and Uniform or Triangular distributions [40]. Since in this case no human
knowledge is used we can approximate the error distribution as symmetrical and require an upper
bound of a TTPD. The TTPD serves as a family of maximally specific probability distributions that
enables upper bounds of probabilities of events to be computed. Since the TPD is the transform
of the uniform distribution function it serves as an upper bound to all other pdfs.
84
References
[1] Saha, S. “The new age electronic patient record system”; Biomedical Engineering Conference,
1995., Proceedings of the 1995 Fourteenth Southern. pp. 134137, 79 Apr. 1995.
[2] Bobbert D.; Wolska M. “Dialog OS: An Extensible Platform for Teaching Spoken Dialogue
Systems”, Decalog 2007: Proceedings of the 11th Workshop on the Semantics and Pragmatics of
Dialogue, pp. 159–160.Trento, Italy, Jun. 2007.
[3] http://mindstorms.lego.com/
[4] http://www.hrichina.org/
[5] K. Fujita et al. “A New Digital TV Interface Employing Speech Recognition”. IEEE Trans. on
Consumer Electronics, Vol. 49, Issue 3, pp. 765 – 769, Aug. 2003.
[6] D. OShaughnessy, Speech Communication, AddisonWesley Publishing Company, 1987.
[7] S. B. Davis and P. Mermelstein, "Comparison of Parametric Representations for Monosyllabic
Word Recognition in Continuously Spoken Sentences," IEEE Trans. Acoust., Speech, Signal
Processing, Vol. ASSP28, pp. 357366, Aug. 1980.
[8] Dudley H., The Vocoder, Bell Labs Record, Vol. 17, pp. 122126, 1939.
[9] Renals S. et. al. “Connectionist Probability Estimators in HMM Speech Recognition” IEEE
Tran. on Speech and Audio Processing, Vol. 2, No. 1, Part 11, pp. 161174, Jan. 1994.
[10] Juang, B.H.; Rabiner, L.R.; “Spectral representations for speech recognition by neural
networksa tutorial”, Neural Networks for Signal Processing [1992] II., Proceedings of the 1992
IEEESP Workshop, pp. 214 – 222, Sep. 1992.
[11]
Morgan, N.; Bourlard, H.A.; “Neural Networks for Statistical Recognition of Continuous
Speech”. Proceedings of the IEEE. Vol. 83, Issue 5, pp. 742 – 772, May 1995.
[12] Makhoul J. et al. “A Hybrid Segmental Neural Net/Hidden Markov Model System for
Continuous Speech Recognition” IEEE Trans. on Speech and Audio Processing. pp. 151160,
Vol. 2, No. 1, Part 2, Jan. 1994.
[13] Lamel, L. F. and Zue, V.W. “An Expert Spectrogram Reader: A KnowledgeBased Approach
to Speech Recognition”. ICASSP, Vol. 11, pp. 558561, 1986.
[14] Zue, V.W. and Cole, R.A. “Experiments on Spectrogram Reading”. ICASSP, Vol. 4, pp. 116
119, 1979.
[15] Hemdal, J.F. and Lougheed, R.M. “Morphological Approaches to the Automatic Extraction of
Phonetic Features”. IEEE Trans. on Signal Processing, Issue 2, pp. 490497, Feb. 1991.
[16] Roger Y. Tsai, “A Versatile Camera Calibration Technique for HighAccuracy 3D Machine
Vision Metrology Using Offtheshelf TV Cameras and Lenses”. IEEE Journal of Robotics and
Automation, Vol. RA3, No. 4, Aug. 1987.
85
[17] Wikipedia, The Free Encyclopedia."Phoneme." 17 Aug 2007, 21:36 UTC. Wikimedia
Foundation, Inc. 24 Aug. 2007
<http://en.wikipedia.org/w/index.php?title=Phoneme&oldid=151909142>.
[18] Proakis et al. “Average location of formants in English vowels”. DiscreteTime Processing of
Speech Signals, Macmillan Publishing Company, 1993. ISBN 0023283017.
[19] Leprettre, B. and Martin ,N. “Extraction of pertinent subsets from timefrequency
representations for detection and recognition purposes,” Signal Process., Vol. 82, No. 2, pp. 229–
238, Feb. 2002.
[20] Hory, C., Martin, N. and Chehikian, A. “Spectrogram Segmentation by Means of Statistical
Features for NonStationary Signal Interpretation” IEEE Trans. on Signal Processing, Vol. 50, No.
12, Dec 2002.
[21] Serra, J. “Lecture Notes on Morphological Operators”
http://cmm.ensmp.fr/~serra/cours/T34.pdf
[22] Matheron, G. and Serra, J. “The Birth of Mathematical Morphology”. Jun 1998.
[23] http://homepages.inf.ed.ac.uk/rbf/HIPR2/index.htm
[24] Dougherty, E. R., “Mathematical Morphology in Image Processing”. CRC Press. Sep. 1992.
ISBN: 0824787242.
[25] www.mw.com , MerriamWebster online dictionary.
[26] Digabel, H. and Lantuéjoul, C. “Iterative Algorithms” in Proc. 2nd European Symp.
Quantative Analysis of Microstructures in Material Science, Biology and Medicine, Caen, France,
Oct. 1977. J.L. Chermant, Ed. Stuttgart, West Germany: Riederer, Verlag, pp. 8599. 1978.
[27] Beucher, S. and Lantuéjoul, C. “Use of watersheds in contour detection” in Proc. Int.
Workshop Image Processing, RealTime Edge and Motion Detection/Estimation, Rennes, France,
Sep. 1979.
[28] Vincent, L. and Soille, P. “Watersheds in Digital Spaces: An Efficient Algorithm Based on
Immersion Simulations”. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.
13, No. 6, Jun. 1991.
[29] Klir, G. J. and Folger, T. A. Fuzzy sets, uncertainty, and information. Englewood Cliffs,
N.J. Prentice Hall, 1988, ISBN: 0133459845.
[30] “Fuzzy logic” Zadeh, L.A.; Computer, Vol. 21, Issue 4, pp. 83 – 93, Apr. 1988.
[31] Imasaki, N. et al. “Elevator Group Control System Tuned by a Fuzzy Neural Network Applied
Method” Fuzzy Systems, 1995. International Joint Conference of the Fourth IEEE International
Conference on Fuzzy Systems and The Second International Fuzzy Engineering Symposium.,
Proceedings of 1995 IEEE International Conference on; Vol. 4, pp.1735 – 1740, 2024 Mar. 1995.
[32] Dubois, D. and Prade, H. and Ughetto, L.,"A New Perspective on Reasoning with Fuzzy
Rules". International Journal of Intelligent Systems, No. 5, Vol. 18, pp. 541563, 2003.
[34] Schafer, R.W. and Rabiner, L.R. “System for Automatic Formant Analysis of Voiced Speech,”
J. Acoust. Soc. Amer., Vol.47, pp. 634648, Feb. 1970.
86
[33] Rabiner L. R. “Use of Autocorrelation Analysis for Pitch Detection”. IEEE Trans. on
Acoustics, Speech, and Signal Processing, Vol. 25, No. 1, Feb. 1977.
[35] Jain, A.K. “Fundamentals of Digital Image Processing”. Prentice Hall, 1988. ISBN:
0133361659.
[36] Kapelner, A., Lee, P.P. Holmes, S. “An Interactive Statistical Image Segmentation and
Visualization System” Medical Information Visualisation  BioMedical Visualisation, 2007. MediVis
2007. International Conference on pp. 81 – 86, Jul. 2007.
[37] Zhu, W. et. al. "Local Region Based Medical Image Segmentation Using JDivergence
Measures" ; Engineering in Medicine and Biology Society. IEEEEMBS 2005. 27th Annual
International Conference of the. pp. 7174 – 7177, 2005.
[38] DePiero, F.W. , Trivedi M.M. "Realtime range image segmentation using adaptive kernels
and Kalman filtering"; Pattern Recognition, Proceedings of the 13th International Conference on
Vol. 3, 2529 pp. 573 – 577, Aug 1996.
[39] Porat, B. “A Course in Digitial Signal Processing”. John Wiley and Sons. Oct. 1996. ISBN 0
471149616.
[40] Dubois, Foulloy, Mauris, Prade. “ProbabilityPossibility Transformations, Triangular Fuzzy
Sets and Probabilistic Inequalities“,Institut de Recherche en Informatique de Toulouse IRIT;
Reliable Computing. Vol. 10, pp. 273297, 2004.
87
Sommaire
Cette thèse se concentre sur la reconnaissance de la parole automatique du discours
continu de l'anglais au moyen de traitement d'image des spectrogrammes de la parole
et de la logique floue. Dans ce travail nous présentons d'abord une vue d'ensemble
théorique des théories de la logique floue et de la morphologie mathématique et une
courte vue d'ensemble de la phonétique de la parole, nous continuons en présentant un
algorithme pour l'évaluation du lancement et nous concluons par une approche
nouvelle pour la segmentation des spectrogrammes de la parole.
La théorie de logique floue joue un rôle important dans les systèmes qui sont
basés sur la connaissance experte. La logique floue est semblable à la morphologie
mathématique et ces deux théories différentes sont employées pour aborder le même
problème de la reconnaissance de la parole par l'intermédiaire du spectrogramme de
l'image. Une variable de logique floue est employée dans la correction de l'algorithme
de segmentation puisque par définition nous ne pouvons pas décider si l'identification
est exécutée bien ou pas selon un métrique bien formulé. En fait, si nous avions un tel
métrique, nous aurions l'habitude d'effectuer la segmentation en premier lieu.
Un algorithme qui essaye d'estimer le lancement d'un spectrogramme à bande
étroite a été développé. L'algorithme applique des méthodes mathématiques de
morphologie pour extraire les harmoniques de lancement, les amincit à une largeur
simple de Pixel et puis calcule la distance entre ces lignes. Nous expliquons les
décisions qui ont mené aux différentes étapes algorithmiques et aux obstacles qui
empêchent l'algorithme d'être employé d'une façon certaine. L'algorithme de
lancement a échoué de donner des résultats corrects à cause de la difficulté en
déterminer les distances et dans certains cas nous manquons réellement quelques
harmoniques de lancement. Nous proposons une approche informatiqueimpliquée qui
devrait produire des évaluations plus fiables.
88
Un algorithme pour segmenter des spectrogrammes de la parole est développé.
L'algorithme emploie des techniques de traitement morphologiques d'image pour
effectuer la segmentation. Notre objectif est de créer un prototype d'un module qui
pourrait plus tard être employé dans une application automatique de la reconnaissance
de la parole. La reconnaissance de la parole automatique est encore considérée une
matière ouverte de recherche et les techniques courantes n'exploitent pas les vaste
connaissances que les humains possèdent sur la parole. Bien qu'il soit presque
impossible que d'expliquer aux autres ce que notre système auditif accepte car des
entrées et ce qui nous incite à distinguer différents mots, nous pouvons facilement
expliquer l'entrée dans notre système visuel et par exemple comment nous distinguons
entre différents mots écrits. Motivé par des améliorations récentes de l'ordinateur dans
la conjoncture avec des avances dans les domaines de la morphologie mathématique
(traitement d'image morphologique), de la Reconnaissance Optique des Caractères
(ROC) et de la logique floue, nous établissons un arrangement pour effectuer une
dissection d'un spectrogramme de la parole la traitant comme si c'était un certain
genre de texte écrit dans une certaine langue étrangère. En segmentant le
spectrogramme de la parole nous permettons à un expert en matière de lecture de
spectrogramme de noter des règles basées sur des connaissances et sur une expérience
acquise. Nous prévoyons extraire de cette façon l'information qui devient autrement a
duré extraire ou simplement manquée dans les techniques conventionnelles
d'identification. Un important transforme utilisé dans le procédé de détection est la
ligne de 'Watershed Transform'. La ligne de 'Watershed Transform' est une technique
sur la morphologie et que permet a segmentes segmenter des objets dans une image,
même lorsque les objets occluent partiellement un l'autre.
Nous concluons la thèse en présentant des idées d'améliorer les résultats de
l'algorithme de segmentation pour produire de meilleurs résultats dans l'énergie
inférieure et pour accélérer (se levant ou tombant) des mouvements de formant. Nous
donnons également quelques idées pour la future recherche qui prendrait les résultats
produits par la segmentation et les incorporerait dans un système expert Logiquebasé
brouillé.
89
Chapitre 1
Introduction
La reconnaissance de la parole automatique a été une matière active de
recherche pendant les dernières quatre décennies. L'objectif principal de la
reconnaissance de la parole automatique est de changer ou de convertir un
segment de la parole en message interprétable des textes sans besoin
d'intervention humaine. On a proposé beaucoup de différents algorithmes et
arrangements basés sur différents paradigmes mathématiques afin d'essayer
d'améliorer des taux d'identification. Puisque le problème de la reconnaissance
de la parole est complexe, dans certaines circonstances, les taux d'identification
sont loin d'être optimal. En outre d'autres contraintes telles que la complexité
informatique et contraintes en temps, prennent la place, dans la conception et
l'exécution d'un produit fonctionnel. Le matériel et le logiciel d'ordinateur se sont
sensiblement améliorés en termes de vitesse, mémoire, coût et disponibilité, qui
ont permis l'utilisation des algorithmes plus sophistiqués et informatique plus
exigeants coupable d'être mis en application même sur les dispositifs
électroniques tenus dans la main peu coûteux de basse puissance. Cependant,
nous préférons des algorithmes avec des conditions informatiques basses des
exigences et de mémoire puisqu'ils peuvent être mis en application facilement et
a un coût inférieur. En raison des améliorations des algorithmes et des le
matériel, la reconnaissance de la parole automatique est devenue plus
accessible et disponible. La reconnaissance de la parole automatique est
toujours une matière de recherche ouverte, où l'amélioration et les changements
sont constamment faits dans un espoir d'avoir des meilleurs taux d'identification.
Dans ce travail nous proposons une approche différente à la
reconnaissance de la parole automatique basée sur des théories de morphologie
mathématiques et de la logique floue. Cette nouvelle approche est peu connue
usuelle dans le sens qu'elle implique trois champs de recherche principaux, à
90
savoir la théorie de la parole avec l'accent sur la reconnaissance de la parole
automatique, traitement de l'image avec l'accent sur la segmentation d'image et
la théorie d'évidence avec l'accent sur la logique floue, la prise de décision et
l'évidence de combinaison. Avant la fouille dans le mondes de la phonologie et
du traitement de l'image morphologique, nous présentons une vue d'ensemble
de la reconnaissance de la parole automatique et donnons une vue d'ensemble
de plusieurs techniques généralement utilisées et qui essayent de résoudre cette
fonction formidable.
1.1 Reconnaissance de la parole Automatique (RSA)
La Reconnaissance de la Parole Automatique (RSA) est le processus de
convertir le discours humain en texte écrit. Beaucoup d'avancements ont été faits
dans ce domaine qui ont mené aux systèmes avec des taux d'identification
élevés; cependant, il restent beaucoup de problèmes non résolus en particulier
dus à quatre paramètres principaux :
1. Richesse de vocabulaire  la taille minimale de vocabulaire est de deux mots
(Oui/Non). Une autre taille commune consiste en 10 chiffres. Les conversations
téléphoniques ou les rapports de nouvelles exigent un vocabulaire d'environ
60.000 mots, qui font la reconnaissance du vocabulaire plus difficile. Les
journaux et les textes professionnels emploient un vocabulaire bien plus
ésotérique, et ceci exige souvent l'utilisation des dictionnaires spécialisés.
2. La fluidité  il est plus facile de reconnaître des mots d'isolement que le
discours continu. Lisez la parole est habituellement plus clair que le discours
conversationnel. En outre, les mots isolés donnent au système plus de temps de
traiter les résultats et d'avoir moins de variabilité entre les personnes qui
conversent.
3. La voie et bruit  un microphone de laboratoire et un environnement de
laboratoire ont une interférence de bruit et une déformation de signal inférieure a
91
a la parole prélevé par un microphone de téléphone de cellulaire dans une
voiture mobile avec les fenêtres ouvertes. Un bas rapport de signalbruit peut
causer une grave interférence et peut dégrader l'exécution d'un système de
manière significative. La déformation de signal et le taux élevé de compression
peuvent faire retentir quelques mots de la même façon.
4. Accents et d'autres paramètres spécifiques du hautparleur  les enfants ont
une gamme de fréquence différente des adultes. Les accents étrangers
dégraderont l'exécution d'un système aussi bien que les accents noncommuns
pour lequel il n'a pas été conçu. Habituellement un système de RSA peut être
adapté à un hautparleur spécifique afin de réduire le degré d'erreur.
RSA est employé dans de nombreuses applications. Nous présentons plusieurs
exemples pour des usages communs de RSA :
1. Les centres d'appel fournissent l'information nécessaire selon les demandes
des clients. Habituellement les centres d'appel opèrent un vocabulaire limité lié à
leur champ d'opération. Le but du système de RSA est de faciliter au
représentant à la clientèle de service et de remplir son rôle d'une manière plus
efficace.
2. La dictée permet le transfert de la parole dans le texte presque totalement
mains libres. Un exemple d'un logiciel populaire qui est normalement utilise est
"Nuance's Dragon Naturally Speaking". L'utilisateur rarement oblige d'intervenir
pour corriger les résultats de la dictée, grâce aux performances élevées du
logiciel.
3. La transcription médicale prend plus de place ; Les règlements et les besoins
pratiques exigent que le dossier des patients soient convertis en texte
numérique. La tache de reconnaissance optique des caractères (ROC) pour
convertir l'écriture en texte est particulièrement provocante quand elle touche
l'écriture du médecin. Actuellement la transcription médicale dans le domaine est
92
faite par des professionnels ou par les systèmes de RSA qui ont un vocabulaire
enrichi par des termes médicaux [1].
4. Les mobilophones et les autres dispositifs de communication emploient RSA
pour composer plus rapidement. Le dispositif de téléphone contient le logiciel de
RSA qui peut identifier des noms et composer le numéro du correspondant.
Habituellement le système s'exerce à un hautparleur particulier, a un
vocabulaire restreint et fonctionne sous des contraintes d'informatiques, de
mémoire et d'énergie de puissance.
5. Utilisation RSA de robotique pour des conseils et des instructions. Un robot
peut être guidé pour accomplir une tâche spécifique, par exemple "Dialog OS" [2]
est un progiciel qui peut être employé pour permettre à une unité de "Lego
Mindstorms RCX" de comprendre la parole [3]. L'utilisateur peut alors guider le
robot du Lego pour exécuter une certaine préprogrammée d'avance.
6. Applications de sécurité telles que le tapement automatique aux lignes
téléphoniques en utilisant des motsclés spécifiques. Par exemple, Nortel
développe un système de RSA ainsi que l'université de Qinghua afin d'essayer
de surveiller chaque civil en Chine [4].
7. Traduction automatique en convertissant la parole en texte et puis en
traduisant le texte produit.
8. L'évaluation de la prononciation dans des applications de connaissance des
langues. Un hautparleur d'une deuxièmelangue doit faire un effort spécial pour
prononcer correctement différents mots et phrases. Le système de RSA indique
à l'utilisateur si la prononciation est claire.
9. Les appareils ménagers emploient RSA pour avoir une interface humaine plus
amicale. Fujita et d'autres. [5] proposent une télécommande de la parole pour la
TV numérique qui utilise 15 boutons au lieu de 70 requis pour actionner une TV
93
multicanal. Des commandes difficiles sont rendues simples puisque, au lieu de
la programmation complexe des boutons, l'utilisateur parle simplement hors de la
commande désirée.
Chapitre 3
Morphologie Mathématique
La morphologie mathématique est un modèle théorique qui justifie en particulier
les opérateurs utiles dans des applications de traitement d'image. Basée sur le
trellis et la théorie des ensembles et les axiomes, la morphologie mathématique
fournit des solutions à la manipulation des objets géométriques spécifiques dans
différents espaces topologiques. Des opérateurs de base sont employés pour
construire des opérateurs plus complexes, tandis que tous les opérateurs
comptent sur les éléments de structure qui servent d'unités modulaires
géométriques aux différents opérateurs morphologiques. Nous commençons
d'abord par une vue d'ensemble historique de la morphologie mathématique.
Nous continuons les axiomes et les définitions de base qui exemplifient
l'importance des différents opérateurs morphologiques. Nous passons en revue
les opérateurs morphologiques commençant par les opérateurs les plus
fondamentaux de la dilatation et de l'érosion, et finalement nous concluons avec
l'arrangement algorithmique de la ligne de "Watershed Transform" qui est
employée pour la segmentation d'image. Nous finissons ce chapitre en montrant
le rapport étroit entre la logique floue et la morphologie mathématique et par la
conclusion sur l'importance et la pertinence de la morphologie mathématique
avec la lecture automatisée de spectrogramme.
3.2.2 Érosion
94
L'érosion est une opération morphologique de base. Un grain employant
typiquement un `logique 1' comme indicateur à l'élément de structure (ES) est
employé. L'objectif est de trouver les objets dans l'image qui correspondent
exactement au ES. Nous considérons chaque fois que le point central du grain
comme point de référence produit un `logique 1' au cas où le Se serait
entièrement contenu dans l'image en ce qui concerne ce point de référence.
L'opération est équivalente au cas logique 'AND' dans le system binaire. L'image
de rendement contiendra tous les points dans lesquels le Se est entièrement
contenu dans l'image originale. L'image est donc érodée en ce qui concerne le
Se. Par conséquent l'érosion est un opérateur antiextensif.
3.2.4 Dilatation
La dilatation est l'opérateur duel de l'érosion. Cependant, en général, la dilatation
n'est pas l'opérateur inverse de l'érosion à moins que l'ouverture soit idempotent
entre quant à l'image et au ES. De la même manière l'érosion est, en général,
pas l'opérateur inverse de la dilatation à moins que la fermeture soit idempotent
quant à l'image et au ES. La dilatation suit le même balayage de l'image d'un
grain auquel le Se est indiqué par un ` logique 1'. Cependant, un logique 'OR' est
employé dans le cas binaire et un `résultant 1' est écrit à l'image de rendement
au cas où il y aurait au moins un Pixel dans l'image qui correspondrait au ES.
Par conséquent la dilatation est un opérateur étendu.
3.2.6 L'ouverture (Opening)
Open(Im,SE) = Im ◦ SE
L'ouverture est exécutée en dilatant le résultat d'une érosion d'une image avec le
même élément de structure (ES). La fermeture est l'opérateur duel de
l'ouverture. L'ouverture est normalement employée pour ouvrir une image et pour
l'essuyer du bruit de sel, et pour séparer entre de différents objets
dactylographies (des lignes dans des directions spécifiques, des objets
95
spécifiques). Une simple érosion de l'image réduira les pixels de l'image claire
des objets propres ; il est donc nécessaire d'effectuer la dilatation pour
reconstruire ces objets. Le bruit du sel disparaîtra après érosion et ne se
développera pas dû à la dilatation suivante. Normalement après s'ouverir, des
dispositifs plus petits que l'élément de structure sont enlevés, alors que les
dispositifs plus grands que le ES demeurent à peu près identiques. L'ouverture
morphologique présentée est adaptent. Elle est également antiétendue et
grandissante. Dans l'algèbre chaque opération qui augmente, antiétendu et
quantité s'appelle l'ouverture :
L'exemple suivant démontre comment l'ouverture peut être employée pour
réduire le bruit du sel et du poivre :
Figure 3.4 : L'image des cercles a été corrompue par le bruit du sel et du poivre
(gauche) et après l'ouverture.
3.3 La Distance Transforment
La distance transforme produit la distance de chaque pixel du premier plan à
partir de la frontière la plus proche selon la connectivité choisie. Les arêtes de la
distance transforme peuvent être considérées en tant que maximum locaux et ils
représentaient le squelette de l'objet en question. Il est possible d'éroder une
image par un élément de structure qui est un disque d'un certain rayon r
simplement en enlevant les Pixel qui atteignent une valeur moins que r. Les
.
B B B
c o ¸ =
96
érosions consécutives dans ce casci seraient équivalentes à alphacoupe de la
distance transforment dans les intervalles de la taille r.
Chapitre 4
Logique Floue
4.2 Logique floue contre la logique booléenne
La logique booléenne ainsi que la théorie des ensembles croquantes et la théorie des
probabilités croquantes sont insatisfaisantes pour traiter l'imprécision, l'incertitude et
la complexité du vrai monde. Les limites des théories (booléennes) croquantes qu'ils ne
laissent pas de place aux ambiguïtés et à l'ignorance. Si nous avons des informations
préalables sur un ensemble d'événements, nous voudrions avoir un modèle qui nous
permettre d'employer cette connaissance. D'autre part si nous n'avons pas
d'informations sur d'autres événements, les modeler d'après une la distribution
uniforme ne serait pas toujours un bon choix. La logique floue a été développée pour
traiter ces types de situations et pour nous permettre d'écrire facilement et d'adapter
plus tard sur le parcours les règles logiques qui sont le résultat d'une connaissance
experte.
Par exemple supposez que nous voudrions savoir si une personne est grande.
Nous pourrions décider d'un seuil fixe spécifique de 1.70 mètre et considérer que tous
les candidats d'une taille plus grande ou égale à 1.70 mètre sont grands. Le seuil d'une
certaine population peut être déterminé intuitivement ou par caractéristiques médianes,
moyennes ou par autres statistiques. Évidemment le problème avec cette méthode est
que deux personnes qui diffèrent par 1 centimètre seront classées par catégorie dans
deux groupes différents. Afin de surmonter les frontières croquantes de l'ensemble
"grand" et de son ensemble complémentaire "non grand", nous construisons une
fonction brouillée d'adhésion qui assigne des valeurs s'étendant de zéro à un qui
représentent le taille d'une personne. Une personne avec la taille plus de 1.90 mètre
97
sera considérée grande dans la plupart des cas (la population des joueurs de basket
ball est un bon contreexemple) tandis qu'une personne en de sous de 1.50 mètre ne
sera pas considéré grand. Nous assignerons des valeurs aux cas extrêmes, '1' pour
représenter que la personne est clairement dans l'ensemble de personnes grandes et '0'
pour représenter que la personne n'appartient clairement pas un membre de
l'ensemble de personnes de taille. Toutes autres tailles seront assignées une valeur de
miportée. En fait le "grand" variable (booléen) croquant a été converti en variable
floue qui peut prendre des valeurs sur l'intervalle [0,1].
L'exemple précédent a exigé de convertir une variable croquante simple en
variable brouillée. La conversion d'un problème croquant en floue s'appelle la
fuzzification. Nous examinons maintenant le poids d'une personne pour déterminer si
la personne est mince. Au lieu d'un seuil simple nous avons maintenant deux seuils
puisqu'une personne peut peser plus ou moins que le poids considéré qui estime
qu'une personne soit mince. On peut remarquer que, il y a une dépendance entre la
taille d'une personne et le poids dans un sens que nous nous attendions qu'une
personne mince et grande pèse plus qu'une personne mince mais plus courte. Nous
pouvons construire les fonctions d'appartenance aux poids variables en fonction du
"grand" variable floue et créer un ensemble floue de type 2. Nous pouvons également
quantifier la taille et considérer chaque soussection pour avoir une fonction différente
d'appartenance. Par exemple, nous pouvons quantifier la taille dans les variables
brouillées suivantes: {Audessous De Moyen, De Moyen, De Grand, Très De Grand}.
Un autre concept important dans la logique floue est celui des haies
linguistiques. Par exemple nous pouvons appliquer le linguistique de la haie "très
haut" d'ensemble floue "grand" pour créer l'ensemble floue "très grand". D'autres
exemples des haies sont : ` légèrement ', ` extrêmement ', ` plutôt ', ` davantage ou
moins etc. Après nous pouvons interpréter la haie d'une certaine manière prédéfinie.
Nous pouvons choisir que la haie "très" correspondrait à carrer la fonction
d'appartenance. Puisque les valeurs d'appartenance sont entre zéro et un, les valeurs
ajuster augmentera leur appartenance dans l'ensemble d'une façon non linéaire. Selon
98
le problème nous avons la flexibilité de choisir différentes fonctions à différentes haies
aussi longtemps que nous restons conforme à nos définitions.
Un autre exemple d'importance de la logique floue et de la théorie des
ensembles floue peut être vu par le "dilemme de l'étudiant". La premier implication est
du dilemme de l'étudiant "plus que j'étudie, plus je saurai". La deuxième implication
est "plus je sais, plus j'oublie". De ce fait en utilisant la déduction croquante l'étudiant
stipule que "plus j'étudie, plus j'oublie", qui peut mener à la conclusion que l'étude
diminue la connaissance. Cependant, en utilisant des opérateurs de logique floue pour
lier les différentes prétentions, nous conclurions que la partie de la connaissance
oubliée est négligeable par rapport à la connaissance gagnée. Au lieu d'un seuil
croquant qui, une fois déclenché, produirait un certain résultat, nous obtiendrions une
valeur de la connaissance oubliée qui devrait être basse par rapport à la valeur de la
connaissance apprise.
Chapitre 6
Algorithme Automatique de Segmentation de
Spectrogramme
6.1 Introduction
En ce chapitre nous présentons un algorithme pour montrer la segmentation
efficace des phonèmes dans un spectrogramme de la parole. Nous passons en
revue les travaux précédents et donnons la motivation pour la lecture
automatique des spectrogrammes de la parole. Nous continuons avec des
descriptions particulières et détaillées des principaux ingrédients algorithmiques
suivis d'une explication de l'algorithme de segmentation. Après présentation et
analyse des résultats nous suggérons différentes manières d'adapter l'algorithme
99
pour qu'il puisse manipuler différentes procédures. Nous récapitulerons le
chapitre avec des conclusions.
6.3 Description d'Algorithme
1. Le filtre médian est utilisé la 1
ère
fois 3 sur 20 rectangles et en plus 4 fois en
utilisant un trait horizontal de 20 Pixel.
2. Exécutez une 2D fenêtre Gaussienne (filtre de Gabor).
3. Lissez à l'aide d'un 2D filtre Wiener. Les moyens et la variance désaccord
locaux sont estimés dans des 16 sur 16 autour de chaque Pixel filtré.
4. Appliquez le seuil local sur (3).
5. Appliquez le seuil global sur (3).
6. Combinez les résultats de (4) et de (5) en utilisant la logique „OR‟.
7. Dilatez avec un disque comme élément structurant afin de débrancher les
lignes minces et éliminer de petits secteurs dans l'image.
8. Employez la connectivité morphologique pour négliger les petites sections qui
contiennent moins de 40 Pixel ou qui ont une largeur maximale de moins de 20
Pixel.
9. Exécutez un algorithme de la "Watershed Transform" 8 connectivité.
100
Local Threshold
Image Spectrogram
Binary Mask
Median 3 by 20, Set cnt = 1
Median 1 by 20
Is cnt = 4?
Local 2D Wiener Filter
Global Threshold
Dilation (disk as structuring element)
Discard Small Connected
Sections
Watershed Transform
Logical OR
No
Yes
cnt = cnt + 1
101
Figure 6.1 : Écoulement De Diagramme D'Algorithme
L'algorithme emploie des techniques de traitement locales et globales.
Lisser et seuil sont faits au niveau local et global. Lisser au niveau local emploie
un moyen et un désaccord localement estimés pour un 2D filtre Wiener tandis
que des procédures douces globales en utilisant une fenêtre gaussienne, un
filtrage médian et une dilatation d'image sont appliquées avec des paramètres
globaux. La "Watershed Transform" est appliquée à l'image entière depuis
l'interaction entre différents jeux d'objets d'image par partie importante dans le
procédé de segmentation.
6.12 Résultats
L'algorithme a été examiné sur différents échantillons de la parole provenant de
la base de données de TIMIT. Les résultats étaient robustes ; la segmentation a
été bien exécuté sur différents hautparleurs et différentes phrases. La base des
données TIMIT contient les hautparleurs femelles et masculins de 7 régions
avec de différents dialectes aux EtatsUnis. Les hautparleurs ont répété des
phrases particulièrement conçues à SRI, à MIT et à TI pour exemplifier
différentes caractéristiques de la parole telles que l'accent, la coarticulation et les
différentes combinaisons de phonèmes. La transcription orthographique et la
transcription phonétique tempsalignée sont incluses pour chaque phrase.
Notre premier exemple emploie la phrase significative “However, the litter
remained, augmented by several dozen lunchroom suppers”; mots accentués
indiquent 1 seconde de la parole qui dans cet exemple est montrée dans la fig.
1(a). Nous obtenons la bonne segmentation pour les premiers et deuxièmes
formants pour tous les phonèmes exprimés. Pour le troisième et quatrième
formant, la segmentation manque quelques parties des phonèmes /r/ mais la
direction générale est préservée. Dans cet exemple, chacun des quatre formants
est bien aligné et préparé pour être reconnu par un système approprié.
102
Notre deuxième exemple présente un arrangement plus provocant. Nous
examinons une section différente de la même phrase : “However, the litter
remained, augmented by several dozen lunchroom suppers.” Comme
représenté dans la fig. 6.2(b), l'algorithme a la difficulté en segmentant le
deuxième et le troisième formant de /r/. Puisque ces formants sont très étroits
ensemble, il est difficile de distinguer entre eux et de les segmenter en tant que
différents objets. En outre, les forces élevées de f3 le rendent plus difficile de le
séparer du F2. Une autre difficulté surgit dans l'identification du /m nasal/. la
basse densité spectrale le rend dur pour segmenter le phonème correctement.
La basse densité spectrale est provoquée par un zéro spectral qui réduit le
deuxième formant. Un autre problème est de petits segments qui ne
représentent pas un formant mais apparaissent toujours sur l'image (positifs
faux). Ce problème peut être résolu en changeant la constante dans l'étape #8
de l'algorithme. Cependant, changer la constante pour accepter seulement des
énergies plus fortes aboutirait par la perte de quelques vrais formants. En
général, l'algorithme parvient à bien fonctionner quand les énergies de formant
sont fortes.
Comme dernier exemple, nous choisissons : “Don‟t ask me to carry an
oily rag like that.” Comme dans la fig. 6.2(c), nous obtenons plusieurs cas dans
lesquels des formants sont segmentés dans plus d'une BLOB. Quoique l'hyper
segmentation ait été abordée dans l'algorithme de "Watershed Transform", nous
avons toujours des restes sous forme de petits objets binaires qui peuvent poser
des problèmes dans l'étape d'identification. D'autre part comme a été également
noté dans les exemples précédents, les BLOB liées à f1 se relient parfois à plus
d'un phonème. Ce phénomène se produit aussi bien dans certains cas pour les
formants plus élevé.
Afin de vérifier le comportement d'algorithme d'une façon plus
systématique, nous examinons les résultats par des multiples examines. Les
critères par lesquels nous jugeons l'exécution est la catégorie variable floue de
103
cette des prises que incluse les valeurs { 'parfait', 'bon', 'moyenne de' , 'au
dessous de la moyenne', 'pauvre'} pour les résultats de la segmentation. Nous
assignons des nombres à chaque descripteur où prises parfaites de ` les 'la
valeur la plus élevée de 5, pauvres de ` 'prend la valeur la plus basse de 1 et on
le croit que la moyenne de ` ', qui prend la valeur de 3, contient assez
d'information pour l'identification automatique. Nous choisissons 10 phonèmes et
exécutons 20 essais différents pour chaque phonème, au total des 200
segments différents de la parole. Les résultats comprenant le moyen et le
désaccord des mesures visuelles sont présentés dans le tableau 6.1.
104
A
“several
dozen”
B
“the litter
remained”
C
Hamming
Window
“an oily
rag”
D
Hann
Window
“an oily
rag”
Figure 6.2: Résultats d'algorithme pour différents cas
Après avoir examiné l'algorithme nous voyons qu'en général l'algorithme
obtient de bons résultats de segmentation pour les forces de formant en passant par des
phonèmes différents. L'algorithme obtient de meilleurs résultats de segmentation
quand la durée du phonème est plus longue. Puisque plus d'information est disponible
et puisque notre algorithme de segmentation recherche de grands objets, nous tendons
105
à manquer de petites concentrations d'énergie. En général, les voyelles sont bien
reconnues. Les /m/ nasaux et le glissement /l/ ont des résultats inférieurs de
segmentation dus à la difficulté de dépister les lignes diagonales dans le
spectrogramme. Il est possible de prolonger l'algorithme pour détecter des lignes de
diagonale en ajoutant un procédé de cheminement tel qu'un filtre de Kalman ou par
une ligne diagonale soulignant le filtre médian. La semivoyelle /w/ est mieux
segmentée sur des phonèmes de courte durée puisqu'il y a une concentration d'énergie
plus élevée qui permet mieux la segmentation de f
3
et de f
4
.
Test # Phoneme
aa ae eh ux ow oy r l m w
1 5 5 5 5 5 5 5 3 5 1
2 5 5 4 4 4 4 3 4 5 2
3 2 3 3 2 2 3 3 1 4 5
4 4 5 2 5 4 5 5 5 3 3
5 5 4 5 5 5 3 5 5 2 4
6 2 5 5 5 3 5 5 2 5 5
7 5 4 4 5 5 4 3 5 1 5
8 4 5 5 5 5 3 3 1 2 2
9 5 4 2 2 4 4 4 1 1 3
10 3 3 5 5 5 5 5 1 4 3
11 5 4 1 5 5 4 4 2 1 3
12 5 5 3 4 3 4 5 1 1 4
13 5 4 5 5 4 3 5 5 3 5
14 5 4 4 3 2 3 3 3 1 2
15 2 2 5 3 3 3 4 5 2 2
16 5 4 5 4 5 2 5 5 2 5
17 4 3 5 5 5 3 5 2 1 2
18 2 5 5 5 3 5 4 5 2 2
19 4 5 5 5 5 5 4 5 3 5
20 5 5 5 5 4 4 4 3 5 5
Mean 4.1 4.2 4.15 4.35 4.05 3.85 4.2 3.2 2.65 3.4
Variance 1.46 0.8 1.61 1.08 1.10 0.87 0.69 2.91 2.34 1.94
Tableau 6.1 : Résultats d'une inspection visuelle. Les catégories décrivent l'exactitude
de l'algorithme de segmentation pour chaque phonème.
Nous démontrons le choix subjectif des catégories selon notre variable floue en
montrant quelques spectrogrammes de la parole qui correspondent à différentes
catégories. Nous choisissons le 'l de glissement' que comme vu dans le tableau 6.1
106
reçoit toutes les 5 valeurs possibles. De nouveau la partie "bold" de la phrase indique
le segment de la parole qui est montré réellement dans le spectrogramme de la parole
où le 'l de glissement' est centré a la 0.5 secondes depuis de départ. Les nombres entre
parenthèses indiquent la rangée de la table aussi bien que la ligne de la localisation des
noms de fichier dans le dossier inputl.txt des textes dans l'annuaire d'entrée. Nous
faisons prononcer la même phrase par différents hautparleurs.
1. Don't ask me to carry an oily rag like that. (9)
Figure 6.3: Exemple des points de la note 1.
Justification : Nous manquons le deuxième formant presque totalement dû à
l'élévation pointue et à l'énergie relativement basse. Puisque le caractéristique
principale du glissement se trouve dans le deuxième formant du glissement nous
donnons la note ‘1' à ce résultat.
2. Don't ask me to carry an oily rag like that. (6)
107
Figure 6.4: Exemple des points de la note 2.
Justification : Nous avons un premier et quatrième formant clair ; le deuxième
formant est bien représenté mais le troisième formant est absent. L'identification serait
difficile (quoique non impossible) ; donc la note du ‘2' a été donnée.
3. She had your dark suit in greasy wash water all year. (1)
Figure 6.5: Exemple des points de la note 3.
Justification : Nous avons chacun des quatre formants. Nous pouvons conclure par le
masque la direction et l'endroit du deuxième formant. Le troisième formant peut
également être bien estimé. L'identification devrait être possible ; dans ce casci donc
une catégorie du ‘ 3' a été donnée.
108
4. She had your dark suit in greasy wash water all year. (2)
Figure 6.6: Exemple des points de la note 4.
Justification : Nous avons chacun des quatre formants. Le deuxième formant est bien
décrit. L'endroit du troisième et du quatrième formant peut être facilement compris.
Par conséquent la note ‘4' a été attribué.
5. Don't ask me to carry an oily rag like that. (5)
Figure 6.7: Exemple des points de la note 5.
Justification : Chacun des quatre formants est bien caractérisé. La segmentation
algorithmique attrape les formants ainsi la reconnaissance de la parole automatique
devient possible.
109
Nous voyons qu'en raison de différents accents et distributions d'énergie, nous
avons des résultats sensiblement différents pour la segmentation. Puisque notre
algorithme est formé pour suivre les traits horizontaux et les formes, nous avons un
problème avec les glissements et en particulier avec des fréquences qui montent et qui
descendent. Un modèle linéaire pour une élévation et une chute est bien connu sous le
nom les "coefficients du delta" (premier dérivé) dans le cadre du RSA. Un modèle
d'ordre secondaire emploie également les coefficients du deltadelta, qui sont une
approximation du deuxième dérivé et qui aboutit à l'adaptation des résultats des
données à une fonction parabolique. Quoiqu'avec quelques ajustements c'est possible
d'adapter l'algorithme pour capturer les mouvements nonhorizontaux, nous voyons
que même à une étape prématurée noncommerciale du l'algorithme, nous obtenons
dans la plupart des cas des résultats qu'on puisse apprécier comme suffisants pour un
système automatique de reconnaissance de la parole. Nous obtenons de très bons
résultats d'identification quand la durée de temps du phonème est courte. Nous
attribuons cela à la concentration relativement élevée de l'énergie et au glissement
doux dans le deuxième formant qui sont plus appropriés à un algorithme qui vise à
segmenter les traits horizontaux.
6.14 Conclusion
Un algorithme robuste pour la segmentation de spectrogramme de la parole a été
présenté. En employant des techniques de traitement morphologiques d'image, nous
pouvons obtenir la segmentation fiable des formants dans la plupart des cas.
L'algorithme exécute bien pour tous les phonèmes exprimés et a de meilleurs résultats
de segmentation que des techniques précédentes ; cependant, les difficultés se
produisent quand les fréquences de formant sont étroites ensemble ou quand il y a un
formant à énergie réduite qui monte rapidement ou vers le bas dans la fréquence.
Certaines suggestions telles que changer le niveau de seuil ont été faites pour
améliorer ou accorder l'algorithme. Ces résultats peuvent être employés comme entrée
à un système automatique de reconnaissance de la parole ou dans d'autres utilisations
générales des spectrogrammes de la parole. C'est dans la croyance des auteurs qu'un
système spectrogrammebasé de reconnaissance de la parole peut compléter un système
110
existant d'identification en incorporant la connaissance experte humaine dans
l'identification chargent.
Chapitre 7
Sommaire
7.1 Examen du Travail et des Déductions Logiques
Dans les chapitres précédents nous avons créé les bases de trois théories principales :
Reconnaissance de la parole, traitement de l'image morphologique et logique floue.
Nous avons vu qu'il est possible de combiner ces méthodes afin de concevoir un
arrangement qui puisse exécuter la reconnaissance de la parole automatique. Le
rapport étroit entre la logique floue et la morphologie mathématique nous a aidé a
comprendre comment lier entre ces deux théories. Des justifications ont été présentées
pour l'usage de la morphologie mathématique pour effectuer la segmentation de
l'image. Le but principal de cette thèse était de segmenter un spectrogramme de l'image
et pour cette raison un algorithme de segmentation a été conçu.
L'algorithme de segmentation fonctionne bien dans la plupart des cas. Nous
avons vu comment en choisissant une fenêtre de Hamming au lieu d'une fenêtre de
Hann nous pouvons obtenir de meilleurs résultats de segmentation puisque nous avons
une meilleure séparation entre les fréquences adjacentes et parceque dans une
certaine mesure les dépendances de temps entre les pixels dans le spectrogramme
d'image peuvent être compromises. Nous avons conclu que les experts peuvent extraire
l'information à partir des spectrogrammes de la parole à large bande et nous avons vu
la différence entre les images du spectrogramme à bande étroite et à large bande et
l'information qu'ils contiennent ainsi que les différentes formes qui exigent de
différents opérateurs morphologiques d'extraire l'information à partir des images.
Dans la section 5.4 nous avons vu les propriétés mathématiques de l'opération
médiane. Nous avions également utilisé la médiane pour lisser la bande étroite et les
111
images à large bande comme première étape avant d'appliquer des techniques de
segmentation ou d'extraction plus fortes telles que la ligne de partage transforment ou
l'opérateur morphologique "Thinning". La "Watershed Transform" est efficace dans
des images bruyantes de segmentation et en particulier dans les cas dans lesquels les
différents objets de cible occluent partiellement un l'autre. Nous obtenons une image
marquée de masque et dans la plupart des cas chaque BLOB correspond directement à
un formant d'un phonème particulier. Dans certains cas nous obtenons plusieurs
petites BLOB qui appartiennent au même formant ; cependant ceci ne devrait pas
poser un problème particulièrement difficile puisque la majeure partie d'information
de la laquelle nous avons besoin pour la fonction de l'identification est toujours
maintenue.
Nous avions utilisé une variable floue pour mesurer les résultats de
l'algorithme. Cette méthode de correction nous a assuré que l'algorithme sera optimisé
pour rapporter des résultats qui seront aussi étroitement que possible prêts de
l'information a extrait par un expert. Celuici effectuera une inspection visuelle d'un
spectrogramme de la parole afin d'essayer d'extraire l'information. Dans la plupart des
cas, les voyelles sont segmentées d'une façon satisfaisante. et tous les quatre premiers
formants sont bien détectés et identifiés. Parfois, à cause de la basse énergie, on
manque un formant. Un autre avertissement commun est un formant qui se casse en
de plus petits morceaux dans le procédé de segmentation dû à des forces plus basses
dans son secteur central. Les glissements présentent un arrangement plus provocant
puisqu'ils exigent dépister les formants qui augmentent ou diminuent d'après la
fréquence. En outre leurs forces sont en général inférieures à ceux des voyelles. Nous
pouvons obtenir des résultats satisfaisants dans la plupart des cas pour les glissements.
Ces résultats sont inférieurs à ceux obtenus pour les voyelles.
7.2 Idées pour les Travaux Futurs
Nous avons réussi à effectuer la segmentation qui fonctionne bien dans la plupart des
cas. Cependant, l'exécution de l'égalisation qui emploierait comme entrée les
propriétés de temps et d'énergie de chaque phonème et serait ajustée à un personne ou
112
à sur un groupe spécifique d'accent peut aider à obtenir encore de meilleurs résultats.
Une égalisation simple peut employer une correction gamma comme a été expliqué
dans la section 6.5 pour changer la luminance et donc l'obscurité des différentes
sections d'énergie dans le spectrogramme de la parole. Une autre amélioration de
l'algorithme de segmentation pourrait accorder l'algorithme et l'ajuster sur différents
types de phonème. En contraignant le nombre de BLOB nous comptons segmenter
l'excédent par période de temps pré spécifiée car en réduisant le nombre de petites
BLOB et des BLOB de fusion qui sont réellement des constituants du même formant
nous pouvons améliorer les résultats de manière significative puisque nous.
Afin d'exécuter la reconnaissance de la parole automatique en utilisant les
résultats de notre algorithme nous devrons construire un système expert. Le système
expert se fonderait sur un ou plusieurs experts en matière de lecture de spectrogramme
et aura la forme des règles SI/PUIS (IF/THEN). Les règles auront également une
méthode d'agrégation qui expliquerait comment effectuer les combinaisons,
l'intersection ou la négation. En outre nous devrons extraire un vecteur de dispositif à
partir de notre image segmentée. Puisque nous ne sommes pas limités à l'information
nous avons dans le masque (image binaire segmentée) nous pouvons employer le
masque comme référence et extraire une information plus précise liée à une BLOB
spécifique du spectrogramme original. Le vecteur de dispositif peut inclure des
paramètres tels que la longueur de la BLOB, l'emplacement de la bande de fréquence,
son degré de pente l'approximativement d'un première ou d'un second degré. De la
même façon coefficients du delta et la force d'énergie sont mesuré. Le vecteur du
dispositif serait construit selon les règles étendues en avant dans le système expert. Les
fonctions d'adhésion pour chaque élément du vecteur de dispositif peuvent être
manuellement conçues ou formées par un réseau neurologique. Enfin, dans différents
cas, une régression complète serait exécutée pour analyser l'exécution du système.
Nous espérons que l'information contenue dans les spectrogrammes de la
parole comme interprétés par des experts humains et connaisseurs des modèles
acoustiques, de phonétique, de linguistiques et de parole de production peut rapporter
de meilleurs taux d'identification que les méthodes courantes qui n'incorporent pas la
113
connaissance humaine dans leurs algorithmes. L'information extraite par cette
méthode peut également être combinée avec les systèmes existants pour améliorer leurs
résultats. Un avantage clair du système proposé est sa conception basée sur les règles
intuitive et la possibilité d'incorporer la connaissance de plus d'un expert. Une solution
possible pour créer l'ensemble de règles est le système "wikibasé" qui permettra aux
experts de différents endroits autour du monde à transmettre leur expérience.
114
Liste de figures
Fig. 3.1: Repère de l'image de cercles.
Fig. 3.2: l'image de cercles appliquant avant et après l'érosion finale.
Fig. 3.3: Entoure le bruit gaussien blanc corrompu par image après application du
gradient de Beucher.
Fig. 3.4: L'image de cercles a corrompu par bruit de Salt et de poivre et après s'être
ouvert. (*)
Fig. 3.5: Image originale ; application d'un squelette ; taille du squelette.
Fig. 3.6: Exemple du nettoyage et information d'extraire d'une image en utilisant les
opérateurs morphologiques.
Fig. 4.1: Exemples des fonctions paramétriques communes d'appartenance.
Fig. 4.2: Complément floue de Sugeno pour différents paramètres de lambda.
Fig. 4.3: Complément floue de Yager pour différentes valeurs paramétriques.
Fig. 5.1: Spectrogramme à bande étroite de la Parole.
Fig. 5.2: Spectrogramme à bande étroite de la parole après détection des lignes.
Fig. 5.3: Résultats de l'algorithme d'évaluation de lancement.
Figure 6.1 : Écoulement de diagramme d'Algorithme. (*)
Figure 6.2: Résultats d'algorithme pour différents cas. (*)
Figure 6.3: Exemple des points de la note 1. (*)
Figure 6.4: Exemple des points de la note 2. (*)
Figure 6.5: Exemple des points de la note 3. (*)
Figure 6.6: Exemple des points de la note 4. (*)
Figure 6.7: Exemple des points de la note 5. (*)
(*) La figure a été incluse dans le sommaire
115
List de Tables
Tableau 2.1: Valeurs moyennes de fréquence de formant pour des phonèmes choisis.
Tableau 6.1: Résultats d'une inspection visuelle. Les catégories décrivent l'exactitude
de l'algorithme de segmentation pour chaque phonème. (*)
(*) Le tableau a été inclus dans le sommaire
116
Fig. 3.1: Repère de l'image de cercles.
117
Fig. 3.2: l'image de cercles appliquant avant (gauche) et après l'érosion finale.
Fig. 3.3: Entoure le bruit gaussien blanc corrompu par image après application du
gradient de Beucher.
118
Fig. 3.5: Image originale (dessus) ; application d'un squelette (moyen); taille du
squelette.
119
(a)
(b)
120
(c)
(d)
Fig. 3.6: Exemple du nettoyage et information d'extraire d'une image en utilisant les
opérateurs morphologiques. (a) L'image originale a corrompu par bruit de Salt et de
poivre. (b) Un résultat très bruyant après application de l'opérateur squelettique. (c)
Résultez du nettoyage (a) en employant les opérateurs de fermeture et d's'ouvrir. (d)
Résultez d'appliquer l'opérateur squelettique à (c). Beaucoup de squelettique plus clair
est obtenu.
121
(a)
(b)
(c)
122
(d)
Fig. 4.1: Exemples des fonctions paramétriques communes d'appartenance. (a)
Fonction d'appartenance Sigmoid. (b) Fonction d'appartenance Pi. (c) Fonction
d'appartenance Z. (d) Fonction d'appartenance Triangulaire.
123
Fig. 4.2: Complément floue de Sugeno pour différents paramètres de lambda.
124
Fig. 4.3: Complément floue de Yager pour différentes valeurs paramétriques.
125
Fig. 5.1: Spectrogramme à bande étroite de la Parole.
126
Fig. 5.2: Spectrogramme à bande étroite de la parole après détection des lignes.
127
Fig. 5.3: Résultats de l'algorithme d'évaluation de lancement.
128
Phoneme f1 f2 f3
/i/ 270 2290 3010
/I/ 390 1990 2550
/E/ 530 1840 2480
/@/ 660 1720 2410
/a/ 730 1090 2440
/c/ 570 840 2410
/U/ 440 1020 2240
/u/ 300 870 2240
/A/ 640 1190 2390
/R/ 490 1350 1690
Tableau 2.1: Valeurs moyennes de fréquence de formant pour des phonèmes choisis.
collection of acquired knowledge and experience. We anticipate extracting in this way information that is otherwise either hard to extract or simply missed out in the conventional recognition techniques. An important transform used in the detection process is the Watershed Transform. The Watershed Transform is a morphologicallybased technique and allows segmenting objects in an image even when the objects are partially occluding one another.
We conclude the thesis by presenting ideas to improve the results of the segmentation algorithm to produce better results in lower energy and fast (rising or falling) formant movements. We also give some ideas for future research that would take the results produced by the segmentation and incorporate them within a Fuzzy Logicbased expert system.
ii
Acknowledgments
First and foremost I would like to thank my thesis supervisor, Prof. Douglas O‟Shaughnessy that has inspired and motivated this work and through his immense knowledge and most dedicated mentoring contributed to the creation of this thesis.
I would also like to thank the technical and administrative staff of INRSEMT for their positive approach and willingness to help and assist at all times. I am very grateful to Prof. Ioannis Psaromiligkos from McGill University that has provided me with a solid background and methodology. Not forgetting Mr. Liron Yatziv from Siemens Research at Princeton, NJ that has contributed from his knowledge in biomedical image processing and especially in image segmentation techniques.
Last but not least I would like to thank the many people I have met in the beautiful city of Montreal and the many different lifeexperiences I have encountered, for helping me set a crisp value in the Fuzzy Sets of “Quality of Life” and “Friendship”.
iii
iv .
..................................................................................................................................5....................6 Thesis Outline ...... 5 1...5 Image Processing Basic Concepts ........................................................................1 Sonorant ........................................6 Trills......2 Vowels ................ 10 2...5...........................................3 Approximants ................................Contents Chapter 1 Introduction ............................................................................................1 Automatic Speech Recognition (ASR) ......5 Different Phoneme Classes .................2 Neural Networks ................ 1 1............................4 Drawbacks of existing automatic speech recognition systems . 14 vybrid Systems .................3.................................................................................... 2 1............... 5 1........................2 Pitch .............................. 11 2......................................................................... 12 2... 12 2......................................................................................... 9 Chapter 2 Fundamentals of Speech Theory and Time Frequency Analysis ......5.........................5.................................. 10 2................................ 11 2...............4 Nasals ....................................................... 13 2....................... 8 1....1 Phoneme ............. 4 1....................... 13 2.................5 Taps/Flaps ..2 Cepstral coefficients and Mel Frequency Cepstral Coefficients .......................3 Different Approaches to Automatic Speech Recognition .......1 Hidden Markov Models (HMMs) ........................................................................................................................................................................................................7 Obstruents ....................3 International Phonetic Alphabet (IPA) ................3......................................................... 5 1.......5.................4 Voiced and Unvoiced Speech ............................................................................. 6 1................................................3.................................................
.....1 Continuous Case .... 36 viruning ............................. 29 3............................................................. 26 3............................................................................................................................................................................................2...............6 Opening ...................................................5 Morphological Gradient ...................................................... 26 3................................................10 Thickening ............................................................4 Skeleton ............................................................6 The relationship between Fuzzy Logic and Mathematical Morphology .............10 Summary ........................... 19 3..8 Hit and Miss.........5 Watershed Transform ......................................................... 28 3..................................9 Time Frequency Representation ................................................................................. 24 3....................................2...........................................................2 Erosion ......4 Dilation ........................................................................................................2.....................................................................................................................................................................3 Distance Transform .................2...... 23 3................................. 16 2.............7 Closing ................................. 24 3....................... 15 2............................................................................................................. 18 Chapter 3 Mathematical Morphology ......... 33 3............................3 Ultimate Erosion .................... 29 3......................................................................................... 21 3.................................................................2.......................................................................... 27 3...7 The DARPA TIMIT Speech Database............2................2................................ 20 3. 35 3...........................................2 Useful Properties .................................7 Summary ........................................... 28 3................... 28 3........... 17 2..................................................................... 25 3........................8 The Uncertainty Principle .......................3........ 19 3................... 27 3.......................1 History of Mathematical Morphology .....6 Coarticulation ....................2.....3................................................... 15 2.................................9 Thinning ...............2..........................1 Connectivity .........2 Discrete Case .............
...................................................................................7....4 Method of Aggregation ............................................................................................................................................................................... 51 5............................................ 49 4..............................Chapter 4 Fuzzy Logic ...................................................................................................................................................................................................................... 61 vii ........ 43 4....................................................................9 Vagueness and Ambiguity ......6 General Aggregation Operations ............. 49 4............................................................... 56 5........................4 DeFuzzification ...............................................7 Results ....... 52 5.................................7 DempsterShafer (DS) Theory ....... 54 5..................................1 Introduction...............................................................1 Basic Probability Assignment (BPA) ....................................................................................................................................................... 57 5............. 46 4............................................................. 37 4....................... 47 4.. 48 4.................................................................. 47 4........................10 Summary ......................................................................................6 Algorithm Description .................................................................................................2 Theoretical Overview ................................2 Combining Evidence ...7.................. 38 4................ 58 5.....................................................................................................................................................1 Motivation .....................................8 Fuzzy Logic Toolbox ................................................................................................ Boolean Logic ................................................5 Suggested Algorithm ................................. 37 4.................................................................... 42 4..................................................................................................................................3 Alpha cuts . 42 4........2 Fuzzy Logic vs........................................... 52 5.......................................5 Fuzzy Union ............... 50 Chapter 5 Pitch detection algorithm ......8 Summary ..................................................................... 51 5.....3 Autocorrelation Method .....
............................................................................................................... 68 6......................................................... 69 6...................... 67 6.6 Window Selection ...10 Calculating the Local Threshold ....................................................................................................................... 66 6.......................................... 80 7................................................................. 62 6......... 78 6... 79 Chapter 7 Conclusion ..................................... 69 6..............3 Algorithm Description ..........................5 Gamma Correction .......2 Overview ...8 Local vs........................ 67 6.............................................................................................................11 Function Description ...........................1 Review of the Work and Logical Deductions .................................................................................................................................................................................. 81 Appendix Justifications for Choosing a Triangular Membership Function (TMF) ...........................................12 Results ...............9 Working with the TIMIT Database ...................................................................................................................................Chapter 6 Automatic Spectrogram Segmentation Algorithm ................................................. 62 6....13 Suggestions for Improving the Algorithm .................................. 83 viii ...........4 Adaptive Histogram Equalization ..........................................7 Connectivity Operations .......... 70 6................................................................................................................................................................................................... 62 6......... Global Threshold ...................................................................................................... 72 6........................................................................................................................2 Ideas for Future Work .................................... 68 6....................................... 80 7...................................1 Introduction..................... 64 6...............14 Summary .........................
.............References ........................... 84 ix ...............
x .
......... applying a skeleton...2: Sugeno fuzzy complement for different lambda parameters...3: Circles image corrupted white Gaussian noise after applying Beucher gradient..............................5: Example of a grade 3 score. ............... ............................ .............. 45 Figure 4................................................................... 60 Figure 6.....................6: Example of cleaning and extracting information from an image using morphological operators..3: Example of a grade 1 score........... .......................... 59 Figure 5............ 77 xi .......................... ............... 65 Figure 6.............. ............ .................. ........ ...... 45 Figure 5................... ............List of Figures Figure 3........... ......... 73 Figure 6..... ......................................1: Examples of common parametric membership functions......5: Original image.......1: Narrowband Speech Spectrogram.................. ................. 76 Figure 6....................................2: Narrowband Speech Spectrogram after line detection............................... ........... 58 Figure 5................................. pruning the skeleton........... 21 Figure 3.................... 76 Figure 6.....4: Example of a grade 2 score........................................... ..............................................3: Yager fuzzy complement for different parametric values..........2: Circles before and after applying ultimate erosion.......... 25 Figure 3.................... 75 Figure 6.................................2: Algorithm Results for different cases.................................................... .................... 75 Figure 6..........4: Circles image corrupted by Salt and Pepper noise and after opening................ 23 Figure 3.........................................................7: Example of a grade 5 score. 32 Figure 4........ ........6: Example of a grade 4 score....................... 30 Figure 3.....................1: Circles benchmark image……………………………………………… ..........................1: Algorithm Diagram Flow ....... 24 Figure 3.. 41 Figure 4..........................3: Results of the Pitch Estimation Algorithm....
xii .
.........................................1: Average formant frequency values for selected phonemes ...................... 12 Table 6.1: Results of a visual inspection.... ......List of Tables Table 2.... 74 xiii ............................
Due to improvements both in algorithms and in hardware. Automatic speech recognition is still an open topic of research. decision making and combining evidence. under certain circumstances. memory. automatic speech recognition has become more affordable and available. Since the problem of speech recognition is complex. recognition rates are far from optimal. which have enabled the use of more sophisticated and computationally demanding algorithms to be implemented even on lowpower lowcost handheld electronic devices.Chapter 1 Introduction Automatic Speech Recognition has been an active topic of research for the past four decades. where improvement and changes are constantly made in a hope for better recognition rates. namely speech theory with emphasis on automatic speech recognition. In this work we propose a different approach to automatic speech recognition on based on mathematical morphology and fuzzy logic theories. This new approach is unconventional in the sense that it involves three major fields of research. 1 . Before delving into the worlds of phonology and morphological image processing. Computer hardware and software have significantly improved in terms of speed. we present an overview of automatic speech recognition and give insight to some commonly used techniques that attempt to solve this formidable task. However. The main objective of the automatic speech recognition task is to convert a speech segment into an interpretable text message without the need of human intervention. Many different algorithms and schemes based on different mathematical paradigms have been proposed in an attempt to improve recognition rates. image processing with emphasis on image segmentation and evidence theory with emphasis on fuzzy logic. In addition other constraints such as computational complexity and realtime constraints come into play in the design and implementation of a working product. cost and availability. we prefer algorithms with low computational and memory requirements since they can be implemented easily and at lower cost.
With high recognition rates for native speakers. 2. 4. An example of a commonly used software application for dictation is Nuance‟s Dragon Naturally Speaking software package. Professional journals and text use an even more esoteric vocabulary.000 words.1 Automatic Speech Recognition (ASR) Automatic speech recognition (ASR) is the process of converting human speech into written text. Vocabulary size – The minimal possible vocabulary size is 2 (for example Yes/No). Fluency – Isolated words are easier to recognize than continuous speech. Usually the call centers operate on a limited vocabulary related to their field of operation. Signal distortion and high compression rate can cause some words to sound the same. the user is rarely required to intervene and correct the dictation results. Accents and other speakerspecific parameters – Children have a different frequency range than adults. Regulations and practical needs require that a patient file would be converted into digital text. Usually an ASR system can be finetuned to a specific speaker in order to reduce the error rate. 3.1. In addition. however. isolated words give the system more time to process results and have lower interspeaker variability. there are still many open problems in particular due to four major parameters: 1. Performing an Optical Character Recognition (OCR) task to convert the handwritten information into text is especially challenging when it comes to doctor‟s handwriting. The purpose of the ASR system is to aid the customer service representative perform his task more efficiently and in less time. We present several examples for common uses of ASR: 1. which makes the recognition task more difficult. Dictating allows almost completely handsfree conversion of speech into text. 2. Read speech is usually clearer than conversational speech. Currently medical transcription is done 2 . Foreign accents will degrade the performance of a system as well as noncommon accents that the system was not designed to handle. Another common size is the 10 digits. Telephone conversations or news reports require a vocabulary of about 60. Medical transcription is growing in importance. ASR is used in numerous applications. 3. A low signal to noise ratio can cause severe interference and degrade the performance of a system significantly. Channel and noise – a lab microphone and lab environment have lower noise interference and lower signal distortion than speech sampled through a cell phone microphone in a moving car with the window open. Call centers route calls and give out information according to user requests. and this often requires the use of specialized dictionaries. Many advancements have been made in the field that have led to systems with high recognition rates.
the user simply speaks out the desired command. A robot can be guided to a certain task. 6. Automatic translation by converting speech to text and then translating the produced text. for example the Dialog OS [2] is a software package that can be used to enable a Lego Mindstorms RCX unit to understand speech [3]. Home devices use ASR as a friendlier human interface. instead of complex programming of buttons. Security applications such as automatic tapping to telephone lines using specific keywords. 8. Robotics use ASR for guidance and instructions. 4. 9. Fujita et al. has a small vocabulary and operates under harsh computational. 3 . A speaker of a secondlanguage has to make a special effort to correctly pronounce different words and sentences. Mobile phones and other communication devices use ASR for speed dialing. 5. The ASR system indicates to the user if the pronunciation was clear. The phone device contains ASR software which can identify names and dial the corresponding number.either by professionals in the field or by ASR systems that have a vocabulary enriched with medical terms [1]. For instance. memory and power consumption constraints. Nortel is developing an ASR system together with Qinghua University in an attempt to monitor every civilian in China [4]. Usually the system trains to a particular speaker. Pronunciation evaluation in language learning applications. [5] propose a speech remote control for digital TV that uses 15 buttons instead of 70 needed to operate a multichannel TV. The user can then guide the Lego robot to perform certain preprogrammed tasks. 7. Difficult commands are made simple since.
a justification for a triangular function is given in the appendix. The definition for real cepstral coefficients is given by the following equation: (1. The triangular lifters are linearly spaced up to 1000 Hz and logarithmically spaced afterwards up to 4000 Hz. (1.2) cepstrumx y cepstrumx cepstrum y Equation 1. M 2 20 k 1 M is the number of cepstrum coefficients and X k 201 represents the logenergy output of the kth k Mel filter. 4 . The human ear filters sound linearly for lower frequencies and logarithmically for higher frequencies.3) 20 1 MFCC i X k cosi k ' i 1. DFT is the Discrete Fourier Transform often implemented by the Fast Fourier Transform algorithm.1) cepstrum x IDFT log DFT x We also note that (1. The result is a smoothed cepstrum which can be further sampled to a specific number of coefficients.2 Cepstral coefficients and Mel Frequency Cepstral Coefficients Cepstral coefficients play an important part in speech theory and in automatic speech recognition in particular due to their ability to compactly represent relevant information that is contained in a short time sample of a continuous speech signal [6].2. Quefrency is a cepstrum value ('cepstrum frequency value') while a lifter is a weighted cepstrum or in other words a filter for the cepstrum coefficients. The hidden assumption is that more important speech information is encapsulated in the low frequency band of 01000 Hz while the higher 10004000 Hz band contains less information per Hz. The Mel Frequency Cepstral Coefficients (MFCCs) [7] are obtained by converting the result of the logabsolute value frequency spectrum to a Mel perceptuallybased spectrum and taking an inverse discrete cosine transform of the result.2 can be easily derived from 1.1. Partitioning the frequency range into two different spacing schemes that also resemble the Bark scale yields an efficient representation of the spectrum. The possibility function entails all the possible distributions that might occur and is the coarsest upper bound we can obtain knowing only the mean and variance of a stochastic process. . Using cepstral terminology we regard the Mel mapping to be a rectangular low quefrency lifter followed by a discrete cosine transform.1 and is useful in case we model the speech signal as a result of an excitation convolved with an impulse response of the vocal tract filter. The triangular lifters can be regarded as a possibility function which serves as an upper bound to a symmetrical distribution where only the mean and variance are known.
Since linguistic isomorphism does not imply acoustic isomorphism.1 Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) is currently the most common approach to automatic speech recognition. The nonuniform law is usually exponential where a common technique is to use a critical band scale that combines a linear and exponential filter placement similar to the Bark or Mel scales. A bank of filters is a set of overlapping filters that are spaced in frequency according to either a uniform or nonuniform law. In [10] Rabiner et al. demonstrate the importance of spectral parameterization of a speech signal that serves as input to a NN system.1. Two methods of parameterization that are commonly used are a bank of filters and an allpole linear prediction model.3. Finally the Viterbi algorithm is used to perform Maximum APosteriori (MAP) analysis of the data and produce the sequence with the highest likelihood of occurrence. we can expect different spectral representations to similar words/phonemes.3. however due to the relative success of HMM they have been somewhat neglected. Both the Bark and Mel scales are justified based on perceptual 5 . A feature vector is computed for each time interval. a feature vector has 13 elements which are the cepstral coefficients of the sampled speech signal in the current time interval.1 while meeting the constraints of reasonable recognition rates and affordable computational requirements. A Hidden Markov Model is a Markov chain in which the actual state of the chain is hidden from the observer. The features are then used to determine the state which represents the distribution associated with the specified time interval. Typically. Different approaches have been developed to cope with the challenges presented in 1. In the following we present an overview of three common approaches to automatic speech recognition: 1. There has been a significant amount of work in the field of HMMbased automatic speech recognition systems and many theoretical and application specific algorithms exist [9].3 Different Approaches to Automatic Speech Recognition Automatic Speech Recognition has been an active research field since the invention of the vocoder by Homer Dudley in the late 1930s [8]. The different states of the HMM represent different distributions. The speech signal is modeled as a piecewise stationary stochastic process and in many applications time intervals are held constant at 10 ms. A Markov chain is a chain in which each state depends only on the previous state and does not depend in any way on any state other than the previous state.2 Neural Networks Neural Networks (NN) based systems were popular in the late 80s. 1.
Common hybrid systems are Neural Network Hidden Markov Model as described in [12]. Different types of neurons together with different types of connectivity exist. suggest an Nbest paradigm that uses multiple hypotheses instead of a single one. Makhoul et al. we are attempting to build a complex system that mimics that way the human brain operates but we do not have a complete understanding on the behavior of the system due to its complexity. The main caveat of NN which are also used in Artificial Intelligence is that “real intelligence is a prerequisite for Artificial Intelligence” (Prof. NN can also be used for statistical estimation of phonetic probability that can later be used in a HMM to solve a continuous speech statistical ASR system [11]. In general. 6 . Several stages of neurons can be built keeping in mind the tasks of the system on one hand and the computational and theoretical complexity on the other. A segmental neural net is constructed to model the different phonemic connections. in other words.3. By using multiple connections a consistent improvement of performance is obtained. 1. the NN require training for a specific set of data to which the desired results are already known. Often the input to the NN is first categorized to different clusters. It is important not to over train the system to a specific data set. The Hilbert norm of the difference of the cepstral coefficients is used as the distance measure and afterwards is optimized. The categorization improves the performance of the NN in particular in schemes of pattern recognition.studies of speech.3 Hybrid Systems Hybrid systems as their name implies combine different strategies with the objective of improving recognition rates. The Linear prediction analysis technique models speech as an allpole filter and looks at the distance from the coefficients of an actual known utterance as an optimization criteria measure. Such modeling is not possible to perform with a HMM since the basic assumption of the HMM restricts the dependency of the current state only on the previous one. David Malah).
there is no room for human reasoning or humanbased rules. as demonstrated by the Heisenberg–Gabor inequality. The wideband spectrogram is generated using a relatively short time window that gives good time resolution but less accurate frequency resolution. It deals with voiced/unvoiced fricatives. Such encouraging recognition rates motivate the development of an automatic tool to perform the reading task. we believe it is necessary to identify and segment the speech spectrogram into Binary Large Objects (BLOBs). an expert system based on spectrogram reading knowledge was devised with an objective to segment speech into different phonemes. voiced/unvoiced stops. Attempting to track down frequency changes with time using a single pixel skeleton path is futile when the time interval is too short to allow single pixel localization.1. human experts can read speech spectrograms with a high level of accuracy. A rulebased expert system reports recognition rates 7 . linguistics and acoustic phonetics [13]. stipulates that the extent to which a particular frequency can be localized is inversely proportional to the length of the time interval chosen. Different experiments demonstrate recognition rates in the range of 85%. In an effort to mimic the human experts‟ behavior we choose a large time interval on the order of 1 second in order to capture several phonemes that may be related through coarticulation. Since the HMM relies on probability models to reach conclusions. We note here the work of [15] that used morphological skeletons to extract information. In [13]. due to the fact that humans are well trained since early childhood to recognize speech and they perform the task better than any existing ASR system. Victor Zue of MIT has spent over 2500 hours learning spectrogram reading and has reached impressive recognition rates. Previous attempts to extract information from speech spectrograms have been made.4 Drawbacks of existing automatic speech recognition systems The main drawback of the previous three methods is their blind treatment of the problem. Zue and Cole [14] have given encouraging results for automatic speech recognition based on speech spectrograms. The uncertainty principle. Prof. nasals and liquids. While in general it is possible to extract information through a skeletonbased approach. phonotactics. Not having any option to incorporate human knowledge into the system is particularly odd. While in general a person cannot give a consistent reasoning to the parameters that allow distinguishing between different words or phonemes. Movements of the vocal tract can be well represented using a wideband spectrogram. Spectrogram reading requires a combination of different sources of knowledge such as articulatory movement. Speech signals can be modeled as nonstationary signals.
An image is stored in memory as a matrix and different signal processing operators such as convolution. Apart from the toggle of black and white for the extreme values. infrared night vision produces greenlike images. In grayscale images the maximum value (255) corresponds to white and the minimum value (0) corresponds to black. Our work focuses mainly on two types of image representations: binary images and grayscale images. chapter 4 will discuss Fuzzy Logic in more detail. can be performed on the image as long as their two dimensional version is applied. a pixel takes the logical value „1‟ if it is a foreground pixel (black) and „0‟ if it is a background pixel (white). a 3 dimensional CCD (Chargecoupled Device) is used for each pixel‟s Red Green and Blue (RGB) values. filtering. In grayscale images each pixel usually takes a value between 0 and 255 which can be stored in 8 bits. In digital cameras for instance. 3D cameras produce two images with disparity in the order of the distance between the eyes that combined would show a 3 dimensional image. In order to convert an image from gray scale to binary we need to perform a threshold operation on the image.5 Image Processing Basic Concepts A digital image can be either acquired by sampling the continuous space or through synthetic computer generated methods. Each pixel that contains a value greater than the threshold is quantized to a logical „1‟ while each pixel that contains a value smaller than the threshold is quantized to a logical „0‟. 1. therefore three matrices are needed for a single image representation. The sampling can be regarded as some form of averaging of energies that is represented by a matrix grid of pixels (picture elements). In binary images. these columns are combined to form an image. These results motivate us to focus on the classes that are more difficult to recognize by a rulebased segmentation. for example. satellite images obtained from espionage satellites use a ruler (vertical line) that delivers a stream of pixels per second. Since these values can be calculated we do not focus on the visual aspects of a color image. tracking etc. The reasoning behind the selection of a logical „1‟ as black and a logical „0‟ as white is 8 . Some spectrogram readers prefer to use a color spectrogram in which different colors indicate decreasing/increasing energy levels and maximum/minimum energy points. glides and nasals. other 3D cameras extract information from the image by way of inference [16]. grayscale images can be regarded as an extension of binary images just as Fuzzy Logic extends the traditional Boolean logic. Many different types and formats of images exist. namely the vowels. Both types of images are generated from processing an input speech signal.of about 90% for the aforementioned phoneme classes. a zcamera can provide in addition to a color image also the distance from the camera.
Reasons for the failure of this approach to pitch estimation are discussed. As an anecdote the relationship and similarity between Fuzzy Logic and Mathematical Morphology is presented. background pixels are white. the simplest procedure is to take black as „0‟ since it is has the same effect in all display types (cathode rays are shut). the desired result of obtaining the correct pitch is not reached. 1. An overview of mathematical morphology. Commonly but not necessarily they are represented by 8 bits (byte). Therefore.6 Thesis Outline Chapter 2 provides a technical overview of phonological terms and Time Frequency representation that are important to the understanding of the proposed speech recognition system. Chapter 4 introduces Fuzzy Logic and some concepts from Fuzzy set theory and Evidence theory. The Matlab are given. its history. Chapter 3 serves as a tutorial to morphological image processing. Applying a threshold means that the pixels above the threshold produce a truth value (logical „1‟) and therefore are foreground (black) pixels. Chapter 5 shows an attempt to extract pitch from a narrowband image spectrogram and explains why even though the recognition and extraction of information is performed in a good manner. Various morphological operators that are used in the thesis are explained and accompanied with practical examples. Experimental results are discussed. Many algorithms on threshold selection exist in the literature.that since the paper we print on is white. axioms and advantages are described in detail. Chapter 7 summarizes the thesis with conclusions and ideas for future research. Chapter 6 gives a description of the proposed algorithm for spectrogram segmentation. TM Fuzzy Logic toolbox is presented and justifications for using Fuzzy Logic 9 . we will see in chapter 6 that both a global and a local threshold are required in addition to some preprocessing of the image in order to convert it to a binary image. The decision to select zero as black in grayscale images arrives from the fact that grayscale images can be represented by an arbitrary number of bits depending on the display type and the characteristic of the problem at hand.
Since we are dealing with continuous speech. There are different ways we can model speech. acoustic. We choose to focus our recognition task on a basic speech unit. We have seen in Chapter 1 the challenges in designing an Automatic Speech Recognition (ASR) system. Pitch cannot be defined rigidly in mathematical terms since it is a perceived property that represents the frequency in which the 10 .1 Phoneme Our work concentrates on recognizing different kinds of phonemes in the continuous speech of a single speaker. Most languages have about 2040 phonemes [6]. we need to understand some basic concepts of articulatory phonetics. We conclude the chapter with a description of the uncertainty principle and time frequency analysis that allows us to better understand the intricate features of the speech spectrogram. the phoneme. we do not have any indication for word boundaries. Since phonemes are conceptual units it is hard to quantify their start and end point in time on a given speech signal. words or even complete sentences. articulatory. A phoneme can be defined as a theoretical representation of sound [17].7 provides the start and end point of each phoneme in the database as segmented by various speech experts. ASR systems can be implemented to detect phonemes. It is the smallest meaningful contrastive unit in the phonology of a language [6]. namely. 2. A phoneme is the conception of sound that is sufficient to distinguish between two different words.2 Pitch Pitch is the perceived fundamental frequency (f 0) of speech. For example. and design our system to be able to recognize phonemes. The TIMIT database as described in section 2. phonetic and perceptual. To perform good phoneme recognition. The additional information provided in TIMIT can be used for debugging. 2. improving and displaying performance results of the recognition algorithm. bear and tear differ only by their first letter.Chapter 2 Fundamentals of Speech Theory and Time Frequency Analysis This chapter presents some important concepts in speech theory and timefrequency analysis. In this case /b/ and /t/ are considered different phonemes and bear and tear can be distinguished due to the different phonetic transcription.
voicing is a binary parameter even though in some particular cases there can be a degree of voicing. a half voiced/partially voiced sound is caused by voicing over part of the sound (short duration). Chapter 5 thoroughly deals with the concept of pitch as we attempt to recognize pitch by using image processing techniques on a narrowband spectrogram. In order to match between the IPA and the phoneme representation in TIMIT which is given in standard ASCII format that does not include the IPA. In general. Eventually it was decided to merge all languages to a unique set of phonetic symbols. 11 . Low intensity voicing occurs when the voicing is weak and is also considered as a partially voiced sound. coughs. Degrees of voicing are usually measured by duration (voice onset time) and by intensity. speech compression. giggles and other information. The IPA attempts to associate each sound with a single phonetic symbol while two symbols are used in case the phonetic unit is produced by a combination of two sounds. Pitch plays an important part in synthetic speech production. In English. whispers. we perform a onetoone mapping between the two methods of representation.vocal tract cords vibrate. A speech recognition system must be able to distinguish between voiced and unvoiced speech in order for it to detect moments of silence. Since pitch can also give an indication of whether the speech segment is voiced or not we again see the importance of pitch in the overall recognition process.3 International Phonetic Alphabet (IPA) The International Phonetic Alphabet was first developed by linguistics in 1886 in an attempt to create a different set of phonetic symbols for each language. Our work concentrates on the English language. However the promising feature of incorporating pitch into a rulebased automatic speech recognition system has motivated our investigation of the pitch. Pitch has not played an important part in automatic speech recognition due to the difficulty of using the pitch information in existing speech recognition systems. 2. speech coding and other speech related techniques and algorithms. different phoneme classes. There are a few possible representations for phonemes where a very common one is the International Phonetic Alphabet for English (IPA). Unicode is a computer coding standard that supports most known font systems including the IPA.4 Voiced and Unvoiced Speech A voiced sound is one in which the vocal cords vibrate. 2. end of word. We use the IPA to easily identify different phonemes.
Vowels can be distinguished primarily by their first three formants. Approximants. Sonorant are voiced in most world languages.2 Vowels Vowels and diphthongs are the phonemes with the greatest intensity.1 Sonorant A sonorant is a speech sound that is produced without turbulent airflow in the vocal tract. We also show a possible implementation of fuzzy membership functions for the vowels based on statistical data. The vowels can be modeled as quasiperiodic with the periodicity being the fundamental frequency.1 Average formant frequency values for selected phonemes We would like to examine a possible implementation to a Fuzzy Logic (FL) system. We use the table as reference for an expected formant location. Following the work of [18] we obtain the following table for the average location of the first three formants of different English phonemes: Phoneme /i/ /I/ /E/ /@/ /a/ /c/ /U/ /u/ /A/ /R/ f1 270 390 530 660 730 570 440 300 640 490 f2 2290 1990 1840 1720 1090 840 1020 870 1190 1350 f3 3010 2550 2480 2410 2440 2410 2240 2240 2390 1690 Table 2. 2. 2. We would like to know what kind of information we need to extract from the image spectrogram.2. Nasals.5. Realizing how a FL system would look like would assist in the development of the segmentation algorithm. An automatic script 12 . If a sound can be produced continuously with the same pitch it is considered a sonorant.5.5 Different Phoneme Classes We present in the following some basic concepts in phonetics that are used throughout this work. Taps and Trills. Most of the energy of the vowels is concentrated in the first formant f1 which is usually below 1 kHz. The sonorant includes the following classes: Vowels. f 0.
and to three agreeing formants over two. As explained in section 1. Approximants can be slightly fricated in case the out flowing air becomes turbulent. however. English nasal sounds consist of [m]. 2. 13 . however. The tongue articulation (and not the nose itself) differentiates among the different nasals. These average values can farther be adjusted and trained to specific speakers. [n]. The spectral null is inversly proportional to the length of the oral cavity behind the constriction.creates a FL system that assigns a grade to each phoneme based on its similarity to a given vowel.3 Approximants Approximants are sounds that can be regarded as in between vowels and typical consonants. [ŋ]. We see that in order for this type of FL system to work we would need to provide it with some kind of estimate of the formant frequency.4 Nasals Nasals are sonorant voiced and are caused by using the tongue to block the air allowing it only to escape freely through the nose. The MF is constructed so that it attains a zero value at each adjacent average formant values and attains a one at the target average formant value. It is compared with the average frequency location of formant of the English language. Since humans have a very poor perceptual resolution of spectral nulls.5. In this way. each midfrequency point can be associated with 2 membership functions. These characteristics together with the spectral null can be observed in a spectrogram image of a nasal phoneme. Murmurs are similar to vowel waveforms but have significantly weaker energy due to a spectral null. we can farther reduce this number. A triangular membership function (TMF) is created around each phoneme. Nasal waveforms are called murmurs. in this case we use constant reference data. The input to this FL system is the average frequency location of each phoneme. Articulators narrow the vocal tract but leave enough space for air to flow without much audible turbulence. 3 2. For three formants. 2. Taps and flaps are considered by most linguistics to be the same even though some distinctions can be made to distinguish between the two. the TMF are well used in cases of limited or very low prior knowledge.5. TMF is selected since it gives coverage of symmetric distributions with a minimal number of parameters. other cues such as formant transitions in adjacent sounds and spectral jumps distinguish nasals from other phonemes.5 Taps/Flaps An articulator is thrown towards another articulator using a single contraction of the muscles to produce a consonantal sound that is a tap. at the most.2 and the appendix. we obtain 2 =8 possibilities.5. after matching the formants and giving priority to two agreeing formants over one.
2. The two English affricates are /dg/ or in IPA /ʤ/ as in Jump and /ch/ or in IPA /ʧ/ as in Charly. a flap does not consist of a buildup of air pressure behind the place of articulation. In a trill the articulator is held in place unlike flaps/taps in which an active articulator is struck against a passive one. Most stops start with a silent period due to closure of the articulators. in particular. however some trills last 5 periods or more and in some cases a trill can last for a single period. Stops can be very brief in between vowels. Trills significantly differ from flaps. We can regard the affricates either as a combination or as a single phonemic unit. The voiced bar is caused by radiation of periodic glottal pulses through the walls of the vocal tract. [z]. An example of an alveolar tap is the consonant /ɾ/ or /tt/ as in the English word latter. Sibilants (stridents) are a subset of fricatives. Trills. They are created by curling the tongue lengthwise to direct the air caused by the closely placed articulators towards the edge of the teeth.6 Trills A trill is a consonantal sound produced by vibrations between the articulator and the place of articulation. Stops Stops are transient phonemes and thus are acoustically complex. Producing /f/ using the lower lip against the upper teeth is an example of placing two articulators closely together. We regard the affricates as a single phonemic unit. 14 . Normally trills vibrate on 23 periods.7 Obstruents Obstruents include stops (also known as plosives). and [ʒ]. The trills consist of three phonemes: [ʙ] [r] [ʀ]. fricatives and affricates. The throat and the cheek heavily attenuate all other frequencies. Affricates An affricate is a sound that begins with a stop (plosive) and ends with a fricative.5.Unlike a stop (plosive) consonant. [ʃ]. Fricatives Fricatives are caused by forcing air through a narrow channel made by two articulators that are moved close to each other. vary in the number of periods they occur.5. Examples of sibilants are English [s]. Single period trills differ from flaps by articulation. 2. may become flaps in which the tongue tip retains contact with the palate for about 1040 ms. Some stops have a voiced bar of energy in the first few harmonics (some voiced stops). Alveolar stops. unlike flaps. therefore there is no release burst upon producing the sound.
In order to reduce the effort required to pronounce the different phonemes. We need to take into account different variations due to coarticulation that would probably cause different constraints to contradict and result in no correct answer. In general there are rightleft (RL) and leftright (LR) articulations where in RL the articulator may move toward the suceeding phoneme in case the articulator‟s new position does not interfer with the current phoneme. energy levels. Thus the LR articulation helps reduce the effort in pronouncing different phonemes consecutively. Different files in the database have different purposes which are 15 .2. The RL are also called anticipatory coarticulation since the brain prepares the articulators during the current phoneme to pronounce the procedeeing phoneme. 2. The database is designed to assist in the development and testing of Automatic Speech Recognition systems. Understanding the effects of coarticulation is of particular importance when designing an expertbased Fuzzy Logic system that relies on input from speech spectrograms.7 The DARPA TIMIT Speech Database The DARPA (Defense Advanced Research Projects Agency) TIMIT speech database consists of utterances of 630 speakers of eight major dialects of American English. We need to allow flexibility in the design of the system to allow it to output different possible outcomes with various grades of belief that would later be reconciled through a Maximum APosteriori algorithm such as the Viterbi algorithm. coarticulation imposes a big hurdle. Instead of having a small possible alphabet of phonemes to recognize we have plenty of possible combinations all flavored with interspeaker variabilities in pronunciation.6 Coarticulation Speech is produced by articulator gestures that essentially overlap in time. for example a vowel. From the Automatic Speech Recognition perspective. speed and system noise. Since there are dependencies between different phonemes we cannot simply use a regression table for values of the locations of the first three formants of the vowels with appropriate confidence intervals for example in order to perform vowel recognition. for example a consonant to another phoneme. Both an orthographic transcription and a timealigned phoneme transcription are included for each binary speech file. the LR coarticulations do not involve any lookahead planning but are driven by the realistic physical constraints that are imposed on the articulators when moving from one phoneme. The shape of the vocal tract is highly dependent on the previous and successive phonemes. Unlike the RL. instead of fast movement of the articulators they are allowed to return to their natural position over several phones.
distinguished according to the file/directory names. Files for example that start with sx to their name are MIT (Massachusetts Institute of Technology) phonetically compact sentence while files that start with si are TI (Texas Instruments) random contextual variant sentence. Both genders are represented and phonemes can be tracked according to the location in the sentence given by the sample number in which they are present. The database is sampled in low noise conditions at a sampling rate of 16 kHz. The original database is in bigendian format and conversion to littleendian is necessary when reading speech files from the cdrom to an Intelbased machine.
2.8 The Uncertainty Principle
The uncertainty principle also known as Heisenberg‟s uncertainty principle is directly derived from the Fourier transform equations. Analyzing a signal over long time duration would produce more specific frequency results at the cost reducing the localization in time. If high frequency resolution is desired we need a longer time segment. However, if high frequency resolution is not required we can take a shorter time segment that would give better time resolution. This tradeoff gives rise to two common spectrograms, namely the narrowband and the wideband spectrogram. The narrowband spectrogram, as used in chapter 4, gives good frequency resolution, which is essential in determining the pitch period that takes values in the range of 100 Hz and with a required resolution of a few Hz. The wideband spectrogram is used in determining the location and direction in frequency of the different formants which range up to 4 kHz with a required resolution in the range of 100 Hz.
The Gaussian function is the only function that gives a maximal timefrequency resolution. Since a Gaussian is the eigenfunction of the Fourier transform, the transform of a Gaussian using a General Fourier transform is still a Gaussian. This special property gives the theoretical justification to the Gabor functions (and Gabor wavelet transform) that are used in chapter 5 to better identify horizontal lines in the spectrogram.
For this example we give the Fourier transform definition of: (2.1) F w
1 2
f t e
jwt
dt
And the inverse Fourier transform as: (2.2) f t
1 2
F w e
jwt
dw .
Note that instead of a single constant used for the inverse transform we now have two identical constants. As long as the product of the constants is
1 we can select the constants at will. In 2
most computational software the constant is used only on the inverse transform to save floating
16
point multiplications, however in demonstrating the Gaussian property as an eigenfunction of the Fourier transform, we make the aforementioned selection. We note that the Gaussian is not the only function that serves as an eigenfunction of the Fourier transform, another trivial example is the pulse train signal. Using integral tables (or calculating using the residue theorem) we obtain that for the above definition of the Fourier transform, the relationship is: (2.3) f t e
t
2
2 2
Fourier
F w e
w
2
2 2
We input a centered normalized Gaussian of unity variance, N(0,1), to the Fourier transform. The output is identical to the input and the Fourier transform did not change the input (except for a countable set of points of zero measure).
2.9 Time Frequency Representation
TimeFrequency Representation (TFR) differs from a spectrogram representation by taking the square value of the signal energy instead of its logarithm. Nadine Martin examined different algorithms for TFR segmentation [19, 20]. In [19] two algorithms for TFR segmentation are suggested. The first is based on morphological filters and the watershed transform and the second is based on tracking using a Kalman filter. Another interesting segmentation scheme based on statistical features of a spectrogram is presented in [20]. The TFR has better resolution and the variance of the Capon estimator used to segment the image is lower according to [19]. The segmentation is blind toward the analyzed signal; it does not assume that the signal is of any particular type. A speech signal will be analyzed in exactly the same way as a seismic signal. In addition the algorithm does not require tuning. Assuming a deterministic signal corrupted by additive Gaussian noise a probability model is developed to allow for local segmentation of objects. We note that by ignoring our knowledge of the signal‟s source we lose much prior information that is known about speech signals and can be used in the segmenting process.
Wideband speech spectrograms are indeed noisy images. Vertical lines that striate the spectrogram show that it may be inappropriate to model the noise as a lognormal distribution as would be the case if we apply the algorithm developed in [19] for TFR to the spectrogram. The vertical striating lines are caused by the opening and closure of the vocal cords. These lines appear in a spaced distance that can be used as a rough approximation to the fundamental frequency f0, also known as the pitch. Another caveat for using a blind method as proposed in [19, 20] is the difficulty in adjusting it to recognize specific types of information present in the speech spectrogram. Existing speech recognition systems have been developed for many years
17
and reach impressive results. Developing a new recognition system is a challenging task. Using Dempster‟s Rule it is possible to combine two sources of evidence to a joint basic assignment. We will see in chapter 4 a detailed description of Dempster‟s rule of combining evidence.
2.10 Summary
In this chapter we examined different concepts of speech. We also reviewed the uncertainty principle and timefrequency representation methods. In chapter 6 we will see how different phonemes take the form of a smeared energy shape in a speech spectrogram. An expert with intricate knowledge of the speech process can read the spectrogram and make sense of it. We will also see in chapter 6 how the morphological image processing tools presented in the following chapter allow us to extract information from the speech spectrogram even though it is corrupt due to the uncertainty principle and vertical lines caused by the pitch. Finally, appendix A gives an idea on how to utilize the information given in table 2.1 in a Fuzzy Logic based expert system.
18
Serra was researching the iron deposits of Lorraine. We continue with basic axioms and definitions that exemplify the importance of the different morphological operators. We end this chapter by showing the close relationship between Fuzzy Logic and Mathematical Morphology and by concluding on the importance and relevancy of mathematical morphology to automated spectrogram reading. mathematical morphology provides solutions to handling of specific geometrical objects in different topological spaces. Non 19 . The second was the numerous books and industrial products and applications based on mathematical morphology. Based on lattice and set theory and axioms. The first was the establishment of the foundations of mathematical morphology with respect to the mathematical field of complete lattices and graph theory. The idea was to use the particular previously known shape of the minerals in question in order to identify and classify these minerals in images. 22]. Watershed Transform and more.Chapter 3 Mathematical Morphology Mathematical morphology is a theoretical model that justifies particularly useful operators in image processing applications. The 90‟s made mathematical morphology an important tool in segmentation through the refinement of existing algorithms. Opening and the essence of a structure element were investigated. binary thinning.1 History of Mathematical Morphology Mathematical morphology was developed in 1964 by Jean Serra while doing his PhD work with George Matheron [21. We review the morphological operators starting with the most basic operators of dilation and erosion. 3. We first begin with a historical overview of Mathematical morphology. In the 70‟s mathematical morphology was further developed where recursive algorithms were implemented such as ultimate erosion. Basic operators are used to construct more complex operators while all operators rely on structure elements that serve as geometrical modular units to the different morphological operators. Skeleton by Influence Zones. Two notable improvements occurred in the 80‟s. A method to distinguish between different shapes was needed and the Hit And Miss transform was developed to identify specific shapes in the image. and finally we conclude with the algorithmic scheme of the Watershed Transform that is used for image segmentation.
In general they do not follow the linearity mapping condition: 2. etc. in Biomedical imaging mathematical morphology is used in detecting different objects in an image.2 X X Idempotence: Reapplying an idempotence operator does not change the number of foreground pixels 2.5 X C X C 20 . Other duality properties are: Self dual 2. For example. thinning and thickening. Duality Principle In general. Reconstruction of an image that has undergone a morphological operator is in most cases impossible. In the oil industry. dilation and erosion. mathematical morphology is used to estimate both intensity and range. in order to address the issue of sand and oil ore. opening and closing. For example. 3.4 a1 x a2 y a1 x a2 y .1) X X Antiextensitivity: Applying an antiextensive operator decreases the number of foreground pixels 2. 24]: Extensitivity: Applying an extensive operator increases the number of foreground pixels (2. 23. mathematical morphology is an important tool in industrial applications that require fast processing and detection of objects in acquired images. a . One consequence of the nonlinearity is that in many cases an inverse does not exist since there is loss of information.2 Useful Properties The following sections are based on [21. Today. The linearity mapping condition is a necessary and sufficient condition for the linearity of an operator. An example of a nonlinear operator that does have an inverse is the Medial Axis Transform that will be described later in the chapter. morphological operators exist in pairs.3 X X Nonlinearity: Morphological operators are nonlinear (except cardinal cases).linear mathematical morphology based filtering and different connectivity methods (topology) were developed.
1 Connectivity Since we are using a digital grid instead of a continuous space. In order for us to define a distance between two pixels or if one pixel is a neighbor of another one.7 SE 1 1 1 0 1 0 Figure 3.6 X C X In the following sections and examples we will use the circles image which is a standard benchmark binary image from the Matlab TM database as the original image and the unit radii disc as the structure element. 21 .8 0 1 0 1 0 1 . we need to specify a connectivity.For example we show a diagonal (and in this case symmetrical) connectivity: 1 0 1 2. The current pixel is the center of the matrix and is also denoted by a logical „1‟. 3. We will concentrate on the twodimensional case although connectivity is a welldefined concept in higher dimensions. Each element with a logical value of „1‟ represents a foreground pixel while each element with a logical value of „0‟ represents a background pixel. we need to perform quantization to a discrete area/volume. The structure element (SE) and the benchmark circle image are as follows: 0 1 0 2.Invariant under duality 2.2.1: Circles benchmark image. A straightforward way to describing connectivity is by specifying in a matrix form a logical „1‟ to all pixels that are neighbors of the current pixel. The quantization takes the form of pixels (picture elements) which are a lowpass smoothed result of the actual continuous image.
The “8connected” case is also known as fullyconnected. An example of “8connected”: 1 1 1 2.An example of an asymmetrical connectivity is: 1 0 0 2. An example of “4connected”: 0 1 0 2. 22 .11 1 1 1 1 1 1 .10 1 1 1 0 1 0 . The two most common connectivity schemes in 2D are 4connected and “8connected”.9 0 1 1 0 1 0 .
using an additional simple restriction that limits the circle distance to be in the range of the radii (the most common distance between each pair of points) to obtain the exact location of the circles‟ centers. The output image will contain all the points in which the SE is fully contained in the original image.2. The Ultimate Erosion is important in robust marker selection which is usually a part of the watershed transform. Hence erosion is an antiextensive operator. We obtain the centers of all 13 circles and some additional spurious points. 3.2: Circles before (left) and after applying ultimate erosion.3. It is possible.2 Erosion Erosion is a basic morphological operation. ultimate erosion was calculated. Using that information we can obtain a good estimate on the number of circles in the original image. 23 . The Ultimate Erosion is the union of all differences of erosion and reconstruction using opening at all stages.2. The objective is to find objects in the image that exactly correspond with the SE. A negative of both the circles image and the ultimate erosion are presented: Figure 3. In the following example. We regard each time the center kernel point as the reference point and output a logical „1‟ in case the SE is fully contained within the image with regard to that reference point. The operation is equivalent to a logical AND in the binary case. The image is therefore eroded with respect to the SE. A kernel typically using a logical „1‟ as an indicator to the Structure Element (SE) is used.3 Ultimate Erosion Consider eroding an image over and over again until an idempotent state is reached.
in general. The morphological gradient is used in determining the boundaries of an object which can be of particular importance in segmentation algorithms such as the Watershed Transform.5 Morphological Gradient Also known as the Beucher Gradient. Dilation follows the same scanning of the image by a kernel in which the SE is indicated by a logical „1‟. 3. not the inverse operator of dilation unless the closing is idempotent with regards to the image and the SE. Therefore dilation is an extensive operator. in general. a logical OR is used in the binary case and a resulting „1‟ is written to the output image in case there is at least one pixel in the image that corresponds with the SE.4 Dilation Dilation is the dual operator of erosion. An approximation of a gradient would in most cases require computing two directional gradients in the horizontal and vertical direction. When detecting objects with specific (known) geometrical boundaries the advantage of a Beucher gradient is evident by closely tracking the boundary through the use of the structure element. In the same manner erosion is. Using mathematical morphology to compute the gradient allows a nonlinear granular geometric approach. Usually the same structure element is used for both dilation and erosion.2.3. dilation is not the inverse operator of erosion unless the opening is idempotent with regards to the image and the SE.3: Circles image corrupted white Gaussian noise after applying Beucher gradient. 24 .2.12) g(f) = (f ⊕ B) . Computing the directional gradients can be performed using a sobel kernel to convolve the image and then combining the results of the vertical and horizontal “derivatives” to obtain the direction of the gradient at each point. However. However.(f ⊖ B). the morphological gradient is defined as: (2. An example of using a Beucher gradient to the circles image corrupted by white Gaussian noise: Figure 3.
in this case. while features larger than the SE remain about the same.6 Opening (2.4: Circles image corrupted by Salt and Pepper noise (left) and after opening. antiextensive and idempotent is called opening: 2.After a simple threshold we are left with the image borders. it is therefore necessary to perform dilation to reconstruct these objects. features smaller than the structure element are removed. The morphological opening presented is idempotent. we can use a flooding algorithm that identifies the interior of the objects and fillsin the gaps. To perform reconstruction of the original image.SE) = Im ◦ SE Opening is performed by dilating the result of an erosion of an image with the same structure element (SE). In algebra every operation which is increasing. Simply eroding the image will reduce pixels from „clean‟ image objects. Closing is the dual operator of opening. and to separate between different object types (lines in specific directions. Normally after opening. The following example demonstrates how opening can be used to reduce Salt and Pepper noise: Figure 3. 25 . 3. The salt noise will disappear after erosion and will not grow due to the subsequent dilation.14 B B B . It is also antiextensive and increasing. specific objects). Opening is normally used to open an image and clean it from salt noise.2.13) Open(Im.
Again.18) 1 0 1 1 0 0 .15) Close(Im. Eroding the dilated image creates a smoother image that also has better (less chaotic) skeleton properties.7 Closing (2.SE) = Im ･ SE Some Image Processing algorithms create holes in the image. The structure element should be designed according to the holes needed to be filled in.17) HitAndMiss(Im. simply dilating the image will expand objects that are „clean‟. Performing the additional dilation with the same SE will reduce the distortion caused to the clean objects while filling in the holes. 3. Examples: A kernel for finding a corner: (2. In algebra. Foreground pixels are usually denoted with a logical „1‟.16 B B B . every operation which is increasing.8 Hit and Miss (2.3. The transform is equivalent to finding the exact foreground object (erosion with a foreground requirement) while matching the background objects (erosion of the background requirement on the image background). Constraining both background and foreground pixels allows one to identify a specific shape. background pixels with a logical „0‟ and pixels that can be considered either as foreground or background as a don‟t care. extensive and idempotent is called closing: 2. A kernel for finding pixels that are connected to at least one more pixel in a “4connectivity” lattice (using OR for the results of applying the kernel with 4 rotations of 90”): 26 .SE) = Im ⊛ SE The Hit and Miss Transform can be used to find specific objects in an image. That and pepper noise can be treated using the closing operator. The morphological closing presented is idempotent.2. It is also extensive and increasing.2.
However the process of thinning is different than that of a generating skeleton. SE .20) 0 1 0 0 1 0 . Usually thinning results in some form of skeleton of the image.21) where (2. 3. A kernel for locating an endpoint of a “4connectivity” Skeleton (using OR for the results of applying the kernel with 4 rotations of 90”): (2. Using DeMorgan‟s law we see that thickening an image with a structure element SE is equivalent to thinning of the inverse image with the same structure element.2. The logical relation between thinning and Hit and Miss Transform is: (2.22) ThinningIm Im HitAndMissIm. 27 .(2. 3. We would usually employ thinning in an iterative sequence where the thinning mask differs at each sequence to produce a more refined result.2. We use an extendedtype structure element as the one described for the Hit and Miss Transform that contains zeros. SE.23) ThickeningIm Im HitAndMissIm. SE) . ones and don‟t care elements.10 Thickening Thickening is the dual of thinning and is used to grow foreground pixels in the image.19) 1 1 . It also uses an extendedtype structure element and is related to the Hit and Miss Transform through the following equation: (2. Im HitAndMissIm. SE Im HitAndMiss(Im.9 Thinning Thinning is the operation of removing foreground pixels from the image.
3 Distance Transform The distance transform produces the distance of each foreground pixel from the nearest boundary according to the selected connectivity.2. 3. Closing was presented as a way to reduce the complexity of a skeleton and therefore reduce the number of branches a skeleton has. Pruning is typically performed after thinning or skeleton operations. Pruning run through an infinite number of runs will converge when the image will contain only closed loops. however due to its importance and unique description it is regarded as a separate operator. Pruning can be regarded as a particular case of thinning.25) 0 A and 1 B . It is important to choose a correct branch length that on one hand would eliminate the obscure branches but on the other hand would preserve the original lines in question.1 Continuous Case Assuming the image f is an element of the space C(D) of real twicecontinuouslydifferentiable functions on a connected domain D with only an isolated number of critical points (of zero measure). 3. The infimum is over all paths (smooth curves of measure 1) inside the domain D with (2.3. It is possible to erode an image by a structure element that is a disk of a certain radius r simply by removing the pixels that attain a value less than r. Ridges of the distance transform can be considered as local maxima and represent the skeleton of the object in question.24) T f A. A branch is a group of pixels connected according to the connectivity method that corresponds to the geometry of the problem and in which each pixel except the end pixel is connected to two other pixels while the end pixel is connected only to one pixel. Consecutive erosions in this case would be equivalent to alphacuts of the distance transform in intervals of size r. Branches are usually common in applications involving skeletons and in particular when the image is noisy. B inf f s ds .3.11 Pruning Pruning reduces branches that are shorter than a specific length. we have: (2. 28 . Pruning is usually performed on a finite number of iterations usually in the length of the longest branch we would like to remove.
3. Since connectivity rules change in discrete topology. The distance between two points is the path that lies totally within the object and leads to a minimum distance result. however due to the aforementioned drawbacks it was not implemented in any standard image compression algorithm. Thinning. Unlike the skeleton. Highly sensitive to noise – Even small irregularities in the shape will cause large distortions in the skeleton since each irregularity has to be included within a bitangent circle.28 that describes the watershed transform computation assists in understanding the distance function. we have more paths to check when the connectivity is higher.4 Skeleton The skeleton is the locus of all centers of bitangent circles that fit entirely within the foreground pixels. High computational complexity – typically computed using either a distance transform or a constrained thinning. Thickening and the Distance Transform.2 Discrete Case In the discrete case the result would depend on the connectivity chosen. the MAT can also be used to exactly reconstruct the original shape. Notable drawbacks of the skeleton/MAT are: 1. Obviously. 2. not a continuous transformation). We then approximate the length of the prunes and perform an iterative pruning algorithm to produce a cleaner skeleton.3. 3. The skeleton provides a compact representation to a shape.. a skeleton in the nonEuclidean case (digital) does not preserve continuity if computed according to the mathematical definition and therefore is not a homotopic transformation (i. To solve that problem we use the following transforms: HitAndMiss. The skeleton is not a onetoone transform (the MAT is).e. There have been several attempts to use the MAT and the skeleton in lossy or lossless image compression algorithms. It is necessary to pre/postprocess the image (see the closing and pruning operators for example) in order to obtain a less complex skeleton. 3. The process shown in equation 2. 29 . In the following example we take a clean binary image of the printed acronym INRS and run the skeleton algorithm on it. The Medial Axis Transform (MAT) is the graylevel image that represents the radii of these bitangent circles.
5: Original image (top).Figure 3. 30 . applying a skeleton (middle). pruning the skeleton (bottom). we can use the skeleton to reconstruct the original written text. Since in general the width of printed characters is uniform. In this case the difference between the MAT and the skeleton is small.
We conclude the example by assuming a Salt and Pepper (S&P) noise added to the original text image. compressing handwritten text can be done using this method. The S&P noise is often caused in the process of acquiring data through scanning devices. We see that the skeleton is highly sensitive to noise and therefore it is essential to perform cleaning before calculating the skeleton. (a) (b) 31 . However. Another conclusion is that in case of a more severe noise it could be better to take a different strategy of segmentation such as the Watershed Transform that will yield a cleaner image that can later be processed using the skeleton.Compressing text is usually done by using the LZW algorithm.
(c) Result of cleaning (a) by employing the closing and opening operators. A much clearer skeleton is obtained.6: Example of cleaning and extracting information from an image using morphological operators: (a) Original image corrupted by Salt and Pepper noise. 32 .(c) (d) Figure 3. (d) Result of applying the skeleton operator to (c). (b) A very noisy result after applying the skeleton operator. The result can be pruned to obtain a more compact skeleton. Cleaning the S&P noise is done by a closing with a disk of a unit radius (identical to the 4connectivity kernel) and opening with an 11 pixellength radii.
Segmentation is not a mathematically defined term and therefore it is hard to compare one segmentation algorithm to another. The catchment basin CB mi of a minimum mi is defined as the set of points any other regional minimum x D which are topographically closer to mi than to m j in D: (2. The notations have been kept as in [28] to facilitate the understanding of the algorithm.b} A among all paths in A. we present a theoretical description of the watershed developed for the continuous case.Skeleton by Influence Zones (SKIZ) Skeleton by Influence Zones also known as a Generalized Voronoi Diagram consists of all points which are equidistant.27) Wshed f D CB mi . The geodesic distance between a and b within A is the minimum (in the continuous case. Many variants of the watershed transform exist in the literature. notably in medical image processing. mi f m j T f x. to at least two nearly connected components.5 Watershed Transform Watershed – A region or area bounded peripherally by a divide and draining ultimately to a particular watercourse or body of water. 3. Assuming again the image f is an element of the space C(D) of real twice continuously differentiable functions on a connected domain D with only an isolated number of critical points (of zero measure) we have the following definition: Let f C D have minima mk kI for some index set I. In the following. The strength of the watershed transform can be attributed to the capability of performing segmentation to objects even in the case of partial occlusions [26. [25] The Watershed Transform is a morphologicalbased algorithm to segment images. The watershed of f is the set of points which do not belong to any catchment basin: (2. In the case of a digital grid it may be that no such point exists due to quantization effects. m j . iI c 33 . 27]. for example in MRI (Magnetic Resonance Images) it is used to identify different tissues and aid in detecting tumors. infimum) path length between two points {a.26) CB mi x D j I \ i : f mi T f x. The watershed transform is common in numerous segmentation applications. in the geodesic distance sense.
In the following example. a similar operation to performing different alphacuts on a fuzzy membership function. In case a plateau point has the same distance to both basin associated points it is considered a (temporary) watershed and is denoted by W. In particular. Computing the watershed transform on a digital grid can be a tedious task. 3 2 2 3 2 2 3 2 (2. One possible way to overcome this problem is by a simulated immersion. We see that a temporary watershed value can become associated with a basin when a closer basinassociated point is created in the iteration.28) 3 1 1 3 1 1 3 W h 0 h 1 0 1 0 A 1 B A W 2 3 B B 3 B h2 B A W B B B B W B h 3 B A W B B B 34 . that is when we threshold with the maximal value in the image. W I. The watershed pixels are the output of the watershed transform.Let W be some label. large parts of the image can be flat plateaus. The watershed transform of f is a mapping such that : D I . we first find the basins which in this case are the two lowest points denoted by A and B. The process stops when the whole image is „flooded‟. The distance between pixels in a plateau is zero and an additional ordering is therefore required. By performing threshold and computation at each level we are able to associate at a relatively low computational cost each pixel with its catchment basin. Then at each stage we threshold the image and according to a fullyconnected scheme we associate each plateau point with its nearest basin. The thresholds are performed on the graylevels of the image. A flooding algorithm is used to help and associate different pixels with different catchment basins. W p i if p CB mi and p W if p Wshed f . As a result from the watershed we obtain a labeled image. Each catchment basin is labeled with a different number and a special label W is assigned to all points of the watershed of f.
1]. The flexibility in constructing the basic operators gives both theories their power. Generalizing to the graylevel case.6 The relationship between Fuzzy Logic and Mathematical Morphology A close relationship exists between Fuzzy Logic and Mathematical Morphology since both theories are extensions of set theory and rely on similar mathematical axioms [24. The structure element can be regarded as a measure of belief the system designer has in a certain geometric specification for the objects to be investigated. Alphacuts are performed on the graylevel image reducing it to binary threshold images that correspond with the original Mathematical Morphology framework. 29]. Graylevel extensions to Mathematical Morphology Many Morphological operators have been extended to deal with graylevel images. Another technique uses basic fuzzy logic operators to construct the dilation and erosion building blocks. 4. In Mathematical Morphology. 2. the structuring element can be easily modified to capture the characteristics of the objects in question. Composing the basic operations yields more complex operators In Fuzzy Logic a conclusion is usually deduced by taking the Tconorm of all Tnorms. In Fuzzy Logic we have as the logical conjunction (logical AND) a Tnorm and as a logical implication Snorm also known as Tconorm (logical OR). Both theories rely on a basic union operation and its dual intersection. 1. A common technique uses Fuzzy Logic as its basis and assumes that each graylevel represents a membership value. 3. we obtain degrees of uncertainty for values in the range of [0. Both theories are based on human intuition and expertise in order to construct the system In Fuzzy Logic a set of rules as well as basic norms and aggregation methods are specifically chosen based on the task in question.3. In Mathematical Morphology taking the dilation of the erosion is called an opening operator. The flexibility in selection is through a structuring element (SE) that is in general the same for both operations. A common choice for basic norms is the dual pair min and max functions. 35 . In Mathematical morphology the dual pair Dilation and Erosion play the part of Union and Intersection. These operators are in general nonlinear and therefore have some advantages in terms of noise and outlier measurements.
Indeed in this work we choose to use mathematical morphology due to its high performance both in terms of runtime complexity and in results. Since the speech spectrograms are considered very noisy images with specific geometrical objects that convey specific information. In particular the development of more complex compound morphological operators from basic operators is demonstrated. we tend to favor using mathematical morphology when constructing a recognition algorithm. is a nonlinear theory based on set theory.7 Summary This chapter gives a description of some of the more common morphological image processing operators. Specific examples demonstrate the usefulness of mathematical morphology in solving every day image processing tasks. The relationships between the different operators are presented and the justifications for using each operator are given. 36 . A note on the association between these two theories is presented to show that different approaches to the same problem may actually rely on the same methodology but with a different formulation. like Fuzzy Logic. Using morphological operators extended to grayscale images allows us to extract relevant information from our input speech spectrogram images. The history of Mathematical Morphology helps to understand the ways morphological operators are used and gives motivation to use mathematical morphology in different contexts. Mathematical Morphology.3.
they are not certain. developed a theory that would make it easier to translate problems easily understood by humans to a computer that relies on binary logic to make decisions [30]. Fuzzy Logic is derived from Fuzzy Sets and has much stronger practical advantages than merely extending the possible output levels of a logical statement. fuzzy union.1 Introduction Fuzzy Logic can be traced back to the preaching of Buddha in about 500 BC that claimed that everything contains something from its opposite. Fuzzy Logic may be viewed by some as an extension to binary logic. How many sand grains does it take to obtain a pile of sand? This problem is defined in vague terms. The WesternEuropean philosophy adopted binary logic to daily life. We show how different sources of information can be combined using the DempsterSchafer rule of combining evidence and afterwards demonstrate the capabilities of the Matlab Fuzzy Logic toolbox." . they do not refer to reality. Finally. the Greek philosopher Aristotle developed the well known binary logic. perhaps as a multivalue logic. vagueness and ambiguity. We will see in the following that a membership function devised by an expert can give a possible solution to the paradox. namely.Chapter 4 Fuzzy Logic "So far as the laws of mathematics refer to reality. In contrast to EastAsian philosophy. therefore even a multiple level solution to it is inadequate. Consider for example Eublides‟ pile of sand paradox. Lofti Zadeh. alpha cuts. a professor from Berkeley University. And so far as they are certain. to distinguish between good and bad. fuzzy intersection and fuzzy complement. 37 . fuzzification and defuzzification. We continue with theoretical background that explains fundamental concepts in Fuzzy Logic such as fuzzy variables.Albert Einstein We start the chapter with a short historical overview and by explaining the importance of Fuzzy Logic and Evidence Theory to an expert systembased speech recognition system. about 200 years later. However. We need to have more understanding of the linguistic terms of the paradox and use some fuzzy technique to solve the paradox. In 1964 Dr. 4. we conclude the chapter with an overview of different terms that describe situations that are not well defined.
laundry machines. A person with height over 1.50 meters will not be considered tall. we construct a fuzzy membership function that assigns values ranging from zero to one that represent the tallness of a person. some examples are: rice cookers. „1‟ to represent that the person is clearly in the set of tall people and „0‟ to represent that the person is clearly not a member of the set of tall people. We will assign values to the extreme cases. air conditioners. In fact the crisp (Boolean) variable “tall” has been converted to a fuzzy variable that can take values on the interval [0. modeling them as equally likely using the uniform distribution is not always the right choice. Boolean Logic Boolean logic as well as crisp set theory and crisp probability theory are inadequate to deal with the imprecision. A neural network based forecasting model helps predict variations that occur over large time intervals while a fuzzy logic set of rules handles uncertainty and ambiguity that occurs over short time intervals. For example assume we would like to know if a person is tall. uncertainty and complexity of the real world.90 meters will be considered tall in most cases (the population of basketball players is a good counterexample) while a person shorter than 1. An example of a hardware implementation of a controller by Toshiba is the EJ1000FN [31]. 4.Today Fuzzy Logic applications are employed in numerous commercial devices. It can control up to 8 elevators and helps minimize waiting times. automatic braking systems and many more.70 meters and regard all candidates with a height greater or equal to 1. We might decide on a specific fixed threshold of 1. we would like to have a model that allows us to use that knowledge. In order to overcome the crisp boundaries of the set “tall” and its complementary set “not tall”. 1]. 38 .70 meters to be tall. If we have prior information about a set of events. On the other hand if we do not have information about other events. The limitations of crisp (Boolean) theories are that no room is left to disambiguate or for unknown cases.2 Fuzzy Logic vs. Evidently the problem with this method is that two persons that differ by 1 cm will be categorized into two different groups. All other heights will be assigned a midrange value. The threshold can be determined heuristically or through median. cameras. The processor can be programmed to operate under different requirements and the result is a shorter wait time under the constraints and limitations imposed by the building manager. Fuzzy logic was developed to deal with these types of situations and to allow us to easily write and later implement logical rules that are the result of an expert‟s knowledge. average or other statistical characteristics of a certain population.
mostly false etc. Converting a crisp problem to a fuzzy one is called fuzzification. We can construct the membership functions of the “weight” variable as a function of the fuzzy variable “tall” and create a Fuzzy set of type 2. once triggered. The student dilemma‟s first implication is “The more I study. Other examples of hedges are: „slightly‟. using Fuzzy Logic operators to link the different assumptions. almost all. There are many methods by which a membership function can be constructed. For example. We now examine a person‟s weight to determine if the person is slim. old. Another important concept in Fuzzy Logic is that of linguistic hedges. The second implication is “The more I know. usually. Very Tall}. more or less true. the more I know”. An expert may assign values based on previous experience with the problem. Average. the more I forget”. Depending on the problem we have the flexibility of choosing different functions to different hedges as long as we remain consistent in our definitions. rare. For example we can apply the linguistic hedge „very‟ to the fuzzy set „tall‟ to create the fuzzy set „very tall‟. In general. the more I forget”. which may lead to the conclusion that learning decreases knowledge. We can also quantize the height and regard each subsection of it to have a different membership function. „more or less‟ etc. We can afterwards interpret the hedge in some predefined way. Instead of a single threshold we now have two thresholds since a person may either weigh more or less than the weight considered to be categorized as “slim”. More interestingly. However. and the like) truth values (quite true. Statistical methods may help in giving a lower bound to a membership function. we will use expert 39 . dangerous etc.) quantifiers (many. few. Since membership values are between zero and unity. we have a dependency between a person‟s height and weight in a sense that we would expect a slim and tall person to weigh more than a slim but shorter one.A Fuzzy variable can be declared for the following reasons: predicates (expensive. we can quantize the height into the following fuzzy variables: {Below Average. „rather‟. „extremely‟. would produce a certain result. We can choose that the hedge „very‟ would correspond to squaring the membership function. Tall. we would conclude that the part of the knowledge forgotten is negligible in comparison to the knowledge gained.) The previous example required converting a single crisp variable into a fuzzy variable. A neural network can be trained to tune membership functions in a way that would produce a desired result. Another example of the importance of Fuzzy Logic and Fuzzy set theory can be seen through the “student dilemma”. Thus using crisp deduction the student stipulates that “The more I study. we would obtain a value for the knowledge forgotten that should be low in comparison to the value for the knowledge learnt. squaring the values will increase their membership in the set in a nonlinear manner. Instead of a crisp threshold that.
knowledge when it is available and when it is feasible to construct each membership function in the problem at hand. We can always further tune a predetermined membership function using a neural network. Statistical methods can be justified by evidence theory methods that require that the membership function should be an envelope function of the plausibility function, which is an upper bound on all probability density functions that could be associated with the specific problem. In the same manner a belief function serves as the lower bound for all possible probability density functions that can be assigned to the problem. We denote by the plausibility function and by the membership function and obtain the following relationship [32]:
4.1 u U , X u A u
i
In the following we present prototypes of different membership functions. The values of the vector P stipulate the transition band(s) just as in DSP filter design.
(a)
(b)
40
(c)
(d) Figure 4.1: Examples of common parametric membership functions: (a) Sigmoid membership function. (b) Pi membership function. (c) Z membership function. (d) Triangular membership function.
41
4.3 Alpha cuts
Alpha cuts are simply threshold levels that convert a fuzzy set into a crisp set. The process of converting a fuzzy set to a crisp one is called defuzzification. We have already performed alpha cuts in our previous example when we quantized the “height” fuzzy variable to different subsections. An alpha cut is considered to be a strong alpha cut if the inequality of the threshold is strict. For example x>.25 is a strong alpha cut while x.68 is a regular alpha cut. We present a formal definition: Regular Alpha cut:
4.2 A x X A x
Strong alpha cut:
4.3 A
x X A x
Alpha cuts are mostly used to extract information from a membership function and are rarely used to defuzzify a fuzzy set. Usually several alpha cuts are taken whereby in decreasing the parameter alpha more elements are added to the generated crisp set. Hence a nested subset is created with different alpha parameters. The following method introduces a more common way to defuzzify a set.
4.4 DeFuzzification
In order to obtain a result that can be used to make a decision it is necessary to defuzzify the fuzzy set. There are many ways to perform defuzzification. One very common way is called the Center of Gravity (COG).
The COG is defined as:
(4.4)
u'
u u du
A u
u du
A u
.
The concept of COG is simple: if we have both the horizontal and vertical locations of the COG and if we regard the area under the membership function to have a mass that is evenly distributed, then by placing a pin at the location of the COG we can (in terms of classical physics that is  disregarding any inherent uncertainties in measurements) balance the entire mass. The COG is therefore an equilibrium point in terms of mass. Equation 4.4 calculates the horizontal
42
in order to find the exact center of gravity we would need to perform another calculation for the vertical axis value. 2 u' Taking the derivative of the above expression w. u0. b u a' . Since we obtained a mean value it is the solution of the following optimization problem: (4. u5 and u6 show the uniqueness of these default operators. a [commutative]. b. c ua. Two additional axioms are only satisfied by the default union and intersection. We have made an assumption that the normalized MF can be regarded as a probability density function. Axiom u6: ua.5) u ' arg min A u u u ' du .axis value of the center of gravity.r. Hence. we obtain the mean according to the normalized membership function. b' [monotonic]. Additional axioms satisfied only by the logical operators OR and AND: Axiom u5: u is a continuous function.1 1 [conforms to crisp boundaries]. that is.5 Fuzzy Union In order to combine two fuzzy values in a constructive manner we need to define a union operator. 4. and obtain as a result another membership function. Using the COG would allow us to use a single output value instead of a collection of values with different membership grades.1 u1. Axiom u3: If Axiom u4: a a' and b b' then ua. Replacing the membership function (MF) with a normalized one times a constant K. We note the similarity between the union and the morphological dilation operator. c [associative]. a a [u is idempotent]. The union operator satisfies four axioms in order for it to be logically consistent and serve as a building block to fuzzy logic. u‟ we obtain the value that is the best in a meansquare sense and is given by the COG. ub. however.0 u1. by the logical OR and AND operators. In practical implementations we would aggregate different membership functions according to different weights and rules. ua. uua. Axioms for the union operator: Axiom u1: Axiom u2: u0. we keep in mind that the MF is not directly generated from a probability model since it contains vagueness and is a result of nonadditive measures. 43 .t. b ub.0 0 .
Examples of different types of complement functions that satisfy all 4 axioms presented above: Sugeno class of fuzzy complement: (4. Axioms for the fuzzy complement operator: Axioms c1. a.6) c a 1 a 1 a where 1. We obtain six axioms. c2 are axiomatic skeletons for fuzzy complements. The intersection complies with all axioms except u1.1 : if a b. i1i6 for the fuzzy intersection. Axiom c2: monotonic nonincreasing that is: Axiom c3: c is continuous. then ca cb . b 0. Axiom c1: c0 1 and c1 0 . 44 . Axiom c4: c is involutive (therefore necessary continuous).The dual operator of the fuzzy union is the fuzzy intersection. Changing axiom u1 to accept a logical value „1‟ only in case both inputs have a logical value of „1‟ and to accept a logical value „0‟ in all other cases would give the required additional axiom for the intersection operator.
For 0<w<1 we obtain a concave function and for w>1 we obtain a convex function.9) ia. 45 .7) cw a 1 a w 1/ w where w 0. Yager class of fuzzy unions: (4. Yager class of fuzzy complements: (4. the complement curve turns from convex (1<<0) to concave (>0) as can be seen by taking the second derivative equation (4. The Yager class of fuzzy intersection (Yager Intersection) can be obtained by inserting the Yager union and Yager complement into DeMorgan‟s law: (4.3: Yager fuzzy complement for different parametric values.6). The complementary function has at most one equilibrium solution. a b w w 1 w where w 0. b uca . For w=1 we obtain the traditional complementary function. As increases.Figure 4. cb . b min1. Figure 4.2: Sugeno fuzzy complement for different lambda parameters.8) u a.
bi 0.…. Axiom h4 assumes that „all fuzzy sets are created equal‟. bi s.In the latter equation the union. ai bi we have hai hbi where i n and ai . Aggregation operations must satisfy two axioms to be logically consistent.1 that is.0.1 and b 0.1.…. h is monotonic decreasing in all its arguments.1.10) h a1 . if a i i i bi for all i n then ai . h(1. The two additional axioms are common to most aggregation operations used. 4.0)=0 . intersection and complement correspond to the same parameter value w. this axiom would not be satisfied when devising an aggregate operation. a 2 . Additional (optional) axioms: Axiom h3: h is a continuous function. Axiom h1: h(0. a n n 1 where 0 is a parameter by which different means are distinguished:  min =1 Harmonic mean 0 Geometric mean =1 Algebraic mean max 46 .6 General Aggregation Operations Aggregation operations on fuzzy sets are operations in which several fuzzy sets are combined to produce a single fuzzy set. Axiom h4: h is a symmetric function in all its arguments. One class of averaging operations that covers the entire interval between the min and max operations consists of generalized means. Fuzzy unions and intersections qualify as aggregation to fuzzy sets since they satisfy the associative axioms (u4/i4). These are defined by the formula: a1 a 2 a n (4.t. . If the case is that one set is more important than the other. that is: h ai i n h a p ( i ) i n for any permutation p on n .1)=1 (boundary conditions) Axiom h2: For any pair a i i n and bi i n where a 0.
(4. 1 47 . (4.7 DempsterShafer (DS) Theory 4.13) AP X m A 1 A3: Normalized values All mass assignements are normalized to the unit interval. w2 wn wi a i i 1 where 1 w i 1 n i 1 and wi 0 for i 1 N . BPA follows three axioms: A1: Empty set The mass assigned to the empty set is zero.7. a 2 .12) m 0 A2: Frame of Discernment The total mass assigned to the frame of discernment is one.Weighted Generalized Means: n (4.1 Basic Probability Assignment (BPA) Basic Probability Assignment (BPA) describes a mass value assigned by an information source only to events for which it has direct evidence. 4. Extensions of this aggregation method would include dynamic change for the weights to fit a nonstationary or quasistationary environment. w1 .11) h a1 . The weighted general mean enables giving more importance to some input and less importance to others.14) m( A) 0. . (4. a n . The information source assigns values on subsets of the frame of discernment.
DS theory is a generalization of the Bayesian theory of subjective probability. Rule of combining evidence is a generalization of Bayesian combining and when two events are independent or when the basic probability assignment sums to unity the rule of combining evidence and the rule of joint probability are identical. (4. 4. The plausibility value of unity is more general and allows all possible types of probability distributions since it serves as an upper bound for the underlying (unknown) probability density function if it exists.2 on the set A. In Bayesian inference.15) m1. in the case of evidence theory a plausibility value of unity would be assigned. Degrees of belief need not have the mathematical properties of probability. 2 A K B C A m B m C 1 2 1 K B C m B m C 1 2 The known formula for joint probability distribution is a particular case of the DempsterShafer (DS) rule for combining evidence.A particular case of a BPA is the wellknown probability distribution.2 Combining Evidence Two basic probability assumptions m 1 on the set of events B and m 2 on the set of events C are combined give a joint basic probability assumption m 1.5 would be assigned to a variable when there is no knowledge about its value. Instead of simply assuming a uniform distribution due to lack of knowledge or inherent vagueness we allow the system to take upon all types of probability distributions. where the sum of all known evidence for a particular event is unity. There is no need to specify prior probabilities in the evidence theory scheme. The set A contains events that are included in sets B and C. Combining an existing speech recognition system with a spectrogrambased system can improve the performance of the existing system as long as the spectrogram system gives reasonable results. 48 . Normalizing factor K is needed to compensate for mass that is assigned to events that are not common to both B and C. a symmetric probability value of .7.
variety. An example of a feature vector extracted from the image spectrogram can be an estimated location of the frequency of the first three formants. Nasals for example can be easily identified by their suppression of the second formant. cloudiness. Ambiguity – nonspecificity. 2. Dissonance in evidence – Two disjoint sets in X that give information about prospective locations of an element of X contradict each other by their evidence. analysis and design. generality. Three types of ambiguity that lead to three different measures are: 1. onetomany relation. indistinctiveness and sharplessness. In our work we witness all three types of ambiguity. Confusion in evidence – Subsets of X that do not overlap. The larger the subset. 3. unclearness. the greater the nonspecificity is. cause confusion in evidence.4. Nonspecificity in evidence – Nested subsets cause nonspecificity in evidence. a different construction can at first distinguish between different phoneme classes to reduce the dimensionality. Vagueness is best described by fuzzy sets. The output of the system would be a crisp value that corresponds to a particular phoneme with the dimension being the number of phonemes to be recognized. Uncertainty can be used to measure information. we would train the membership functions to each speaker to improve performance. The toolbox also contains different dedicated functions for fast computation of Fuzzy Logic operations. a reasonable membership function would accept different frequency values around some expected formant value that can be either learnt by a Neural Network or calculated through statistical regression for the particular speaker. However. In this case.9 Vagueness and Ambiguity We present two different concepts that deal with cases that are not well defined in terms of classical Boolean logic [29]: Vagueness – fuzziness. or that partially overlap. Since we have vagueness due to the uncertainty principle. haziness. Ideally. 4. We 49 . diversity and divergence Ambiguity is best described by fuzzy measures.8 Fuzzy Logic Toolbox The Fuzzy Logic Toolbox in Matlab has a Graphical User Interface (GUI) that allows easy manipulation of membership functions and a general construction of a Fuzzy Logic rulebased system. the measure of information does not include semantic and pragmatic aspects of information but can serve an important part in system modeling.
Specifying the amount of uncertainty in this case is directly related to the performance of the segmentation algorithm under different conditions. is the prime reason for the slow development of rulebased speech recognition systems. however. 50 . but more to extract information that would be deciphered at a later stage. Dissonance of evidence. We expect a human expert to design a set of rules that would use the information generated by the segmentation algorithm to perform the speech recognition. the amount of uncertainty can be easily quantified. allows overcoming this problem since fuzzy logic will produce a valid result even when two bodies of evidence contradict each other. 4. The smearing of information over a wide frequency band does not allow us to associate a specific frequency with a particular formant. We saw the similarity between the fuzzy union/intersection and dilation/erosion in mathematical morphology. The amount of uncertainty in this case is directly related to the proximity of formants of different phonemes under interspeaker variation and coarticulation effects. Fuzzy Logic serves as an important and useful technique to model daily human tasks. After identifying a specific region in the image spectrogram that is associated with a single frequency.10 Summary The basics of Fuzzy Logic operators were introduced. We have in this situation missing evidence that can still be handled by a fuzzy logic based set of rules. The third ambiguity is introduced in our case when the segmentation algorithm does not recognize all formants in a correct way. Knowing that Fuzzy Logic can deal with ambiguity and with contradicting evidence allows us to develop a scheme that does not necessarily try to identify a specific phoneme at a first run. Instead we are left with a wide band and the particular frequency can be anywhere in the band. Understanding Fuzzy Logic is important to our work since when designing the automatic spectrogram reading we need to know what type of outputs we need to generate that would be accepted by a Fuzzy Logic system. the second type of ambiguity. Using fuzzy logic rules. The contradiction in rules has led to null results when crisp rule sets were used.experience nonspecificity in evidence when we examine the results of the uncertainty principle in the image spectrogram.
pitch can also be used as a feature that aids an automatic speech recognition system. That information can be incorporated within a language model to improve recognition rates. Pitch can also indicate a transition between different phonemes and phoneme types. To obtain a better view of the pitch we generate a narrowband spectrogram. reduces the noise in the image and makes it easier to segment the different formants and consequently to perform an automatic recognition task. These lines indicate the peaks and valleys of the signal that are caused by the opening and closing of the glottis. A longer time window gives better frequency resolution. Removing the vertical lines from the wideband spectrogram. Other features that can assist a speech recognition system are the emotional state of the speaker. Estimating the pitch period can aid in removing these lines from the wideband spectrogram.1 Motivation The wideband speech spectrogram is striated by vertical lines. pitch can indicate an “end of sentence”. The opening and closing of the glottis generates the fundamental frequency also known as the pitch. transition between phonemes and different types of phoneme.Chapter 5 Pitch detection algorithm 5. For example. 51 . Pitch that indicates that the emotional state of the speaker has changed can help adjust parameters and assist in overcoming the interspeaker variability problem. indicate semantic and emotional speaker status that can be analyzed using a higher level recognition technique. In addition. Pitch can help distinguish between voiced and unvoiced speech. We are therefore motivated to detect the pitch and examine a new approach to pitch recognition.
Defining the start and end points of the pitch during voiced segments 4. background noise. Such a low pass filter manages to preserve enough harmonics of the pitch to allow detection. Differentiating between lowleveled voiced speech and unvoiced speech. These algorithms can be classified into four groups namely: Time domain.3 Autocorrelation Method To perform good analysis. The interaction between the vocal tract and the glottal excitation can have an impact on the clarity of the fundamental frequency. We present the autocorrelation method [33] followed by the cepstrum method [34]. We focus on these methods since the autocorrelation method is considered to be the most common method used and since the cepstral method is related to our work and can be easily obtained with minor additional calculations from the speech spectrogram. Autocorrelation requires multiplyaccumulate (MAC) operations and the complexity of the operation is O N 2 . The fundamental frequency is also caused in the process of the speech creation.2 Theoretical Overview Several pitch detection algorithms exist in the literature. high pitch speakers require a short window of 520 ms while low pitch speakers require a long window of 2050 ms. Frequency domain. The O function serves as an upper bound to the actual complexity function and N is the number of samples in question. Difficulties are high computation costs. works in the time domain and usually assumes stationary analysis (system does not change during the computation of the autocorrelation function). Several difficulties in detecting pitch are: 1. compression over the channel etc.5. The glottis generates a “saw” wave that propagates through the larynx. 3. The autocorrelation method is robust. Noise – The speech signal can be corrupted by noise (recording device. 5. The complexity can be farther reduced to the order of ON logN . which shapes the wave to produce the required spoken utterance. Pitch can be described as the rate of vibrations of the vocal cords. TimeFrequency hybrid and Event Driven. 52 .) 2. Most methods use a sharp cutoff filter at 900 Hz to reduce the effect of the second and higher formants. accurately detecting the correct peak in the resulting autocorrelation function and the need to window the signal.
2. there is a change in the autocorrelation function that can affect detection.700 Hz (a sampling rate of 10 KHz is assumed for the speech signal in this scheme). Sign function of the clipped signal. Then the signal is nonlinearly clipped and as a result the spectral density function is flattened. It is also assumed that there is a voiced/unvoiced detector that passes only the voiced speech. 3. Changing the analysis frame size is important in particular when the application is to handle different speakers. Since the speech signal is nonstationary. thus the clipping reduces the energy of the higher frequencies in a nonlinear fashion.25 is assumed to be unvoiced speech. Clip and reduce signal by a threshold value. The signal is preprocessed with a low pass filter with a cutoff frequency of 900 Hz and stop band frequency of 1. Clip: keep signal above a threshold. Note that this operation is equivalent to a fuzzy union of the absolute values of the signal at each interval and then the fuzzy intersection between both intervals. Rule of thumb: set the threshold value to 68% of the minimum of Q where Q is a two element vector containing the maximum of the absolute values of the signal in the first and in the last thirds of the analysis frame. Spectral smoothing is achieved by a combination of nonlinear functions. Thus an instantaneous adaptive algorithm (for the window size) is not required. Since we are dealing with a discretetime signal. These three methods give rise to 10 (3*3+1) different ways to correlate the nonlinearly processed signal with its shifted nonlinearly processed counterpart. The goal is to reduce the effects of the first formant so as to allow reliable pitch detection.Windowing: the type of window chosen affects the result and since the window tries to taper the function near zero. The frame size must contain at least two pitch periods but not be too large for it to be possible to detect the pitch at a given time interval. we do not have discontinuity points due to the clipping effect as would be expected in a continuoustime signal. Three different types of clipping are presented: 1. For a periodic signal s(n)=s(n+P) we have periodicity in the autocorrelation A(k)=A(k+P). it is more reasonable to define a shorttime autocorrelation function with the underlying assumption of a quasistationary signal in each time segment. The range of pitch variation within an utterance is generally one octave or less from the average pitch for the utterance. 53 . An autocorrelation value below .
The windowing can be left to a later stage since by cyclic convolution we have: (5. We want to measure the distance between the lines in the spectrogram.1) W d k X d k . Therefore. The signal is partitioned into intervals on the order of 4 pitch periods.9. A Hamming window is used to reduce aliasing effects. 4. This will increase the chances of error and inaccuracy since the calculation depends only on the first and last line position. The result is a more accurate pitch estimate that applies to an interval of a few pitch periods. applications that are using wideband spectrograms can perform the pitch detection algorithm with a relatively low additional computational cost. Combining two wideband (shorter time duration) sections into a single narrowband section is done by using the appropriate twiddle factors as a last stage of the corresponding FFT. First we need to thin the lines to a single pixel width so the measurement would have meaning. we need to find a way to increase the accuracy of our measurement. By taking a larger time window we can improve our frequency resolution of the pitch at the cost of time localization.In this work we develop a new scheme to detect pitch that is an extension of the Cepstrum method (CEP) [34]. It is possible to generate the narrowband spectrogram from the partial computation of the wideband spectrogram. The thinning morphological operator was described in section 3. X d k . We need to aggregate the different measurements in some way that will reduce errors. 2. The logarithm of the absolute value of a 512 point Fast Fourier Transform (FFT) of the windowed signal is then computed. 5. Since the image is noisy and since we sometimes miss lines where they are supposed to be detected or have lines in places there shouldn‟t be any.4 Method of Aggregation Using an arithmetic mean to calculate the average distance between the lines will not give accurate results since it would only take into account the first and last lines and completely ignore all the lines in the middle. A short description of the CEP method: 1. 54 . 1 2 W d k is either a duplicate of a wideband length hamming window or a full length narrowband hamming window in the frequency domain where „d‟ stands for the DFT transform. 3. A 512 point Inverse FFT is computed and the resulting peaks are detected.2. The distance would give an indication of a half cycle of the sinusoid (peak to peak).
x i 1 N Taking the derivative to find a global optimum we obtain: (5. This would also be an optimal solution in the mean square error in case we assume that the average between these two points is also the average of the whole ensemble (good assumption for a large sample drawn from a symmetric distribution). A median aggregation method is selected. This would ensure that the derivative of the function is exactly zero.Assume 26 parallel lines located at frequencies given by the 26 letters of the English alphabet {a.b. For a typical narrowband spectrogram that shows 20 lines we will have a 5% error if we miss one line and a 10% error if we miss two lines. this method assumes that we correctly identified each line. i 1 N If N is odd we can obtain a unique minimum for the function f(x) by taking x to be the midpoint. However. In case we have an even number of lines (odd number of distances) we are guaranteed to obtain measurements that are within the set. This method has better properties in terms of sensitivity to outliers. This error also inflates the original measurement error. the point q is selected. This scheme is used in the 1D Matlab implementation of the median.2) {ab + bc + cd + … + xy + yz}/26 = {az}/26 We see that only the location of the first and last lines are taken into account together with their property as boundaries of the set of lines. A simple to implement solution 2 2 would be to take the midvalue (one bit register shift in a hardware implementation). A limiter that requires a minimum number of lines at a certain time instance and a maximum possible line distance (pitch height) helps reduce nonvoiced speech effects. In case N is even. we can select any value between the points q= N 1 N 1 and s= .4) f ' x sgn x xi . Unlike a 55 . We continue by examining the properties of the median operator. Since the distance between the intermediate lines is not important to the final result we avoid the measurement error associated with each intermediate distance. In the following.c …} We obtain a telescopic sequence: (5. we see the median can be regarded as an optimization of minimum absolute distances (MAD): Consider the following optimization problem: (5. In case a line is not identified we will have an error in the denominator.3) f x arg min x xi . 2D median is used in images in order to reduce the effect of Salt and Pepper noise [35]. In the 2D median implementation.
We use these centers to identify again the objects and select from those only objects that have 1000 connected pixels with a “4connectivity” requirement. we restrict all small objects that might have been generated by noise and exclude these objects from the image. We find the centers by thinning the image to infinity which in practical terms means thinning the image to a single pixel width so the segments would be evident and well localized.000 Hz.convolution operation which smears (low passes) the energy of the speckles the median filter if constructed properly would remove these speckles while avoiding placing gray level values that did not exist in the original image. Since we are dealing with a narrowband speech spectrogram which is considered a noisy image. therefore sampling the speech at a frequency of 16 kHz gives a very good resolution since it is much more than the Nyquist rate for the 10 harmonic (which is expected to be around 1. We perform this task using connectivity properties and disregard all segments that have less than 100 connected pixels in “8connectivity”. 5. we need to concentrate our detection efforts on objects that would produce a reasonable result. Having extracted the lines from the image we continue by measuring the distance between the lines. We start by performing a local threshold to the image to better distinguish between the lines we wish to detect and the background. Points with less than 4 corresponding lines might indicate bad detection of lines or more commonly indicate a nonvoiced speech segment that does not have a well defined pitch period. We ignore distances that are too large that indicate a pitch higher than 300 Hz to reduce the possibility of an erroneous measurement. We have seen in section 5. We conclude the image processing part of the algorithm by performing pruning to remove spurious segments longer than 10 pixels.4 that a median will be a more reliable choice than an average. The pitch is usually in the order of 100 Hz for an adult male speaker. we also ignore points that have less than 4 corresponding lines since a reliable measurement can not be provided in such case. We proceed by finding segments that are within the objects and are in fact centers of the thicker lines. Since the pitch is caused by movement of the vocal cords we can safely assume that no significant change occurs at the sampling rate of 16 kHz. th 56 . The latter restriction is tighter than the 100 pixels “8connectivity” and yields only long line sections. Hence. We again thin the objects to a single pixel width.5 Suggested Algorithm Our objective is to extract thin lines that represent the fundamental frequency‟s harmonics and use the distance between the lines to estimate the fundamental frequency. In order to reduce the amount of data we need to process we down sample the image spectrogram and calculate the distance every 10 sample points.
Down sample by a ratio of 1:10 to an effective sampling frequency of 1. 4.6 Algorithm Description We present a stepbystep description of the pitch estimation algorithm: 1.5. 5. 57 . Perform a local threshold. disregard objects that have less than 1000 connected pixels using “4connectivity”. 7.6 kHz. Prune the result to cut branches of less than 10 pixels. Disregard line distance that correspond to a fundamental frequency that is higher than 300 Hz and disregard time instances that have less than 4 lines associated with them. 8. Thin the identified objects to a single pixel width. Disregard objects that have less than 100 connected pixels using “8connectivity”. Using a mask to identify the objects that contain the centers from the thinning process as performed in the former stage. Compute a median distance of a down sample of the time axis. 3. Thin the image indefinitely until all objects are of a single pixel width to find centers of lines. 6. 2.
58 . Scale: Horizontal Axis 0 – 1 sec.1: Narrowband Speech Spectrogram.5.000 Hz.7 Results Figure 5. Vertical Axis 0 – 4.
2 would show that the lines are not perfectly spaced and in fact there are some noise and erroneous line segments. The median in our algorithm reduces the effects of outliers and lets us focus on a more stable and conservative subset of the measurements.2: Narrowband Speech Spectrogram after line detection. however close examination of figure 5.000 Hz. At first the lines seem evenly spaced. Scale: Horizontal Axis 0 – 1 sec. However. 59 .Figure 5. Vertical Axis 0 – 4. We also have line detection in areas that the pitch is not well defined in. for example in fricatives or stops. The algorithm manages to thin the lines to a single pixel width. by simple examination of the results we can disregard these areas or indicate to a higher level that they are caused by unvoiced speech. We see that in general the lines are well detected.
600/1.5625 which is the frequency spacing of each pixel (1. therefore 26/16 parts of the second. The speech signal is sampled at 16 kHz.We present some calculations that will correlate between the pitch values and the spectrogram image. 4. FFT has 4. with a specific setting to calculate a narrowband spectrogram.000 pixels. We interpolate the data farther to spread half the information on the 1. The signal is down sampled to 3.3: Results of the Pitch Estimation Algorithm. We calculate line distances every 10 sample.600 samples long which correspond to .1 sec. Due to all the information above we obtain a pitch sample result that corresponds to a duration of 16. In order to compute the narrowband spectrogram we use the specwb function. We have a 50% overlap between consecutive windows. 5. Our window is 1. Before the calculation we have the following facts: 1.000 pixels. We normalize the image spectrogram to be in the length of 1.000 sample points. To translate the distance between the lines to a valid frequency we need first to multiply by two to have a full cycle in pixels and then to multiply by 1. 3. We process a speech segment of 26. We obtain the following results for the pitch values: th Figure 5.024).25 ms.096 taps.200 Hz. We need a very long FFT in order to capture at full at least one time window.11. 60 . as described in section 6. 2.
We see that the values obtained for the pitch are not stable. On the other hand. the errors in this type of measurement are too large to tolerate. After identifying and thinning the lines we would need to count them in an „intelligent‟ way that would warn in case a line is missing. Perhaps the main drawback of our method is ignoring most of the available information that is lost in the quantization process. While a rough approximation of the pitch is possible and even though on a long time interval the errors cancel out and produce a reasonable result. There is insufficient resolution to determine the pitch frequency accurately. We would expect the pitch to be stable at values close to a 100 Hz with slight variations in the order of about 10% due to syntactic differences. The main problem of this technique is that we have large errors due to the difficulty in determining the exact frequency of the sinusoid just by maximum values that are obtained in the thinning process of the quantized image.8 Summary We have seen that attempting to detect the pitch. Using morphological image processing techniques to extract the pitch period from a narrowband speech spectrogram is not a good idea since there is no visual advantage or expert knowledge that helps to achieve a more accurate result than existing algorithms that use more information and achieve relatively accurate and stable results. where the frequency is in the range of 100 Hz and the average error is over 15%. We will then sample a point from the first and last lines and perform the average calculation. however. Relying on extreme levels to compute the frequency of a sinusoid can be reasonable only if we wish to obtain a rough approximation. for example. In order to perform correct recognition. through thinning of a band of maximum points obtained by a threshold (clipping) yields inaccurate results when there is insufficient frequency resolution. or any other noisy sinusoidal function. due to the threshold the regions are wide and in order to remove them further processing is required. this method is unacceptable. which is important if we desire to make any use of the pitch information. estimating the threshold regions from a known pitch is possible. [34] where the cepstral coefficients can be computed without much additional computational effort directly from the spectrogram. 5. 61 . It is possible to employ cepstralbased algorithms. In this case. In addition we have obtained very high or very low pitch values that are usually caused in unvoiced speech segments. If the „intelligent‟ system would work properly we should expect to obtain a correct result for the pitch estimate. by increasing the resolution we lose time accuracy. it would be necessary to imitate the exact procedure of an expert.
we have good ordering of phonemes. We review previous work and give motivation for automatic reading of speech spectrograms. Optical Character Recognition (OCR) is the task of identifying handwritten words acquired by some imaging device and translating that into text. Unvoiced speech can be better described by a smearing of energy 62 . we focus our efforts on the task of automatically reading a wideband speech spectrogram as if it were some kind of text written in a language known to expert readers. 6.1 Introduction In this chapter we present an algorithm to enable efficient segmentation of phonemes in a speech spectrogram. Zue from MIT. In order to do that we must first identify important spectrogram features. After presenting and analyzing the results we suggest ways to fit the algorithm so it could handle different procedures and summarize the chapter with conclusions. We would like to perform automatic reading of a spectrogram image. There are a few differences between OCR and spectrogram reading. Voiced speech can be characterized by its relative high energy levels in certain frequency regions which make up the formants associated with the particular phoneme. We then continue with particular and detailed descriptions of key algorithmic ingredients followed by an explanation of the segmentation algorithm.Chapter 6 Automatic Spectrogram Segmentation Algorithm 6. We have an idea on the location of the different formants and this prior knowledge can be used to construct rules that would aid in recognizing the particular phoneme. In the frequency axis we also have ordering for the different formants. Unlike written text that is not necessarily aligned over a certain axis. The length of each phoneme depends on the pace and pronunciation of the particular word or sentence.2 Overview Inspired by the work of Prof. First the time axis gives good synchronization in spectrograms.
In addition we would like to have an algorithm that has low computational requirements and that can be later modified to run in realtime perhaps on a downsampled lowresolution version of the speech spectrogram. Since portioning the image into different BLOB‟s a major task we do not obtain any advantages by using this method. We wish to work with an algorithm that is robust to scaling in the time axis since we do not want faster or slower speech to have a detrimental effect on our results. Spectrograms are a special kind of images that can be subcategorized under TimeFrequency Resolution images as explained in section 2. In [38] realtime segmentation of range images is performed using a Kalman filter. Differential calculus entails a heavy computational burden and would need cleanup of the image to obtain correct results. due to the uncertainty principle. an actual frequency is smeared over a band of frequencies. The Kalman filter on the other hand might get lost tracking a line since. The aforementioned techniques seem to be inappropriate in our case. We also wish to have a robust algorithm in terms of energy levels or at least one that does not require rapid.throughout all frequency bands in some random pattern. Noise caused as a result of the pitch makes it very hard to determine object boundaries. The extraction can be done using image processing techniques for segmentation. Several algorithms have been developed to perform image segmentation. It is not clear how a statistical model should be designed since it would have to include the vagueness due to the uncertainty principle as well as a model of speech and the disturbing pitch lines. These constraints lead us to select mathematical morphology and particularly the watershed transform as the tool to segment the 63 . It would have problems in segmenting two formants that due to the uncertainty principle are smeared into one large object. Common techniques to perform image segmentation are: 1. general speech noise also exists and due to the uncertainty principle we have smearing of all frequencies over a wide band which smears the formants and sometimes causes nearby frequencies to merge into a single large segment. 2. 3. The spectrogram image is very noisy due to several factors. Differential calculus techniques: track down the borders of each object in the image using locally computed divergence and Laplacian [37]. If we use [38] and regard the spectrogram image to be a range image we still need to perform a cleanup that separates between the different BLOB‟s.9. Statistical techniques: able to identify and separate different areas in the image according to their different statistical properties. and often obscure. changes to threshold parameters. Tracking methods: algorithms such as the Kalman filter are used to track frequency changes of each formant and segment the image according to the acquired information. Concentrating on the voiced speech we attempt to identify and extract the different formants from the spectrogram. In [36] statistical techniques are used to classify and segment an image in an attempt to automatically detect cancer cell nuclei.
Median filter is used once on a 3 by 20 rectangle and 4 times using a 20 pixel horizontal line. 6. 7. 5. Apply local threshold on (3). Apply global threshold on (3). The local mean and variance are estimated in a 16 by 16 square around each filtered pixel. Dilate with a disk as a structuring element in order to disconnect thin lines and eliminate small areas in the image. 64 .3 Algorithm Description 1. 8. Perform an 8connectivity watershed algorithm. 9. Use morphological connectivity to disregard small sections that contain less than 40 pixels or that have a maximum width that is less than 20 pixels. 2. The watershed transform is efficiently implemented in Matlab and is intended to perform image segmentation in harsh noise conditions. 6. Run a 2D Gaussian window (Gabor filter).image spectrogram. 4. Combine the results of (4) and (5) using a logical OR. Smooth using a 2D Wiener filter. 3.
1: Algorithm Diagram Flow 65 . Set cnt = 1 Median 1 by 20 cnt = cnt + 1 Is cnt = 4? Yes No Local 2D Wiener Filter Local Threshold Global Threshold Logical OR Dilation (disk as structuring element) Discard Small Connected Sections Watershed Transform Binary Mask Figure 6.Image Spectrogram Median 3 by 20.
A global equalization would diminish most of the details in the image. A bilinear transform is then used to reduce the bordering vertical lines in the rectangular tiles. The adaptive equalization is done through the adapthisteq Matlab instruction. Both smoothing and threshold are done at the local and global level. since we must preserve a constant number of pixels in each tiled rectangular. 66 . 6. Since the tiled histogram equalization operates on rectangular regions it generates more homogeneous energy values for the different formants. For example. where s heq invh . f1 to f4.1 may seem trivial at first glance. In general any invertible histogram function on a certain domain can be equalized to any invertible function on the same domain. equalizing it to heq would require: (6.4 Adaptive Histogram Equalization In order to obtain a spectrogram that is clear to read. An algorithm can be devised to map each pixel from one histogram to the other. The adaptive histogram works first on tiles of the image and then combines the tiles by using bilinear interpolation in order to reduce border effects. the borders are smoothed down and practically eliminated. The watershed transform is applied to the entire image since the interaction between different image objects plays an important part in the segmentation process. Equation 6.The algorithm uses both local and global processing techniques. since it is performed on a rectangular tile and not on particular vertical lines/time instances.1) heq s h . if the image histogram is h. however. This method of equalization differs from the preemphasis filter. Smoothing at the local level uses a locally estimated mean and variance for a 2D Wiener filter while global smoothing procedures using a Gaussian window. The advantage of using adaptive histogram equalization is that details in the image can be emphasized. The result is an image spectrogram that clearly shows the first four speech formants. an adaptive histogram is applied to the image. we cannot simply move pixels around and an additional step is required. median filtering and image dilation are applied with global parameters. After applying the bilinear interpolation.
8. 6. the gamma correction serves to adjust this scale.3) VC VS . Since the human visual system views luminance using a logarithmic scale. For a computer CathodeRay Tube (CRT). However. The lower rolloff introduces dependencies on previous and future speech samples resulting in a noisier image. The cathode ray is a nonlinear device. The Hamming window has a lower rolloff of 20 dB/decade but a lower main lobe width since its maximum side lobe level is 43 dB as apposed to 32 dB for the Hann window [39]. Gamma correction is performed on the whole image and is specific to the hardware used. The relationship between the luminance and the voltage is given by the formula: (6. 6. gamma is about 2.2. 6.2 (d).5 ms. To compensate for the distortion effect a gamma correction is performed to the voltage: (6.2 (c) and fig.6 Window Selection A common tradeoff in window selection is the main lobe width versus side lobe rolloff rate.5 Gamma Correction A Gamma characteristic is a powerlaw relationship that approximates the relationship between the encoded luminance in a TV system and the actual desired brightness. Choosing a narrow main lobe reduces the uncertainty in frequency and allows us to better distinguish between formants that have a small frequency difference.6. in a similar way to the logarithmic scale used in the human hearing system. Therefore we choose to work with a hamming window for the specified short time interval of 6. In this work we use a gamma correction value of 0. A Hann window is used often due to its good rolloff properties: 60 dB/decade.2) I VS . 1 Changing the screen brightness is equivalent to performing gamma correction. the watershed transform can produce better results since it can better capture low energy regions in particular on rising and falling formants as seen by comparing fig. 67 .
A straightforward approach to determine which objects are big is to predetermine a minimum object size in terms of the number of connected pixels and discard all objects that contain less than that minimum value. Since lower formants tend to have higher energy concentration than higher formants and since the spectrogram image contains much detailed information regarding different formants. 6. however choosing an adaptive threshold value is also possible and might improve the results at the cost of run time and algorithmic complications. The human vision system when examining a particular object performs a local threshold operation. We need to separate between larger and smaller objects in the image to get rid of spurious small spots that are the residue of the threshold and dilation operations. Since the speech can be modeled as quasistationary its statistical properties change several times during a 1 second interval leading to changes in the spectral density for different phonemes and even within the same phoneme. Combining the local threshold image with the global threshold image using a logical OR will yield the desired result. Global Threshold We need to quantize the grayscale image spectrogram in order to obtain a binary image. simply using a global threshold will not yield good results. As a result we manage to get rid of small objects and are left only with larger objects that can later be farther segmented and perhaps separated from one another using the watershed transform. The global threshold serves to reduce white noise with low energy that can appear as very small speckles in the image spectrogram.8 Local vs.7 Connectivity Operations Mathematical morphology is a latticebased theory in which connectivity plays an important role. We also demand that the minimum width of each object would be 20 pixels due to the minimal signal bandwidth that should be present as a result of the windowing operation and the uncertainty principle.6. We note that we choose a rigid hard threshold value that was determined according to analysis of a few image spectrograms. A global threshold is used to clean the image from noise. We therefore choose a fully connected grid (8connectivity) and discard all objects with less than 40 pixels. 68 . A local threshold is used to isolate each BLOB from its surrounding. The global threshold would not be able to capture small image details and will not do a good job in separating between different phonemes that are close to one another. Quantization is performed by selecting a threshold level.
6. We create a line that is 5 pixels wide and that can be easily identified by the user. The threshold is adjusted locally to capture the BLOB boundary since an abrupt local change will have an immediate effect on the threshold value. We continue by manually creating a text file that contains the path of the speech files. Unlike the block processing of the histogram equalization.9 Working with the TIMIT Database We want to test our algorithm on different phonemes. we only need to compute the end point. The effect of the filter is an averaging over a long line of pixels. The function creates a subfolder within the Database folder in the same name as the requested phoneme and with the extracted speech files that contain the phoneme. In order to extract samples from the database we use the function GetPhoneme and input the desired phoneme to be extracted. The idea behind the process is that values close to the pixel will have a greater effect on the threshold value than values farther away. Since the phoneme is centered. Using trial and error we select a parameter for the Gaussian standard deviation to be 100 and create a vertical Gaussian filter of length 200 pixels. 69 . continues with the segmentation algorithm using the recon function and finally saves both the identified spectrogram and the mask. as described in section 6. The speech files are extracted in a way that the desired phoneme begins 0. The image that has passed several median stages as described in fig 6. The function also marks borders that indicated the start and end of the phoneme according to the information contained in the TIMIT database. The 1 second speech segment was saved in a name that encapsulates the end point of the phoneme so by simple text manipulation we have the end point of each phoneme. Since the threshold is set according to the vertical surroundings it is not affected by intensity changes in the image caused by possibly lower energies and narrower BLOB‟s in the upper frequency bands. We then run the function manager that takes the file.1 is then filtered with the Gaussian to obtain the local threshold parameters. The function is run using a break point in debugging mode (similar to an event driven „catch‟ and „try‟ instructions in Matlab/Java). calculates the wideband spectrogram using our specwb function. in this case we wish to have a very discriminative and distinctive difference between pixels. 6.10 Calculating the Local Threshold We need to obtain a local threshold for each pixel in the image.5 seconds into the extracted 1 second speech segment (centered). we therefore desire to have a more local environment that might contain a smaller number of pixels but will still allow us to discriminate between the borders of the BLOB‟s.4. The user can examine and grade the results of the segmentation algorithm accordingly. We have 20 samples for each phoneme and we name the text file as the phoneme to be tested.
11 Function Description We describe the functions that are used in this work.6. Output: In the Database directory. Input: Text file from the Input file directory containing paths to different speech segments of a specific phoneme. Output: Segmented speech spectrograms and segmentation masks saved in the directory Results under the specific phoneme subdirectory. saves and returns speech according to a given speech segment. Input: 1. The phoneme can be identified by a left and right border vertical line. Manager Objective: Create. Segmented mask image. their input. Input: Phoneme name as appears in TIMIT. Borders of BLOBs. Speech signal. 70 . 3. The phoneme is centered and the speech segment is 1 second long. 2. Sampling rate. The name indicates the sentence the phoneme was taken from and the location of the phoneme in the complete sentence. a subdirectory containing all files extracted. output and objectives: specwb Objective: Function creates. GetPhoneme Objective: Extract from TIMIT speech segments that contain a specific phoneme. Output: Image Spectrogram recon Objective: Function segments the wideband speech spectrogram and determines the borders of the different resulting BLOBs. 2. Segmented speech spectrogram. displays. Input: Wideband speech spectrogram. Parameters such as narrowband/wideband etc. Files that are either at the end or the beginning of a sentence are marked on a different name. display and save segmented speech spectrograms and segmentation masks of specific phonemes. Output: 1. 3.
Input: 1. Input: Narrowband spectrogram. Output: A . fuzzybrain Objective: Creating triangular membership functions to the different vowels that can be used in the fuzzy logic toolbox graphical user interface (GUI). 2. The spectrogram with the lines and a mask containing only the lines are displayed.wav file.wav file from the TIMIT database and converts the file from bigendian byte ordering to littleendian byte ordering.mat speech file in little endian format.fis file that can be later opened using the fuzzy logic GUI tool (the Matlab instruction fuzzy). readTIMIT Objective: Function reads a .linez Objective: Function detects and marks lines and outputs a vector of the median distance at selected locations. This function also automatically constructs the rules of the system and creates input and output variables. 71 . Output: A vector containing the median distance between the lines. Full path in the TIMIT directory. Inputs: Through parameters: the estimated frequency location of vowel phonemes for the first three formants. Output: Saves a . Name of .
we still have remainders in the form of small binary objects that can cause problems in the recognition stage. This phenomenon occurs in some cases for the higher formants as well. The TIMIT database contains female and male speakers from 7 different dialect regions in the United States. the algorithm has difficulty in segmenting the second and third formant of /r/.6. BLOBs associated with f 1 sometimes relate to more than one phoneme. 72 . Orthographic transcription and timealigned phonetic transcription are included for every sentence. all four formants are well aligned and ready to be recognized by an appropriate system. augmented by several dozen lunchroom suppers”. For the third and fourth formant.” As seen in fig. Our first example uses the meaningful sentence “However. In addition. Our second example presents a more challenging scheme. 1(b). we choose: “Don‟t ask me to carry an oily rag like that. The low spectral density makes it hard to segment the phoneme correctly. Even though oversegmentation was tackled in the watershed algorithm. The low spectral density is caused by a spectral zero that reduces the second formant. Since these formants are very close together.12 Results The algorithm was tested on different speech samples from the TIMIT database. 1(a).” As seen in fig. changing the constant to accept only stronger energies would result in losing some real formants. the segmentation performed well on different speakers and different sentences. On the other hand as was also noticed in the previous examples. it is hard to distinguish between them and to segment them as different objects. Results were robust. we obtain several cases in which formants are segmented into more than one BLOB. augmented by several dozen lunchroom suppers. 1(c). In general. In this example. the segmentation misses part of the phoneme /r/ but the general direction is preserved. One other problem is small segments that do not represent a formant but still appear in the image (false positives). This problem can be solved by changing the constant in step #8 of the algorithm. We examine a different section of the same sentence: “However. coarticulation and different combinations of phonemes. bold face fonts indicate 1 second of speech that in this example is displayed in fig. We obtain good segmentation for the first and second formants for all voiced phonemes. high energy levels for f 3 make it more difficult to separate it from f 2. However. the algorithm manages to perform well when the formant energies are strong. the litter remained. As a last example. The speakers repeated sentences especially designed at SRI. MIT and TI to exemplify different speech characteristics such as accent. Another difficulty arises in the identification of the nasal /m/. the litter remained.
In order to check the algorithm behavior in a more systematic fashion.000 Hz.1. We assign numbers to each descriptor where „Perfect‟ takes the highest value of 5. „Poor‟ takes the lowest value of 1 and it is believed that „Average‟. We select 10 phonemes and run 20 different tests for each phoneme. „Poor‟} for the segmentation results. The results including the mean and variance of the visual measurements are presented in Table 6. in total 200 different speech segments. „Average‟. Vertical Axis 0 – 4. which takes the value of 3. The criteria for which we judge the performance is the fuzzy variable „Grade‟ that takes the values {„Perfect‟. 73 . A “several dozen” B “the litter remained” C Hamming Window “an oily rag” D Hann Window “an oily rag” Figure 6. contains enough information for automatic recognition. we test the results on multiple runs. Scale: Horizontal Axis 0 – 1 sec.2: Algorithm Results for different cases. „Below Average‟. „Good‟.
1: Results of a visual inspection. The numbers in brackets indicate the row from the table as well as the line location of the filenames in the text file inputl.69 l 3 4 1 5 5 2 5 1 1 1 2 1 5 3 5 5 2 5 5 3 3.2 0.34 w 1 2 5 3 4 5 5 2 3 3 3 4 5 2 2 5 2 2 5 5 3.10 5 4 3 5 3 5 4 3 4 5 4 4 3 3 3 2 3 5 5 4 3. The nasal /m/ and the glide /l/ have lower segmentation results due to the difficulty of tracking diagonal lines in the spectrogram.61 ux 5 4 2 5 5 5 5 5 2 5 5 4 5 3 3 4 5 5 5 5 4.35 1.05 1. Test # aa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Mean Variance 5 5 2 4 5 2 5 4 5 3 5 5 5 5 2 5 4 2 4 5 4. We select the glide „l‟ that as seen in table 6.87 r 5 3 3 5 5 5 3 3 4 5 4 5 5 3 4 5 5 4 4 4 4.txt in the Input directory.46 ae 5 5 3 5 4 5 4 5 4 3 4 5 4 4 2 4 3 5 5 5 4. We have the same sentence pronounced by different speakers. Once again the bold part of the sentence indicates the speech segment that is actually displayed in the speech spectrogram where the glide „l‟ is centered at . In general. The grades describe the accuracy of the segmentation algorithm for each phoneme We demonstrate the subjective selection of grades according to our fuzzy variable by displaying a few speech spectrograms that correspond to different grades. The algorithm obtains better segmentation results when the phoneme duration is longer.65 2. Since more information is available and since our segmentation algorithm is searching for large objects.08 Phoneme ow oy 5 4 2 4 5 3 5 5 4 5 5 3 4 2 3 5 5 3 5 4 4.8 eh 5 4 3 2 5 5 4 5 2 5 1 3 5 4 5 5 5 5 5 5 4.2 2.1 1.15 1.5 seconds from the start. the vowels are wellrecognized. we tend to miss small concentrations of energy. 74 .4 1.2 0.1 takes all 5 possible values.91 m 5 5 4 3 2 5 1 2 1 4 1 1 3 1 2 2 1 2 3 5 2.85 0.94 Table 6. It is possible to extend the algorithm to detect diagonal lines either by adding a tracking procedure such as a Kalman filter or by a diagonal line emphasizing median filter. The semivowel /w/ is better segmented on short duration phonemes since there is a higher energy concentration that enables better segmenting of f 3 and f4.After examining the algorithm we see that in general the algorithm obtains good segmentation results for the formant energy levels throughout different phonemes.
75 . therefore a grade of „2‟ is given. Justification: We have a clear first and fourth formant. Don't ask me to carry an oily rag like that.4: Example of a grade 2 score.1. Don't ask me to carry an oily rag like that. Justification: We miss the second formant almost completely due to the sharp rise and relatively low energy. the second formant is well represented but the third formant is missing. (9) Figure 6. Recognition would be difficult (even though not impossible). 2. Since the main characteristic of the glide is entailed in the gliding second formant we give a grade „1‟ to this result. (6) Figure 6.3: Example of a grade 1 score.
We can conclude by the mask the direction and location of the second formant.6: Example of a grade 4 score.3. (2) Figure 6.5: Example of a grade 3 score. Justification: We have all four formants. 4. in this case therefore a grade of „3‟ is given. The location of the third and fourth formants can be easily understood. Recognition should be possible. (1) Figure 6. She had your dark suit in greasy wash water all year. The second formant is well described. Justification: We have all four formants. Therefore a grade „4‟ is given. The third formant can also be well estimated. 76 . She had your dark suit in greasy wash water all year.
we obtain results that are believed to be sufficient for an automatic speech recognition system in most cases. 77 . We attribute that to the relatively high concentration of energy and to the mild glide in the second formant that are more suitable to an algorithm that aims at segmenting horizontal lines. we see that even at a noncommercial premature stage of the algorithm. which are an approximation of the second derivative and results in fitting the data to a parabolic function.5. Don't ask me to carry an oily rag like that. Justification: All four formants are well characterized. we have a problem with the glides in particular and with rising and falling frequencies. Since our algorithm is trained to follow horizontal lines and shapes. A secondorder model also uses the deltadelta coefficients. A linear model for a rise and fall is known as the delta coefficients (first derivative) in common ASR. We obtain very good recognition results when the phoneme time duration is short. we have significantly different results for the segmentation. (5) Figure 6. We see that due to different accents and energy distributions.7: Example of a grade 5 score. Albeit with some adjustments it is possible to fit the algorithm to capture nonhorizontal movements. The segmentation algorithm well captures the formants and automatic speech recognition is possible.
6. Advantages: Would help in detecting nasals. Rules need to be constructed to deal with the additional information in a nonambiguous manner. Disadvantages: Changing the threshold would require additional processing and will introduce areas in the image that do not belong to any particular formant (false positive detections) 3. 1. Implementing a tracking algorithm that would help to identify formants that are rising or falling with frequency. Advantages: Recognition rates can be increased using additional information. liquids and glides where low energy formants tend to rise or fall. A Kalman filter can be used to track the energy peaks and add rising or falling energy dense regions with an identified BLOB. We present a few ideas for improvement together with their advantages and disadvantages.13 Suggestions for Improving the Algorithm Several changes and modifications can be made in an attempt to improve the algorithm. Using additional information from the spectrogram to help in the recognition process. Adaptive change of the local threshold value to include formants in low energy phonemes. Advantages: Since some phonemes contain less energy than others. lowering the threshold would help in segmenting formants that otherwise would have been partially segmented or completely ignored. Changes in energy level. 78 . Adding regions that were not detected before as BLOBs would introduce also false positive results since lower spectral density regions are not ignored. Simply taking a binary image that ignores these parameters reduces the amount of information available to our recognition system. peaks of energy and other parameters play an important part in the recognition process. Disadvantages: The tracking algorithm would require an additional stage of BLOB detection to avoid the effect of the uncertainty principle. 2. Disadvantages: More processing and storing of information is required.
we are able to obtain reliable segmentation of formants in most cases. difficulties occur when formant frequencies are close together or when there is a lowenergy formant that is rapidly going up or down in frequency.14 Summary A robust algorithm for speech spectrogram segmentation was presented. Some suggestions such as changing the threshold level were made to improve or tune the algorithm. however. By using morphological image processing techniques. 79 . The algorithm performs well for all voiced phonemes and has better segmentation results than previous techniques.6. It is in the authors‟ belief that a spectrogrambased speech recognition system can complement an existing recognition system by incorporating human expert knowledge into the recognition task. These results can be used as input to an automatic speech recognition system or in other general uses of speech spectrograms.
We first give conclusions and an overview of the results obtained from the different algorithms that were developed. The watershed transform is effective in segmenting noisy images and in particular in cases in which different target objects partially occlude one another. In section 5.1 Review of the Work and Logical Deductions In the previous chapters we laid the foundations of three major theories: Speech Recognition. We have seen that it is possible to combine these methods in order to design a scheme that can perform automatic speech recognition. 7. the information they contain and the different shapes that require different morphological operators to extract information from the images. We concluded that experts can extract information from wideband speech spectrograms and saw the difference between narrowband and wideband spectrogram images. We conclude the chapter with ideas for future work and development. Morphological Image Processing and Fuzzy Logic. We have seen how by choosing a Hamming window instead of a Hann Window we can obtain better segmentation results since we have better separation between adjacent frequencies and since the time dependencies between pixels in the image spectrogram can be compromised to some extent. Justifications for using Mathematical Morphology to perform the image segmentation were presented. We obtain a labeled mask image and in most cases each BLOB directly corresponds to a formant of a particular phoneme. The close relationship between Fuzzy Logic and Mathematical Morphology helped in understanding how to link between these two theories. We also used the median to smooth the narrowband and the wideband images as an initial stage before applying stronger segmentation or extraction techniques such as the watershed transform or the morphological thinning operator. In some cases we obtain several small BLOBs that belong to the same 80 . The segmentation algorithm performs well in most cases.4 we saw the mathematical properties of the median operation.Chapter 7 Conclusion This chapter summarizes the contributions presented in this thesis. The main purpose of this thesis was to segment an image spectrogram and for that reason a segmentation algorithm was devised.
its frequency band location. In order to perform automatic speech recognition using the results of our algorithm we will need to construct an expert system.formant. A simple equalization can use a Gamma correction as explained in section 6. This method of debugging insured that our algorithm will be optimized to yield results that are as close as possible to the information extracted by an expert that is performing a visual inspection of a speech spectrogram in an attempt to extract information from it. The glides present a more challenging scheme since they require tracking formants that either increase or decrease with frequency.5 to change the luminance and therefore the darkness of the different energy sections in the speech spectrogram. performing equalization that would use as input the time and energy properties of each phoneme and would be adjusted to a specific speaker or accent group can help obtain even better results. These results are lower than those obtained for the vowels. Since we are not restricted to the information we have in the mask (segmented binary image) we can use the mask as reference and extract more accurate information related to a specific BLOB from the original spectrogram. The feature vector would be constructed according to the rules laid forth in the expert system. The rules will also have an aggregation method that would explain how to perform combinations. By constraining the number of BLOBs we expect to segment over a prespecified time period we can significantly improve the results since we will reduce the number of small BLOBs and merge BLOBs that are actually constituents of the same formant. The vowels are segmented in most cases in a satisfying manner. We are able to obtain satisfactory results in most cases for the glides. The feature vector can include parameters such as the length of the BLOB. The expert system would rely on one or more experts in spectrogram reading and will have the form of IF THEN rules. Also their energy levels are in general lower than those of the vowels. In addition we will need to extract a feature vector from our segmented image. Another improvement to the segmentation algorithm could be finetuning the algorithm and adjusting it to tackle different phoneme types. its slope measure by a first or second order approximation. Another common caveat is a formant that breaks up into smaller pieces in the segmentation process due to lower energy levels in its center area. in a similar way delta coefficients are measured and its energy strength. We used a fuzzy variable to quantify the results of the algorithm. The membership functions for 81 . However.2 Ideas for Future Work We have managed to perform segmentation that works well in most cases. In most cases all of the first four formants are well detected and recognized. intersection or negation. Sometimes we miss a formant due to low energy levels. however this should not pose a particularly difficult problem since most of the information that we need for the recognition task is still maintained. 7.
A clear advantage of the proposed system is its intuitive rulebased design and the possibility of incorporating knowledge of more than one expert. Information extracted by this method can also be combined with existing systems to improve their results.each element of the feature vector can be either manually designed or trained by a neural network. phonetics. Finally a thorough regression would be performed to analyze the system performance in different cases. 82 . We expect that the information contained in speech spectrograms as interpreted by a human expert welltrained and familiar with acoustic. A possible solution for creating the set of rules is a wikibased system that will allow experts from different places around the world to contribute from their experience. linguistic and speech production models can yield better recognition rates than the current methods that do not incorporate human knowledge in their algorithms.
We can regard the values in table [2. if we have two consecutive frequency values corresponding to two different phonemes. and Uniform or Triangular distributions [40]. we can assume that most measurements associated with the lower frequency fall below the measurements of the higher one. We assume/approximate these measurements are the mode of the distribution of outcomes and since we do not have any additional information about the standard deviation we can only make conclusions about the range of measurements associated with a certain confidence interval.Appendix Justifications for Choosing a Triangular Membership Function (TMF) The MF serves as an upper bound to the possibility function.1] as measurements that contain some error according to some unknown probability distribution. 83 . Since the TPD is the transform of the uniform distribution function it serves as an upper bound to all other pdfs. For example. Since there is never a tight bound of the possibility and necessity function on the probability function. A Truncated Triangular Possibility Distribution (TTPD) can be used to cover efficiently a Normal. the tightest interval should be obtained in order to keep as much information as possible (maximally specific possibility function). Laplacian. The TTPD serves as a family of maximally specific probability distributions that enables upper bounds of probabilities of events to be computed. Since in this case no human knowledge is used we can approximate the error distribution as symmetrical and require an upper bound of a TTPD.
. “Spectral representations for speech recognition by neural networksa tutorial”. Vol. 2. 49.A. [7] S. “A New Digital TV Interface Employing Speech Recognition”. 490497.W. Vol. No. 742 – 772. Part 11. 357366. and Cole. 1992. Feb. AddisonWesley Publishing Company. 1979. [13] Lamel. et. pp. Issue 3. [12] Makhoul J. 558561." IEEE Trans. and Zue. “Experiments on Spectrogram Reading”.. ICASSP. S. Vol. 4. B. [14] Zue.com/ [4] http://www. Wolska M. on Speech and Audio Processing. Fujita et al. 2007. “An Expert Spectrogram Reader: A KnowledgeBased Approach to Speech Recognition”. IEEE Journal of Robotics and Automation.Trento. Issue 5. Bourlard. 1994. [15] Hemdal. pp. V. No. B. 1987. 4. [6] D. pp. [16] Roger Y. and Lougheed. 1994. [8] Dudley H. 116119. J.A. 1991. V. 122126.hrichina. Aug. [2] Bobbert D. 1995. pp. The Vocoder. [10] Juang. R. Speech. 214 – 222.org/ [5] K.W. Aug. 17. Vol. 1939.. H. R. Aug. Signal Processing. “A Versatile Camera Calibration Technique for HighAccuracy 3D Machine Vision Metrology Using Offtheshelf TV Cameras and Lenses”. 84 .. 1980.. pp. pp. "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences. IEEE Trans. 765 – 769. pp. 151160. Part 2.R. on Consumer Electronics. Mermelstein. pp. 1986. pp.. Rabiner.F. 2. [3] http://mindstorms. IEEE Trans. Proceedings of the 1995 Fourteenth Southern. May 1995. 2003. 11. 1995. 83. 1987. al. 159–160. pp. Jan.. Tsai.References [1] Saha. on Signal Processing. F. Decalog 2007: Proceedings of the 11th Workshop on the Semantics and Pragmatics of Dialogue. Proceedings of the 1992 IEEESP Workshop. OShaughnessy. Vol. Vol. “Connectionist Probability Estimators in HMM Speech Recognition” IEEE Tran.. N.M.H. “Morphological Approaches to the Automatic Extraction of Phonetic Features”. [9] Renals S. Vol. Acoust. “Dialog OS: An Extensible Platform for Teaching Spoken Dialogue Systems”. “The new age electronic patient record system”. “Neural Networks for Statistical Recognition of Continuous Speech”. Davis and P. 161174. Sep. 134137. Jun. Proceedings of the IEEE.lego. 1. ICASSP. L. “A Hybrid Segmental Neural Net/Hidden Markov Model System for Continuous Speech Recognition” IEEE Trans. on Speech and Audio Processing. Jan. Neural Networks for Signal Processing [1992] II. Vol. [11] Morgan. 79 Apr. 1. L.. No. pp. Speech Communication. et al. Vol. RA3. Italy. Biomedical Engineering Conference. Issue 2. ASSP28. Bell Labs Record. pp.
Acoust.ac.ensmp.. MerriamWebster online dictionary. Vol. pp. Vol. [27] Beucher. Proceedings of 1995 IEEE International Conference on. “Extraction of pertinent subsets from timefrequency representations for detection and recognition purposes. E.fr/~serra/cours/T34. Macmillan Publishing Company. 1991. “Average location of formants in English vowels”. Amer. No. 2002.mw." 17 Aug 2007. Feb. Soc. 2024 Mar. “Elevator Group Control System Tuned by a Fuzzy Neural Network Applied Method” Fuzzy Systems. 24 Aug. ISBN: 0824787242. and Prade. France. [26] Digabel. pp. Jun.A.. on Signal Processing. Workshop Image Processing. The Free Encyclopedia. and Martin . 8599. Stuttgart. S. L. T. 85 .com .. [18] Proakis et al. J. pp. 4. Biology and Medicine. and Serra. and Lantuéjoul. Fuzzy sets. 12. Issue 4. [31] Imasaki. C. D. 1978. 229– 238. [30] “Fuzzy logic” Zadeh. [21] Serra. No. and Rabiner.pdf [22] Matheron. “System for Automatic Formant Analysis of Voiced Speech. N. Vol. et al. [25] www.. pp.J. 2003. 1995. [20] Hory. Jun 1998. CRC Press. and Folger. R. A. France. P. 2nd European Symp. IEEE Transactions on Pattern Analysis and Machine Intelligence. and Ughetto. A.inf. [29] Klir. No.org/w/index.. Ed. [23] http://homepages. “The Birth of Mathematical Morphology”. West Germany: Riederer.W."A New Perspective on Reasoning with Fuzzy Rules". 1995. Verlag. International Journal of Intelligent Systems. 50. 1979. ISBN: 0133459845. “Use of watersheds in contour detection” in Proc. Vol. Rennes. Prentice Hall. Chermant.R. 1988. L.php?title=Phoneme&oldid=151909142>. Sep. “Iterative Algorithms” in Proc. 1992."Phoneme.[17] Wikipedia. 1977.htm [24] Dougherty. L. 21:36 UTC. “Watersheds in Digital Spaces: An Efficient Algorithm Based on Immersion Simulations”.47. Vol.. J. H.” J. and Chehikian. No. Feb. uncertainty. and information. Wikimedia Foundation. and Lantuéjoul. 6. Sep. 83 – 93. “Mathematical Morphology in Image Processing”. 634648. 1970. N. pp. 2.ed.wikipedia. R. H. Int. 541563. and Soille. RealTime Edge and Motion Detection/Estimation. L. pp. [34] Schafer.uk/rbf/HIPR2/index. Vol. 82. J.” Signal Process. G. G.1735 – 1740. 5. 1993.N. 21.. J. [19] Leprettre. C. 18. 13. Caen. DiscreteTime Processing of Speech Signals. Inc.L. “Lecture Notes on Morphological Operators” http://cmm. N. 2007 <http://en. [28] Vincent. ISBN 0023283017. 1988. Apr. [32] Dubois. Vol. International Joint Conference of the Fourth IEEE International Conference on Fuzzy Systems and The Second International Fuzzy Engineering Symposium. Computer. Dec 2002. Oct. B. C. Quantative Analysis of Microstructures in Material Science. “Spectrogram Segmentation by Means of Statistical Features for NonStationary Signal Interpretation” IEEE Trans. Martin. Englewood Cliffs.
7174 – 7177. 1. [36] Kapelner. 2007. "Local Region Based Medical Image Segmentation Using JDivergence Measures" . “ProbabilityPossibility Transformations.. P. 25. pp. 2004. 2007.P. B. 273297. 1977. W. International Conference on pp. . ISBN: 0133361659. on Acoustics. Holmes. Lee. MediVis 2007. 81 – 86. Prentice Hall.BioMedical Visualisation. Reliable Computing. Prade. 2005. Oct. Proceedings of the 13th International Conference on Vol. “A Course in Digitial Signal Processing”. No. 2529 pp. [38] DePiero. A. 3. “Fundamentals of Digital Image Processing”. and Signal Processing.K. Aug 1996. [35] Jain. 1988. IEEEEMBS 2005. ISBN 0471149616. [37] Zhu. "Realtime range image segmentation using adaptive kernels and Kalman filtering". F. 1996. “An Interactive Statistical Image Segmentation and Visualization System” Medical Information Visualisation . Mauris. A. 10. 27th Annual International Conference of the.[33] Rabiner L. Triangular Fuzzy Sets and Probabilistic Inequalities“. “Use of Autocorrelation Analysis for Pitch Detection”. R. al. 86 . Foulloy. S. Feb. [40] Dubois.M. 573 – 577. Jul.Institut de Recherche en Informatique de Toulouse IRIT. pp. Speech. Trivedi M. [39] Porat. Pattern Recognition. John Wiley and Sons. et. IEEE Trans. Vol.W. Engineering in Medicine and Biology Society. Vol.
Une variable de logique floue est employée dans la correction de l'algorithme de segmentation puisque par définition nous ne pouvons pas décider si l'identification est exécutée bien ou pas selon un métrique bien formulé. si nous avions un tel métrique. Nous expliquons les décisions qui ont mené aux différentes étapes algorithmiques et aux obstacles qui empêchent l'algorithme d'être employé d'une façon certaine. L'algorithme applique des méthodes mathématiques de morphologie pour extraire les harmoniques de lancement. nous continuons en présentant un algorithme pour l'évaluation du lancement et nous concluons par une approche nouvelle pour la segmentation des spectrogrammes de la parole. nous aurions l'habitude d'effectuer la segmentation en premier lieu. Nous proposons une approche informatiqueimpliquée qui devrait produire des évaluations plus fiables. 87 .Sommaire Cette thèse se concentre sur la reconnaissance de la parole automatique du discours continu de l'anglais au moyen de traitement d'image des spectrogrammes de la parole et de la logique floue. les amincit à une largeur simple de Pixel et puis calcule la distance entre ces lignes. La logique floue est semblable à la morphologie mathématique et ces deux théories différentes sont employées pour aborder le même problème de la reconnaissance de la parole par l'intermédiaire du spectrogramme de l'image. La théorie de logique floue joue un rôle important dans les systèmes qui sont basés sur la connaissance experte. Dans ce travail nous présentons d'abord une vue d'ensemble théorique des théories de la logique floue et de la morphologie mathématique et une courte vue d'ensemble de la phonétique de la parole. En fait. Un algorithme qui essaye d'estimer le lancement d'un spectrogramme à bande étroite a été développé. L'algorithme de lancement a échoué de donner des résultats corrects à cause de la difficulté en déterminer les distances et dans certains cas nous manquons réellement quelques harmoniques de lancement.
88 . nous établissons un arrangement pour effectuer une dissection d'un spectrogramme de la parole la traitant comme si c'était un certain genre de texte écrit dans une certaine langue étrangère. L'algorithme emploie des techniques de traitement morphologiques d'image pour effectuer la segmentation. de la Reconnaissance Optique des Caractères (ROC) et de la logique floue.Un algorithme pour segmenter des spectrogrammes de la parole est développé. Nous prévoyons extraire de cette façon l'information qui devient autrement a duré extraire ou simplement manquée dans les techniques conventionnelles d'identification. Notre objectif est de créer un prototype d'un module qui pourrait plus tard être employé dans une application automatique de la reconnaissance de la parole. même lorsque les objets occluent partiellement un l'autre. En segmentant le spectrogramme de la parole nous permettons à un expert en matière de lecture de spectrogramme de noter des règles basées sur des connaissances et sur une expérience acquise. La ligne de 'Watershed Transform' est une technique sur la morphologie et que permet a segmentes segmenter des objets dans une image. Bien qu'il soit presque impossible que d'expliquer aux autres ce que notre système auditif accepte car des entrées et ce qui nous incite à distinguer différents mots. nous pouvons facilement expliquer l'entrée dans notre système visuel et par exemple comment nous distinguons entre différents mots écrits. Motivé par des améliorations récentes de l'ordinateur dans la conjoncture avec des avances dans les domaines de la morphologie mathématique (traitement d'image morphologique). Nous donnons également quelques idées pour la future recherche qui prendrait les résultats produits par la segmentation et les incorporerait dans un système expert Logiquebasé brouillé. La reconnaissance de la parole automatique est encore considérée une matière ouverte de recherche et les techniques courantes n'exploitent pas les vaste connaissances que les humains possèdent sur la parole. Un important transforme utilisé dans le procédé de détection est la ligne de 'Watershed Transform'. Nous concluons la thèse en présentant des idées d'améliorer les résultats de l'algorithme de segmentation pour produire de meilleurs résultats dans l'énergie inférieure et pour accélérer (se levant ou tombant) des mouvements de formant.
à 89 . les taux d'identification sont loin d'être optimal. On a proposé beaucoup de différents algorithmes et arrangements basés sur différents paradigmes mathématiques afin d'essayer d'améliorer des taux d'identification. Le matériel et le logiciel d'ordinateur se sont sensiblement améliorés en termes de vitesse. dans certaines circonstances. mémoire. qui ont permis l'utilisation des algorithmes plus sophistiqués et informatique plus exigeants coupable d'être mis en application même sur les dispositifs électroniques tenus dans la main peu coûteux de basse puissance. Puisque le problème de la reconnaissance de la parole est complexe.Chapitre 1 Introduction La reconnaissance de la parole automatique a été une matière active de recherche pendant les dernières quatre décennies. où l'amélioration et les changements sont constamment faits dans un espoir d'avoir des meilleurs taux d'identification. Cette nouvelle approche est peu connue usuelle dans le sens qu'elle implique trois champs de recherche principaux. En raison des améliorations des algorithmes et des le matériel. dans la conception et l'exécution d'un produit fonctionnel. prennent la place. coût et disponibilité. la reconnaissance de la parole automatique est devenue plus accessible et disponible. La reconnaissance de la parole automatique est toujours une matière de recherche ouverte. En outre d'autres contraintes telles que la complexité informatique et contraintes en temps. Cependant. nous préférons des algorithmes avec des conditions informatiques basses des exigences et de mémoire puisqu'ils peuvent être mis en application facilement et a un coût inférieur. L'objectif principal de la reconnaissance de la parole automatique est de changer ou de convertir un segment de la parole en message interprétable des textes sans besoin d'intervention humaine. Dans ce travail nous proposons une approche différente à la reconnaissance de la parole automatique basée sur des théories de morphologie mathématiques et de la logique floue.
Lisez la parole est habituellement plus clair que le discours conversationnel. les mots isolés donnent au système plus de temps de traiter les résultats et d'avoir moins de variabilité entre les personnes qui conversent. La voie et bruit .la taille minimale de vocabulaire est de deux mots (Oui/Non). 1.il est plus facile de reconnaître des mots d'isolement que le discours continu. Beaucoup d'avancements ont été faits dans ce domaine qui ont mené aux systèmes avec des taux d'identification élevés. il restent beaucoup de problèmes non résolus en particulier dus à quatre paramètres principaux : 1. la prise de décision et l'évidence de combinaison. 2. Richesse de vocabulaire .000 mots. 3.savoir la théorie de la parole avec l'accent sur la reconnaissance de la parole automatique. et ceci exige souvent l'utilisation des dictionnaires spécialisés. La fluidité . cependant. qui font la reconnaissance du vocabulaire plus difficile. En outre. Les journaux et les textes professionnels emploient un vocabulaire bien plus ésotérique. Avant la fouille dans le mondes de la phonologie et du traitement de l'image morphologique. traitement de l'image avec l'accent sur la segmentation d'image et la théorie d'évidence avec l'accent sur la logique floue. Une autre taille commune consiste en 10 chiffres. Les conversations téléphoniques ou les rapports de nouvelles exigent un vocabulaire d'environ 60.1 Reconnaissance de la parole Automatique (RSA) La Reconnaissance de la Parole Automatique (RSA) est le processus de convertir le discours humain en texte écrit. nous présentons une vue d'ensemble de la reconnaissance de la parole automatique et donnons une vue d'ensemble de plusieurs techniques généralement utilisées et qui essayent de résoudre cette fonction formidable.un microphone de laboratoire et un environnement de laboratoire ont une interférence de bruit et une déformation de signal inférieure a 90 .
La dictée permet le transfert de la parole dans le texte presque totalement mains libres. La transcription médicale prend plus de place . Les centres d'appel fournissent l'information nécessaire selon les demandes des clients. Le but du système de RSA est de faciliter au représentant à la clientèle de service et de remplir son rôle d'une manière plus efficace. Habituellement un système de RSA peut être adapté à un hautparleur spécifique afin de réduire le degré d'erreur. L'utilisateur rarement oblige d'intervenir pour corriger les résultats de la dictée. Un exemple d'un logiciel populaire qui est normalement utilise est "Nuance's Dragon Naturally Speaking". Un bas rapport de signalbruit peut causer une grave interférence et peut dégrader l'exécution d'un système de manière significative. La tache de reconnaissance optique des caractères (ROC) pour convertir l'écriture en texte est particulièrement provocante quand elle touche l'écriture du médecin.a la parole prélevé par un microphone de téléphone de cellulaire dans une voiture mobile avec les fenêtres ouvertes.les enfants ont une gamme de fréquence différente des adultes. La déformation de signal et le taux élevé de compression peuvent faire retentir quelques mots de la même façon. Accents et d'autres paramètres spécifiques du hautparleur . grâce aux performances élevées du logiciel. RSA est employé dans de nombreuses applications. Nous présentons plusieurs exemples pour des usages communs de RSA : 1. Actuellement la transcription médicale dans le domaine est 91 . 2. Les accents étrangers dégraderont l'exécution d'un système aussi bien que les accents noncommuns pour lequel il n'a pas été conçu. Habituellement les centres d'appel opèrent un vocabulaire limité lié à leur champ d'opération. Les règlements et les besoins pratiques exigent que le dossier des patients soient convertis en texte numérique. 4. 3.
9. par exemple "Dialog OS" [2] est un progiciel qui peut être employé pour permettre à une unité de "Lego Mindstorms RCX" de comprendre la parole [3]. 5. [5] proposent une télécommande de la parole pour la TV numérique qui utilise 15 boutons au lieu de 70 requis pour actionner une TV 92 . Fujita et d'autres. L'évaluation de la prononciation dans des applications de connaissance des langues. 8. 7.faite par des professionnels ou par les systèmes de RSA qui ont un vocabulaire enrichi par des termes médicaux [1]. Applications de sécurité telles que le tapement automatique aux lignes téléphoniques en utilisant des motsclés spécifiques. Utilisation RSA de robotique pour des conseils et des instructions. Le système de RSA indique à l'utilisateur si la prononciation est claire. 4. Un robot peut être guidé pour accomplir une tâche spécifique. Un hautparleur d'une deuxièmelangue doit faire un effort spécial pour prononcer correctement différents mots et phrases. a un vocabulaire restreint et fonctionne sous des contraintes d'informatiques. Par exemple. 6. de mémoire et d'énergie de puissance. Traduction automatique en convertissant la parole en texte et puis en traduisant le texte produit. L'utilisateur peut alors guider le robot du Lego pour exécuter une certaine préprogrammée d'avance. Nortel développe un système de RSA ainsi que l'université de Qinghua afin d'essayer de surveiller chaque civil en Chine [4]. Habituellement le système s'exerce à un hautparleur particulier. Les mobilophones et les autres dispositifs de communication emploient RSA pour composer plus rapidement. Les appareils ménagers emploient RSA pour avoir une interface humaine plus amicale. Le dispositif de téléphone contient le logiciel de RSA qui peut identifier des noms et composer le numéro du correspondant.
2 Érosion 93 .multicanal. au lieu de la programmation complexe des boutons. la morphologie mathématique fournit des solutions à la manipulation des objets géométriques spécifiques dans différents espaces topologiques. l'utilisateur parle simplement hors de la commande désirée. et finalement nous concluons avec l'arrangement algorithmique de la ligne de "Watershed Transform" qui est employée pour la segmentation d'image. Chapitre 3 Morphologie Mathématique La morphologie mathématique est un modèle théorique qui justifie en particulier les opérateurs utiles dans des applications de traitement d'image.2. Nous finissons ce chapitre en montrant le rapport étroit entre la logique floue et la morphologie mathématique et par la conclusion sur l'importance et la pertinence de la morphologie mathématique avec la lecture automatisée de spectrogramme. Des opérateurs de base sont employés pour construire des opérateurs plus complexes. Nous passons en revue les opérateurs morphologiques commençant par les opérateurs les plus fondamentaux de la dilatation et de l'érosion. Nous commençons d'abord par une vue d'ensemble historique de la morphologie mathématique. Basée sur le trellis et la théorie des ensembles et les axiomes. tandis que tous les opérateurs comptent sur les éléments de structure qui servent d'unités modulaires géométriques aux différents opérateurs morphologiques. Nous continuons les axiomes et les définitions de base qui exemplifient l'importance des différents opérateurs morphologiques. 3. Des commandes difficiles sont rendues simples puisque.
L'image de rendement contiendra tous les points dans lesquels le Se est entièrement contenu dans l'image originale.4 Dilatation La dilatation est l'opérateur duel de l'érosion.2.L'érosion est une opération morphologique de base.2. un logique 'OR' est employé dans le cas binaire et un `résultant 1' est écrit à l'image de rendement au cas où il y aurait au moins un Pixel dans l'image qui correspondrait au ES. la dilatation n'est pas l'opérateur inverse de l'érosion à moins que l'ouverture soit idempotent entre quant à l'image et au ES. 3.6 L'ouverture (Opening) Open(Im.SE) = Im ◦ SE L'ouverture est exécutée en dilatant le résultat d'une érosion d'une image avec le même élément de structure (ES). en général. La fermeture est l'opérateur duel de l'ouverture. Nous considérons chaque fois que le point central du grain comme point de référence produit un `logique 1' au cas où le Se serait entièrement contenu dans l'image en ce qui concerne ce point de référence. et pour séparer entre de différents objets dactylographies (des lignes dans des directions spécifiques. pas l'opérateur inverse de la dilatation à moins que la fermeture soit idempotent quant à l'image et au ES. L'opération est équivalente au cas logique 'AND' dans le system binaire. Cependant. L'ouverture est normalement employée pour ouvrir une image et pour l'essuyer du bruit de sel. Un grain employant typiquement un `logique 1' comme indicateur à l'élément de structure (ES) est employé. Par conséquent la dilatation est un opérateur étendu. 3. De la même manière l'érosion est. L'objectif est de trouver les objets dans l'image qui correspondent exactement au ES. Cependant. en général. L'image est donc érodée en ce qui concerne le Se. Par conséquent l'érosion est un opérateur antiextensif. des objets 94 . La dilatation suit le même balayage de l'image d'un grain auquel le Se est indiqué par un ` logique 1'.
Normalement après s'ouverir. Les arêtes de la distance transforme peuvent être considérées en tant que maximum locaux et ils représentaient le squelette de l'objet en question. Les 95 . 3. L'ouverture morphologique présentée est adaptent.spécifiques). il est donc nécessaire d'effectuer la dilatation pour reconstruire ces objets. antiétendu et quantité s'appelle l'ouverture : B B B . Une simple érosion de l'image réduira les pixels de l'image claire des objets propres . alors que les dispositifs plus grands que le ES demeurent à peu près identiques. Dans l'algèbre chaque opération qui augmente. des dispositifs plus petits que l'élément de structure sont enlevés. L'exemple suivant démontre comment l'ouverture peut être employée pour réduire le bruit du sel et du poivre : Figure 3.4 : L'image des cercles a été corrompue par le bruit du sel et du poivre (gauche) et après l'ouverture. Il est possible d'éroder une image par un élément de structure qui est un disque d'un certain rayon r simplement en enlevant les Pixel qui atteignent une valeur moins que r.3 La Distance Transforment La distance transforme produit la distance de chaque pixel du premier plan à partir de la frontière la plus proche selon la connectivité choisie. Le bruit du sel disparaîtra après érosion et ne se développera pas dû à la dilatation suivante. Elle est également antiétendue et grandissante.
La logique floue a été développée pour traiter ces types de situations et pour nous permettre d'écrire facilement et d'adapter plus tard sur le parcours les règles logiques qui sont le résultat d'une connaissance experte. nous construisons une fonction brouillée d'adhésion qui assigne des valeurs s'étendant de zéro à un qui représentent le taille d'une personne. Évidemment le problème avec cette méthode est que deux personnes qui diffèrent par 1 centimètre seront classées par catégorie dans deux groupes différents. D'autre part si nous n'avons pas d'informations sur d'autres événements. les modeler d'après une la distribution uniforme ne serait pas toujours un bon choix. l'incertitude et la complexité du vrai monde. Si nous avons des informations préalables sur un ensemble d'événements. Le seuil d'une certaine population peut être déterminé intuitivement ou par caractéristiques médianes. nous voudrions avoir un modèle qui nous permettre d'employer cette connaissance. Une personne avec la taille plus de 1.90 mètre 96 . Afin de surmonter les frontières croquantes de l'ensemble "grand" et de son ensemble complémentaire "non grand". Par exemple supposez que nous voudrions savoir si une personne est grande. moyennes ou par autres statistiques. Chapitre 4 Logique Floue 4.70 mètre sont grands.2 Logique floue contre la logique booléenne La logique booléenne ainsi que la théorie des ensembles croquantes et la théorie des probabilités croquantes sont insatisfaisantes pour traiter l'imprécision. Les limites des théories (booléennes) croquantes qu'ils ne laissent pas de place aux ambiguïtés et à l'ignorance.érosions consécutives dans ce casci seraient équivalentes à alphacoupe de la distance transforment dans les intervalles de la taille r.70 mètre et considérer que tous les candidats d'une taille plus grande ou égale à 1. Nous pourrions décider d'un seuil fixe spécifique de 1.
1]. les valeurs ajuster augmentera leur appartenance dans l'ensemble d'une façon non linéaire. L'exemple précédent a exigé de convertir une variable croquante simple en variable brouillée. Selon 97 . Par exemple. Nous pouvons construire les fonctions d'appartenance aux poids variables en fonction du "grand" variable floue et créer un ensemble floue de type 2. Un autre concept important dans la logique floue est celui des haies linguistiques. D'autres exemples des haies sont : ` légèrement '. Très De Grand}. En fait le "grand" variable (booléen) croquant a été converti en variable floue qui peut prendre des valeurs sur l'intervalle [0. nous pouvons quantifier la taille dans les variables brouillées suivantes: {Audessous De Moyen. il y a une dépendance entre la taille d'une personne et le poids dans un sens que nous nous attendions qu'une personne mince et grande pèse plus qu'une personne mince mais plus courte. Nous pouvons également quantifier la taille et considérer chaque soussection pour avoir une fonction différente d'appartenance. ` extrêmement '. ` davantage ou moins etc. Nous assignerons des valeurs aux cas extrêmes. Au lieu d'un seuil simple nous avons maintenant deux seuils puisqu'une personne peut peser plus ou moins que le poids considéré qui estime qu'une personne soit mince.sera considérée grande dans la plupart des cas (la population des joueurs de basketball est un bon contreexemple) tandis qu'une personne en de sous de 1. Toutes autres tailles seront assignées une valeur de miportée. On peut remarquer que.50 mètre ne sera pas considéré grand. Nous examinons maintenant le poids d'une personne pour déterminer si la personne est mince. Par exemple nous pouvons appliquer le linguistique de la haie "très haut" d'ensemble floue "grand" pour créer l'ensemble floue "très grand". Puisque les valeurs d'appartenance sont entre zéro et un. De Grand. De Moyen. Nous pouvons choisir que la haie "très" correspondrait à carrer la fonction d'appartenance. La conversion d'un problème croquant en floue s'appelle la fuzzification. Après nous pouvons interpréter la haie d'une certaine manière prédéfinie. '1' pour représenter que la personne est clairement dans l'ensemble de personnes grandes et '0' pour représenter que la personne n'appartient clairement pas un membre de l'ensemble de personnes de taille. ` plutôt '.
plus j'oublie". nous obtiendrions une valeur de la connaissance oubliée qui devrait être basse par rapport à la valeur de la connaissance apprise. plus j'oublie". Chapitre 6 Algorithme Automatique de Segmentation de Spectrogramme 6. Au lieu d'un seuil croquant qui. plus je saurai". nous conclurions que la partie de la connaissance oubliée est négligeable par rapport à la connaissance gagnée. Un autre exemple d'importance de la logique floue et de la théorie des ensembles floue peut être vu par le "dilemme de l'étudiant". une fois déclenché. La premier implication est du dilemme de l'étudiant "plus que j'étudie.1 Introduction En ce chapitre nous présentons un algorithme pour montrer la segmentation efficace des phonèmes dans un spectrogramme de la parole. produirait un certain résultat. La deuxième implication est "plus je sais. Nous passons en revue les travaux précédents et donnons la motivation pour la lecture automatique des spectrogrammes de la parole.le problème nous avons la flexibilité de choisir différentes fonctions à différentes haies aussi longtemps que nous restons conforme à nos définitions. en utilisant des opérateurs de logique floue pour lier les différentes prétentions. De ce fait en utilisant la déduction croquante l'étudiant stipule que "plus j'étudie. Après présentation et analyse des résultats nous suggérons différentes manières d'adapter l'algorithme 98 . Nous continuons avec des descriptions particulières et détaillées des principaux ingrédients algorithmiques suivis d'une explication de l'algorithme de segmentation. qui peut mener à la conclusion que l'étude diminue la connaissance. Cependant.
Les moyens et la variance désaccord locaux sont estimés dans des 16 sur 16 autour de chaque Pixel filtré. 5. 8. Lissez à l'aide d'un 2D filtre Wiener. 7. Combinez les résultats de (4) et de (5) en utilisant la logique „OR‟. 2. 9. Exécutez un algorithme de la "Watershed Transform" 8. 4.3 Description d'Algorithme 1. 6. Employez la connectivité morphologique pour négliger les petites sections qui contiennent moins de 40 Pixel ou qui ont une largeur maximale de moins de 20 Pixel. Exécutez une 2D fenêtre Gaussienne (filtre de Gabor). Appliquez le seuil local sur (3). Dilatez avec un disque comme élément structurant afin de débrancher les lignes minces et éliminer de petits secteurs dans l'image. 3.connectivité. Le filtre médian est utilisé la 1ère fois 3 sur 20 rectangles et en plus 4 fois en utilisant un trait horizontal de 20 Pixel. 6.pour qu'il puisse manipuler différentes procédures. Appliquez le seuil global sur (3). Nous récapitulerons le chapitre avec des conclusions. 99 .
Image Spectrogram Median 3 by 20. Set cnt = 1 Median 1 by 20 cnt = cnt + 1 Is cnt = 4? Yes No Local 2D Wiener Filter Local Threshold Global Threshold Logical OR Dilation (disk as structuring element) Discard Small Connected Sections Watershed Transform Binary Mask 100 .
12 Résultats L'algorithme a été examiné sur différents échantillons de la parole provenant de la base de données de TIMIT. mots accentués indiquent 1 seconde de la parole qui dans cet exemple est montrée dans la fig. chacun des quatre formants est bien aligné et préparé pour être reconnu par un système approprié. la segmentation manque quelques parties des phonèmes /r/ mais la direction générale est préservée. la coarticulation et les différentes combinaisons de phonèmes. the litter remained. Pour le troisième et quatrième formant. Dans cet exemple. un filtrage médian et une dilatation d'image sont appliquées avec des paramètres globaux. La transcription orthographique et la transcription phonétique tempsalignée sont incluses pour chaque phrase. Nous obtenons la bonne segmentation pour les premiers et deuxièmes formants pour tous les phonèmes exprimés. La base des données TIMIT contient les hautparleurs femelles et masculins de 7 régions avec de différents dialectes aux EtatsUnis. Lisser et seuil sont faits au niveau local et global. augmented by several dozen lunchroom suppers”. la segmentation a été bien exécuté sur différents hautparleurs et différentes phrases.1 : Écoulement De Diagramme D'Algorithme L'algorithme emploie des techniques de traitement locales et globales. 1(a). 101 . 6. La "Watershed Transform" est appliquée à l'image entière depuis l'interaction entre différents jeux d'objets d'image par partie importante dans le procédé de segmentation. Lisser au niveau local emploie un moyen et un désaccord localement estimés pour un 2D filtre Wiener tandis que des procédures douces globales en utilisant une fenêtre gaussienne. Les résultats étaient robustes . Les hautparleurs ont répété des phrases particulièrement conçues à SRI.Figure 6. Notre premier exemple emploie la phrase significative “However. à MIT et à TI pour exemplifier différentes caractéristiques de la parole telles que l'accent.
les BLOB liées à f1 se relient parfois à plus d'un phonème. Puisque ces formants sont très étroits ensemble. nous choisissons : “Don‟t ask me to carry an oily rag like that. En général. Comme dernier exemple. Un autre problème est de petits segments qui ne représentent pas un formant mais apparaissent toujours sur l'image (positifs faux). la basse densité spectrale le rend dur pour segmenter le phonème correctement. 6. les forces élevées de f3 le rendent plus difficile de le séparer du F2. Ce problème peut être résolu en changeant la constante dans l'étape #8 de l'algorithme. Cependant. Ce phénomène se produit aussi bien dans certains cas pour les formants plus élevé.Notre deuxième exemple présente un arrangement plus provocant. Les critères par lesquels nous jugeons l'exécution est la catégorie variable floue de 102 . augmented by several dozen lunchroom suppers.2(b). nous avons toujours des restes sous forme de petits objets binaires qui peuvent poser des problèmes dans l'étape d'identification. the litter remained. En outre. nous obtenons plusieurs cas dans lesquels des formants sont segmentés dans plus d'une BLOB. D'autre part comme a été également noté dans les exemples précédents.2(c). l'algorithme parvient à bien fonctionner quand les énergies de formant sont fortes.” Comme représenté dans la fig. La basse densité spectrale est provoquée par un zéro spectral qui réduit le deuxième formant. il est difficile de distinguer entre eux et de les segmenter en tant que différents objets. Quoique l'hyper segmentation ait été abordée dans l'algorithme de "Watershed Transform". l'algorithme a la difficulté en segmentant le deuxième et le troisième formant de /r/. Nous examinons une section différente de la même phrase : “However. Une autre difficulté surgit dans l'identification du /m nasal/.” Comme dans la fig. Afin de vérifier le comportement d'algorithme d'une façon plus systématique. nous examinons les résultats par des multiples examines. changer la constante pour accepter seulement des énergies plus fortes aboutirait par la perte de quelques vrais formants. 6.
1. 'audessous de la moyenne'. 'moyenne de' . 'pauvre'} pour les résultats de la segmentation. au total des 200 segments différents de la parole. 103 . pauvres de ` 'prend la valeur la plus basse de 1 et on le croit que la moyenne de ` '. Nous choisissons 10 phonèmes et exécutons 20 essais différents pour chaque phonème.cette des prises que incluse les valeurs { 'parfait'. 'bon'. Les résultats comprenant le moyen et le désaccord des mesures visuelles sont présentés dans le tableau 6. Nous assignons des nombres à chaque descripteur où prises parfaites de ` les 'la valeur la plus élevée de 5. contient assez d'information pour l'identification automatique. qui prend la valeur de 3.
nous tendons 104 .2: Résultats d'algorithme pour différents cas Après avoir examiné l'algorithme nous voyons qu'en général l'algorithme obtient de bons résultats de segmentation pour les forces de formant en passant par des phonèmes différents.A “several dozen” B “the litter remained” C Hamming Window “an rag” oily D Hann Window “an rag” oily Figure 6. L'algorithme obtient de meilleurs résultats de segmentation quand la durée du phonème est plus longue. Puisque plus d'information est disponible et puisque notre algorithme de segmentation recherche de grands objets.
85 4. Les catégories décrivent l'exactitude de l'algorithme de segmentation pour chaque phonème.10 0.4 Mean Variance 1. Il est possible de prolonger l'algorithme pour détecter des lignes de diagonale en ajoutant un procédé de cheminement tel qu'un filtre de Kalman ou par une ligne diagonale soulignant le filtre médian.2 3.2 2.91 2.1 4.08 1. Test # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Phoneme aa ae 5 5 5 5 2 3 4 5 5 4 2 5 5 4 4 5 5 4 3 3 5 4 5 5 5 4 5 4 2 2 5 4 4 3 2 5 4 5 5 5 eh 5 4 3 2 5 5 4 5 2 5 1 3 5 4 5 5 5 5 5 5 ux 5 4 2 5 5 5 5 5 2 5 5 4 5 3 3 4 5 5 5 5 ow 5 4 2 4 5 3 5 5 4 5 5 3 4 2 3 5 5 3 5 4 oy 5 4 3 5 3 5 4 3 4 5 4 4 3 3 3 2 3 5 5 4 r 5 3 3 5 5 5 3 3 4 5 4 5 5 3 4 5 5 4 4 4 l 3 4 1 5 5 2 5 1 1 1 2 1 5 3 5 5 2 5 5 3 m 5 5 4 3 2 5 1 2 1 4 1 1 3 1 2 2 1 2 3 5 w 1 2 5 3 4 5 5 2 3 3 3 4 5 2 2 5 2 2 5 5 4. Les /m/ nasaux et le glissement /l/ ont des résultats inférieurs de segmentation dus à la difficulté de dépister les lignes diagonales dans le spectrogramme.69 2.à manquer de petites concentrations d'énergie.1 105 .1 : Résultats d'une inspection visuelle.87 0.35 4.2 4. les voyelles sont bien reconnues. Nous choisissons le 'l de glissement' que comme vu dans le tableau 6.65 3. En général.05 3.61 1.8 1.15 4.34 1.94 Tableau 6. Nous démontrons le choix subjectif des catégories selon notre variable floue en montrant quelques spectrogrammes de la parole qui correspondent à différentes catégories.46 0. La semivoyelle /w/ est mieux segmentée sur des phonèmes de courte durée puisqu'il y a une concentration d'énergie plus élevée qui permet mieux la segmentation de f3 et de f4.
(6) 106 .3: Exemple des points de la note 1. Nous faisons prononcer la même phrase par différents hautparleurs. (9) Figure 6. 2.txt des textes dans l'annuaire d'entrée. Don't ask me to carry an oily rag like that. Don't ask me to carry an oily rag like that. 1. Les nombres entre parenthèses indiquent la rangée de la table aussi bien que la ligne de la localisation des noms de fichier dans le dossier inputl. Puisque le caractéristique principale du glissement se trouve dans le deuxième formant du glissement nous donnons la note ‘1' à ce résultat. De nouveau la partie "bold" de la phrase indique le segment de la parole qui est montré réellement dans le spectrogramme de la parole où le 'l de glissement' est centré a la 0. Justification : Nous manquons le deuxième formant presque totalement dû à l'élévation pointue et à l'énergie relativement basse.5 secondes depuis de départ.reçoit toutes les 5 valeurs possibles.
Nous pouvons conclure par le masque la direction et l'endroit du deuxième formant. 3. Justification : Nous avons un premier et quatrième formant clair . Justification : Nous avons chacun des quatre formants.4: Exemple des points de la note 2. L'identification serait difficile (quoique non impossible) .5: Exemple des points de la note 3. dans ce casci donc une catégorie du ‘ 3' a été donnée. (1) Figure 6. L'identification devrait être possible . She had your dark suit in greasy wash water all year. donc la note du ‘2' a été donnée. 107 . le deuxième formant est bien représenté mais le troisième formant est absent. Le troisième formant peut également être bien estimé.Figure 6.
(2) Figure 6.4. Par conséquent la note ‘4' a été attribué. She had your dark suit in greasy wash water all year. (5) Figure 6. Don't ask me to carry an oily rag like that.7: Exemple des points de la note 5. Le deuxième formant est bien décrit.6: Exemple des points de la note 4. La segmentation algorithmique attrape les formants ainsi la reconnaissance de la parole automatique devient possible. 5. L'endroit du troisième et du quatrième formant peut être facilement compris. Justification : Chacun des quatre formants est bien caractérisé. Justification : Nous avons chacun des quatre formants. 108 .
Nous voyons qu'en raison de différents accents et distributions d'énergie, nous avons des résultats sensiblement différents pour la segmentation. Puisque notre algorithme est formé pour suivre les traits horizontaux et les formes, nous avons un problème avec les glissements et en particulier avec des fréquences qui montent et qui descendent. Un modèle linéaire pour une élévation et une chute est bien connu sous le nom les "coefficients du delta" (premier dérivé) dans le cadre du RSA. Un modèle d'ordre secondaire emploie également les coefficients du deltadelta, qui sont une approximation du deuxième dérivé et qui aboutit à l'adaptation des résultats des données à une fonction parabolique. Quoiqu'avec quelques ajustements c'est possible d'adapter l'algorithme pour capturer les mouvements nonhorizontaux, nous voyons que même à une étape prématurée noncommerciale du l'algorithme, nous obtenons dans la plupart des cas des résultats qu'on puisse apprécier comme suffisants pour un système automatique de reconnaissance de la parole. Nous obtenons de très bons résultats d'identification quand la durée de temps du phonème est courte. Nous attribuons cela à la concentration relativement élevée de l'énergie et au glissement doux dans le deuxième formant qui sont plus appropriés à un algorithme qui vise à segmenter les traits horizontaux.
6.14 Conclusion
Un algorithme robuste pour la segmentation de spectrogramme de la parole a été présenté. En employant des techniques de traitement morphologiques d'image, nous pouvons obtenir la segmentation fiable des formants dans la plupart des cas. L'algorithme exécute bien pour tous les phonèmes exprimés et a de meilleurs résultats de segmentation que des techniques précédentes ; cependant, les difficultés se produisent quand les fréquences de formant sont étroites ensemble ou quand il y a un formant à énergie réduite qui monte rapidement ou vers le bas dans la fréquence. Certaines suggestions telles que changer le niveau de seuil ont été faites pour améliorer ou accorder l'algorithme. Ces résultats peuvent être employés comme entrée à un système automatique de reconnaissance de la parole ou dans d'autres utilisations générales des spectrogrammes de la parole. C'est dans la croyance des auteurs qu'un système spectrogrammebasé de reconnaissance de la parole peut compléter un système
109
existant d'identification en incorporant la connaissance experte humaine dans l'identification chargent.
Chapitre 7 Sommaire
7.1 Examen du Travail et des Déductions Logiques
Dans les chapitres précédents nous avons créé les bases de trois théories principales : Reconnaissance de la parole, traitement de l'image morphologique et logique floue. Nous avons vu qu'il est possible de combiner ces méthodes afin de concevoir un arrangement qui puisse exécuter la reconnaissance de la parole automatique. Le rapport étroit entre la logique floue et la morphologie mathématique nous a aidé a comprendre comment lier entre ces deux théories. Des justifications ont été présentées pour l'usage de la morphologie mathématique pour effectuer la segmentation de l'image. Le but principal de cette thèse était de segmenter un spectrogramme de l'image et pour cette raison un algorithme de segmentation a été conçu.
L'algorithme de segmentation fonctionne bien dans la plupart des cas. Nous avons vu comment en choisissant une fenêtre de Hamming au lieu d'une fenêtre de Hann nous pouvons obtenir de meilleurs résultats de segmentation puisque nous avons une meilleure séparation entre les fréquences adjacentes et parceque dans une certaine mesure les dépendances de temps entre les pixels dans le spectrogramme d'image peuvent être compromises. Nous avons conclu que les experts peuvent extraire l'information à partir des spectrogrammes de la parole à large bande et nous avons vu la différence entre les images du spectrogramme à bande étroite et à large bande et l'information qu'ils contiennent ainsi que les différentes formes qui exigent de différents opérateurs morphologiques d'extraire l'information à partir des images. Dans la section 5.4 nous avons vu les propriétés mathématiques de l'opération médiane. Nous avions également utilisé la médiane pour lisser la bande étroite et les 110
images à large bande comme première étape avant d'appliquer des techniques de segmentation ou d'extraction plus fortes telles que la ligne de partage transforment ou l'opérateur morphologique "Thinning". La "Watershed Transform" est efficace dans des images bruyantes de segmentation et en particulier dans les cas dans lesquels les différents objets de cible occluent partiellement un l'autre. Nous obtenons une image marquée de masque et dans la plupart des cas chaque BLOB correspond directement à un formant d'un phonème particulier. Dans certains cas nous obtenons plusieurs petites BLOB qui appartiennent au même formant ; cependant ceci ne devrait pas poser un problème particulièrement difficile puisque la majeure partie d'information de la laquelle nous avons besoin pour la fonction de l'identification est toujours maintenue.
Nous avions utilisé une variable floue pour mesurer les résultats de l'algorithme. Cette méthode de correction nous a assuré que l'algorithme sera optimisé pour rapporter des résultats qui seront aussi étroitement que possible prêts de l'information a extrait par un expert. Celuici effectuera une inspection visuelle d'un spectrogramme de la parole afin d'essayer d'extraire l'information. Dans la plupart des cas, les voyelles sont segmentées d'une façon satisfaisante. et tous les quatre premiers formants sont bien détectés et identifiés. Parfois, à cause de la basse énergie, on manque un formant. Un autre avertissement commun est un formant qui se casse en de plus petits morceaux dans le procédé de segmentation dû à des forces plus basses dans son secteur central. Les glissements présentent un arrangement plus provocant puisqu'ils exigent dépister les formants qui augmentent ou diminuent d'après la fréquence. En outre leurs forces sont en général inférieures à ceux des voyelles. Nous pouvons obtenir des résultats satisfaisants dans la plupart des cas pour les glissements. Ces résultats sont inférieurs à ceux obtenus pour les voyelles.
7.2 Idées pour les Travaux Futurs
Nous avons réussi à effectuer la segmentation qui fonctionne bien dans la plupart des cas. Cependant, l'exécution de l'égalisation qui emploierait comme entrée les propriétés de temps et d'énergie de chaque phonème et serait ajustée à un personne ou
111
l'intersection ou la négation. son degré de pente l'approximativement d'un première ou d'un second degré. En contraignant le nombre de BLOB nous comptons segmenter l'excédent par période de temps pré spécifiée car en réduisant le nombre de petites BLOB et des BLOB de fusion qui sont réellement des constituants du même formant nous pouvons améliorer les résultats de manière significative puisque nous. Une égalisation simple peut employer une correction gamma comme a été expliqué dans la section 6. dans différents cas. Le vecteur du dispositif serait construit selon les règles étendues en avant dans le système expert.à sur un groupe spécifique d'accent peut aider à obtenir encore de meilleurs résultats. Une autre amélioration de l'algorithme de segmentation pourrait accorder l'algorithme et l'ajuster sur différents types de phonème. De la même façon coefficients du delta et la force d'énergie sont mesuré. Les fonctions d'adhésion pour chaque élément du vecteur de dispositif peuvent être manuellement conçues ou formées par un réseau neurologique. Le vecteur de dispositif peut inclure des paramètres tels que la longueur de la BLOB. Enfin.5 pour changer la luminance et donc l'obscurité des différentes sections d'énergie dans le spectrogramme de la parole. de phonétique. l'emplacement de la bande de fréquence. de linguistiques et de parole de production peut rapporter de meilleurs taux d'identification que les méthodes courantes qui n'incorporent pas la 112 . Le système expert se fonderait sur un ou plusieurs experts en matière de lecture de spectrogramme et aura la forme des règles SI/PUIS (IF/THEN). une régression complète serait exécutée pour analyser l'exécution du système. Les règles auront également une méthode d'agrégation qui expliquerait comment effectuer les combinaisons. En outre nous devrons extraire un vecteur de dispositif à partir de notre image segmentée. Afin d'exécuter la reconnaissance de la parole automatique en utilisant les résultats de notre algorithme nous devrons construire un système expert. Puisque nous ne sommes pas limités à l'information nous avons dans le masque (image binaire segmentée) nous pouvons employer le masque comme référence et extraire une information plus précise liée à une BLOB spécifique du spectrogramme original. Nous espérons que l'information contenue dans les spectrogrammes de la parole comme interprétés par des experts humains et connaisseurs des modèles acoustiques.
L'information extraite par cette méthode peut également être combinée avec les systèmes existants pour améliorer leurs résultats. Un avantage clair du système proposé est sa conception basée sur les règles intuitive et la possibilité d'incorporer la connaissance de plus d'un expert. 113 .connaissance humaine dans leurs algorithmes. Une solution possible pour créer l'ensemble de règles est le système "wikibasé" qui permettra aux experts de différents endroits autour du monde à transmettre leur expérience.
4.2: Résultats d'algorithme pour différents cas. (*) Figure 6.1 : Écoulement de diagramme d'Algorithme. Fig.5: Exemple des points de la note 3. Fig. 5.1: Repère de l'image de cercles.Liste de figures Fig. Fig. (*) Figure 6.1: Spectrogramme à bande étroite de la Parole. 5. (*) Figure 6.2: Complément floue de Sugeno pour différents paramètres de lambda.5: Image originale . 3. Fig.7: Exemple des points de la note 5. application d'un squelette . Fig. Fig.2: l'image de cercles appliquant avant et après l'érosion finale. 5. (*) Figure 6.2: Spectrogramme à bande étroite de la parole après détection des lignes. (*) (*) La figure a été incluse dans le sommaire 114 . 3. 4. (*) Fig. 3. (*) Figure 6.6: Exemple du nettoyage et information d'extraire d'une image en utilisant les opérateurs morphologiques. 3. 3.4: Exemple des points de la note 2. Fig.6: Exemple des points de la note 4.1: Exemples des fonctions paramétriques communes d'appartenance. 3.3: Entoure le bruit gaussien blanc corrompu par image après application du gradient de Beucher.3: Complément floue de Yager pour différentes valeurs paramétriques. taille du squelette. Figure 6. (*) Figure 6.4: L'image de cercles a corrompu par bruit de Salt et de poivre et après s'être ouvert. Fig. Fig.3: Exemple des points de la note 1. Fig.3: Résultats de l'algorithme d'évaluation de lancement. 4.
1: Valeurs moyennes de fréquence de formant pour des phonèmes choisis.1: Résultats d'une inspection visuelle. Tableau 6. (*) (*) Le tableau a été inclus dans le sommaire 115 .List de Tables Tableau 2. Les catégories décrivent l'exactitude de l'algorithme de segmentation pour chaque phonème.
116 .1: Repère de l'image de cercles. 3.Fig.
Fig. 117 .Fig. 3.3: Entoure le bruit gaussien blanc corrompu par image après application du gradient de Beucher. 3.2: l'image de cercles appliquant avant (gauche) et après l'érosion finale.
application d'un squelette (moyen). 3.Fig.5: Image originale (dessus) . 118 . taille du squelette.
(a) (b) 119 .
(b) Un résultat très bruyant après application de l'opérateur squelettique.(c) (d) Fig. Beaucoup de squelettique plus clair est obtenu.6: Exemple du nettoyage et information d'extraire d'une image en utilisant les opérateurs morphologiques. (c) Résultez du nettoyage (a) en employant les opérateurs de fermeture et d's'ouvrir. 120 . (a) L'image originale a corrompu par bruit de Salt et de poivre. 3. (d) Résultez d'appliquer l'opérateur squelettique à (c).
(a) (b) (c) 121 .
(b) Fonction d'appartenance Pi. (c) Fonction d'appartenance Z. (a) Fonction d'appartenance Sigmoid.1: Exemples des fonctions paramétriques communes d'appartenance. 4. (d) Fonction d'appartenance Triangulaire. 122 .(d) Fig.
2: Complément floue de Sugeno pour différents paramètres de lambda. 123 .Fig. 4.
3: Complément floue de Yager pour différentes valeurs paramétriques. 4.Fig. 124 .
5. 125 .1: Spectrogramme à bande étroite de la Parole.Fig.
2: Spectrogramme à bande étroite de la parole après détection des lignes.Fig. 126 . 5.
5.3: Résultats de l'algorithme d'évaluation de lancement. 127 .Fig.
128 .1: Valeurs moyennes de fréquence de formant pour des phonèmes choisis.Phoneme f1 f2 f3 /i/ 270 2290 3010 /I/ 390 1990 2550 /E/ 530 1840 2480 /@/ 660 1720 2410 /a/ 730 1090 2440 /c/ 570 840 2410 /U/ 440 1020 2240 /u/ 300 870 2240 /A/ 640 1190 2390 /R/ 490 1350 1690 Tableau 2.