M

u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Analysis, Modelling and Synthesis of
British, Australian and American Accents
Ram Gangwar
Multimedia &Project Lab
Department CS&IT
MAHATMA JYOTIBAPHULE ROHILKHAN UNIVERSITY
Supported by EPSRC
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p Content
1- Introduction to Phonetics and Acoustics of Accents
2- Research Issues in Modelling Acoustics of Accents of English
3- Current Research Problems
4- Accent Analysis and Models
5- Accent Morphing
6- Audio Demo
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
1.1 Background
• Accents are acoustic manifestations of differences in pronunciation and
intonations by a community of people from a national, regional or a socio-economic
grouping.
• Accents are dynamic processes in that they evolve over time influenced by large-
scale immigration, socio-economic changes and cultural trends.
• Applications of accent models include:
- speech recognition,
- text to speech synthesis,
- voice editing,
- accent morphing in broadcasting and films,
- toys and computer games,
- accent coaching, education.
1. Introduction to Phonetics and Acoustics of Accents
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
• The importance of an accent feature depends on its distance from that of the
‘standard’ or ‘received’ pronunciation and the frequency with which that
feature occurs in the acoustics of speech.
1.2 Basic Structure of Accents
• Generally the structural differences between accents can be divided into two broad
parts:
(a) Differences in phonetic transcriptions.
(b) Differences in acoustics correlates and intonations of accents.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
1.3 Phonetics of Accents
• A dominant aspect of accents is in the differences in pronunciation as transcribed by a
phonetic dictionary.
• The differences in phonetic transcription can be categorized into two classes:
a) Differences in the number and identity of the phonemes.
For example, British English as transcribed by Cambridge University’s BEEP
dictionary
2
has five extra vowels: /ax(ə) ea(ɛə) ia(iə) ua (uə) ah (ɒ) / compared to
American as transcribed by Carnegie Melon University CMU dictionary. /iə ɛə
uə/,are allophones of /i ɛ u/. American /ɒ/ is merged with /a/ compared with
British accent.
American transcription has three different levels of stress for vowels and
diphthongs. Also Australian English has distinctive vowels such as /æi/
instead of /ei/ and /æƆ/for /au/.
b) Differences in phonetic realizations: phoneme substitution, deletion, insertion.
For example, ‘JOHN’ is pronounced as / Λ ʤ n/ in American but as / n/ ʤƆ in British and
Australian English. The word ‘SAY’ is pronounced as /sei/ in British and American but it
is pronounced as /sæi / in Australian.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
1.4 Acoustics of Accents
• Perceived acoustics differences of accents are due to the differences, during the
production of sound, in the configurations, positioning, tension and movement of
laryngeal and supra-laryngeal articulatory parameters, namely vocal folds, vocal tract,
tongue and lips
• Four aspects of acoustic correlates of accents are considered essential for accent
models and accent synthesis. These are:
(a) Formants (i.e. frequency of vocal tract resonance) correlates of accents,
including:
(i) Formant trajectories F
k
j
(t), k is the formant index and j is phoneme index.
(ii) Timing and magnitude of the formant target point(s) in formant space for
each phonetic unit.

M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
(b) Pitch prosody correlates of accents, include:
(i) Pitch trajectory at various linguistic contexts and positions. e.g. pitch rise, at
the beginning of a voiced group or phrase, pitch fall at the end of a phrase.
(ii) Pitch nucleus i.e. the timing and magnitude of the prominent pitch event in
a voiced group.
(c) Duration and Timing correlates of accents,
(i) Duration of vowels and diphthongs.
(ii) Relative duration and timings of the two constituent vowels of diphthongs.
(d) Laryngeal (glottal) correlates of accents, i.e the voice quality of speech segments
in certain contexts as a function of accent.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
2. Research Issues in Modelling Acoustics of Accents of English
• Definition of an accent ‘feature set’ composed of formants’ trajectories, formants’
target points, pitch trajectory, power trajectory, duration.
• Separation, normalisation, or averaging out of speakers’ characteristics from accent
characteristics, this is required for modelling parameters of accent.
• Modelling formants of vowels and diphthongs, the latter is composed of two connected
elementary sounds.
• Modelling the duration of vowels and diphthongs and the relative duration of the two
halves of diphthongs.
• Modelling pitch trajectory in different phonetic/linguistic positions and contexts.
• Modelling voice quality correlates of an accents in different phonetic/linguistic positions
and contexts.
• Integration of all accent features within a coherent generative model.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Accent Profile (AP)
Parameters Comments Rank
Phonetic Parameters
Substitution, insertion, deletion Pronunciation differences obtained from phonetic transcription
dictionaries
*****
Supra-laryngeal and Laryngeal Correlates
Formants & their trajectories 2
nd
formant with largest variance is most sensitive to accent ****
Glottal pulse (Voice Quality) Durations and shapes of opening and closing of glottal folds **
Prosody Correlates
F
0
mean
Average of pitch *
F
0
range Range of pitch *
Pitch Nucleus Prominent point (stressed) within an intonation group (Tone Unit) ***
Initial Pitch Rise First pitch slope of a narrative utterance ***
Final Pitch Lowering Final fall pitch slope of a narrative utterance ***
Final Pitch Rise Final rise pitch slope of a narrative utterance ***
Timing and Delivery Correlates
Speaking Rate Phonemes or words per second *
Phoneme Duration Vowel duration elongation and complete pronunciation all affect ***
Excessive Co-articulation Clipped or short duration sounds ****
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Speech Accent Feature Analysis Method
The basic processes involved in accent analysis includes
• Speech phonetic labelling and boundary segmentation using HMMs
• Pitch trajectory and pitch nucleus estimation
• Formant models and formant track estimation
• Duration and power trajectory analysis
HMM
Training
Labeling &
Segmentation
Formants
& Trajectories
Pitch Contour
Tracker
Pitch
Marker
Tone Nucleus
Features
F0 Range/Mean
Pitch Accents
Accent
Profile
Speaking Rate
& Durations
Input
Speech
Block diagram illustration of the processes involved in accent analysis
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Analysis of Duration Correlate of AU, US and UK Accent Speech
Figure: Comparison of speaking rates of British, Australian and American.
Figure: Comparison of phoneme durations of British, Australian and American.
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
aa ae ah ao aw ay eh er ey ih iy ow oy uh uw
Australian British American
D
u
r
a
t
i
o
n

(
s
e
c
)
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Model
Input
British
Model
American Model Australian
Model
British 12.8 29.3 34.9
American 30.6 8.8 29.94
Australian 33.1 27.3 7.28
Table : (%) word error of speech recognition across British, American and Australian accents.
• Australian speaking (word) rate is 23% slower than British
• American speaking (word) rate is 15% slower than British
Comparison of speaking rates of British, American and Australian Accents.
Speaking Rate
(number/sec)

Phone

Word
British 12.1 3.64
American 11.6 3.1
Australian 10.8 2.8
• There is an apparent correlation between automatic speech recognition and speaking rate.
•Australian with the slowest speaking rate obtains the best recognition results followed by American
and British.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Formant Estimation with 2D-HMM
Segmentation
& window
LPC
Model
Polynomial
roots
LP-based Formant-candidate feature extraction method
Formant candidate
Feature vector
Speech
Frequency,Bandwidth
Intensity Calculation
Formant feature extraction, illustrated consists of three main functions,
(1) an LP model,
(2) a polynomial root finder, and
(3) a contour trend estimator.
Consider the z-transfer function of an LP model with K real poles and I complex pole
pairs and a gain factor G as
where A
k
is the pole radius, F
i
the pole frequency and F
s
sampling frequency.
( )
∏ ∏
·
− − −
·

− − −
·
I
i
s
F
i
F π j
i
s
F
i
F π j
i
K
k
k
z e A z e A z A
G z H
1
1 ) / ( 2 1 ) / ( 2
1
1
) 1 )( 1 (
1
) 1 (
1
∆ −estimator
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p



LPC
P1 P2 P3 P4 P5 P6
Frequency(Hz)
Time(s)
Illustration of of LP spectrum and the modelling of 6 complex pole pairs of a speech segment with an HMM composed
of 4 formant-states.
• 2D HMMs span time and frequency dimensions
• Left-right HMM states across frequency model formants such that the first state models
the first formant, the second state the second formant and so on
• The distribution of formants in each state is modelled by a mixture Gaussian density.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Three spectrogram examples of formant tracks superimposed on LPC spectrum of speech
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Comparison of histograms (thin solid line) and Gaussian HMMs of formants of Australian English (bold dashed line). X axis: frequency
(Hz); Y axis: probability.
The figures show that HMMS are excellent models of the distribution of the formants.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Comparison of Formants Spaces of American, Australian and British Accents
Note the following features:
• Rising of vowels /ae/ and /eh/ in Australian.
• Fronting of the open vowel /aa/ and high vowel /uw/ in Australian.
• Fronting and rising of the vowel /er/ in Australian.
• The vowels /iy/, /eh/ and /ae/ in Australian are closer.
F1 vs F2 space of British, Australian and American English. Click phoneme to listen.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Figure : Comparison of trajectories and target time of formant of British, Australian and American accents
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
200
300
400
500
600
700
800
900
1000
AA AE AH AO EH ER IH IY OH UW UH
Australian British American
600
800
1000
1200
1400
1600
1800
2000
2200
2400
2600
2800
AA AE AH AO EH ER IH IY OH UW UH
Australian British American
2550
2625
2700
2775
2850
2925
3000
3075
3150
3225
AA AE AH AO EH ER IH IY OH UW UH
Australian British American
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
AA AE AH AO EH ER IH IY OH UW UH
Australian British American
Accent Pairs
Formant Ranking Order
1 2 3 4
British & Australian 1
st
2
nd
4
th
3
rd
British & American 2
nd
1
st
3
rd
4
th
Australian & American 2
nd
1
st
3
rd
4
th

,
`

.
|
]
]
]

+


·
2
1
) ( 5 . 0
V
v
B
vi
A
vi
B
vi
A
vi
i F F
F F
Rank
• 2
nd
Formant has widest frequency range and is most sensitive to Accent
Formant Ranking using a normalised distance
Figure : Comparison of formants of Australian, British and American (female)
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Accent Morphing Method
Figure : Diagram of a voice morphing system used for accent conversion
Source Speech
Speech Labeling &
Segmentation
Formant Mapping
Formant
Estimation
Prosody
Modification
Accent Model
HMM Training/
Adaptation
Accent Synthesised
Speech
• Formant Mapping : Transformation of formants of the source towards those of the target
accent is based on non-uniform linear prediction model frequency warping.
• Prosody Modification : based on time domain pitch synchronous overlap and add (TD-
PSOLA) method.
• Prosody Modification includes pitch slope, duration and power trajectory.
• Application : Text to speech synthesis, Broadcasting System e.g. Accent modification in
films, Education software such language teaching, Speech interface in mobile, Call centre
and other electronic products
Pitch Tracker
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Formant Transformation via Non-Uniform LP Frequency Warping

Figure Illustration of a non-uniform frequency warping using LP model frequency response. The spectrum is divided into a number of
bands centered on the formants and a different set of warping parameters is applied to each band.

F01
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
1 -75
-70
-65
-60
-55
-50
-45
-40
-35

F12 F23 F34 F45
BW1 BW2
BW3 BW4
I12
I23
I34
M
a
g
n
i
t
u
d
e

(
d
B
)
Frequency (Hz)
Figure : Illustration modification of spectrum towards formants of target accent
Speech
Linear Prediction
Model
LP Spectrum
Mapping
Formant
Estimation
Formant Transformation
Ratios
Accent modified
spectrum
Formant HMMs
Polynomial roots
Pole estimation
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
The frequency bands of the source speaker [F
01
F
12
F
23
F
34
F
45
] are mapped to the target
accent using a set of warping ratios derived from differences in the formants of phonetic
segments of speech across accents as
) 1 ( ) 1 ( ) 1 ( + + +
·
i i i i i i f f
α
S
i
S
i
T
i
T
i
i i
f f
f f


·
+
+
+
1
1
) 1 (
α
Where f
i
T
and f
i
S
are the i
th
formants of the source and target accents
The frequency mapping can be expressed as
Figure : Illustration of warped(solid line) and original(dash dot line) formant trajectories of /aa/ in
accent conversion from Australian to British.
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Pitch Modification Using Time Domain PSOLA (TD-PSOLA)

Source pitch marks
Target pitch marks
• TD-PSOLA is applied into each corresponding voiced speech segment to modify
the pitch slope and duration of the segments
Source Speech Pitch
Marks
Target Speech Pitch
Marks
Illustration of mapping of pitch periods of a source speech to a target
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Examples of changes in accent/duration modulation of pitch
(a) ‘article’ in Australian, (b) Australian-accent ‘article’ transformed to British accent
(c) ‘asked’ in Australian, (d) Australian-accent ‘article’ transformed to British accent
(a)
(b)
(c)
(d)
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Model
Estimation
LP
Model
Formant
Trajectory
Source
Speech
Target
Speech
LP
Model
Formant
Trajectory
Mapped
Speech
Warping
Factors
Target
Speaker
HMM
Model
Source
Speaker
HMM
Model
Formant
Tracking
Formant Mapping
Speech
Recon
struction
Speech
Reconstruction
L
P
C
-
S
p
e
c
t
r
u
m

W
a
r
p
i
n
g

/

P
o
l
e

R
o
t
a
t
i
o
n
Model
Estimation
LP
Model
Formant
Trajectory
Source
Speech
Target
Speech
LP
Model
Formant
Trajectory
Mapped
Speech
Warping
Factors
Target
Speaker
HMM
Model
Source
Speaker
HMM
Model
Formant
Tracking
Formant Mapping
Speech
Recon-
struction
Speech
Reconstruction
L
P
C
-
S
p
e
c
t
r
u
m

W
a
r
p
i
n
g

/

P
o
l
e

R
o
t
a
t
i
o
n
Transformed(AM m->f)
American male
American female
An Outline of Voice-Morph: A system for Voice and Accent Conversion
An example of voice
conversion
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
Accent Conversion Demonstration
Australian British Transformed
British American Transformed
‘Article’
‘Claim’
‘Cooperation’
‘Beige’
Source Accent Target Accent
Spoken
word
‘Boston’
‘Opposition’
‘The occupied’
M
u
l
t
i
m
e
d
i
a

C
o
m
m
u
n
i
c
a
t
i
o
n

S
i
g
n
a
l

P
r
o
c
e
s
s
i
n
g

G
r
o
u
p
The End

Multimedia Communication Signal Processing Group

Content
1- Introduction to Phonetics and Acoustics of Accents 2- Research Issues in Modelling Acoustics of Accents of English 3- Current Research Problems 4- Accent Analysis and Models 5- Accent Morphing 6- Audio Demo

accent morphing in broadcasting and films. • Applications of accent models include: . Introduction to Phonetics and Acoustics of Accents Multimedia Communication Signal Processing Group 1.1.text to speech synthesis. education. • Accents are dynamic processes in that they evolve over time influenced by largescale immigration. . . .speech recognition. socio-economic changes and cultural trends.1 Background • Accents are acoustic manifestations of differences in pronunciation and intonations by a community of people from a national.voice editing. . regional or a socio-economic grouping. . .toys and computer games.accent coaching.

2 Basic Structure of Accents • Generally the structural differences between accents can be divided into two broad parts: (a) Differences in phonetic transcriptions. (b) Differences in acoustics correlates and intonations of accents. .Multimedia Communication Signal Processing Group 1. • The importance of an accent feature depends on its distance from that of the ‘standard’ or ‘received’ pronunciation and the frequency with which that feature occurs in the acoustics of speech.

American /ɒ/ is merged with /a/ compared with British accent. For example. For example. deletion. Also Australian English has distinctive vowels such as /æi/ instead of /ei/ and /æƆ/for /au/.3 Phonetics of Accents • A dominant aspect of accents is in the differences in pronunciation as transcribed by a phonetic dictionary.1. /iə ɛə uə/. ‘JOHN’ is pronounced as /ʤΛn/ in American but as /ʤƆn/ in British and Australian English. • The differences in phonetic transcription can be categorized into two classes: a) Differences in the number and identity of the phonemes. insertion.are allophones of /i ɛ u/. The word ‘SAY’ is pronounced as /sei/ in British and American but it is pronounced as /sæi / in Australian. b) Differences in phonetic realizations: phoneme substitution. Multimedia Communication Signal Processing Group . American transcription has three different levels of stress for vowels and diphthongs. British English as transcribed by Cambridge University’s BEEP dictionary2 has five extra vowels: /ax(ə) ea(ɛə) ia(iə) ua (uə) ah (ɒ) / compared to American as transcribed by Carnegie Melon University CMU dictionary.

. frequency of vocal tract resonance) correlates of accents. in the configurations. k is the formant index and j is phoneme index. (ii) Timing and magnitude of the formant target point(s) in formant space for each phonetic unit. tension and movement of laryngeal and supra-laryngeal articulatory parameters.e. including: (i) Formant trajectories Fkj(t). during the production of sound. positioning. tongue and lips • Four aspects of acoustic correlates of accents are considered essential for accent models and accent synthesis.1. vocal tract. These are: (a) Formants (i.4 Acoustics of Accents Multimedia Communication Signal Processing Group • Perceived acoustics differences of accents are due to the differences. namely vocal folds.

(b) Pitch prosody correlates of accents. segments . pitch rise. i.g. include: Multimedia Communication Signal Processing Group (i) Pitch trajectory at various linguistic contexts and positions.e. (c) Duration and Timing correlates of accents. (ii) Relative duration and timings of the two constituent vowels of diphthongs. e. (d) Laryngeal (glottal) correlates of accents. at the beginning of a voiced group or phrase.e the voice quality of speech in certain contexts as a function of accent. (i) Duration of vowels and diphthongs. (ii) Pitch nucleus i. pitch fall at the end of a phrase. the timing and magnitude of the prominent pitch event in a voiced group.

• Modelling the duration of vowels and diphthongs and the relative duration of the two halves of diphthongs. • Modelling formants of vowels and diphthongs. the latter is composed of two connected elementary sounds. • Modelling voice quality correlates of an accents in different phonetic/linguistic positions and contexts. • Modelling pitch trajectory in different phonetic/linguistic positions and contexts. pitch trajectory. • Integration of all accent features within a coherent generative model. • Separation. power trajectory. formants’ target points. Research Issues in Modelling Acoustics of Accents of English Multimedia Communication Signal Processing Group • Definition of an accent ‘feature set’ composed of formants’ trajectories. normalisation. this is required for modelling parameters of accent. duration.2. . or averaging out of speakers’ characteristics from accent characteristics.

Accent Profile (AP) Parameters Phonetic Parameters Substitution. insertion. deletion Comments Rank Multimedia Communication Signal Processing Group Pronunciation differences obtained from phonetic transcription dictionaries ***** Supra-laryngeal and Laryngeal Correlates Formants & their trajectories 2nd formant with largest variance is most sensitive to accent Glottal pulse (Voice Quality) Prosody Correlates F0 mean F0 range Pitch Nucleus Initial Pitch Rise Final Pitch Lowering Final Pitch Rise Timing and Delivery Correlates Speaking Rate Phoneme Duration Excessive Co-articulation Average of pitch Range of pitch Prominent point (stressed) within an intonation group (Tone Unit) First pitch slope of a narrative utterance Final fall pitch slope of a narrative utterance Final rise pitch slope of a narrative utterance Durations and shapes of opening and closing of glottal folds **** ** * * *** *** *** *** Phonemes or words per second Vowel duration elongation and complete pronunciation all affect Clipped or short duration sounds * *** **** .

Speech Accent Feature Analysis Method Multimedia Communication Signal Processing Group Speaking Rate & Durations Formants & Trajectories F0 Range/Mean Pitch Accents Pitch Marker Pitch Contour Tracker Tone Nucleus Features HMM Training Input Speech Labeling & Segmentation Accent Profile Block diagram illustration of the processes involved in accent analysis The basic processes involved in accent analysis includes • Speech phonetic labelling and boundary segmentation using HMMs • Pitch trajectory and pitch nucleus estimation • Formant models and formant track estimation • Duration and power trajectory analysis .

2 0.12 0.06 0.08 0.02 aa ae Duration (sec) Australian ah ao aw ay eh British er ey ih iy ow American oy uh uw Figure: Comparison of phoneme durations of British.04 0. Australian and American. .1 0. US and UK Accent Speech Multimedia Communication Signal Processing Group Figure: Comparison of speaking rates of British.14 0.18 0.16 0. 0. Australian and American.Analysis of Duration Correlate of AU.

8 Multimedia Communication Signal Processing Group American Australian Comparison of speaking rates of British.8 Word 3. • There is an apparent correlation between automatic speech recognition and speaking rate.6 10. American and Australian accents.9 29.8 30.Speaking Rate (number/sec) British Phone 12.3 8.94 7.8 27.1 2.3 Australian Model 34. .6 33.64 3. American and Australian Accents.28 Table : (%) word error of speech recognition across British.1 11.1 American Model 29. •Australian with the slowest speaking rate obtains the best recognition results followed by American and British. • Australian speaking (word) rate is 23% slower than British • American speaking (word) rate is 15% slower than British Model Input British American Australian British Model 12.

illustrated consists of three main functions. (2) a polynomial root finder. Fi the pole frequency and Fs sampling frequency. Consider the z-transfer function of an LP model with K real poles and I complex pole pairs and a gain factor G as I 1 1 H ( z) = G ∏ −1 ∏ j 2 π ( Fi / Fs ) −1 z )(1 − Ai e − j 2π ( Fi / Fs ) z −1 ) k =1 (1 − Ak z ) i =1 (1 − Ai e K where Ak is the pole radius.Formant Estimation with 2D-HMM Multimedia Communication Signal Processing Group Formant feature extraction. ∆− estimator Segmentation & window LPC Model Polynomial roots Frequency.Bandwidth Intensity Calculation Formant candidate Feature vector Speech LP-based Formant-candidate feature extraction method . (1) an LP model. and (3) a contour trend estimator.

. • 2D HMMs span time and frequency dimensions • Left-right HMM states across frequency model formants such that the first state models the first formant.Multimedia Communication Signal Processing Group LPC P1 Time(s) P2 P3 P4 P5 P6 Frequency(Hz) Illustration of of LP spectrum and the modelling of 6 complex pole pairs of a speech segment with an HMM composed of 4 formant-states. the second state the second formant and so on • The distribution of formants in each state is modelled by a mixture Gaussian density.

Multimedia Communication Signal Processing Group Three spectrogram examples of formant tracks superimposed on LPC spectrum of speech .

. The figures show that HMMS are excellent models of the distribution of the formants. X axis: frequency (Hz). Y axis: probability.Multimedia Communication Signal Processing Group Comparison of histograms (thin solid line) and Gaussian HMMs of formants of Australian English (bold dashed line).

Australian and American English.Comparison of Formants Spaces of American. /eh/ and /ae/ in Australian are closer. • Fronting and rising of the vowel /er/ in Australian. . • Fronting of the open vowel /aa/ and high vowel /uw/ in Australian. • The vowels /iy/. Click phoneme to listen. Note the following features: • Rising of vowels /ae/ and /eh/ in Australian. Australian and British Accents Multimedia Communication Signal Processing Group F1 vs F2 space of British.

Multimedia Communication Signal Processing Group Figure : Comparison of trajectories and target time of formant of British. Australian and American accents .

British and American (female) 2 V  A− B   F vi F vi  Rank     A B  Formant Ranking using a normalised distance i v = 0.5 ( F vi + F vi )   1  ∑ Formant Ranking Order Accent Pairs British & Australian British & American 1 1st 2nd n d 2 2nd 1st s t 3 4th 3rd rd 4 3rd 4th th Australian 2 1 3 4 • 2nd Formant has& American widest frequency range and is most sensitive to Accent .10 00 90 0 80 0 70 0 60 0 50 0 40 0 30 0 20 0 AA Au stra n lia AE AH AO EH British ER IH IY O H Am rica e n U W U H Multimedia Communication Signal Processing Group 2800 2600 2400 2200 2000 1 800 1 600 1 400 1 200 1 000 800 600 AA Australian AE AH AO EH British ER IH IY OH Am erican UW UH 32 25 35 10 37 05 30 00 22 95 25 80 27 75 20 70 22 65 25 50 AA Au a n str lia AE AH AO EH Br ish it ER IH IY O H Am r n e ica U W U H 40 60 40 50 40 40 40 30 40 20 40 10 40 00 30 90 30 80 30 70 30 60 30 50 AA Au stra n lia AE AH AO EH British ER IH IY O H Am rica e n U W U H Figure : Comparison of formants of Australian.

Broadcasting System e. Call centre and other electronic products .Accent Morphing Method HMM Training/ Adaptation Source Speech Speech Labeling & Segmentation Formant Mapping Formant Estimation Accent Model Multimedia Communication Signal Processing Group Prosody Modification Accent Synthesised Speech Pitch Tracker Figure : Diagram of a voice morphing system used for accent conversion • Formant Mapping : Transformation of formants of the source towards those of the target accent is based on non-uniform linear prediction model frequency warping. Accent modification in films. • Application : Text to speech synthesis. duration and power trajectory.g. • Prosody Modification includes pitch slope. Speech interface in mobile. • Prosody Modification : based on time domain pitch synchronous overlap and add (TDPSOLA) method. Education software such language teaching.

Formant Transformation Ratios Speech Linear Prediction Model Polynomial roots Pole estimation LP Spectrum Mapping Formant Estimation Accent modified spectrum Formant HMMs Figure : Illustration modification of spectrum towards formants of target accent .9 F 45 1 F 01 F 12 F 23 F 34 Frequency (Hz) Figure Illustration of a non-uniform frequency warping using LP model frequency response.Formant Transformation via Non-Uniform LP Frequency Warping -35 -40 -45 Multimedia Communication Signal Processing Group Magnitude (dB) -50 -55 -60 -65 -70 -75 0 0.8 BW 40. The spectrum is divided into a number of bands centered on the formants and a different set of warping parameters is applied to each band.7 0.4 BW0.2 0.5 BW2 0.1 I 12 I 23 BW1 0.6 3 0.3 I34 0.

The frequency bands of the source speaker [F01 F12 F23 F34 F45 ] are mapped to the target accent using a set of warping ratios derived from differences in the formants of phonetic segments of speech across accents as Multimedia Communication Signal Processing Group α i(i+1) = + f iT1 − f iT S f i+1 − f i S Where fiT and fiS are the ith formants of the source and target accents The frequency mapping can be expressed as f i ( i+1) =α i ( i+1 ) f i ( i+1) Figure : Illustration of warped(solid line) and original(dash dot line) formant trajectories of accent conversion from Australian to British. /aa/ in .

Pitch Modification Using Time Domain PSOLA (TD-PSOLA) Multimedia Communication Signal Processing Group Source pitch marks Target pitch marks Illustration of mapping of pitch periods of a source speech to a target Source Speech Pitch Marks Target Speech Pitch Marks • TD-PSOLA is applied into each corresponding voiced speech segment to modify the pitch slope and duration of the segments .

(b) Australian-accent ‘article’ transformed to British accent (c) (d) (c) ‘asked’ in Australian.Examples of changes in accent/duration modulation of pitch Multimedia Communication Signal Processing Group (a) (b) (a) ‘article’ in Australian. (d) Australian-accent ‘article’ transformed to British accent .

Source Speech LP LP Model LPCSpectrum Warp / Pole Rotation Spectrum Warping ing / Pole Rotation Multimedia Communication Signal Processing Group Source Speaker HMM Model Target Speaker HMM Model Formant Trajectory Warping Factors Formant Trajectory Speech Speech Recon Reconstruction Mapped Speech LP Model Target Speech Formant Tracking Model Estimation Formant Mapping Speech Reconstruction An Outline of Voice-Morph: A system for Voice and Accent Conversion An example of voice conversion American male American female Transformed(AM m->f) .

Accent Conversion Demonstration Spoken word ‘Article’ ‘Claim’ ‘Beige’ Source Accent Australian Target Accent British Transformed Multimedia Communication Signal Processing Group British ‘Cooperation’ ‘Boston’ ‘Opposition’ ‘The occupied’ American Transformed .

Multimedia Communication Signal Processing Group The End .

Sign up to vote on this title
UsefulNot useful