You are on page 1of 19

End-to-End Automatic Speech Recognition

A thesis submitted in partial fulfillment of


the requirements for the degree of

Bachelor of Technology

by

Kunal Dhawan and Kumar Priyadarshi


(Roll No. 150102030,150102074)

Under the guidance of


Prof. Rohit Sinha

DEPARTMENT OF ELECTRONICS & ELECTRICAL ENGINEERING


INDIAN INSTITUTE OF TECHNOLOGY GUWAHATI
November 2018
Abstract

The goal of an Automatic Speech Recognition system is to accurately and efficiently con-
vert a given speech signal into a text transcription of the spoken words, independent of the
recording device, the speaker’s accent, or the acoustic environment. To achieve this, several
models such as Dynamic Time Warping (DTW), Hidden Markov models (HMM) and Deep
Neural Networks (DNN) have been proposed over time by researchers. These conventional
models give rise to very complicated systems consisting of various modules such as acous-
tic, lexicon and language models. In addition, these models require linguistic sources, like
a pronunciation dictionary, tokenization and phonetic context-dependency trees. An End-to-
End Automatic Speech Recognition system greatly simplifies this pipeline by replacing com-
plicated modules with a deep neural network architecture employing data-driven linguistic
learning methods. In this report, we explore various nuances of building end-to-end ASR sys-
tem, how and why they work and demonstrate their performance on a popular speech corpus.

Keywords: End-to-End, HMM, CTC, attention, beam search, TIMIT

i
Contents

Abstract i

List of Figures iii

Nomenclature iv

1 Introduction 1

2 Literature Survey 3
2.1 Traditional ASR Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Deep Neural Network - Hidden Markov Model (DNN-HMM) . . . . . 4
2.2 End-to-End ASR Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Connectionist Temporal Classificaion (CTC) . . . . . . . . . . . . . . 4
2.2.2 Sequence to Sequence Attention Mechanism . . . . . . . . . . . . . . 5
2.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Baseline Model 8
3.1 Encoder (Listener) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Decoder (Speller) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Results 11

5 Conclusion and Future Work 12

ii
List of Figures

2.1 Components of an ASR system using HMM . . . . . . . . . . . . . . . . . . . 3


2.2 Recurrent Neural Network architecture . . . . . . . . . . . . . . . . . . . . . . 5
2.3 A Seq2Seq Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 A Seq2Seq Model with Attention . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Listener network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9


3.2 Attend and Speller network . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

iii
Nomenclature

ASR Automatic Speech Recognition

HMM Hidden Markov Model

DNN Deep Neural Network

CTC Connectionist Temporal Classification

RNN Recurrent Neural Networks

LAS Listen Attend and Spell

BLSTM Bidirectional Long Short Term Memory

iv
Chapter 1

Introduction

Automatic Speech Recognition aims at enabling devices to correctly and efficiently identify
spoken language and convert it into text. Some of the most important applications of speech
recognition include speech-to-text processing, audio information retrieval, keyword search and
generating streaming captions in videos. From the technology perspective, speech recognition
has a long history with several waves of major innovations. Traditional general-purpose speech
recognition systems are based on Hidden Markov models. This is combined with Deep Neural
Networks which led to improvements in various components of speech recognition pipeline.
But this conventional ASR system based on DNN-HMM is a pretty complicated system con-
sisting of various modules dealing with separate acoustic, pronunciation and language models.
There are some factors that make this method sub-optimal with regards to the speech recogni-
tion performance.
1) Many module-specific processes are required for efficient working of the final model.
2) Curating pronunciation lexicon and defining phoneme sets for the particular language re-
quires expert linguistic knowledge, and is time-consuming. This is particularly difficult for the
so called Low-resource languages.
3) These systems make conditional independence assumptions to make use of GMM, DNN, and
n-gram models. Real world speech may not necessarily follow those assumptions.
4) All the different modules are optimized separately with different objectives, which results in
a sub-optimal model as a whole as each module is not trained to match the other modules. In
addition, the training objectives and final evaluation metrics are very different from each other.

End-to-End Automatic Speech Recognition has the goal of simplifying this module-based ar-

1
chitecture and replace the entire traditional pipeline with a deep neural network architecture
which is trained in an end-to-end fashion to eliminate the above outlined issues. This approach
has benefited from recent advances in Deep Learning and Big Data which is evidenced by a
surge in the number of academic and industrial papers employing these architectures. These
deep learning methods have been adopted worldwide by the industry for designing and deploy-
ing Automatic Speech Recognition engines.

In this report, we explore various nuances of designing end-to-end Automatic Speech Recogni-
tion systems, how and why they work and report their performance and results on the popular
speech corpus TIMIT. In particular, we build an attention-based end-to-end neural network
containing encoder, location-based attention and decoder, thus replacing the traditional DNN-
HMM pipeline with a single network. This model learns all the sub-components of a speech
recognizer jointly and produces text transcriptions from the speech input directly in an end-to-
end fashion. We train the neural network on the widely popular TIMIT database with phoneme
targets and achieve phoneme error rates as low as 22%. The following sections present back-
ground and current trends in end-to-end speech recognition, model description and the experi-
mental setup with preliminary results.

2
Chapter 2

Literature Survey

2.1 Traditional ASR Methods


Historically, most speech recognition systems have been based on statistical models represent-
ing the various sounds/phones of the language to be recognized. In this section, we review some
traditional ASR methods and their submodules.

2.1.1 Hidden Markov Models

The Hidden Markov model (HMM) is a good framework for construction of speech recognition
models because speech has temporal structure and it can be efficiently encoded as a sequence
of spectral vectors inside the audio frequency range [1]. A continuous speech recognition sys-
tem utilizing the hidden Markov model has these principal components as depicted in figure 2.1.

Figure 2.1: Components of an ASR system using HMM

3
2.1.2 Deep Neural Network - Hidden Markov Model (DNN-HMM)

HMM has traditionally been combined with Gaussian Mixture Models (GMM) to do acoustic
modelling for speech recognition, but using Deep Neural Network - Hidden Markov models
(DNN-HMM) have shown better results and are widely used in speech recognition currently [4].
DNN is a feedforward artificial neural network having more than one hidden layer. Each
hidden unit in each layer uses non-linear functions like tanh to map the feature input from
previous layer to the next layer. The outputs of DNN are then fed to HMM. DNNs support
long context features or recurrent neural networks which make use of context in speech, hence
mitigating the issues caused by conditional independence assumptions.

2.2 End-to-End ASR Models


As mentioned earlier, the traditional methods for speech require separate submodules which are
indepentently trained and optimized. For example, acoustic model takes acoustic features as
input and predicts a set of phonemes as outputs. The pronunciation model maps each sequence
of phonemes to corresponding word in the dictionary. Finally, the language model assigns
probability to the word sequences and determines a sequence of word is probable or not.
End-to-End systems aim at learning all these components jointly as a single system with the aid
of sufficient amount of data [5]. Connectionist Temporal Classification (CTC) [6] and sequence-
to-sequence (seq2seq) attention [7] are the two main paradigms around which the entire end-to-
end ASR revolves. These approaches have advantages in terms of training and deployment in
addition to the modelling advantages which we describe in the remaining section.

2.2.1 Connectionist Temporal Classificaion (CTC)

Recurrent Neural Networks (figure 2.2) are a natural choice for modelling of sequential data
where we make use of context and sequential information [8]. RNNs are widely used in Lan-
guage Modelling, Natural Language Processing and Machine Translation tasks. In the area
of speech recognition, RNNs are typically trained as frame classifiers, which then require a
separate training target for every frame. To achieve this, we need an alignment between the
speech and its text transcription which are traditionally determined by the HMM. Connectionist
Temporal Classification (CTC) is a function that allows an RNN to be trained for sequence tran-

4
Figure 2.2: Recurrent Neural Network architecture

scription tasks without requiring any prior alignment between the input and target sequences [6].
While using CTC, the output layer of the RNN contains one unit each for each of the transcrip-
tion labels (for example, phonemes) in addition to one extra unit referred to as a ’Blank’ which
corresponds to the emission of ’nothing’ - a null emission.
For a given training speech example, there are as many possible alignments as there are ways
of separating the labels with blanks. For example, if we use φ to denote a null emission, the
alignments (φ,c,φ,a,t) and (c,φ,a,a,a,φ,φ,t,t) correspond to the same transcription ’cat’. While
decoding, we remove the labels that repeat in successive time-steps. At every time-step, the
network decides whether to emit a symbol or not. As a result, we obtain a distribution over all
possible alignments between the input and target sequences [9]. Finally, CTC uses a forward-
backward algorithm to sum over all possible alignments and determines the probability of the
target sequence given the input speech sequence [6]. CTC does not fully utilize the benefits
of end-to-end ASR as it assumes that the outputs at different time-steps are conditionally inde-
pendent from each other. As a result, it is incapable of learning the language model on itself.
However, it enables the RNN to learn the acoustic and pronunciation models jointly and omits
an HMM/GNN construction step [10].

2.2.2 Sequence to Sequence Attention Mechanism

In Sequence to Sequence leaerning, RNN is trained to map an input sequence to an output se-
quence which is not necessarily of the same length. It has been used in sequence transduction
tasks like Neural Machine Translation, Image Captioning and Question Answering. As the na-
ture of speech and the transcriptions is sequential, Seq2Seq is a viable proposition for speech
recognition as well. In this paradigm, the seq2seq model (figure 2.3) attempts to learn to map
the sequential variable-length input and output sequences with an encoder RNN that maps the
variable-length input into a fixed length context vector representation. This vector representa-

5
tion is used by a decoder to produce the variable-length output sequence.

Figure 2.3: A Seq2Seq Model

An Attention mechanism can be used to produces a sequence of vectors from the encoder
RNN from each time step of the input sequence. The Decoder learns to pay selective attention
to the vectors to produce the output at each time step. Attention mechanism is a dominant

Figure 2.4: A Seq2Seq Model with Attention

technique in Computer Vision for visual recognition tasks where the neural network model
focuses on a specific region of the image with ’high resolution’ while percieving the rest of the
image in ’low resolution’ originally inspired by human visual attention. It has been successfully
applied to object recognition and image captioning tasks in the domain of Computer Vision [11].
Sequence to Sequence models have shown improvement by the use of an attention mechanism

6
(figure 2.4) that allows the decoder to focus on selected inputs when producing the output [7]. At
each output time-step, the last hidden state of the decoder RNN is used to generate an attention
vector over all the input sequence of the encoder. The attention vector is used to propagate
information from the encoder to the decoder at every time step, instead of just once, as with the
original sequence to sequence model [12].
Attention mechanism is differnet from CTC in the sense that it does not make any condi-
tional independence assumptions. As a result, it can learn the language model in addition to the
acoustic and pronunciation models jointly in a single network.

2.3 Implementation

2.3.1 TensorFlow

We used TensorFlow, the Deep Learning toolbox from Google to implement the ’Listen Attend
and Spell’ architecture. According to the TensorFlow authors [2]:
TensorFlow is an interface for expressing machine learning algorithms, and an implementation
for executing such algorithms.

In TensorFlow, a directed graph is constructed based on the input operations which define a
computational model. The graph represents how data i.e, tensors flow trough the model - hence
the name Tensorflow. Each node of the graph has zero or more inputs and zero or more outputs
and the nodes represent operations. Tensors flow along the edges of the graph. Graphs are
executed in sessions. When running a graph the desired output variables or operations must be
specified. Tensorflow then checks if all inputs required to compute the requested output values
or perform the desired operation are present. Afterwards it evaluates only the operations includ-
ing the requested node to find the desired output value or complete the requested operation.

7
Chapter 3

Baseline Model

For a strong baseline, we chose the Listen, Attend and Spell Architecture by Chan et al [3], as
it is a popular approach in the end-to-end speech recognition community over last few years.
According to the authors, LAS is a neural network that learns to transribe speech utterances
to character sequences without making any independence assumption between the characters.
This system combines a seq2seq based encoder-decoder architecture with an attention mech-
anism. Being an end-to-end model, LAS learns all the components of a speech recognition
system jointly. It consists of an encoder recurrent neural network (RNN), called the listener,
and a decoder RNN called the speller. The listener network is a pyramidal RNN that converts
low level speech signals into higher level feature representation and the speller converts this
representation into output symbols using the attention mechanism. The listener(encoder) and
the speller(decoder) are trained together, making it a complete end-to-end system. We describe
the baseline architecture and its working in following subsections.

3.1 Encoder (Listener)


Listener network (figure 3.1) uses a Bidirectional Long Short Term Memory (BiLSTM) [13]
employing a pyramidal structure to reduce the length of encoded vector. Pyramidal structure
is required as an utterance can be hundreds to thousands of frames in length leading to slower
convergence.
The structure of pyramidal BLSTM reduces the time resolution by a factor of 2 in every layer
leading to a reduction in computational complexity and helps the attention to converge faster.

8
Figure 3.1: Listener network

3.2 Decoder (Speller)


The Attend and Spell network (figure 3.2) uses an attention-based LSTM transducer [14]. For
every output time-step, this network produces a probability distribution over the next character
conditioned on all the characters emitted previously.

Figure 3.2: Attend and Speller network

Let the decoder state at a time step i be si which is a function of the previous state si-1 ,

9
the previously emitted character yi-1 and context ci-1 . The context vector ci is produced by an
attention mechanism.

ci = AttentionContext(si , h)

si = RN N (si − 1 , yi − 1 , ci − 1 )

P (yi |x, y< i ) = CharacterDistribution(si , ci )

where CharacterDistribution is an MLP with softmax outputs over characters and RNN
is a 2-layer LSTM. At each time step i, the AttentionContext function computes the scalar
energy eiu for each time step u, using vector hu and si . The scalar energy ei,u is converted into
a probability distribution over time steps αi using a softmax function. This is used to create the
context vector ci by linearly combining the listener features hu , at different time steps:

eiu = hφ(si ), ψ(hu )i

exp(ei,u )
αi, u = P
u exp(ei,u )
X
ci = αi, u hu
u

10
Chapter 4

Results

11
Chapter 5

Conclusion and Future Work

12
Bibliography

[1] Mark Gales and Steve Young, “The application of hidden markov models in speech recognition,”
Foundations and trends in signal processing, 1( 3), pp. 195-304, 2008.

[2] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S.
Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew
Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath
Kudlur, Josh Levenberg, Dan Man, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-
houcke, Vijay Vasudevan, Fernanda Vigas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin
Wicke, Yuan Yu, and Xiaoqiang Zheng, ”TensorFlow: Large-scale machine learning on heteroge-
neous systems,” 2015. Software available from tensorflow.org.

[3] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals,” Listen, attend and spell,” IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.

[4] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,
Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al, ”Deep neural networks
for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal
Processing Magazine, 29( 6): 82- 97, 2012.

[5] Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Philemon Brakel, and Yoshua Bengio, ”End-
to-end attention-based large vocabulary speech recognition,” In Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2016 IEEE International Conference on, pages 4945- 4949. IEEE, 2016.

[6] Alex Graves, Santiago Fernndez, Faustino Gomez, and Jrgen Schmidhuber, ”Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks,” In Proceedings
of the 23rd international conference on Machine learning, pages 369- 376. ACM, 2006.

13
[7] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio,
”Attention-based models for speech recognition, In Advances in Neural Information Processing
Systems, pages 577- 585, 2015.

[8] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton,” Speech recognition with deep recur-
rent neural networks,” In Acoustics, speech and signal processing (ICASSP), 2013 IEEE Interna-
tional Conference on, pages 6645- 6649. IEEE, 2013.

[9] Alex Graves and Navdeep Jaitly, ”Towards end-to-end speech recognition with recurrent neural
networks,” In Proceedings of the 31st International Conference on Machine Learning (ICML-14),
pages 1764- 1772, 2014.

[10] Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R. Hershey, Tomoki hayashi, ”Hybrid
CTC/Attention Architecture for End-to-End Speech Recognition,” IEEE JOURNAL OF SELECTED
TOPICS IN SIGNAL PROCESSING, Vol. 11, No. 8, Dec. 2017

[11] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov,
Richard Zemel, Yoshua Bengio, ”Show, Attend and Tell: Neural Image Caption Generation with
Visual Attention,” Proceedings of the 32nd International Conference on Machine Learning, PMLR
37: 2048- 2057, 2015.

[12] Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, ”Neural Machine Translation by Jointly
Learning to Align and Translate,” International Conference on Learning Representations (ICLR)
, 2015

[13] Sepp Hochreiter and Jurgen Schmidhuber,” Long Short-Term Memory,” Neural Computation,
9( 8): pp. 1735- 1780, November 1997.

[14] Ilya Sutskever, Oriol Vinyals, and Quoc Le,” Sequence to Sequence Learning with Neural Net-
works,” In Neural Information Processing Systems, 2014.

14

You might also like