You are on page 1of 5

ISSN : 2454-2415 Vol.

7, Special Issue 1, 2019

Survey on speech Imitation using Machine Learning

Rahul Kumar1, Jaybrata Chakraborty1 and Bappaditya Chakraborty1


Department of Computational Science, Brainware University, Kolkata
{kumar.rahul0218,jaybrata1411,bappa.chakraborty84}@gmail.com

Abstract been trained separately, with a different motive.


Two main approaches for this, are CTC or
Listening to speech (Hark), spell checking and imitating
connectionist temporal classification and sequence
the sentences is a new challenge in the domain of machine
learning, specifically in generative models. Many to sequence model with scrutiny as stated as in [1].
researchers have used Hidden Markov Model (HMM) and CTC assumes that the output labels are conditionally
Connectionist Temporal Classification (CTC) vastly in independent of each other, whereas sequence to
order to achieve the same. Usually sound wave from the sequence has only been applied to hieroglyphs
hark end is a filter bank that is an array of band pass filters sequence [2]
that separates the input signal into multiple elements, each
one carrying a single frequency sub-band of the original [3]. In this survey paper we examined recent
signal. On the Speller and imitator end the encoder research papers on Hark, Spell and Imitate an
decoder architecture with recurrent neural networks (RNN acoustic signal and we observed that in most of the
a powerful model for sequential data) has nowadays cases the basic idea of the process is a neural
become an effective and standard approach for both neural network is used that transcribes an acoustic signal to
machine translation and sequence to sequence prediction. a word sequence, one character at a time and finally
As observed, recurrent neural network architectures are imitating the whole of the spoken word at once. We
highly capable of handling uneven sequences of data and
have surveyed on previous assumptions of
thus become powerful tool for sequential data processing.
However, a systematic analysis of state-of-the-art for such completely dependency of model on HMM. It is
researches is a need of time. With this motivation, we based on sequence to sequence learning with
present a survey of recent trends and progress of speech scrutiny. [2]
imitation and related works, available datasets and
[3] [1] [4] [5]. It consists of encoder neural network
limitations in this paper.
(RNN) known as Hark, a decoder RNN known as
Keywords- Recurrent Neural Networks, Speech spell and a CLDNN HMM known as imitate with a
recognition, Neural attention. filter.
I. INTRODUCTION II. REVIEW ON CLASSIFIERS
On contemplation of various research papers, we It is observed that to build HSI model a choice of
found that many researchers in their papers stated, proper classifier is important. Mainly three types of
the RNN encoder decoder Model is a sole reflections classifier are popular for this type of work and they
of the HSI (Hark, Spell and Imitate) machine. This are Hidden Markov Model, RNN encoder decoder
model forms the basis of speech recognition and Deep Neural Network. A brief description of
machine. The HSI Machine main target would be these classifier are given below.
sentences to be processed then spelled with A. Hidden Markov Model
imitation. Furthermore, for implementing the basic
idea of spelling and imitating machine, many HMM is a pervasive tool for molding time series
researchers use the combination of HMM model and data. HMM is composed of numerous hidden state
neural network architecture. Basically, the DNN- variable, multi scale representation and mixed
HMM, CTC, RNN Encoder decoder model had discrete and continuous variable. This model is an
given promising result to the researchers for HSI abstract form of emission of sequence. We will
machine which we are looking for. DNN-HMM analyse this abstraction in details by analysing the
jointly had led to many advancements in various model. How these sequences are being generated
fields of speech recognizer. The hybrid form of the and emitted. This speech recognizer model would be
model used in speech recognition have been proven briefed on it’s working by the help of an algorithm
useful to develop an imitate machine model that stated in the following paper by researchers [6].
portray hieroglyphs sequence (word sequence). In
B. RNN Encoder Decoder
aspect of modelling HSI model a popular choice is
work with RNN encoder decoder in order to This Architecture is very new, having only
ameliorate speech recognition accuracy. Basically, pioneered in 2014, core technology Beijing
each of the model in Hark, Spell and Imitate have Google translator. The LSTM encoder-decoder is

39
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019

a recurrent neural network which toils on sequence band processing) has advantages over time domain
to sequence solving of problems. Now the question processing. Filter banks comprises of low pass,
arises how this RNN encoder decoder work. The band- pass, and high-pass filters used for the
LSTM network would consist of 8 encoders and spectral disintegration and combination of signals.
decoders using attention and residual connections. The process of disintegration performed by the
In order to increase parallelism and decrease filter bank is called analysis (meaning analysis of
training time we found out that surveyor had the signal in terms of its components in each sub-
achieved this by connecting bottom layer of band); the output of analysis is referred to as a sub-
decoder to top layer of encoder. The handling of band signal with as many sub-bands as there are
complex sentences is being done by dividing it into filters in the filter bank. The restoration process is
sub sentences for both input and output. Then the called synthesis, meaning restoring a complete
beam search technique employs a length signal resulting from the filtering process. Another
normalization procedure by using a coverage significant and practical use of filter banks in this
penalty, which generates output that covers all the model would be signal compression, when some
words from the source sentence. This technique frequencies are more important than others. After
thereby reduces error by an average of 60%. disintegration, the important frequencies can be
coded with a fine immovability. Small differences
C. Deep Neural Network
at these frequencies are symbolic and a coding
A deep neural network is a feed forward, neural scheme that preserves these differences must be
network consisting many hidden layers between used. On the other hand, less important
its input and output. Since having the most of the frequencies do not have to be exact. The
hidden layers is highly flexible in nature with huge enhancement of sound quality is done by Graphic
number of parameters. This makes them capable Equalizer. The purpose of Graphic Equalizer is to
of holding more complex data being acquainted. equalize is used to improve an instrument’s sound
Many researchers stated in their paper that, this or make certain instruments and sounds more
quality helps in resulting high quality acoustic prominent as elucidated by researchers.
output modelling. One of the remnant examples of
IV. REVIEW ON HSI MODELS
DNN phone recognition results on the TIMIT
database. In this the input is the acoustic input with In this section we would briefly describe about
utterances while the output is the spoken phonetic HSI (Hark, Spell and Imitate) model. We will deal
sequence. Such complexities are being travelled with its working classifica-tion and also about its
without any distortion. The first successful use of models. Assumed by various researchers that, there
acoustic models based on DBM-DNN for a large is a set of acoustic signals A = (a1,……an) with
vocabulary task used data collected from the Bing fre-quency range between 20 to 20,000 HZ normal
mobile voice search application. Thus, researchers human hearing compatibility range passed through
worked with this model in order to successfully filter bank spectra. The output signal sequence
process on convoluted data-set without losing its would be a set of B = < sos >, (b1,…….bn), < eos
distinctiveness. >, Bi ∈ (a,…..z); (0,….9), < space >, < comma >,<
period >,
III. REVIEW ON FEATURE EXTRACTION
< apostrophe >, < unk > [7] basically processed
Proper choice of feature extraction technique
input acoustic signal. Here < sos >, < eos > and <
from speech signal is always a challenging part for
unk > are the special start of sentence token and
building a model. It is observed that a sliding
end of sentence token and unknown tokens
window technique like mel frequency sepstral
respectively. The HSI model will consist of three
coefficient (MFCC) is always advantageous for
components namely Hark, Spell and Imitate. The
extracting temporal features from a speech signal.
Hark is an acoustic model encoder that performs
This type of model consist of filter banks which is
operation called Hark. The Hark end translate the
responsible for extracting frequency oriented
original acoustic signal A into a high-level
features from a small time frame for a long
reproduction H = (H1……HL) with L <= n. The
sequence of speech. To apply this sliding window
speller and imitator are attentive character, and
type feature extraction technique a speech signal is
decoder that comply an operation called spell and
needed to be sliced in small time frame and a set
imitate.
of features is extracted from each frame. Filter
banks are substitute of a group of signal processing Hark, Spell and Imitate (HSI) Model is a
techniques that decompose signals into frequency pyramidal BLSTM encoding input sequence A
sub-bands. This decomposition is useful because into high level features h, the speller is an
frequency domain processing (also called sub- attention-based decoder generation the B

40
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019

characters from h. The idea is described in Fig. 1


and Fig. 2.
The Hark end uses a bidirectional Long Sort
Term Memory RNN (BLSTM) (as mentioned in
paper references) [8][9][10] a cone shape
framework. This is shaped as conic structure in
order to decrease the length L of H from n, the
length of the input a, because the input acoustic
signal can be infinitely long.
The Hark end uses HMM model as a pervasive tool
for working on time series data. The model being
composed of numerous hidden state variable,
multiscale representation and mixture of discrete
and continuous variable is an abstract from of
emission of sequence. This abstraction is being
analyzed by the analyzing model of how the
sequence is being generated and emitted. The
system analyses the particular acoustic signal from
Fig. 2. Long input sequence a is encoded with the
the hark end and then convert it to fine tune of the
pyramidal BLSTM Hark into shorter sequence h
hark’s input resulting in increase accuracy. However
the speech recognition approach of HMM is done After receiving the acoustic signal when
49% through CTC trained LSTM processed by HMM model the next model which
comes into action is LSTM Encoder Decoder, a
[11]. The Soviet researchers in context to HMM
recurrent neural network RNN which toils on
invented a dynamic time warping algorithm and
sequence to sequence dealing with complications.
used it to recognized sentences or speeches by 200-
Researchers used a simple left to right beam finding
words vocabulary. [12] This algorithm refines the
technique similar to [4] the LSTM network
acoustic signal by breaking it into short frames. This
consisting of 8 encoder and decoder. The handling
breakdown of frames is the best technique to handle
of complex sentences is being done by dividing it
complex speeches with 49% accuracy.
into sub-sequences for both input and output. Then
We found that researchers trained the given input the beam search technique employs a length
set of sequences to increase the probability of normalization procedure by using a coverage
getting correct output sequences. penalty, which generates output that covers all the
words from the source sentence. There by reducing
error by an average of 60%.
The spell function is enumerated using an scrutiny
based LSTM transducers [13] [14]. At each time
scrutiny mechanism generate a context vector ci
covering the information in the acoustic signal
needed to produce the next character. The scrutiny
model is content based-model, the contents of
decoder are matched to the contents of period to
producer scrutiny vector si. Hence it will dictate
each and every set of character sequence one by one
that are matching with the input set of character
sequence. This is achieved by the model DNN-
HMM system the state of the art to Hark scrutinize
and finally spell. For the Hark function researchers
have used three layers of 512 pBLSTM nodes (that
is 256 nodes per direction) on top of BLSTM that
operates on the provided sequence. Hence reducing
the time resolution by 23 times the spell function
Fig. 1. Hark, Spell and Imitate (HSI) Model
uses a 2-layer LSTM with 512 nodes each.
Asynchronous Stochastic Gradient Descent (ASGD)
was used for training the model [15]. A learning rate

41
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019

of 0.2 was used with a geometric decay of 0.98 per VI. CONCLUSION
3M utterances (i.e., 1/20-th of an epoch). They used
As being experimented by [7] they used data set
the Dist Belief framework [15] with 32 replicas,
with three million Google voice search utterances
each with a mini batch of 32 utterances. In order to
(Representing 2000 Hours of data) for the
further speed up training, the sequences were
experiment approximately 10 hours of utterances
grouped into buckets based on their frame length [4].
were selected randomly for the set data
The model was trained until the results on the
augmentation was performed using a room
validation set stopped improving, taking
simulator, adding different types of noise and
approximately two weeks. The model was decoded
repetition; The noise sources were obtained from
using N-best list decoding with beam size of N = 32.
YouTube and environmental recordings of daily
After a long processing on LSTM model for spell, events [16]. This increased the amount of audio data
filter bank comes into action to provide high quality by 20 times with a Sound Normal Range between 5
sound imitation or emission. This enhancement a dB and 30 dB [16]. They used 40 dimensional long-
sound quality is done by graphic equalizer. Its highly mel filter bank features computed every 10 ms as the
required to restore the original input set of sequence acoustic inputs to the hark. A separate set of 22K
which is done by the process called synthesis utterances representing approximately 16 hours of
meaning restoring a complete signal from the data which were used for test case. A noisy test set
filtering process (as mentioned in all the research was also created using the same strategy that was
papers which we surveyed). Then the purpose of applied to the training data. The acoustic signal was
graphic equalizer comes into action. The purpose of normalized by converting each character to lower
equalizer is to equalize sound signal in order to case English alpha-numeric including Digits. The
improve sound quality of a machine and make the punctuation: space, comma, period and apostrophe
sound more prominent hence imitating whatever were kept, while all other tokens were converted to
have been uttered with the best quality. the unknown < unk > token, and analyzing each and
every utterance being padded with the start-of
V. FUTURE SCOPE
sentences < sos > and the end-of sentences < eos >
Scrutiny-based sequence-to-sequence model is tokens.
another end- to-end model. It roots from the
We surveyed on a HSI model, a Hark, Spell and
successful scrutiny model in machine learning
Imitate, a neural speech recognizer converting it to
which extends the encoder-decoder frame- work
character by character pronunciation with imitating
using an RNN decoder. Therefore, windowing
whatever have been uttered. The successful
method is used in order to reduce the number of
accomplishment of this goal is being done by usage
candidates used in attention decoder. A pyramidal
a various model such as HMM a speech recognizer
structure is being used, in the encoder network so
model performing Hark operation. DNN-HMM,
that only L high-level hidden vectors are generated
CTC, RNN encoder decoder for disintegrating
instead of T hidden vectors from all the input time
complex sequence into a sub frames for better
steps. Due to the high complexity and slow speed of
output. Filter banks for better quality sound
training, the majority of attention-based works were
production. We are optimistic that this kind of model
majorly done at Google compared to the CTC works
would pave a way for better speech recognizer
reported by many sites. However, recently Google
system. The state-of-the-art model in this data set is
significantly advanced the research of scrutiny -
a CLDNN- HMM system that is described in [16].
based model by HSI model.
The CLDNN-HMM system achieves a WER of
8.0% on the clean set and 8.9% on the noisy set.
REFERENCES
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and
translate. arXiv preprint arXiv:1409.0473, 2014.
[2] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous speech recognition
using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602, 2014.
[3] Dzmitry Bahdanau, Dmitriy Serdyuk, Philemon´ Brakel, Nan Rosemary Ke, Jan Chorowski, Aaron Courville, and
Yoshua Bengio. Task loss estimation for sequence prediction. arXiv preprint arXiv:1511.06456, 2015.
[4] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in
neural information processing systems, pages 3104–3112, 2014.

42
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019

[5] Kyunghyun Cho, Bart Van Merrienboer,¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078, 2014.
[6] Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. Speech emotion recognition using hidden markov models. Speech
communication, 41(4):603–623, 2003.
[7] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large
vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE
International Conference on, pages 4960–4964. IEEE, 2016.
[8] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[9] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In
Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273–278. IEEE, 2013.
[10] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In
International Conference on Machine Learning, pages 1764–1772, 2014.
[11] Has¸im Sak, Andrew Senior, Kanishka Rao, Franc¸oise Beaufays, and Johan Schalkwyk. Google voice search: faster
and more accurate. Google Research blog, 2015.
[12] Jacob Benesty, M Mohan Sondhi, and Yiteng Huang. Springer handbook of speech processing. Springer, 2007.
[13] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on
Machine Learning, pages 2048–2057, 2015.
[14] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models
for speech recognition. In Advances in neural information processing systems, pages 577–585, 2015.
[15] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke
Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems,
pages 1223–1231, 2012.
[16] Tara N Sainath, Oriol Vinyals, Andrew Senior, and Has¸im Sak. Convolutional, long short-term memory, fully
connected deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International
Conference on, pages 4580–4584. IEEE, 2015

43
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019

You might also like