Professional Documents
Culture Documents
39
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019
a recurrent neural network which toils on sequence band processing) has advantages over time domain
to sequence solving of problems. Now the question processing. Filter banks comprises of low pass,
arises how this RNN encoder decoder work. The band- pass, and high-pass filters used for the
LSTM network would consist of 8 encoders and spectral disintegration and combination of signals.
decoders using attention and residual connections. The process of disintegration performed by the
In order to increase parallelism and decrease filter bank is called analysis (meaning analysis of
training time we found out that surveyor had the signal in terms of its components in each sub-
achieved this by connecting bottom layer of band); the output of analysis is referred to as a sub-
decoder to top layer of encoder. The handling of band signal with as many sub-bands as there are
complex sentences is being done by dividing it into filters in the filter bank. The restoration process is
sub sentences for both input and output. Then the called synthesis, meaning restoring a complete
beam search technique employs a length signal resulting from the filtering process. Another
normalization procedure by using a coverage significant and practical use of filter banks in this
penalty, which generates output that covers all the model would be signal compression, when some
words from the source sentence. This technique frequencies are more important than others. After
thereby reduces error by an average of 60%. disintegration, the important frequencies can be
coded with a fine immovability. Small differences
C. Deep Neural Network
at these frequencies are symbolic and a coding
A deep neural network is a feed forward, neural scheme that preserves these differences must be
network consisting many hidden layers between used. On the other hand, less important
its input and output. Since having the most of the frequencies do not have to be exact. The
hidden layers is highly flexible in nature with huge enhancement of sound quality is done by Graphic
number of parameters. This makes them capable Equalizer. The purpose of Graphic Equalizer is to
of holding more complex data being acquainted. equalize is used to improve an instrument’s sound
Many researchers stated in their paper that, this or make certain instruments and sounds more
quality helps in resulting high quality acoustic prominent as elucidated by researchers.
output modelling. One of the remnant examples of
IV. REVIEW ON HSI MODELS
DNN phone recognition results on the TIMIT
database. In this the input is the acoustic input with In this section we would briefly describe about
utterances while the output is the spoken phonetic HSI (Hark, Spell and Imitate) model. We will deal
sequence. Such complexities are being travelled with its working classifica-tion and also about its
without any distortion. The first successful use of models. Assumed by various researchers that, there
acoustic models based on DBM-DNN for a large is a set of acoustic signals A = (a1,……an) with
vocabulary task used data collected from the Bing fre-quency range between 20 to 20,000 HZ normal
mobile voice search application. Thus, researchers human hearing compatibility range passed through
worked with this model in order to successfully filter bank spectra. The output signal sequence
process on convoluted data-set without losing its would be a set of B = < sos >, (b1,…….bn), < eos
distinctiveness. >, Bi ∈ (a,…..z); (0,….9), < space >, < comma >,<
period >,
III. REVIEW ON FEATURE EXTRACTION
< apostrophe >, < unk > [7] basically processed
Proper choice of feature extraction technique
input acoustic signal. Here < sos >, < eos > and <
from speech signal is always a challenging part for
unk > are the special start of sentence token and
building a model. It is observed that a sliding
end of sentence token and unknown tokens
window technique like mel frequency sepstral
respectively. The HSI model will consist of three
coefficient (MFCC) is always advantageous for
components namely Hark, Spell and Imitate. The
extracting temporal features from a speech signal.
Hark is an acoustic model encoder that performs
This type of model consist of filter banks which is
operation called Hark. The Hark end translate the
responsible for extracting frequency oriented
original acoustic signal A into a high-level
features from a small time frame for a long
reproduction H = (H1……HL) with L <= n. The
sequence of speech. To apply this sliding window
speller and imitator are attentive character, and
type feature extraction technique a speech signal is
decoder that comply an operation called spell and
needed to be sliced in small time frame and a set
imitate.
of features is extracted from each frame. Filter
banks are substitute of a group of signal processing Hark, Spell and Imitate (HSI) Model is a
techniques that decompose signals into frequency pyramidal BLSTM encoding input sequence A
sub-bands. This decomposition is useful because into high level features h, the speller is an
frequency domain processing (also called sub- attention-based decoder generation the B
40
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019
41
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019
of 0.2 was used with a geometric decay of 0.98 per VI. CONCLUSION
3M utterances (i.e., 1/20-th of an epoch). They used
As being experimented by [7] they used data set
the Dist Belief framework [15] with 32 replicas,
with three million Google voice search utterances
each with a mini batch of 32 utterances. In order to
(Representing 2000 Hours of data) for the
further speed up training, the sequences were
experiment approximately 10 hours of utterances
grouped into buckets based on their frame length [4].
were selected randomly for the set data
The model was trained until the results on the
augmentation was performed using a room
validation set stopped improving, taking
simulator, adding different types of noise and
approximately two weeks. The model was decoded
repetition; The noise sources were obtained from
using N-best list decoding with beam size of N = 32.
YouTube and environmental recordings of daily
After a long processing on LSTM model for spell, events [16]. This increased the amount of audio data
filter bank comes into action to provide high quality by 20 times with a Sound Normal Range between 5
sound imitation or emission. This enhancement a dB and 30 dB [16]. They used 40 dimensional long-
sound quality is done by graphic equalizer. Its highly mel filter bank features computed every 10 ms as the
required to restore the original input set of sequence acoustic inputs to the hark. A separate set of 22K
which is done by the process called synthesis utterances representing approximately 16 hours of
meaning restoring a complete signal from the data which were used for test case. A noisy test set
filtering process (as mentioned in all the research was also created using the same strategy that was
papers which we surveyed). Then the purpose of applied to the training data. The acoustic signal was
graphic equalizer comes into action. The purpose of normalized by converting each character to lower
equalizer is to equalize sound signal in order to case English alpha-numeric including Digits. The
improve sound quality of a machine and make the punctuation: space, comma, period and apostrophe
sound more prominent hence imitating whatever were kept, while all other tokens were converted to
have been uttered with the best quality. the unknown < unk > token, and analyzing each and
every utterance being padded with the start-of
V. FUTURE SCOPE
sentences < sos > and the end-of sentences < eos >
Scrutiny-based sequence-to-sequence model is tokens.
another end- to-end model. It roots from the
We surveyed on a HSI model, a Hark, Spell and
successful scrutiny model in machine learning
Imitate, a neural speech recognizer converting it to
which extends the encoder-decoder frame- work
character by character pronunciation with imitating
using an RNN decoder. Therefore, windowing
whatever have been uttered. The successful
method is used in order to reduce the number of
accomplishment of this goal is being done by usage
candidates used in attention decoder. A pyramidal
a various model such as HMM a speech recognizer
structure is being used, in the encoder network so
model performing Hark operation. DNN-HMM,
that only L high-level hidden vectors are generated
CTC, RNN encoder decoder for disintegrating
instead of T hidden vectors from all the input time
complex sequence into a sub frames for better
steps. Due to the high complexity and slow speed of
output. Filter banks for better quality sound
training, the majority of attention-based works were
production. We are optimistic that this kind of model
majorly done at Google compared to the CTC works
would pave a way for better speech recognizer
reported by many sites. However, recently Google
system. The state-of-the-art model in this data set is
significantly advanced the research of scrutiny -
a CLDNN- HMM system that is described in [16].
based model by HSI model.
The CLDNN-HMM system achieves a WER of
8.0% on the clean set and 8.9% on the noisy set.
REFERENCES
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and
translate. arXiv preprint arXiv:1409.0473, 2014.
[2] Jan Chorowski, Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. End-to-end continuous speech recognition
using attention-based recurrent nn: First results. arXiv preprint arXiv:1412.1602, 2014.
[3] Dzmitry Bahdanau, Dmitriy Serdyuk, Philemon´ Brakel, Nan Rosemary Ke, Jan Chorowski, Aaron Courville, and
Yoshua Bengio. Task loss estimation for sequence prediction. arXiv preprint arXiv:1511.06456, 2015.
[4] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in
neural information processing systems, pages 3104–3112, 2014.
42
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019
ISSN : 2454-2415 Vol. 7, Special Issue 1, 2019
[5] Kyunghyun Cho, Bart Van Merrienboer,¨ Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv
preprint arXiv:1406.1078, 2014.
[6] Tin Lay Nwe, Say Wei Foo, and Liyanage C De Silva. Speech emotion recognition using hidden markov models. Speech
communication, 41(4):603–623, 2003.
[7] William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals. Listen, attend and spell: A neural network for large
vocabulary conversational speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE
International Conference on, pages 4960–4964. IEEE, 2016.
[8] Sepp Hochreiter and Jurgen¨ Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[9] Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with deep bidirectional lstm. In
Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, pages 273–278. IEEE, 2013.
[10] Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In
International Conference on Machine Learning, pages 1764–1772, 2014.
[11] Has¸im Sak, Andrew Senior, Kanishka Rao, Franc¸oise Beaufays, and Johan Schalkwyk. Google voice search: faster
and more accurate. Google Research blog, 2015.
[12] Jacob Benesty, M Mohan Sondhi, and Yiteng Huang. Springer handbook of speech processing. Springer, 2007.
[13] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua
Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on
Machine Learning, pages 2048–2057, 2015.
[14] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-based models
for speech recognition. In Advances in neural information processing systems, pages 577–585, 2015.
[15] Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew Senior, Paul Tucker, Ke
Yang, Quoc V Le, et al. Large scale distributed deep networks. In Advances in neural information processing systems,
pages 1223–1231, 2012.
[16] Tara N Sainath, Oriol Vinyals, Andrew Senior, and Has¸im Sak. Convolutional, long short-term memory, fully
connected deep neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International
Conference on, pages 4580–4584. IEEE, 2015
43
International Journal of Innovative Knowledge Concepts, Vol. 7, Special Issue 1, 2019