11 views

Uploaded by Richard Y. Alcantara

d

- 1701.05923
- dawnbench-nips17
- assignment 3
- Neural Network for QAC
- 1603.03827
- Voice Controlled Device to Read out Files and Control its Pace
- Long Short Term Memory in MLP Pair
- Paper
- White Paper Neural Networks
- Bag of Words
- Ieee Argencon 2016 Paper 57
- JournalNX-Language Identification
- Challenges in Automatic Speech Recognition
- ppm_en_2004_02_Aggarwal.pdf
- 10.1.1.41.8377
- BitVoicer Manual
- Neural Network for Consumer Choice Prediction
- Neural Networks
- M.E(E.C) 1st Sem Syllabus
- Art Neural Nw

You are on page 1of 5

Haşim Sak, Andrew Senior, Françoise Beaufays

Google, USA

{hasim,andrewsenior,fsb@google.com}

LSTM and conventional RNNs have been successfully ap-

Long Short-Term Memory (LSTM) is a specific recurrent neu- plied to various sequence prediction and sequence labeling

ral network (RNN) architecture that was designed to model tem- tasks. In language modeling, a conventional RNN has ob-

poral sequences and their long-range dependencies more accu- tained significant reduction in perplexity over standard n-gram

rately than conventional RNNs. In this paper, we explore LSTM models [6] and an LSTM RNN model has shown improve-

RNN architectures for large scale acoustic modeling in speech ments over conventional RNN LMs [7]. LSTM models have

recognition. We recently showed that LSTM RNNs are more been shown to perform better than RNNs on learning context-

effective than DNNs and conventional RNNs for acoustic mod- free and context-sensitive languages [8]. Bidirectional LSTM

eling, considering moderately-sized models trained on a single (BLSTM) networks that operate on the input sequence in both

machine. Here, we introduce the first distributed training of directions to make a decision for the current input have been

LSTM RNNs using asynchronous stochastic gradient descent proposed for phonetic labeling of acoustic frames on the TIMIT

optimization on a large cluster of machines. We show that a speech database [9]. For online and offline handwriting recog-

two-layer deep LSTM RNN where each LSTM layer has a lin- nition, BLSTM networks used together with a Connectionist

ear recurrent projection layer can exceed state-of-the-art speech Temporal Classification (CTC) layer and trained from unseg-

recognition performance. This architecture makes more effec- mented sequence data, have been shown to outperform a state-

tive use of model parameters than the others considered, con- of-the-art Hidden-Markov-Model (HMM) based system [10].

verges quickly, and outperforms a deep feed forward neural net- Similar techniques with a deep BLSTM network have been

work having an order of magnitude more parameters. proposed to perform grapheme-based speech recognition [11].

Index Terms: Long Short-Term Memory, LSTM, recurrent BLSTM networks have also been proposed for phoneme predic-

neural network, RNN, speech recognition, acoustic modeling. tion in a multi-stream framework for continuous conversational

speech recognition [12]. In terms of architectures, following

1. Introduction the success of DNNs for acoustic modeling [13, 14, 15, 16], a

deep BLSTM RNN combined with a CTC output layer and an

Speech is a complex time-varying signal with complex cor- RNN transducer predicting phone sequences has been shown to

relations at a range of different timescales. Recurrent neural reach state-of-the-art phone recognition accuracy on the TIMIT

networks (RNNs) contain cyclic connections that make them database [17].

a more powerful tool to model such sequence data than feed-

Deep BLSTM RNNs have recently been shown to per-

forward neural networks. RNNs have demonstrated great suc-

form better than DNNs in the hybrid speech recognition ap-

cess in sequence labeling and prediction tasks such as handwrit-

proach [2]. Using the hybrid approach, we have recently shown

ing recognition and language modeling. In acoustic modeling

that an LSTM architecture with a recurrent projection layer

for speech recognition, however, where deep neural networks

outperforms DNNs and conventional RNNs for large vocabu-

(DNNs) are the established state-of-the-art, recently RNNs have

lary speech recognition, considering moderately-sized models

received little attention beyond small scale phone recognition

trained on a single machine [3]. In this paper, we explore LSTM

tasks, notable exceptions being the work of Robinson [1],

RNN architectures for large-scale acoustic modeling using dis-

Graves [2], and Sak [3].

tributed training. We show that a two-layer deep LSTM RNN

DNNs can provide only limited temporal modeling by op- where each LSTM layer has a linear recurrent projection layer

erating on a fixed-size sliding window of acoustic frames. They outperforms a strong baseline system using a deep feed-forward

can only model the data within the window and are unsuited to neural network having an order of magnitude more parameters.

handle different speaking rates and longer term dependencies.

By contrast, recurrent neural networks contain cycles that feed

the network activations from a previous time step as inputs to 2. LSTM Network Architectures

the network to influence predictions at the current time step. 2.1. Conventional LSTM

These activations are stored in the internal states of the network

which can in principle hold long-term temporal contextual in- The LSTM contains special units called memory blocks in the

formation. This mechanism allows RNNs to exploit a dynami- recurrent hidden layer. The memory blocks contain memory

cally changing contextual window over the input sequence his- cells with self-connections storing the temporal state of the net-

tory rather than a static one as in the fixed-sized window used work in addition to special multiplicative units called gates to

with feed-forward networks. In particular, the Long Short-Term control the flow of information. Each memory block in the orig-

Memory (LSTM) architecture [4], which overcomes some mod- inal architecture contained an input gate and an output gate.

eling weaknesses of RNNs [5], is conceptually attractive for the The input gate controls the flow of input activations into the

rt−1

input input input input

ft

it ot recurrent recurrent

recurrent

output

output

ct mt yt LSTM

input

g cell h

xt rt (a) LSTM

output LSTM

output (c) LSTMP

(b) DLSTM recurrent

LSTM memory blocks

output

Figure 1: LSTMP RNN architecture. A single memory block is

(d) DLSTMP

shown for clarity.

memory cell. The output gate controls the output flow of cell

activations into the rest of the network. Later, the forget gate

was added to the memory block [18]. This addressed a weak- instant. Therefore, the depth in deep LSTM RNNs has an ad-

ness of LSTM models preventing them from processing contin- ditional meaning. The input to the network at a given time step

uous input streams that are not segmented into subsequences. goes through multiple LSTM layers in addition to propagation

The forget gate scales the internal state of the cell before adding through time and LSTM layers. It has been argued that deep

it as input to the cell through the self-recurrent connection of layers in RNNs allow the network to learn at different time

the cell, therefore adaptively forgetting or resetting the cell’s scales over the input [20]. Deep LSTM RNNs offer another

memory. In addition, the modern LSTM architecture contains benefit over standard LSTM RNNs: They can make better use

peephole connections from its internal cells to the gates in the of parameters by distributing them over the space through mul-

same cell to learn precise timing of the outputs [19]. tiple layers. For instance, rather than increasing the memory

An LSTM network computes a mapping from an input size of a standard model by a factor of 2, one can have 4 lay-

sequence x = (x1 , ..., xT ) to an output sequence y = ers with approximately the same number of parameters. This

(y1 , ..., yT ) by calculating the network unit activations using results in inputs going through more non-linear operations per

the following equations iteratively from t = 1 to T : time step.

it = σ(Wix xt + Wim mt−1 + Wic ct−1 + bi ) (1) 2.3. LSTMP - LSTM with Recurrent Projection Layer

ft = σ(Wf x xt + Wf m mt−1 + Wf c ct−1 + bf ) (2)

The standard LSTM RNN architecture has an input layer, a re-

ct = ft ct−1 + it g(Wcx xt + Wcm mt−1 + bc ) (3) current LSTM layer and an output layer. The input layer is con-

ot = σ(Wox xt + Wom mt−1 + Woc ct + bo ) (4) nected to the LSTM layer. The recurrent connections in the

mt = ot h(ct ) (5) LSTM layer are directly from the cell output units to the cell

input units, input gates, output gates and forget gates. The cell

yt = φ(Wym mt + by ) (6) output units are also connected to the output layer of the net-

work. The total number of parameters N in a standard LSTM

where the W terms denote weight matrices (e.g. Wix is the ma-

network with one cell in each memory block, ignoring the bi-

trix of weights from the input gate to the input), Wic , Wf c , Woc

ases, can be calculated as N = nc × nc × 4 + ni × nc × 4 +

are diagonal weight matrices for peephole connections, the b

nc × no + nc × 3, where nc is the number of memory cells

terms denote bias vectors (bi is the input gate bias vector), σ is

(and number of memory blocks in this case), ni is the number

the logistic sigmoid function, and i, f , o and c are respectively

of input units, and no is the number of output units. The com-

the input gate, forget gate, output gate and cell activation vec-

putational complexity of learning LSTM models per weight and

tors, all of which are the same size as the cell output activation

time step with the stochastic gradient descent (SGD) optimiza-

vector m, is the element-wise product of the vectors, g and h

tion technique is O(1). Therefore, the learning computational

are the cell input and cell output activation functions, generally

complexity per time step is O(N ). The learning time for a net-

and in this paper tanh, and φ is the network output activation

work with a moderate number of inputs is dominated by the

function, softmax in this paper.

nc × (4 × nc + no ) factor. For the tasks requiring a large

number of output units and a large number of memory cells to

2.2. Deep LSTM store temporal contextual information, learning LSTM models

As with DNNs with deeper architectures, deep LSTM RNNs become computationally expensive.

have been successfully used for speech recognition [11, 17, 2]. As an alternative to the standard architecture, we proposed

Deep LSTM RNNs are built by stacking multiple LSTM lay- the Long Short-Term Memory Projected (LSTMP) architec-

ers. Note that LSTM RNNs are already deep architectures in ture to address the computational complexity of learning LSTM

the sense that they can be considered as a feed-forward neu- models [3]. This architecture, shown in Figure 1 has a separate

ral network unrolled in time where each layer shares the same linear projection layer after the LSTM layer. The recurrent con-

model parameters. One can see that the inputs to the model nections now connect from this recurrent projection layer to the

go through multiple non-linear layers as in DNNs, however the input of the LSTM layer. The network output units are con-

features from a given time instant are only processed by a sin- nected to this recurrent layer. The number of parameters in this

gle nonlinear layer before contributing the output for that time model is nc ×nr ×4+ni ×nc ×4+nr ×no +nc ×nr +nc ×3,

instructions. We implemented activation functions and gradi-

Table 1: Experiments with LSTM and LSTMP RNN architec- ent calculations on matrices using SIMD instructions to benefit

tures showing test set WERs and frame accuracies on devel- from parallelization.

opment and training sets. L indicates the number of layers,

We use the truncated backpropagation through time (BPTT)

for shallow (1L) and deep (2,4,5,7L) networks. C indicates the

learning algorithm [22] to compute parameter gradients on short

number of memory cells, P the number of recurrent projection

subsequences of the training utterances. Activations are for-

units, and N the total number of parameters.

ward propagated for a fixed step time Tbptt (e.g. 20). Cross

C P Depth N Dev Train WER entropy gradients are computed for this subsequence and back-

(%) (%) (%) propagated to its start. For computational efficiency each thread

840 - 5L 37M 67.7 70.7 10.9 operates on subsequences of four utterances at a time, so matrix

440 - 5L 13M 67.6 70.1 10.8 multiplies can operate in parallel on four frames at a time. We

600 - 2L 13M 66.4 68.5 11.3 use asynchronous stochastic gradient descent (ASGD) [23] to

385 - 7L 13M 66.2 68.5 11.2 optimize the network parameters, updating the parameters asyn-

750 - 1L 13M 63.3 65.5 12.4 chronously from multiple threads on a multi-core machine. This

effectively increases the batch size and reduces the correlation

6000 800 1L 36M 67.3 74.9 11.8 of the frames in a given batch. After a thread has updated the

2048 512 2L 22M 68.8 72.0 10.8 parameters, it continues with the next subsequence in each utter-

1024 512 3L 20M 69.3 72.5 10.7 ance, preserving the LSTM state, or starts new utterances with

1024 512 2L 15M 69.0 74.0 10.7 reset state when one finishes. Note that the last subsequence

800 512 2L 13M 69.0 72.7 10.7 of each utterance can be shorter than Tbptt but is padded to the

2048 512 1L 13M 67.3 71.8 11.3 full length, though no gradient is generated for these padding

frames.

This highly parallel single machine ASGD framework de-

where nr is the number of units in the recurrent projection layer. scribed in [3] proved slow for training models of the scale we

In this case, the model size and the learning computational com- have used for large scale ASR with DNNs (many millions of

plexity are dominated by the nr × (4 × nc + no ) factor. Hence, parameters). To scale further, we replicate the single-machine

this allows us to reduce the number of parameters by the ratio workers on many (e.g. 500) separate machines, each with three,

nr

nc

. By setting nr < nc we can increase the model memory synchronized, computation threads. Each worker communi-

(nc ) and still be able to control the number of parameters in the cates with a shared, distributed parameter server [23] which

recurrent connections and output layer. stores the LSTM parameters. When a worker has computed the

With the proposed LSTMP architecture, the equations for parameter gradient on a minibatch (of 3 × 4 × Tbptt frames), the

the activations of network units change slightly, the mt−1 acti- gradient vector is partitioned and sent to the parameter server

vation vector is replaced with rt−1 and the following is added: shards which each add the gradients to their parameters and re-

rt = Wrm mt (7) spond with the new parameters. The parameter server shards

aggregate parameter updates completely asynchronously. For

yt = φ(Wyr rt + by ) (8) instance, gradient updates from workers may arrive in different

where the r denote the recurrent unit activations. orders at different shards of the parameter server. Despite the

asynchrony, we observe stable convergence, though the learn-

2.4. Deep LSTMP ing rate must be reduced, as would be expected because of the

increase in the effective batch size from the greater parallelism.

Similar to deep LSTM, we propose deep LSTMP where multi-

ple LSTM layers each with a separate recurrent projection layer

are stacked. LSTMP allows the memory of the model to be in- 4. Experiments

creased independently from the output layer and recurrent con- We evaluate and compare the performance of LSTM RNN ar-

nections. However, we noticed that increasing the memory size chitectures on a large vocabulary speech recognition task – the

makes the model more prone to overfitting by memorizing the Google Voice Search task. We use a hybrid approach [24]

input sequence data. We know that DNNs generalize better to for acoustic modeling with LSTM RNNs, wherein the neural

unseen examples with increasing depth. The depth makes the networks estimate hidden Markov model (HMM) state posteri-

models harder to overfit to the training data since the inputs ors. We scale the state posteriors by the state priors estimated

to the network need to go through many non-linear functions. as the relative state frequency from the training data to obtain

With this motivation, we have experimented with deep LSTMP the acoustic frame likelihoods. We deweight the silence state

architectures, where the aim is increasing the memory size and counts by a factor of 2.7 when estimating the state frequencies.

generalization power of the model.

4.1. Systems & Evaluation

3. Distributed Training: Scaling up to

All the networks are trained on a 3 million utterance (about

Large Models with Parallelization 1900 hours) dataset consisting of anonymized and hand-

We chose to implement the LSTM RNN architectures on multi- transcribed utterances. The dataset is represented with 25ms

core CPU rather than on GPU. The decision was based on frames of 40-dimensional log-filterbank energy features com-

CPU’s relatively simpler implementation complexity, ease of puted every 10ms. The utterances are aligned with a 85 million

debugging and the ability to use clusters made from commod- parameter DNN with 14247 CD states. The weights in all the

ity hardware. For matrix operations, we used the Eigen matrix networks are initialized to the range (-0.02, 0.02) with a uni-

library [21]. This templated C++ library provides efficient im- form distribution. We try to set the learning rate specific to a

plementations for matrix operations on CPU using vectorized network architecture and its configuration to the largest value

70 c800_r512x2 c800_r512x2

c440x5

c2048_r512

72 c440x5

c2048_r512

c600x2 c600x2

68 c385x7 c385x7

c750 c750

70

Frame Accuracy (%)

66

68

64 66

62 64

Training Time (hours) Training Time (hours)

(a) Held-out data. (b) Training data.

rates for stable convergence of different networks range from Figure 3 compares the frame accuracies on training and

5e-06 to 1e-05. The learning rates are exponentially decayed held-out sets for various LSTM and LSTMP architectures. The

during training. With the correct learning rate, the training of overfitting problem with LSTMP RNN architecture with large

LSTM RNNs results in a stable convergence. Apart from clip- number of memory cells (2048) can be seen clearly. We observe

ping the activations of memory cells to range [-50, 50], we do that LSTMP RNN architectures converges faster than LSTM

not limit the activations of other units, the weights or the esti- RNN architectures. It is clear that having more layers helps gen-

mated gradients. eralization but makes training harder and convergence slower.

During training, we evaluate frame accuracies (i.e. phone Table 2 shows how the performance of networks with deep

state labeling accuracy of acoustic frames) on a held-out devel- LSTMP RNN architecture changes with the depth and number

opment set of 200,000 frames. The trained models are eval- of model parameters. We see that increasing the number of pa-

uated in a speech recognition system on a test set of 22,500 rameters over 13M does not improve performance. We can also

hand-transcribed, anonymized utterances. For all the decoding decrease the number of parameters substantially without hurt-

experiments, we use a wide beam to avoid search errors. After a ing performance much. The deep LSTMP RNN architecture

first pass of decoding using the LSTM RNNs with a 5-gram lan- with two layers each with 800 cells and 512 recurrent projec-

guage model heavily pruned to 23 million n-grams, lattices are tion units converges mostly in 48 hours and gives 10.9% WER

rescored using a 5-gram language model with 1 billion n-grams. on the independent test set. Training this model for 100 hours

The input to the LSTM RNNs is log-filterbank energy fea- improves the WER to 10.7% and for 200 hours to 10.5%. In

tures, with no frame stacking. Since the information from comparison, our best DNN models with 85M parameters gives

the future frames helps making better decisions for the current 11.3% at the same beam and training takes a few weeks.

frame, we delay the output HMM state label by 5 frames. We

do not calculate the errors for the first 5 frames of each utter-

ance during backpropagation and we repeat the last frame of Table 2: Experiments with LSTMP RNN architectures.

each utterance for 5 more time steps. C P Depth N WER (%)

1024 512 3L 20M 10.7

4.2. Results 1024 512 2L 15M 10.7

In Table 1, we summarize the results for various LSTM and 800 512 2L 13M 10.7

LSTMP RNN architectures. We observe that the conventional 700 400 2L 10M 10.8

LSTM RNNs with a single layer do not perform very well for 600 350 2L 8M 10.9

this large scale acoustic modeling task. With two layers of

LSTM RNNs, the performance improves but still it is not very

good. The LSTM RNN with five layers approaches the perfor- 5. Conclusions

mance of the best model. We see that training an LSTM RNN

with seven layers is hard, the model starts converging after a We showed that deep LSTM RNN architectures achieve state-

day of training. From the table, one can see that the LSTMP of-the-art performance for large scale acoustic modeling. The

RNN models with a single layer and a large number of memory proposed deep LSTMP RNN architecture outperforms standard

cells tends to overfit the training data. Increasing the number of LSTM networks and DNNs and makes more effective use of the

LSTMP RNN layers seems to alleviate this problem of mem- model parameters by addressing the computational efficiency

orization and to result in better generalization to held-out data. needed for training large networks. We also show for the first

The LSTMP RNN models give slightly better results than the time that LSTM RNN models can be quickly trained using

LSTM RNN model with 5 layers. We see that increasing the ASGD distributed training.

number of parameters in the LSTMP RNN models more than

13M by having more layers or more memory cells does not give

6. References [20] M. Hermans and B. Schrauwen, “Training and analysing deep re-

current neural networks,” in Advances in Neural Information Pro-

[1] A. J. Robinson, G. D. Cook, D. P. W. Ellis, E. Fosler-Lussier, S. J.

cessing Systems, 2013, pp. 190–198.

Renals, and D. A. G. Williams, “Connectionist speech recognition

of broadcast news,” Speech Commun., vol. 37, no. 1-2, pp. 27–45, [21] G. Guennebaud, B. Jacob et al., “Eigen v3,” 2010. [Online].

May 2002. Available: http://eigen.tuxfamily.org

[2] A. Graves, N. Jaitly, and A. Mohamed, “Hybrid speech recogni- [22] R. J. Williams and J. Peng, “An efficient gradient-based algorithm

tion with deep bidirectional LSTM,” in Automatic Speech Recog- for online training of recurrent network trajectories,” Neural Com-

nition and Understanding (ASRU), 2013 IEEE Workshop on. putation, vol. 2, pp. 490–501, 1990.

IEEE, 2013, pp. 273–278.

[23] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le,

[3] H. Sak, A. Senior, and F. Beaufays, “Long Short-Term Memory M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and

Based Recurrent Neural Network Architectures for Large Vocab- A. Y. Ng, “Large scale distributed deep networks.” in NIPS, 2012,

ulary Speech Recognition,” ArXiv e-prints, Feb. 2014. pp. 1232–1240.

[4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” [24] H. Bourlard and N. Morgan, Connectionist speech recognition.

Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997. Kluwer Academic Publishers, 1994.

[5] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term de-

pendencies with gradient descent is difficult,” Neural Networks,

IEEE Transactions on, vol. 5, no. 2, pp. 157–166, 1994.

[6] T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudan-

pur, “Recurrent neural network based language model,” in Pro-

ceedings of INTERSPEECH, vol. 2010, no. 9. International

Speech Communication Association, 2010, pp. 1045–1048.

[7] M. Sundermeyer, R. Schlüter, and H. Ney, “Lstm neural networks

for language modeling.” in INTERSPEECH, 2012, pp. 194–197.

[8] F. A. Gers and J. Schmidhuber, “LSTM recurrent networks learn

simple context free and context sensitive languages,” IEEE Trans-

actions on Neural Networks, vol. 12, no. 6, pp. 1333–1340, 2001.

[9] A. Graves and J. Schmidhuber, “Framewise phoneme classifica-

tion with bidirectional LSTM and other neural network architec-

tures,” Neural Networks, vol. 12, pp. 5–6, 2005.

[10] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke,

and J. Schmidhuber, “A novel connectionist system for uncon-

strained handwriting recognition,” Pattern Analysis and Machine

Intelligence, IEEE Transactions on, vol. 31, no. 5, pp. 855–868,

2009.

[11] F. Eyben, M. Wollmer, B. Schuller, and A. Graves, “From speech

to letters using a novel neural network architecture for grapheme

based ASR,” in Automatic Speech Recognition & Understanding,

2009. ASRU 2009. IEEE Workshop on. IEEE, 2009, pp. 376–380.

[12] M. Wollmer, F. Eyben, B. Schuller, and G. Rigoll, “A multi-stream

asr framework for blstm modeling of conversational speech,” in

Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE

International Conference on, May 2011, pp. 4860–4863.

[13] A. Rahman Mohamed, G. E. Dahl, and G. E. Hinton, “Acoustic

modeling using deep belief networks,” IEEE Transactions on Au-

dio, Speech & Language Processing, vol. 20, no. 1, pp. 14–22,

2012.

[14] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent

pre-trained deep neural networks for large-vocabulary speech

recognition,” IEEE Transactions on Audio, Speech & Language

Processing, vol. 20, no. 1, pp. 30–42, Jan. 2012. [Online].

Available: http://dx.doi.org/10.1109/TASL.2011.2134090

[15] N. Jaitly, P. Nguyen, A. Senior, and V. Vanhoucke, “Application of

pretrained deep neural networks to large vocabulary speech recog-

nition,” in Proceedings of INTERSPEECH, 2012.

[16] G. Hinton, L. Deng, D. Yu, A.-R. Mohamed, N. Jaitly, A. Senior,

V. Vanhoucke, P. Nguyen, T. Sainath, G. Dahl, and B. Kingsbury,

“Deep neural networks for acoustic modeling in speech recogni-

tion,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–

97, November 2012.

[17] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with

deep recurrent neural networks,” in Proceedings of ICASSP, 2013.

[18] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget:

Continual prediction with LSTM,” Neural Computation, vol. 12,

no. 10, pp. 2451–2471, 2000.

[19] F. A. Gers, N. N. Schraudolph, and J. Schmidhuber, “Learning

precise timing with LSTM recurrent networks,” Journal of Ma-

chine Learning Research, vol. 3, pp. 115–143, Mar. 2003.

- 1701.05923Uploaded byGalina Alexeeva
- dawnbench-nips17Uploaded byRaden Eka G.
- assignment 3Uploaded byapi-386491919
- Neural Network for QACUploaded byNabil Ziad Kuddah
- 1603.03827Uploaded bytholc
- Voice Controlled Device to Read out Files and Control its PaceUploaded byAJER JOURNAL
- Long Short Term Memory in MLP PairUploaded byTodor Balabanov
- PaperUploaded byAvinash Barnwal
- White Paper Neural NetworksUploaded bympgent
- Bag of WordsUploaded bymandar.hiremath@gmail.com
- Ieee Argencon 2016 Paper 57Uploaded byademargcjunior
- JournalNX-Language IdentificationUploaded byJournalNX - a Multidisciplinary Peer Reviewed Journal
- Challenges in Automatic Speech RecognitionUploaded byToaster97
- ppm_en_2004_02_Aggarwal.pdfUploaded byHarisanto As
- 10.1.1.41.8377Uploaded byCodecom Sisge
- BitVoicer ManualUploaded byalvarado02
- Neural Network for Consumer Choice PredictionUploaded byBradley
- Neural NetworksUploaded byRishi Jha
- M.E(E.C) 1st Sem SyllabusUploaded bykrish_320786
- Art Neural NwUploaded byDhivhya
- An Introduction to BackUploaded byDinesh Sundar
- nn_sas.pdfUploaded byRatna Sulistyowati
- Neural Networks and Deep LearningUploaded byAndres Tuells Jansson
- Performance Evaluation of Ann Based Plasma Position Controllers for Aditya TokamakUploaded byIAEME Publication
- Machine LearningUploaded byTiwari Pranjal
- Jd 3416591661Uploaded byAnonymous 7VPPkWS8O
- Designing_An_Equalizer_Structure_Using__Gradient.pdfUploaded bySowjanya Modalavalasa
- Minin HandoutUploaded byDimas Bagus Cahyaningrat. W
- Neural Network For The Estimation Of Ammonia Concentration In Breath Of Kidney Dialysis PatientsUploaded byInternational Organization of Scientific Research (IOSR)
- Project Milestone FinalUploaded byAnonymous IZSS6CRQs

- Tema_3Uploaded byRichard Y. Alcantara
- Tema_3Uploaded byRichard Y. Alcantara
- Extra_DPUploaded byRichard Y. Alcantara
- Herless, Gabriela, Richard, NicolasUploaded byRichard Y. Alcantara
- diss.pdfUploaded byRichard Y. Alcantara
- Variables Aleatorias y Procesos EstocasticosUploaded bymdcarraro
- estocasticoUploaded byJenipher Ascencio
- foo.txtUploaded byRichard Y. Alcantara
- Robot BrazoUploaded byLuis Alfaro
- Probability Theory and Stochastic Processes with ApplicationsUploaded bynogarycms
- competitive-programming1.pdfUploaded byརྟ་མགྲིན་ དབང་ཕྱུག།
- An algorithmic introduction numerical simulation of stochastic differential equationsUploaded byjustinlfang
- The Hitchhiker’s Guide to the Programming ContestsUploaded bydpiklu
- PREVIO NO2Uploaded byRichard Y. Alcantara
- estocasticoUploaded byJenipher Ascencio
- AUTOTRÓNICA _ AcadémicaUploaded byRichard Y. Alcantara

- microcomputer-architecture-01.pdfUploaded byThành Trường
- Big O MIT.pdfUploaded byJoan
- CD&Oass1234Uploaded bySisay AD
- COMP248 Lab ManualUploaded byAnonymous xjNcJKK
- 1.3 — a First Look at Variables, Initialization, And AssignmentUploaded byoifesobi
- Lecture 24Uploaded bywilliamsock
- machine learningUploaded bygalaxystar
- Out Put PrimitivesUploaded bySharma Divya
- Software Testing Made EasyUploaded byvijay.er123
- Heru Mulyana CVUploaded byHeru Mulyana
- 3 Esrtos IntroUploaded byVijayaraghavan V
- Lecture05 - 8086 AssemblyUploaded bytesfu zewdu
- Perforce Best PracticesUploaded byJeanCV
- 7 Assembly Language LevelUploaded byLuisa Grigorescu
- Spring 2015Uploaded byAnonymous 7d2WetVROe
- uom coursesUploaded bydeepti_555
- Vitae BleeUploaded byramesh
- 2018學位生各院教授名單List of Professors for Degree-seeking Students-201804updatedUploaded byMuhammad Zaka Ud Din
- pierre sutherland resume september 2017 - copyUploaded byapi-351961844
- Intro to Comp Csc 111_Intensive 2Uploaded byVictor Oluwaseyi Olowonisi
- Lecture 2 High Level vs Low Level LanguagesUploaded byapi-3739389
- Programming ParadigmsUploaded byShefali Gupta
- HEC Computer ScienceUploaded bymuradtariq_tk
- Efficient Dispatching Rules Based on Data Mining for the Single Machine Scheduling ProblemUploaded byCS & IT
- NICE Network Intrusion Detection and CountermeasureUploaded byVivek Ranjan
- akira demoss resume 10-10-2018Uploaded byapi-396402491
- unit 1Uploaded byAnitha Joseph
- Department of Computer Science and EngineeringUploaded byManikandan Subbaraja
- Rhetorical AnalysisUploaded byKunal Goyal
- 02 LT-2BUploaded byBilly