You are on page 1of 15

Table of Contents

1. Introduction.......................................................................................................................................5
1.1 Background.............................................................................................................................5-6
1.2 Problem Statement....................................................................................................................7

2. Literature Survey...............................................................................................................................8

3. System Requirement Specification....................................................................................................8

4. Proposed Methodology.....................................................................................................................9

5. Implementation Details …………………………………………………………………………………………………10 5.1

Software And Tools............................................................................................................10

5.2 Language Translation Model.................................................................................................10


5.2.1 Without Attention Mechanism.........................................................................10

6. Intermediate Results and Discussion........................................................................................15 6.1


Dataset.................................................................................................................................15
6.2 Results.......................................................................................................................................16
6.2.1 Model Translation Accuracy with and without attention............................................16
6.2.2 Application Pipeline Results...............................................................................................17

7. Conclusions and Future Work ………………………………………………………………………………………


7.1 Conclusions.........................................................................................................................19
7.2 Future Work.............................................................................................................................19
1 INTRODUCTION

NMT is a new and advanced technique for language translation task, recently proposed by
(Kalchbrenner and Blunsom, 2013), (Sutskever et al., 2014) and (Cho et al., 2014a). From the
probabilistic approach to get the best translated senetence to maximize
the conditional probability of y given x i.e. p(y|x) (Bahdanau et al., 2014). Phrase based
Statistical Machine Translation (PBSMT) (Koehn et al., 2003) help us to solve the translation task by
training the model by dividing it into component. In context to PBSMT in Neural Machine
Translation an end-end to single large neural network
model is trained to achieve the same goal. The main goal to NMT is when user give the input
sentence encoder network will provide the context vector and then decoder network will start
decoding using that context vector to generate the translated sentence

1.1 Background

All the model for language translation use convolutional neural networks as basic building block, and
getting the representation for computing hidden representations in for all input and output positions. It
make it more complex to find the dependencies between distant positions. In the transformer due to
averaging and using the attention mechanism the number of operation gets reduced to constant number In
the Transformer this is reduced to a constant number of operations, an effect we counteract with Multi-
Head Attention as described in section 3.2. Self- attention, sometimes called intra-attention is an attention
mechanism relating different positions of a single sequence in order to compute a representation of the
sequence. Self-attention has been successfully used in a variety of tasks, including reading
comprehension, abstract summarization, textual reasoning, and task-independent sentence representation
[4, 22, 23, 19].

5
This is the focus the study described in this thesis.

A concise statement of the problem is provided in the following section.

1.2 Problem Statement

• This project is a technology demonstrator and YouTube Subtitle Translator application in which I'm
building a language translator model from scratch that do the language translation from English to
Spanish and aim to achieve a translation accuracy more than 80%.

• The subtitles are getting scraped from YouTube with the link of the particular video.
• It is getting translated into Spanish from English.
• Then it is getting synchronized with the video timeline
2 LITERATURE SURVEY

For this study as well, NMT are used for Language Translation The literature survey presented here is
from two papers that have done research in recent advance .

[1] Verma, Ajay Anand, and Pushpak Bhattacharyya. "Literature survey: Neural machine translation."
CFILT, Indian Institute of Technology Bombay, India (2017).

In this paper they have explored the NMT approach in detail and they have come to conclusion that NMT
only have few limitations like
1. It will not able to translate accurately if the length of sentence start increasing and this issue is
somewhat solved or handled by attention based NMT with LSTM as basic unit
2. rare or out-of-vocabulary words. And this issue is handled by sub wording segmentation with Byte
Pair Encoding (BPE). They have touched factored approach which makes use of linguistic
information of sentence, and realized it also help in the performance of NMT models.

[2] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information
processing systems 30 (2017).

1. In this research work paper they have explored more on transformer which internally uses attention
mechanism only but they have replaced the RNNs layer which is common is Encoder decoder
architecture with multi headed self-attention ,
2. For translation purpose, the advantages of the Transformer can huge like it can be trained
significantly faster than architectures based on recurrent or convolutional layers. On both WMT 2014
English-to-German and WMT 2014 English-to-French translation tasks, they achieve a new state of
the art. Using transformer they got huge improvement and it performs better than their own attention
based models. They are excited about the future of attention-based models and plan to apply them to
other tasks. They plan to extend the Transformer to problems involving input and output modalities
other than text and to investigate local, restricted attention mechanisms to efficiently handle large
inputs and outputs such as images, audio and video.

3 SYSTEM REQUIREMENTS SPECIFICATION

Minimum Requirements:
• 32 GB Ram
th
• 9 Gen Octa Core Processor
• High End GPU with CUDA support

Operating System Development Ubuntu 18.04, 64bit


Language Deep Learning Python 3.7
Framework Third-party software Tensorflow
(Keras Backend)
and libraries Numpy
SciKit
Matplotlib
MoviePy
YouTubeTranscriptApi
4 PROPOSED METHODOLOGY

NMT based models is a seq to seq architecture which is built on two fundamental called encoder and
decoder. The task of Encoder is generating a real vector representation of sentence called summary vector or
context vector that gives relation based meaningful representation of input sentence Ideally context vector
should be able to represent all the information present in source sentence in real vector representation.

The task of the decoder is to use this context vector to generate the translated text the Decoder process this
context vector to generate target language sentence word by word, such that all the meaning present in source
sentence is transferred in it. Decoder has also two parts which is slightly different during training and
inferencing. We have used LSTM as Recurrent Neural Network as basic unit because of its advantages over
simple RNN as it handled and able to remember the context of long sentence easily.
5 IMPLEMENTATION DETAILS

This chapter describes the details of the implementation of YouTube Subtitle Translator
application.

The software and tools used are described in 5.1. Language Translation Model used in the project
are described in 5.2.

5.1 Software and Tools

The project runs on a Laptop that has an Intel Core i5 Processor, and is configured with Windows
10, 64bit. The system has 8GB RAM and 6GB GPU the entire project has been developed in
Python 3.7. The tools and libraries used in python for various tasks are summarized in Table 5-1.

Task Library / Framework Versio


n
YouTube Subtitle Extraction YouTube Transcript 4.1
Api
Language Translation Tensorflow 2.0
Framework
Vector Operations Numpy 1.17
Adding Subtitle in Video Moviepy 2.0
Table 5-1 Python libraries and their version for the various tasks used in the project

5.2 Language Translation Model

5.2.1 Without Attention Mechanism

The sequence to sequence with encoder and decoder architecture works like mapping a source language
sentence to target language sentence. The source language sentence is the input to the encoder, and the
target language sentence is the final output that models give after the training. The details explanation
of algorithm to achieve the translation task is explained below while training we have used one
technique called teacher forcing in that technique we have all the source and target data so we use that
target data in training the decoder in training phase.

• Load the data from downloaded file from manything.org website containing the source and target
sentences

• Do the data pre-processing like making the sentences to lower case space removal digit and quotes
removal.

• You need to tag the start and end tag in the target sentence to enable the decoder that you have to
start decoding and end the decoding but in the training and translation phase

• Make a word to idx and idx to word dictionary both for source and target language as when we pass the
sentence to lstm it will not understand the text but it will understand the vectors and for every word we
need one representation for each word

• Rearrange the data for train test split randomly

• Split the dataset into train and test data

• Fit the data to the encoder and decoder using fit_generator function

• Build the encoder and decoder architecture and then compile the model
• Make predictions from the model
10
Building the Model

Building the model is divided into two parts Encoder and Decoder using LSTM . The last timestamp
hidden state and cell state as a context vector is passed to decoder first timestamp along with the actual
target sentence during the training and during the inference .

Build the Encoder

While build the encoder we have pre steps as well like we create an tensor flow input layer through which
we paas the word in senetce id that’s why we have created a word to idx dictionary after that we attach an
embedding layer and that will be the first hidder layer for the encoder network.
Embedding layer task is to give contextual vector representation of all the unique word that we have in
the source l;anguage in the training data another advantages to embedding layer is to it also preserve
the semantic replationship.

Then we create a LSTM and make parameters return state equal true only because in the ncoder
we want to genetate the context vector which we will get at last timestamp of encoder
Steps for inference

• In the inference the architecture of encoder and process of encoder is same it gives the context
vector when we pass the input sequence by there is slight change in the decoder phase it will
accept the context vector along with start tag id and start generating the target language sentence
word by word till we get the end tag or reach the max target sentence length.
5.2.2 With Attention Mechanism

Seq2Seq model with an attention mechanism consists of an encoder, decoder, and attention layer.
There are there important parts in the attention mechanism like Alignment layer, attention weights,
Context vectors

Then we provide the attention for the weights actually in a way when we are defining the weight we are
focusing or giving attention by conceding the hidden state of decoder previous to the timestamp that we
are prediction the word at and the hiddent state of encoder for that window length corresponding to
weight

Finally we are standardizing the weiths by using the softmax function so that they will be in
between 0-1
So finally neural network beyween encoder and decoder learn these weight in a way that for predicting
the word it learn which word to docus and give more attention in that word

• Context vector( ᵢ),


• Decoder’s output from the previous time step (yᵢ₋₁) and
• Previous decoder’s hidden state(sᵢ₋₁)

Model Building & Fitting without attention

6 INTERMEDIATE RESULTS AND DISCUSSION


This chapter presents the intermediate results found for the procedures elaborated in the previous
section, and also describes the datasets used for the project, and the evaluation metrics considered.

6.1 Datasets

The Dataset used for training the language translator model is being taken from website that has
corpus of all the language translation in almost all the language.

Data Source:
http://www.manythings.org/anki/

15
6.2 Results

The results of running the model with and without attention on English to Spanish and
merging the subtilte back to the video using moviepy libraries

6.2.1 Model Translation Accuracy with and Without Attention

Dataset Accuracy (%)


0.77.50 Train
Without
Attention 0.6126 Validation
0.9684 Train
With
0.969 Validation
Attention 0

16
6.2.2 Application Pipeline Results

Transcript Scraping

18
7 CONCLUSIONS AND FUTURE WORK

7.1 Conclusions

Although the project is still in its initial stage, some conclusions can be drawn from the work
carried out so far. The overall system architecture, along with each component, the modules and
their functionalities are clearly defined. Each have a specific role to play in the process, and have
been designed to be optimal in their functionalities.

Computationally, the pipeline described in this project meets the requirements quite well. All
parts of the pipeline are built such that they consume lower resources, and work fast. The
expectation is that even if the same pipeline is moved to an embedded system, there would not be
a significant increase in the time taken.

The language translation has been done from English to Spanish using seq2seq encoder decoder
techniques. Achieved training accuracy of 77% and testing accuracy of 61% 3. As per the suggestions
from 3rd review, working on advanced language translation model with attention mechanism
In future, we planned to train the best model on English to German and build a application that gives
options to users to translation their YouTube subtitle in Spanish or German

7.2 Future Work

Although there are still some open questions, a few of the targeted goals have been met. The
software implementation is nearing completion, and the work for next phase i.e. tracking using
identifiers can begin. The following items are being targeted for the subsequent stage:

a) Improving accuracy of Language Translation using predefined sentence


transformer through Transfer Learning
b) Make this application to work with any random video without having subtitle using speech
recognition technique

19
REFERENCES

[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine
translation by jointly learning to align and translate. CoRR abs/1409.0473.
http://arxiv.org/abs/1409.0473.
[2] Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models.
Association for Computational Linguistics, Seattle.

[3] P.P.Brown, S.A. Della Pietra, V.J. Della Prieta and R.L. Mercer: "The Mathematics of
Statistical Machine Translation: Parameter Estimation". Computational Linguistics, 19, pp.
263-311. (1993)
[4] Minh-Thang Luong, Hieu Pham and Christopher D. Manning: “Effective
Approaches to Attention-based Neural Machine Translation” (2015)

[5] Jeffrey Pennington, Richard Socher, Christopher D. Manning: “GloVe: Global Vectors for
Word Representation”

You might also like