Professional Documents
Culture Documents
Second Progress Report Updated Final
Second Progress Report Updated Final
On
Text Summarization
Bachelor of Technology
(Computer Science and Engineering)
Submitted By
To develop a Deep Learning Model using Transformer architecture for Abstractive Text Summarization
PROGRESS:
OBJECTIVE ACHIEVED:
1. We applied Multi-Headed Attention and Feed-Forward Neural Network. We split the inputs into
multiple heads, and after processing, concatenate output from all the heads.
2. We made fundamental units of encoder and decoder. These expanded into 4 encoder/decoder
layers.
3. We stack all the intermediate layers in a Custom Model class.
4. We applied custom learning rate in which training on a custom learning rate scheduler that helps faster
convergence.
5. We train the model with Sparse Categorical Cross Entropy loss and used Adam Optimizer.
ADDITIONAL WORK:
PROPOSED ARCHITECTURE
Most competitive neural sequence transduction models have an encoder-decoder structure [5, 2, 35]. Here, the
encoder maps an input sequence of symbol representations (x1, ..., xn) to a sequence of continuous representations z
= (z1, ..., zn). Given z, the decoder then generates an output sequence (y1, ..., ym) of symbols one element at a time.
At each step the model is auto-regressive [10], consuming the previously generated symbols as additional input
when generating the next. The Transformer follows this overall architecture using stacked self-attention and point-
wise, fully connected layers for both the encoder and decoder.
Fig.1. Architect
ureofth emodel
Encoder
And Decoder
Blocks
1. The first
step in
calculating self-
attention is to
create three
vectors from
each of the
encoder’s input
vectors (in this
case, the
embedding of
each word). So
for each word,
we create a
Query vector, a
Key vector, and
a Value vector.
These vectors
are created by
multiplying the
embedding by three matrices that we trained during the training process.
2. The second step in calculating self-attention is to calculate a score. Say we’re calculating the self-attention
for the first word in this example, “Thinking”. We need to score each word of the input sentence against this
word. The score determines how much focus to place on other parts of the input sentence as we encode a
word at a certain position.
3. The third and fourth steps are to divide the scores by 8 (the square root of the dimension of the key
vectors used in the paper – 64. This leads to having more stable gradients. There could be other possible
values here, but this is the default), then pass the result through a SoftMax operation. SoftMax normalizes
the scores so they’re all positive and add up to 1.
Custom Learning Rate
The transformer paper also suggests training on a custom learning rate scheduler that helps faster
convergence
Encoder Decoder Block
EXPECTED RESULT:
Our algorithm takes less computation time and resources than other approaches.
Provide more accuracy of summarization than other approaches.
5|Page
REFERENCES:
4. https://machinelearningmastery.com/how-does-attention-work-in-encoder-decoder-
recurrent-neural-networks/
5. https://medium.com/analytics-vidhya/https-medium-com-understanding-attention-
mechanism-natural-language-processing-9744ab6aed6a
6|Page