Attention Mechanism For Image Processing - Coding Ninjas CodeStudio

1/20/23, 12:05 PM Attention Mechanism for Image Processing - Coding Ninjas CodeStudio
Guided Paths Contests Interview Prep Practice Resources Login
Have you registered for Codestudio Beginner Contest 26 yet! Register Now
Codestudio Library Deep Learning Attention Mechanism
Attention Mechanism …
Browse Category
Attention Mechanism
Problem of the day
Consistent and structured
for Image Processing
practice daily can land you in
soham Medewar Share

Last Updated: May 13,
2022 :
Explore
Table of Contents
1. Introduction
2. Image captioning using

Attention Mechanism
3. Working of Bahdanau
Attention Model
4. Image Captioning
4.1. Importing libraries
4.2. Loading Data
4.3. Data Preprocessing
4.4. Model Making
4.5. Training Model
4.6. Training Model
4.7. Greedy Search and BLEU

Evaluation
Related Articles
Set your goal
Important for focused learning
Hierarchical Attention
Prepare for tech
Network interviews
Learn and practise

Attention Mechanism coding
In Deep Learning
Become an expert
competitive coder
Already setup? Next
https://www.codingninjas.com/codestudio/library/attention-mechanism-for-image-processing 1/16

Introduction
Humans have a complicated cognitive skill called the
attention mechanism. When people receive information,
they might choose to disregard part of the primary data
while paying attention to secondary data.
Attention is the term for this power of self-selection. The

neural network's attention mechanism allows it to focus
on a subset of inputs in order to choose certain
characteristics.
For Deep Learning practitioners, the attention

mechanism has been a go-to approach. It was originally
developed for Neural Machine Translation utilizing
Seq2Seq Models, but today we'll look at how it's used in
Image Captioning.
Instead of compressing a complete image into a static

form, the Attention technique dynamically brings
important elements to the forefront when they are
needed. This is especially crucial when an image has a
lot of clutter.
Image captioning
using Attention
Mechanism
The encoder-decoder image captioning system would
encode the image with a pre-trained Convolutional
Neural Network in a hidden state. An LSTM would then
use it to decode this hidden state and generate a
caption.
The outputs from previous elements and new sequence

data are used as inputs for each sequence element. This
provides RNN networks a kind of memory, which could
help captions become more useful and contextual.
However, because RNNs are computationally expensive

to train and assess, memory is typically limited to a few
components. By picking the most relevant elements from
an input image, attention models can help solve this
challenge.
The image is first divided into n pieces with an Attention

method, and then we compute an image representation
for each part. The attention mechanism focuses on the
appropriate region of the image when the RNN
generates a new word, so the decoder only uses certain Set your goal
sections of the image.
Prepare for tech
interviews
Learn and practise

coding
Become an expert
competitive coder
Already setup?
source
Working
Guided Paths
of Interview Prep
Contests Practice Resources Login
Bahdanau Attention
Model
Only a few source positions are focused on in Bahdanau
or Local attention. Global attention is computationally
expensive because it focuses on all source side words
for all target words. To compensate for this shortcoming,
local attention chooses to focus on only a tiny subset of
the encoder's hidden states per target word.
Local attention locates an alignment point, calculates

the attention weight in the left and right windows where
its location is found, and then weights the context
vector. The main benefit of local attention is that it
lowers the cost of calculating the attention mechanism.
The local attention is used in the computation to

forecast the position of the source language end to be
aligned at the present decoding using a prediction
function and then travel through the context window,
only considering the words within the window.
Now, let us implement a model that will help understand

the attention mechanism in image captioning.
Image
Captioning
I will be implementing the model using the Flickr 8k
dataset. The link for the dataset is given here. The
dataset has 8000 different images and each image has
five different captions.
Importing libraries
import numpy as np
import pandas as pd
import string
from numpy import array
from PIL import Image
from pickle import load
import pickle
import matplotlib.pyplot as plt
from collections import Counter
import sys, time, os, warnings
warnings.filterwarnings("ignore")
from tqdm import tqdm
import re Set your goal
import keras Important for focused learning
from nltk.translate.bleu_score import sentence_ble
u Prepare for tech
interviews
import tensorflow as tf
from keras.preprocessing.sequence import pad_se
Learn and practise
quences
coding
from tensorflow.keras.utils import to_categorical, pl
ot_model
Become an expert
from keras.models import Model competitive coder
from keras.layers import Input, Dense, BatchNorm
alization, LSTM, Embedding, Dropout
from keras.layers.merge import add Already setup?
from keras.callbacks import ModelCheckpoint

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.image import load_img, i
mg_to_array
from keras.applications.vgg16 import VGG16, prep
rocess_input
from sklearn.model_selection import train_test_spli
t
from sklearn.utils import shuffle
Loading Data
I am using google colab for training the model, and I am
loading the dataset directly from google drive.
image_path = "/content/drive/MyDrive/Datasets/arc
hive/Images"
dir_Flickr_text = "/content/drive/MyDrive/Datasets/a
rchive/captions.txt"
jpgs = os.listdir(image_path)
print("Total Images in Dataset = {}".format(len(jpg

s)))
Total Images in Dataset = 8101
Data Preprocessing
Firstly, we will make a dataframe of image names
associated with its captions.
file = open(dir_Flickr_text,'r')
text = file.read()
file.close()
datatxt = []
i = 0
for line in text.split('\n'):
try:
col = line.split('\t')
col = col[0].split(',')
w = col[0].split("#")
if i == 0:
i+=1
continue
i+=1
datatxt.append(w + [col[1].lower()])
except:
continue
data = pd.DataFrame(datatxt,columns=["filenam
e","caption"])
data = data[data.filename != '2258277193_58694 Set your goal
9ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)
Prepare for tech
data.head() interviews
Learn and practise

coding
Become an expert
competitive coder
Already setup?
Next, we will visualize some images with their
respective 5 captions.
npc = 5
npx = 224
t_sz = (npx,npx,3)
count = 1
Figure = plt.figure(figsize=(10,20))
for i in uni_filenames[10:14]:
fname = image_path + '/' + i
captions = list(data["caption"].loc[data["filenam
e"]==i].values)
image_load = load_img(fname, t_sz=t_sz)
axs = Figure.add_subplot(npc,2,count,xticks=[],yti
cks=[])
axs.imshow(image_load)
count += 1
axs = Figure.add_subplot(npc,2,count)
plt.axis('off')
axs.plot()
axs.set_xlim(0,1)
axs.set_ylim(0,len(captions))
for i, caption in enumerate(captions):
axs.text(0,i,caption,fontsize=20)
count += 1
plt.show()
Let us see the size of the current vocabulary.
vocabulary = []
for txt in data.caption.values:
vocabulary.extend(txt.split())
print('Vocabulary is of size: %d' % len(set(vocabula
ry)))
Set your goal

Prepare for tech

Vocabulary is of Size: 8182
interviews
Now we will do some text cleaning on the caption, i.e.,

Learn and practise
removing punctuation, removing single characters, coding
removing numerical values.
Become an expert
competitive coder
def punctuation_removal(text_original):
tnp = text_original.translate(string.punctuation)
return(tnp) Already setup?
def single_character_removal(text):
tlmt1 = ""
for word in text.split():
if len(word) > 1:
Guided Paths tlmt1 += " "Interview
Contests + wordPrep Practice Resources Login
return(tlmt1)
def number_removal(text):
tnn = ""
for word in text.split():
isalpha = word.isalpha()
if isalpha:
tnn += " " + word
return(tnn)
def text_cleaner(text_original):
text = punctuation_removal(text_original)
text = single_character_removal(text)
text = number_removal(text)
return(text)
for i, caption in enumerate(data.caption.values):

nc = text_cleaner(caption)
data["caption"].iloc[i] = nc
Let’s check the size of the dataset after cleaning the

dataset.
clean = []
for txt in data.caption.values:
clean.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean)))
Clean Vocabulary Size: 8182
Next, we save all of the descriptions and picture paths in

two separate lists so that we can use the path set to
load all of the images at once. We also add '<start >'
and '<end >' tags to each caption so that the model can
understand where each caption begins and ends.
PATH = "/content/drive/MyDrive/Datasets/archive/I
mages/"
total_captions = []
for cp in data["caption"].astype(str):
cp = '<start> ' + cp+ ' <end>'
total_captions.append(cp)
total_captions[:10]
Set your goal

Prepare for tech

interviews
img_vectors = []
Learn and practise
for annotations in data["filename"]:
coding
image_paths = PATH + annotations
img_vectors.append(image_paths)
Become an expert
competitive coder
img_vectors[:10]
Already setup?
Now we will see the size of captions and image path

vectors.
print(f"len(img_vectors) : {len(img_vectors)}")
print(f"len(total_captions) : {len(total_captions)}")
len(all_img_name_vector) : 40455
len(all_captions) : 40455
We'll just take 40000 of each so we can properly set

batch size, i.e. 625 batches if batch size=64. To do this,
we create a function that restricts the dataset to 40000
photos and descriptions.
def data_limiter(nums,tc,imv):
training_captions, image_vector = shuffle(tc,imv,ra
ndom_state=1)
training_captions = training_captions[:nums]
image_vector = image_vector[:nums]
return training_captions,image_vector
train_captions,img_name_vector = data_limiter(40
000,total_captions,img_vectors)
Model Making
Let's use VGG16 to define the image feature, extraction
model. It's important to note that we don't need to
classify the images here; all we need to do is extract an
image vector. As a result, the softmax layer is removed
from the model. Before feeding the photos into the
model, we must all preprocess them to the same size,
224×224.
def load_image(path):
image = tf.io.read_file(path)
image = tf.image.decode_jpeg(image, channels=
3)
image = tf.image.resize(image, (224, 224))
image = preprocess_input(image)
return image, path Set your goal
image_model = tf.keras.applications.VGG16(includ
Prepare for tech
e_top=False, weights='imagenet')
interviews
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
Learn and practise
image_features_extract_model = tf.keras.Model(ne coding
w_input, hidden_layer)
Become an expert
image_features_extract_model.summary() competitive coder
Already setup?
Next, let’s Map each image name to the function to load

the image:
encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slice
s(encode_train)
image_dataset = image_dataset.map(load_image,
num_parallel_calls=tf.data.experimental.AUTOTUN
E).batch(64)
We extract the features and save them in the

appropriate .npy files, after which we pass them through
the encoder. NPY files include all of the data needed to
recreate an array on any machine, including dtype and
shape data.
%%time
for img, path in tqdm(image_dataset):
batch_features = image_features_extract_model(i
mg)
batch_features = tf.reshape(batch_features,
(batch_features.shape[0], -1, batch
_features.shape[3]))
Set your goal
for bf, p in zip(batch_features, path):
path_of_feature = p.numpy().decode("utf-8") Prepare for tech
np.save(path_of_feature, bf.numpy()) interviews
Now, we will tokenize the captions and will build a Learn and practise
vocabulary of 5000 unique words from the data. The coding
words that are not in the vocabulary will be marked as
<unk>. Become an expert
competitive coder
topk = 5000
Already setup?
tkn = tf.keras.preprocessing.text.Tokenizer(num_wo
rds=topk,
oov_token="<unk>",
filters='!"#$%&()*+.,-/:;
=?@[\]^_`{|}~ ')
tkn.fit_on_texts(train_captions)
train_seqs = tkn.texts_to_sequences(train_caption
s)
tkn.word_index['<pad>'] = 0
tkn.index_word[0] = '<pad>'
train_seqs = tkn.texts_to_sequences(train_caption
s)
cap_vector = tf.keras.preprocessing.sequence.pad_
sequences(train_seqs, padding='post')
train_captions[:3]
train_seqs[:3]
Let us see the maximum and minimum length of the

captions.
def max_sz(tensor):
return max(len(t) for t in tensor)
mx_l = max_sz(train_seqs)
def min_sz(tensor):
return min(len(t) for t in tensor)
min_l = min_sz(train_seqs)
print('Max Length of any caption : Min Length of an

y caption = '+ str(mx_l) +" : "+str(min_length)min_l)
Max Length of any caption : Min Length of any cap

tion = 31 : 2
Training Model
Now we will split the data using train_test_split.
Set your goal
img_name_train, img_name_val, cap_train, cap_val
= train_test_split(img_name_vector,cap_vector, test Prepare for tech
_size=0.2, random_state=0) interviews
Defining the training parameters Learn and practise

coding
BATCH_SIZE = 64 Become an expert

BUFFER_SIZE = 1000 competitive coder
embedding_dim = 256
units = 512
Already setup?
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
features_shape = 512
attention_features_shape = 49
Next, let’s create a tf.data dataset to use for training our
model.
def map_function(img_name, cap):
tensor_img = np.load(img_name.decode('utf-8')+'.n
py')
return tensor_img, cap
dataset = tf.data.Dataset.from_tensor_slices((img_
name_train, cap_train))
dataset = dataset.map(lambda item1, item2: tf.nu

mpy_function(
map_function, [item1, item2], [tf.float32, tf.int3
2]),
num_parallel_calls=tf.data.experimental.AUTO
TUNE)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BA
TCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.exp
erimental.AUTOTUNE)
Let us define the encoder-decoder model with attention.
class VGG16_Encoder(tf.keras.Model):
def __init__(self, embedding_dim):
super(VGG16_Encoder, self).__init__()
self.fc = tf.keras.layers.Dense(embedding_dim)
self.dropout = tf.keras.layers.Dropout(0.5, noise
_shape=None, seed=None)
def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x
Defining RNN
def rnn_type(units):
if tf.test.is_gpu_available():
return tf.compat.v1.keras.layers.CuDNNLSTM(u
nits,
return_state=True,
return_sequences=True,
recurrent_initializer='glorot_u
niform')
else:
return tf.keras.layers.GRU(units,
return_state=True,
recurrent_activation='sigmoid',
recurrent_initializer='glorot_unifo Set your goal
rm') Important for focused learning
Defining RNN Decoder with Bahdanau Attention. Prepare for tech

interviews
class Rnn_Local_Decoder(tf.keras.Model): Learn and practise

def __init__(self, embedding_dim, units, vocab_siz coding
e):
super(Rnn_Local_Decoder, self).__init__() Become an expert
self.units = units competitive coder
self.embedding = tf.keras.layers.Embedding(voca
b_size, embedding_dim) Already setup?
self.gru = tf.keras.layers.GRU(self.units,
return_state=True,
recurrent_initializer='glorot_unifo
rm')
self.fc1 = tf.keras.layers.Dense(self.units)
self.dropout = tf.keras.layers.Dropout(0.5, noise_s
hape=None, seed=None)
self.batchnormalization = tf.keras.layers.BatchNor
malization(axis=-1, momentum=0.99, epsilon=0.00
1, center=True, scale=True, beta_initializer='zeros',
gamma_initializer='ones', moving_mean_initializer
='zeros', moving_variance_initializer='ones', beta_r
egularizer=None, gamma_regularizer=None, beta_
constraint=None, gamma_constraint=None)
self.fc2 = tf.keras.layers.Dense(vocab_size)
# Attention Mechanism
self.Uattn = tf.keras.layers.Dense(units)
self.Wattn = tf.keras.layers.Dense(units)
self.Vattn = tf.keras.layers.Dense(1)
def call(self, x, features, hidden):

hidden_with_time_axis = tf.expand_dims(hidde
n, 1)
score = self.Vattn(tf.nn.tanh(self.Uattn(features) +
self.Wattn(hidden_with_time_axis)))
attention_weights = tf.nn.softmax(score, axis=1)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, ax
is=1)
x = self.embedding(x)
x = tf.concat([tf.expand_dims(context_vector, 1),
x], axis=-1)
output, state = self.gru(x)
x = self.fc1(output)
x = tf.reshape(x, (-1, x.shape[2]))
x= self.dropout(x)
x= self.batchnormalization(x)
x = self.fc2(x)
return x, state, attention_weights
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))
encoder = VGG16_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, un
its, vocab_size)
Defining optimizer and loss function.
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCros
sentropy(
from_logits=True, reduction='none')
def loss_function(real, pred): Set your goal

mask = tf.math.logical_not(tf.math.equal(real, 0)) Important for focused learning
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype) Prepare for tech
interviews
loss_ *= mask
return tf.reduce_mean(loss_) Learn and practise

coding
Become an expert
Training Model competitive coder
Let's go on to define the training stage. We use a

technique known as Teacher Forcing, which involves Already setup?
passing the target word to the decoder as the next
input. This strategy aids in fast learning the correct
sequence or statistical features for the sequence.
loss_plot
Guided Paths = []
Contests Interview Prep Practice Resources Login
@tf.function
def train_step(img_tensor, target):
loss = 0
hidden = decoder.reset_state(batch_size=target.sh
ape[0])
dec_input = tf.expand_dims([tokenizer.word_index
['<start>']] * BATCH_SIZE, 1)
with tf.GradientTape() as tape:

features = encoder(img_tensor)
for i in range(1, target.shape[1]):
predictions, hidden, _ = decoder(dec_input, fea
tures, hidden)
loss += loss_function(target[:, i], predictions)
dec_input = tf.expand_dims(target[:, i], 1)
total_loss = (loss / int(target.shape[1]))

trainable_variables = encoder.trainable_variables +
decoder.trainable_variables
gradients = tape.gradient(loss, trainable_variables)
optimizer.apply_gradients(zip(gradients, trainable_
variables))
return loss, total_loss
Training the model.
EPOCHS = 20
for epoch in range(0, EPOCHS):
start = time.time()
total_loss = 0
for (batch, (img_tensor, target)) in enumerate(dat
aset):
batch_loss, t_loss = train_step(img_tensor, targ
et)
total_loss += t_loss
if batch % 100 == 0:
print ('Epoch {} Batch {} Loss {:.4f}'.format(
epoch + 1, batch, batch_loss.numpy() / int(ta
rget.shape[1])))
loss_plot.append(total_loss / num_steps)
print ('Epoch {} Loss {:.6f}'.format(epoch + 1
total_loss/num_steps))
print ('Time taken for 1 epoch {} sec\n'.format(tim
e.time() - start))
Greedy Search and BLEU Evaluation

Defining a greedy method of defining captions
Set your goal
def evaluate(image):
ap = np.zeros((max_length, attention_features_sh Prepare for tech
ape)) interviews
hdn = decoder.reset_state(batch_size=1)
ti = tf.expand_dims(load_image(image)[0], 0) Learn and practise
itv = image_features_extract_model(ti) coding
itv = tf.reshape(itv, (itv.shape[0], -1, itv.shape[3]))
ftrs = encoder(itv) Become an expert
dec_input = tf.expand_dims([tokenizer.word_inde competitive coder
x['<start>']], 0)
result = [] Already setup?
for i in range(max_length):
predictions, hdn, attention_weights = decoder(d
ec_input, ftrs, hdn)
ap[i] = tf.reshape(attention_weights, (-1, )).num
py()
predicted_id = tf.argmax(predictions[0]).numpy
()
result.append(tokenizer.index_word[predicted_i
d])
if tokenizer.index_word[predicted_id] == '<end
>':
return result, ap
dec_input = tf.expand_dims([predicted_id], 0)
ap = ap[:len(result), :]
return result, ap
Plotting attention maps for each generated word.
def plot_attention(image, result, attention_plot):

ti = np.array(Image.open(image))
f = plt.figure(figsize=(10, 10))
lr = len(result)
for l in range(lr):
temp_att = np.resize(attention_plot[l], (8, 8))
ax = f.add_subplot(lr//2, lr//2, l+1)
ax.set_title(result[l])
img = ax.imshow(ti)
ax.imshow(temp_att, cmap='gray', alpha=0.6, e
xtent=img.get_extent())
plt.tight_layout()
plt.show()
Constructing captions for the images.
r = np.random.randint(0, len(img_name_val))
photo = img_name_val[r]
start = time.time()
real_caption = ' '.join([tokenizer.index_word[i] for i i
n cap_val[r] if i not in [0]])
result, attention_plot = evaluate(photo)
first = real_caption.split(' ', 1)[1]

real_caption = first.rsplit(' ', 1)[0]
for i in result:

if i=="<unk>":
result.remove(i)
#remove <end> from result

result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
real_appn = []
real_appn.append(real_caption.split()) Set your goal
reference = real_appn Important for focused learning
candidate = result_final
Prepare for tech
print ('Real Caption:', real_caption) interviews
print ('Prediction Caption:', result_final)

Learn and practise
coding
plot_attention(photo, result, attention_plot)
print(f"time took to Predict: {round(time.time()-star
t)} sec") Become an expert
competitive coder
Image.open(img_name_val[r])
Already setup?
Real Caption: brown dog in field

Prediction Caption: the brown dog is standing in dr
y grass field
time took to Predict: 2 sec
FAQs
1. What are Attention Models?
Attention models, also known as attention
mechanisms, are neural network input processing
strategies that allow the network to focus on
specific parts of a complicated input one by one
until the entire dataset is categorized.
2. What are attention layers?

The attention layers are based on human concepts
of attention, however, they are just a weighted
mean reduction. The query, the values, and the keys
are all fed into the attention layer. When the query
has one key and the keys and values are the same,
these inputs are frequently identical.
3. What is the self-attention model?

The self-attention mechanism, in layman's words,
allows the inputs to interact with one another
("self") and determine who they should pay more
attention to ("attention"). These interactions and Set your goal
attention scores are aggregated in the outputs. Important for focused learning
Prepare for tech

Key interviews
Takeaways
Learn and practise
coding
In this article, we have discussed the following topics:
Become an expert
Attention mechanism
competitive coder
Working of Bahdanau attention model
Implementation of image captioning model
Want to learn more about Machine Learning? Here is an Already setup?
excellent course that can guide you in learning.
Happy Coding!
Previous Article Next Article

Attention Mechanism In Hierarchical Attention

Deep Learning Network
Share
this
Was this article article
helpful ? with
friends
0 upvotes and
colleague
:
Comments
Write your thoughts...
Post
No comments yet
Be the first to share what you think
Set your goal

Categories: Coding courses for beginners | Web Development Courses |
Data Science & Machine Learning Courses | Prepare for tech
Competitive Programming Course | interviews
Android App Development Courses | Courses for interview preparation
Learn and practise
Popular C++ Foundation with Data Structures | coding
Courses: Java Foundation with Data Structures |
Python Foundation with Data Structures | Competitive Programming | Become an expert
Full Stack Web Development competitive coder
Already setup?
Career Ninja Competitive Programmer Track |
Tracks: Ninja AndroidGuided
Developer
PathsCareer Track | Interview Prep
Contests Practice Resources Login
Ninja Web Developer Career Track - NodeJS & ReactJs |
Ninja Web Developer Career Track - NodeJS |
Ninja Data Scientist Career Track |
Ninja Machine Learning Engineer Career Track
Interested in Coding Ninjas Flagship

Click Here
Courses?
CODING PRODUCTS COMMUNITY

FOLLOW US ON
NINJAS
Problem of CodeStudio
About Us the day
Blog
Press Interview
Events
Privacy Problems
Campus
Policy Interview
Ninjas
Terms & Experiences
Conditions Interview
Bug Bounty Bundle
Hire from Guided

CodeStudio Paths
Library
Test Series
Contest
We
accept
payments
using:
Set your goal

Prepare for tech

interviews
Learn and practise

coding
Become an expert
competitive coder
Already setup?

Attention Mechanism For Image Processing - Coding Ninjas CodeStudio

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Attention Mechanism For Image Processing - Coding Ninjas CodeStudio

Uploaded by

Copyright:

Available Formats

1/20/23, 12:05 PM Attention Mechanism for Image Processing - Coding Ninjas CodeStudio

Guided Paths Contests Interview Prep Practice Resources Login

Codestudio Library Deep Learning Attention Mechanism

soham Medewar Share

2. Image captioning using

4.1. Importing libraries

4.2. Loading Data

4.3. Data Preprocessing

4.4. Model Making

4.5. Training Model

4.6. Training Model

4.7. Greedy Search and BLEU

Learn and practise

Already setup? Next

Guided Paths Contests Interview Prep Practice Resources Login

Attention is the term for this power of self-selection. The

For Deep Learning practitioners, the attention

Instead of compressing a complete image into a static

The outputs from previous elements and new sequence

However, because RNNs are computationally expensive

The image is first divided into n pieces with an Attention

Learn and practise

Local attention locates an alignment point, calculates

The local attention is used in the computation to

Now, let us implement a model that will help understand

from keras.callbacks import ModelCheckpoint

print("Total Images in Dataset = {}".format(len(jpg

Total Images in Dataset = 8101

Learn and practise

Next, we will visualize some images with their

Let us see the size of the current vocabulary.

Set your goal

Prepare for tech

Now we will do some text cleaning on the caption, i.e.,

for i, caption in enumerate(data.caption.values):

Let’s check the size of the dataset after cleaning the

Clean Vocabulary Size: 8182

Next, we save all of the descriptions and picture paths in

Set your goal

Prepare for tech

Guided Paths Contests Interview Prep Practice Resources Login

Now we will see the size of captions and image path

We'll just take 40000 of each so we can properly set

Guided Paths Contests Interview Prep Practice Resources Login

Next, let’s Map each image name to the function to load

We extract the features and save them in the

Let us see the maximum and minimum length of the

print('Max Length of any caption : Min Length of an

Max Length of any caption : Min Length of any cap

Defining the training parameters Learn and practise

BATCH_SIZE = 64 Become an expert

dataset = dataset.map(lambda item1, item2: tf.nu

Let us define the encoder-decoder model with attention.

Defining RNN Decoder with Bahdanau Attention. Prepare for tech

class Rnn_Local_Decoder(tf.keras.Model): Learn and practise

def call(self, x, features, hidden):

Defining optimizer and loss function.

def loss_function(real, pred): Set your goal

return tf.reduce_mean(loss_) Learn and practise

Let's go on to define the training stage. We use a

with tf.GradientTape() as tape:

total_loss = (loss / int(target.shape[1]))