Getting Hands-On With BERT

Getting Hands-On with BERT
In this chapter, we will learn how to use the pre-trained BERT model in
detail. First, we will look at the different configurations of the pre-trained
BERT model open sourced by Google. Then, we will learn how to use the
pre-trained BERT model as a feature extractor. We will also explore
Hugging Face's transformers library and learn how to use it to extract em-
beddings from the pre-trained BERT.
Moving on, we will understand how to extract embeddings from all en-
coder layers of BERT. Next, we will learn how to fine-tune the pre-trained
BERT model for the downstream tasks. First, we will learn to fine-tune the
pre-trained BERT model for a text classification task. Next, we will learn
to fine-tune BERT for sentiment analysis tasks using the transformers library.
Then, we will look into fine-tuning the pre-trained BERT model for natu-
ral language inference, question answering tasks, and named entity
recognition tasks.
In this chapter, we will learn the following topics:
Exploring the pre-trained BERT model

Extracting embeddings from pre-trained BERT
Extracting embeddings from all encoder layers of BERT
Fine-tuning BERT for downstream tasks
Exploring the pre-trained BERT model
In Chapter 2, Understanding the BERT Model, we learned how to pre-train

BERT using masked language modeling and next-sentence prediction
tasks. But pre-training BERT from scratch is computationally expensive.
So, we can download the pre-trained BERT model and use it. Google has
open sourced the pre-trained BERT model and we can download it from
Google Research's GitHub repository – https://github.com/google-research/
bert. They have released the pre-trained BERT model with various config-
urations, shown in the following figure. denotes the number of encoder
layers and denotes the size of the hidden unit (representation size):
Figure 3.1 – Different configurations of pre-trained BERT as provided by Google (https://github.com/google-research/ber
t)
The pre-trained model is also available in the BERT-uncased and BERT-

cased formats. In BERT-uncased, all the tokens are lowercased, but in
BERT-cased, the tokens are not lowercased and are used directly for train-
ing. Okay, which pre-trained BERT model we should use? BERT-cased or
BERT-uncased? The BERT-uncased model is the one that is most com-
monly used, but if we are working on certain tasks such as Named Entity
Recognition (NER) where we have to preserve the case, then we should
use the BERT-cased model. Along with these, Google also released pre-
trained BERT models trained using the whole word masking
method. Okay, but how exactly we can use the pre-trained BERT model?
We can use the pre-trained model in the following two ways:
As a feature extractor by extracting embeddings

By fine-tuning the pre-trained BERT model on downstream tasks such
as text classification, question-answering, and more
In the upcoming sections, we will learn how to use the pre-trained BERT
model as a feature extractor by extracting embeddings, and we will also
learn how to fine-tune the pre-trained BERT model for downstream tasks
in detail.
Extracting embeddings from pre-

trained BERT
Let's learn how to extract embeddings from pre-trained BERT with an ex-
ample. Consider a sentence – I love Paris. Say we need to extract the con-
textual embedding of each word in the sentence. To do this, first, we tok-
enize the sentence and feed the tokens to the pre-trained BERT
model, which will return the embeddings for each of the tokens. Apart
from obtaining the token-level (word-level) representation, we can also
obtain the sentence-level representation.
In this section, let's learn how exactly we can extract the word-level and
sentence-level embedding from the pre-trained BERT model in detail.
Let's suppose we want to perform a sentiment analysis task, and say we

have the dataset shown in the following figure:
Figure 3.2 – Sample dataset
As we can observe from the preceding table, we have sentences and their
corresponding labels, where 1 indicates positive sentiment and 0 indi-
cates negative sentiment. We can train a classifier to classify the senti-
ment of a sentence using the given dataset.
But we can't feed the given dataset directly to a classifier, since it has text.
So first, we need to vectorize the text. We can vectorize the text using
methods such as TF-IDF, word2vec, and others. In the previous chapter,
we learned that BERT learns the contextual embedding, unlike other con-
text-free embedding models such as word2vec. Now, we will see how to
use the pre-trained BERT model to vectorize the sentences in our dataset.
Let's take the first sentence in our dataset – I love Paris. First, we tokenize
the sentence using the WordPiece tokenizer and get the tokens (words).
After tokenizing the sentence, we have the following:
tokens = [I, love, Paris]
Now, we add the [CLS] token at the beginning and the [SEP] token at the end.
Thus, our tokens list becomes this:
tokens = [ [CLS], I, love, Paris, [SEP] ]
Similarly, we can tokenize all the sentences in our training set. But the
length of each sentence varies, right? Yes, and so does the length of the to-
kens. We need to keep the length of all the tokens the same. Say we keep
the length of the tokens to 7 for all the sentences in our dataset. If we look
at our preceding tokens list, the tokens length is 5. To make the tokens length 7,
we add a new token called [PAD]. Thus, now our tokens are as follows:
tokens = [ [CLS], I, love, Paris, [SEP], [PAD], [PAD] ]
As we can observe, now our tokens length is 7, as we have added two [PAD]
tokens. The next step is to make our model understand that the [PAD] token
is added only to match the tokens length and it is not part of the actual to-
kens. To do this, we introduce an attention mask. We set the attention
mask value to 1 in all positions and 0 to the position where we have a [PAD]
token, as shown here:
attention_mask = [ 1,1,1,1,1,0,0]
Next, we map all the tokens to a unique token ID. Suppose the following is
the mapped token ID:
token_ids = [101, 1045, 2293, 3000, 102, 0, 0]
It implies that ID 101 indicates the token [CLS], 1045 indicates the token I, 2293
indicates the token Paris, and so on.
Now, we feed token_ids along with attention_mask as input to the pre-trained

BERT model and obtain the vector representation (embedding) of each of
the tokens. This will be made more clear once we look into the code.
The following figure shows how we use the pre-trained BERT model to ob-
tain the embedding. For clarity, the tokens are shown instead of token
IDs. As we can see, once we feed the tokens as the input, encoder 1 com-
putes the representation of all the tokens and sends it to the next encoder,
which is encoder 2. Encoder 2 takes the representation computed by en-
coder 1 as input, computes its representation, and sends it to the next en-
coder, which is encoder 3. In this way, each encoder sends its representa-
tion to the next encoder above it. The final encoder, which is encoder 12,
returns the final representation (embedding) of all the tokens in our
sentence:
Figure 3.3 – Pre-trained BERT
As shown in the preceding figure, is the embedding of the token

[CLS], is the embedding of the token I, is the embedding of the
token love, and so on. Thus, in this way, we can obtain the representation
of each of the tokens. These representations are basically the contextual-
ized word (token) embeddings. Say we are using the pre-trained BERT-
base model; in that case, the representation size of each token is 768.
We learned how to obtain the representation for each word in the sen-
tence I love Paris. But how do we obtain the representation of the com-
plete sentence?
We learned that we have prepended the [CLS] token to the beginning of

our sentence. The representation of the [CLS] token will hold the aggre-
gate representation of the complete sentence. So, we can ignore the em-
beddings of all other tokens and take the embedding of the [CLS] token
and assign it as a representation of our sentence. Thus, the representation
of our sentence I love Paris is just the representation of the [CLS] token
.
In a very similar fashion, we can compute the vector representation of all

the sentences in our training set. Once we have the sentence representa-
tion of all the sentences in our training set, we can feed those representa-
tions as input and train a classifier to perform a sentiment analysis task.
Note that using the representation of the [CLS] token as a sentence repre-
sentation is not always a good idea. The efficient way to obtain the repre-
sentation of a sentence is either averaging or pooling the representation
of all the tokens. We will learn more about this in the upcoming chapters.
Now that we have learned how to use the pre-trained BERT model to ex-
tract an embedding (representation), in the next section, we will learn
how to do this using a library known as transformers.
Hugging Face transformers
Hugging Face is an organization that is on the path of democratizing AI

through natural language. Their open source transformers library is very
popular among the Natural Language Processing (NLP) community. It is
very useful and powerful for several NLP and Natural Language
Understanding (NLU) tasks. It includes thousands of pre-trained models
in more than 100 languages. One of the many advantages of the
transformer's library is that it is compatible with both PyTorch and
TensorFlow.
We can install transformers directly using pip as shown here:
pip install transformers==3.5.1
As we can see, in this book, we use transformers version 3.5.1. Now that
we have installed transformers, let's get started.
Generating BERT embeddings
In this section, we will learn how to extract embeddings from the pre-
trained BERT model. Consider the sentence I love Paris. Let's see how to
obtain the contextualized word embedding of all the words in the sen-
tence using the pre-trained BERT model with Hugging Face's transformers
library. We can also access the complete code from the GitHub repository
of the book. In order to run the code smoothly, clone the GitHub reposi-
tory of the book and run the code using Google Colab.
First, let's import the necessary modules:
from transformers import BertModel, BertTokenizer

import torch
Next, we download the pre-trained BERT model. We can check all the
available pre-trained BERT models here – https://huggingface.co/transfor
mers/pre-trained_models.html.We use the 'bert-base-uncased' model. As the
name suggests, it is the BERT-base model with 12 encoders and it is
trained with uncased tokens. Since we are using BERT-base, the represen-
tation size will be 768.
Download and load the pre-trained bert-base-uncased model:
model = BertModel.from_pretrained('bert-base-uncased')
Next, we download and load the tokenizer that was used to pre-train
the bert-base-uncased model:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Now, let's see how to preprocess the input before feeding it to BERT.
Preprocessing the input
Define the sentence:
sentence = 'I love Paris'

Tokenize the sentence and obtain the tokens:
tokens = tokenizer.tokenize(sentence)
Let's print the tokens:
print(tokens)
The preceding code will print the following:
['i', 'love', 'paris']
Now, we will add the [CLS] token at the beginning and the [SEP] token at the
end of the tokens list:
tokens = ['[CLS]'] + tokens + ['[SEP]']
Let's look at our updated tokens list:
print(tokens)
The previous code will print the following:
['[CLS]', 'i', 'love', 'paris', '[SEP]']
As we can observe, we have a [CLS] token at the beginning and an [SEP] to-
ken at the end of our tokens list. We can also see that length of our tokens list
is 5.
Say we need to keep the length of our tokens list to 7; in that case, we add
two [PAD] tokens at the end as shown in the following snippet:
tokens = tokens + ['[PAD]'] + ['[PAD]']
Let's print our updated tokens list:
print(tokens)
['[CLS]', 'i', 'love', 'paris', '[SEP]', '[PAD]', '[PAD]']
As we can see, now we have the tokens list with [PAD] tokens and the length of
our tokens list is 7.
Next, we create the attention mask. We set the attention mask value to 1 if
the token is not a [PAD] token, else we set the attention mask to 0, as shown
here:
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]
Let's print attention_mask:
print(attention_mask)
The preceding code will print this:
[1, 1, 1, 1, 1, 0, 0]
As we can see, we have attention mask values 0 at positions where have a

[PAD] token and 1 at other positions.
Next, we convert all the tokens to their token IDs as follows:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
Let's have a look at token_ids:
print(token_ids)
[101, 1045, 2293, 3000, 102, 0, 0]
From the output, we can observe that each token is mapped to a unique
token ID.
Now, we convert token_ids and attention_mask to tensors as shown in the follow-

ing code:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
That's it. Next, we feed token_ids and attention_mask to the pre-trained BERT
model and get the embedding.
Getting the embedding
As shown in the following code, we feed token_ids and attention_mask to model and
get the embeddings. Note that model returns the output as a tuple with two
values. The first value indicates the hidden state representation, hidden_rep,
and it consists of the representation of all the tokens obtained from the fi-
nal encoder (encoder 12), and the second value, cls_head, consists of the rep-
resentation of the [CLS] token:
hidden_rep, cls_head = model(token_ids, attention_mask = attention_mask)
In the preceding code, hidden_rep contains the embedding (representation) of

all the tokens in our input. Let's print the shape of hidden_rep:
print(hidden_rep.shape)
torch.Size([1, 7, 768])
The size [1,7,768] indicates [batch_size, sequence_length, hidden_size] .
Our batch size is 1. The sequence length is the token length. Since we
have 7 tokens, the sequence length is 7. The hidden size is the representa-
tion (embedding) size and it is 768 for the BERT-base model.
We can obtain the representation of each token as follows:
hidden_rep[0][0] gives the representation of the first token, which is [CLS].

hidden_rep[0][1] gives the representation of the second token, which is I.
hidden_repo[0][2] gives the representation of the third token, which is love.
In this way, we can obtain the contextual representation of all the tokens.
This is basically the contextualized word embeddings of all the words in
the given sentence.
Now, let's take a look at cls_head. It contains the representation of the [CLS]
token. Let's print the shape of cls_head :
print(cls_head.shape)
torch.Size([1, 768])
The size [1,768] indicates [batch_size, hidden_size].
We learned that cls_head holds the aggregate representation of the sentence,

so we can use cls_head as the representation of the sentence I love Paris.
We learned how to extract embeddings from the pre-trained BERT model.

But these are the embeddings obtained only from the topmost encoder
layer of BERT, which is encoder 12. Can we also extract the embeddings
from all the encoder layers of BERT? Yes! We will find out how to do that
in the next section.
Extracting embeddings from all

encoder layers of BERT
We learned how to extract the embedding from the pre-trained BERT

model in the previous section. We learned that they are the embeddings
obtained from the final encoder layer. Now the question is, should we
consider the embeddings obtained only from the final encoder layer (fi-
nal hidden state), or should we also consider the embeddings obtained
from all the encoder layers (all hidden states)? Let's explore this.
Let's represent the input embedding layer with , the first encoder layer
(first hidden layer) with , the second encoder layer (second hidden
layer) with , and so on to the final twelfth encoder layer, , as shown
in the following figure:
Figure 3.4 – Pre-trained BERT
Instead of taking the embeddings (representations) only from the final

encoder layer, the researchers of BERT have experimented with taking
embeddings from different encoder layers.
For instance, for NER task, the researchers have used the pre-trained
BERT model to extract features. Instead of using the embedding only from
the final encoder layer (final hidden layer) as a feature, they have experi-
mented with using embeddings from other encoder layers (other hidden
layers) as features and obtained the following F1 score:
Figure 3.5 – F1 score of using embeddings from different layers
As we can observe from the preceding table, concatenating the embed-

dings of the last four encoder layers (last four hidden layers) gives us a
greater F1 score of 96.1%. Thus, instead of taking embeddings only from
the final encoder layer (final hidden layer), we can also use embeddings
from the other encoder layers.
Now, we will learn how to extract the embeddings from all the encoder
layers using the transformers library.
Extracting the embeddings
First, let's import the necessary modules:
from transformers import BertModel, BertTokenizer

import torch
Next, download the pre-trained BERT model and tokenizer. As we can see,
while downloading the pre-trained BERT model, we need to set
output_hidden_states = True . Setting this to True helps us to obtain embeddings from
all the encoder layers:
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Next, we preprocess the input before feeding it to the model.
Let's consider the sentence we saw in the previous section. First, we tok-
enize the sentence and add a [CLS] token at the beginning and an [SEP] token
at the end:
sentence = 'I love Paris'

tokens = tokenizer.tokenize(sentence)
tokens = ['[CLS]'] + tokens + ['[SEP]']
Suppose we need to keep the token length to 7. So, we add the [PAD] tokens
and also define the attention mask:
tokens = tokens + ['[PAD]'] + ['[PAD]']

attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]
Next, we convert tokens to their token IDs:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
Now, we convert token_ids and attention_mask to tensors:
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
Now that we have preprocessed the input, let's get the embeddings.
Getting the embeddings
Since we set output_hidden_states = True while defining the model to get the em-
beddings from all the encoder layers, now the model returns an output
tuple with three values, as shown in the following code:
last_hidden_state, pooler_output, hidden_states = \

model(token_ids, attention_mask = attention_mask)
In the preceding code, the following applies:
The first value, last_hidden_state, contains the representation of all the to-
kens obtained only from the final encoder layer (encoder 12).
Next, pooler_output indicates the representation of the [CLS] token from
the final encoder layer, which is further processed by a linear and
tanh activation function.
hidden_states contains the representation of all the tokens obtained from
all the encoder layers.
Now, let's take a look at each of these values and understand them in
more detail.
First, let's look at last_hidden_state. As we learned, it holds the representation

of all the tokens obtained only from the final encoder layer (encoder 12).
Let's print the shape of last_hidden_state:
last_hidden_state.shape
torch.Size([1, 7, 768])
The size [1,7,768] indicates [batch_size, sequence_length, hidden_size] .
Our batch size is 1. The sequence length is the token length. Since we
have 7 tokens, the sequence length is 7. The hidden size is the representa-
tion (embedding) size and it is 768 for the BERT-base model.
We can obtain the embedding of each token as follows:
last_hidden[0][0] gives the representation of the first token, which is [CLS].

last_hidden[0][1] gives the representation of the second token, which is I.
last_hidden[0][2] gives the representation of the third token, which is love.
Similarly, we can obtain the representation of all the tokens from the fi-
nal encoder layer.
Next, we have pooler_output, which contains the representation of the [CLS]

token from the final encoder layer, which is further processed by a linear
and tanh activation function. Let's print the shape of pooler_output:
pooler_output.shape
The size [1,768] indicates [batch_size, hidden_size].
We learned that the [CLS] token holds the aggregate representation of the
sentence. Thus, we can use pooler_output as the representation of the
sentence I love Paris.
Finally, we have hidden_states, which contains the representation of all the to-
kens obtained from all the encoder layers. It is a tuple containing 13 val-
ues holding the representation of all encoder layers (hidden layers), from
the input embedding layer to the final encoder layer :
len(hidden_states)
13
As we can see, it contains 13 values holding the representation of all
layers:
hidden_states[0] contains the representation of all the tokens obtained from

the input embedding layer .
the first encoder layer .
the second encoder layer .
hidden_states[12] contains the representation of all the tokens obtained
from the final encoder layer .
Let's explore this more. First, let's print the shape of hidden_states[0], which
contains the representation of all the tokens obtained from the input em-
bedding layer :
hidden_states[0].shape
torch.Size([1, 7, 768])
The size [1, 7, 768] indicates [batch_size, sequence_length, hidden_size] .
Now, let's print the shape of hidden_states[1], which contains the representa-
tion of all tokens obtained from the first encoder layer :
hidden_states[1].shape
torch.Size([1, 7, 768])
Thus, in this way, we can obtain the embedding of tokens from all the en-
coder layers. We learned how to use the pre-trained BERT model to ex-
tract embeddings; can we also use pre-trained BERT for a downstream
task such as sentiment analysis? Yes! We will learn about this in the next
section.
Fine-tuning BERT for downstream

tasks
So far, we have learned how to use the pre-trained BERT model. Now, let's
learn how to fine-tune the pre-trained BERT model for downstream
tasks. Note that fine-tuning implies that we are not training BERT from
scratch; instead, we are using the pre-trained BERT and updating its
weights according to our task.
In this section, we will learn how to fine-tune the pre-trained BERT model
for the following downstream tasks:
Text classification
Natural language inference
NER
Question-answering
Text classification
Let's learn how to fine-tune the pre-trained BERT model for a text classifi-
cation task. Say we are performing sentiment analysis. In the sentiment
analysis task, our goal is to classify whether a sentence is positive or neg-
ative. Suppose we have a dataset containing sentences along with their
labels.
Consider a sentence: I love Paris. First, we tokenize the sentence, add the
[CLS] token at the beginning, and add the [SEP] token at the end of the sen-
tence. Then, we feed the tokens as an input to the pre-trained BERT model
and get the embeddings of all the tokens.
Next, we ignore the embedding of all other tokens and take only the em-
bedding of [CLS] token, which is . The embedding of the [CLS] token will
hold the aggregate representation of the sentence. We feed to a classi-
fier (feed-forward network with softmax function) and train the classifier
to perform sentiment analysis.
Wait! How does this differ from what we saw at the beginning of the sec-
tion? How does fine-tuning the pre-trained BERT model differ from using
the pre-trained BERT model as a feature extractor?
In the Extracting embeddings from pre-trained BERT section, we learned

that after extracting the embedding of a sentence, we feed to a
classifier and train the classifier to perform classification. Similarly, dur-
ing fine-tuning, we feed the embedding of to a classifier and train
the classifier to perform classification.
The difference is that when we fine-tune the pre-trained BERT model, we

update the weights of the model along with a classifier. But when we use
the pre-trained BERT model as a feature extractor, we update only the
weights of the classifier and not the pre-trained BERT model.
During fine-tuning, we can adjust the weights of the model in the follow-
ing two ways:
Update the weights of the pre-trained BERT model along with the clas-
sification layer.
Update only the weights of the classification layer and not the pre-
trained BERT model. When we do this, it becomes the same as using
the pre-trained BERT model as a feature extractor.
The following figure shows how we fine-tune the pre-trained BERT

model for a sentiment analysis task:
Figure 3.6 – Fine-tuning the pre-trained BERT model for text classification
As we can observe from the preceding figure, we feed the tokens to the
pre-trained BERT model and get the embeddings of all the tokens. We take
the embedding of the [CLS] token and feed it to a feedforward network
with a softmax function and perform classification.
Let's get a better understanding of how fine-tuning works by getting

hands-on with fine-tuning the pre-trained BERT model for sentiment
analysis task in the next section.
Fine-tuning BERT for sentiment analysis
Let's explore how to fine-tune the pre-trained BERT model for a sentiment
analysis task with the IMDB dataset. The IMDB dataset consists of movie
reviews along with the respective sentiment of the review. We can also
access the complete code from the GitHub repository of the book. In order
to run the code smoothly, clone the GitHub repository of the book and run
the code using Google Colab.
Importing the dependencies
First, let's install the necessary libraries:
!pip install nlp==0.4.0

!pip install transformers==3.5.1
Import the necessary modules:
from transformers import BertForSequenceClassification, BertTokenizerFast, Trainer, Trainin

from nlp import load_dataset
import torch
import numpy as np
Loading the model and dataset
Load the model and dataset. First, let's download and load the dataset us-
ing the nlp library:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
dataset = load_dataset('csv', data_files='./imdbs.csv', split='train')
Let's check the datatype:

type(dataset)
Here is the output:
nlp.arrow_dataset.Dataset
Next, let's split the dataset into train and test sets:
dataset = dataset.train_test_split(test_size=0.3)
Let's print the dataset:
dataset
{
'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='in
'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='
}
Now, we create the train and test sets:
train_set = dataset['train']
test_set = dataset['test']
Next, let's download and load the pre-trained BERT model. In this exam-
ple, we use the pre-trained bert-base-uncased model. As we can see, since we
are performing sequence classification, we use the BertForSequenceClassification
class:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
Next, we download and load the tokenizer that was used to pre-train
the bert-base-uncased model.
As we can see, we create the tokenizer using the BertTokenizerFast class instead
of BertTokenizer. The BertTokenizerFast class has many advantages compared
to BertTokenizer. We will learn about this in the next section:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
Now that we have loaded the dataset and model, let's preprocess the
dataset.
Preprocessing the dataset
We can preprocess the dataset quickly using our tokenizer. For example,
consider the sentence I love Paris.
First, we tokenize the sentence and add a [CLS] token at the beginning and
a [SEP] token at the end, as shown here:
tokens = [ [CLS], I, love, Paris, [SEP] ]
Next, we map the tokens to the unique input IDs (token IDs). Suppose the
following are the unique input IDs (token IDs):
input_ids = [101, 1045, 2293, 3000, 102]
Then, we need to add the segment IDs (token type IDs). Wait, what are
segment IDs? Suppose we have two sentences in the input. In that case,
segment IDs are used to distinguish one sentence from the other. All the
tokens from the first sentence will be mapped to 0 and all the tokens from
the second sentence will be mapped to 1. Since here we have only one
sentence, all the tokens will be mapped to 0 as shown here:
token_type_ids = [0, 0, 0, 0, 0]
Now, we need to create the attention mask. We know that an attention

mask is used to differentiate the actual tokens and [PAD] tokens. It will map
all the actual tokens to 1 and the [PAD] tokens to 0. Suppose our tokens length
should be 5. Our tokens list already has five tokens, so we don't have to add
a [PAD] token. Our attention mask will become the following:
attention_mask = [1, 1, 1, 1, 1]
That's it. But instead of doing all the aforementioned steps manually, our
tokenizer will do these steps for us. We just need to pass the sentence to
the tokenizer as shown in the following code:
tokenizer('I love Paris')
The preceding code will return the following. As we can see, our input
sentence is tokenized and mapped to input_ids, token_type_ids, and also
attention_mask :
{
'input_ids': [101, 1045, 2293, 3000, 102],
'token_type_ids': [0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1]
}
With the tokenizer, we can also pass any number of sentences and per-
form padding dynamically. To do that, we need to set padding to True and also
the maximum sequence length. For instance, as shown in the following
code, we pass three sentences and we set the maximum sequence length,
max_length , to 5:
tokenizer(['I love Paris', 'birds fly','snow fall'], padding = True,

max_length=5)
The preceding code will return the following. As we can see, all the sen-
tences are mapped to input_ids, token_type_ids, and attention_mask. The second and
third sentences have only two tokens, and after adding [CLS] and [SEP], they
will have four tokens. Since we set padding to True and max_length to 5, an addi-
tional [PAD] token is added to the second and third sentences, and that's
why we have 0 in the attention mask of the second and third sentences:
{
'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 10
'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]
}
That's it – with the tokenizer, we can easily preprocess our dataset. So, we
define a function called preprocess to process the dataset as follows:
def preprocess(data):
return tokenizer(data['text'], padding=True, truncation=True)
Now, we preprocess the train and test sets using the preprocess function:
train_set = train_set.map(preprocess, batched=True,

batch_size=len(train_set))
test_set = test_set.map(preprocess, batched=True, batch_size=len(test_set))
Next, we use the set_format function and select the columns that we need in
our dataset and the format we need them in as shown in the following
code:
train_set.set_format('torch',
columns=['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch',
columns=['input_ids', 'attention_mask', 'label'])
That's it. Now that we have the dataset ready, let's train the model.
Training the model
Define the batch size and epoch size:
batch_size = 8
epochs = 2
Define the warmup steps and weight decay:
warmup_steps = 500
weight_decay = 0.01
Define the training arguments:
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
warmup_steps=warmup_steps,
weight_decay=weight_decay,
evaluate_during_training=True,
logging_dir='./logs',
)
Now, define the trainer:
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_set,
eval_dataset=test_set
)
Start training the model:
trainer.train()
After training, we can evaluate the model using the evaluate function:
trainer.evaluate()
{'epoch': 1.0, 'eval_loss': 0.68}

{'epoch': 2.0, 'eval_loss': 0.50}
In this way, we can fine-tune the pre-trained BERT model. Now that we
have learned how to fine-tune BERT for a text classification task, in the
next section, let's see how to fine-tune the BERT model for Natural
Language Inference (NLI).
Natural language inference
In NLI, the goal of our model is to determine whether a hypothesis is an

entailment (true), a contradiction (false), or undetermined (neutral) given
a premise. Let's learn how to perform NLI by fine-tuning BERT.
Consider the sample dataset shown in the following figure; as we can see,
we have a premise and a hypothesis with a label indicating whether they
are entailment, contradiction, or undetermined:
Figure 3.7 – Sample NLI dataset
Now, the goal of our model is to determine whether a sentence pair

(premise-hypothesis pair) is an entailment, a contradiction, or undeter-
mined. Let's understand how to do this with an example. Consider the fol-
lowing premise-hypothesis pair:
Premise: He is playing
Hypothesis: He is sleeping
First, we tokenize the sentence pair, then add a [CLS] token at the begin-
ning of the first sentence and an [SEP] token at the end of every sentence.
The tokens are as follows:
tokens = [ [CLS], He, is, playing, [SEP], He, is, sleeping [SEP]]
Now, we feed the tokens to the pre-trained BERT model and get the em-
bedding of each token. We learned that the representation of the [CLS] to-
ken holds the aggregate representation.
So, we take the representation of the [CLS] token, which is , and feed it
to a classifier (feeedforward + softmax), which returns the probability of
the sentence being a contradiction, an entailment, or neutral. Our results
will not be accurate in the initial iteration, but over a course of multiple
iterations, we will get better results:
Figure 3.8 – Fine-tuning the pre-trained BERT model for NLI
Now that we have learned how to fine-tune BERT for NLI, in the next sec-
tion, we will learn how to fine-tune BERT for question-answering.
Question-answering
In a question-answering task, we are given a question along with a para-

graph containing an answer to the question. Our goal is to extract the an-
swer from the paragraph for the given question. Now, let's learn how to
fine-tune the pre-trained BERT model to perform a question-answering
task.
The input to the BERT model will be a question-paragraph pair. That is,
we feed a question and a paragraph containing the answer to the ques-
tion to BERT and it has to extract the answer from the paragraph. So, es-
sentially, BERT has to return the text span that contains the answer from
the paragraph. Let's understand this with an example – consider the fol-
lowing question-paragraph pair:
Question = "What is the immune system?"
Paragraph = "The immune system is a system of many biological structures

and processes within an organism that protects against disease. To func-
tion properly, an immune system must detect a wide variety of agents,
known as pathogens, from viruses to parasitic worms, and distinguish
them from the organism's own healthy tissue."
Now, our model has to extract an answer from the paragraph; it essen-
tially has to return the text span containing the answer. So, it should re-
turn the following:
Answer = "a system of many biological structures and processes within an

organism that protects against disease"
Okay, how can we fine-tune the BERT model to do this task? To do this,
our model has to understand the starting and ending index of the text
span containing the answer in the given paragraph. For example, take the
question, "What is the immune system?" If our model understands that the
answer to this question starts from index 4 ("a") and ends at index 21
("disease"), then we can get the answer as shown here:
Paragraph = "The immune system is a system of many system of many

biological structures and processes within an organism that protects
against disease" biological structures and processes within an organism
that protects against disease. To function properly, an immune system must
detect a wide variety of agents, known as pathogens, from viruses to para-
sitic worms, and distinguish them from the organism's own healthy tissue."
Now, how do we find the starting and ending index of the text span con-
taining the answer? If we get the probability of each token (word) in the
paragraph of being the starting and ending token (word) of the answer,
then we can easily extract the answer, right? Yes, but how we can achieve
this? To do this, we use two vectors called the start vector and the end
vector . The values of the start and end vectors will be learned during
training.
First, we compute the probability of each token (word) in the paragraph

being the starting token of the answer.
To compute this probability, for each token , we compute the dot product
between the representation of the token and the start vector . Next,
we apply the softmax function to the dot product and obtain the
probability:
Next, we compute the starting index by selecting the index of the token
that has a high probability of being the starting token.
In a very similar fashion, we compute the probability of each token
(word) in the paragraph being the ending token of the answer. To com-
pute this probability, for each token , we compute the dot product be-
tween the representation of the token and the end vector . Next, we
apply the softmax function to the dot product and obtain the
probability:
Next, we compute the ending index by selecting the index of the token
that has a high probability of being the ending token. Now, we can select
the text span that contains the answer using the starting and ending
index.
As shown in the following figure, first, we tokenize the question-para-

graph pair and feed the tokens to the pre-trained BERT model, which re-
turns the embeddings of all the tokens. As shown in the figure, to
denotes the embeddings of the tokens in the question and to de-
notes the embedding of the tokens in the paragraph.
After computing the embedding, we compute the dot product with the
start/end vectors, apply the softmax function, and obtain the probabilities
of each token in the paragraph being the start/end word as shown here:
Figure 3.9 - Fine-tuning the pre-trained BERT for question-answering
From the preceding figure, we can see how we compute the probability of
each token in the paragraph being the start/end word. Next, we select the
text span containing the answer using the starting and ending indexes
with the highest probability. To get a better understanding of how this
works, let's see how to use the fine-tuned question-answering BERT
model in the next section.
Performing question-answering with fine-tuned BERT
In this section, let's learn how to perform question answering with a fine-
tuned question-answering BERT model. First, let's import the necessary
modules:
from transformers import BertForQuestionAnswering, BertTokenizer
Now, we download and load the model. We use the bert-large-uncased-whole-word-

masking-fine-tuned-squad model, which is fine-tuned on the Stanford Question-
Answering Dataset (SQUAD):
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-fin
Next, we download and load the tokenizer:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-fine-tuned
Now that we have downloaded the model and tokenizer, let's preprocess
the input.
First, we define the input to BERT, which is the question and paragraph
text:
question = "What is the immune system?"

paragraph = "The immune system is a system of many biological structures and processes with
Add a [CLS] token to the beginning of the question and an [SEP] token to the
end of both the question and the paragraph:
question = '[CLS] ' + question + '[SEP]'

paragraph = paragraph + '[SEP]'
Now, tokenize the question and paragraph:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)
Combine the question and paragraph tokens and convert them to input_ids:
tokens = question_tokens + paragraph_tokens

input_ids = tokenizer.convert_tokens_to_ids(tokens)
Next, we define segment_ids. Now, segment_ids will be 0 for all the tokens of the
question and 1 for all the tokens of the paragraph:
segment_ids = [0] * len(question_tokens)

segment_ids = [1] * len(paragraph_tokens)
Now we convert input_ids and segment_ids to tensors:
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])
Now that we have processed the input, let's feed it to the model and get
the result.
Getting the answer
We feed input_ids and segment_ids to the model, which returns the start score
and end score for all of the tokens:
start_scores, end_scores = model(input_ids, token_type_ids = segment_ids)
Now, we select start_index, which is the index of the token that has the high-
est start score, and end_index, which is the index of the token that has the
highest end score:
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)
That's it! Now, we print the text span between the start and end indexes
as our answer:
print(' '.join(tokens[start_index:end_index+1]))
a system of many biological structures and processes within an organism that protects again
Now that we have learned how to fine-tune BERT for question-answering,

in the next section, we will learn how to fine-tune BERT for NER.
Named entity recognition
In NER, our goal is to classify named entities into predefined categories.

For instance, consider the sentence Jeremy lives in Paris. In this sentence,
"Jeremy" should be categorized as a person, and "Paris" should be catego-
rized as a location.
Now, let's learn how to fine-tune the pre-trained BERT model to perform
NER. First, we tokenize the sentence, then we add the [CLS] token at the
beginning and the [SEP] token at the end. Then, we feed the tokens to the
pre-trained BERT model and obtain the representation of every
token. Next, we feed those token representations to a classifier (feedfor-
ward network + softmax function). Then, the classifier returns the cate-
gory to which the named entity belongs. This is shown in the following
figure:
Figure 3.10 - Fine-tuning the pre-trained BERT model for NER
We can fine-tune the pre-trained BERT model for several downstream

tasks. So far, we have learned how BERT works and also how to use the
pre-trained BERT model. In the next chapter, we will learn about differ-
ent variants of BERT.
Summary
We started the chapter by looking at different configurations of the pre-

trained BERT model provided by Google. Then, we learned that we can
use the pre-trained BERT model in two ways: as a feature extractor by ex-
tracting embeddings, and by fine-tuning the pre-trained BERT model for
downstream tasks such as text classification, question-answering, and
more.
Then, we learned how to extract embeddings from the pre-trained BERT

model in detail. We also learned how to use Hugging Face's transformers
library to generate embeddings. Then, we learned how to extract embed-
dings from all the encoder layers of BERT in detail.
Moving on, we learned how to fine-tune pre-trained BERT for down-

stream tasks. We learned how to fine-tune BERT for text classification,
NLI, NER, and question-answering in detail. In the next chapter, we will
explore several interesting variants of BERT.
Questions
Let's put our knowledge to the test. Try answering the following
questions:
1. How do you use the pre-trained BERT model?

2. What is the use of the [PAD] token?
3. What is an attention mask?
4. What is fine-tuning?
5. How do you compute the starting index of an answer in question-
answering?
6. How do you compute the ending index of an answer in question-
answering?
7. How do you use BERT for NER?
Further reading
To learn more, refer to the following resources:
Check out the Hugging Face transformers documentation, available

at https://huggingface.co/transformers/model_doc/bert.html.
BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova available at https://arxiv.org/pdf/1810.04805.pdf.

Getting Hands-On With BERT - Getting Started With Google BERT

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Getting Hands-On With BERT - Getting Started With Google BERT

Uploaded by

Copyright:

Available Formats

In this chapter, we will learn the following topics:

Exploring the pre-trained BERT model

Exploring the pre-trained BERT model

In Chapter 2, Understanding the BERT Model, we learned how to pre-train

Figure 3.1 – Different configurations of pre-trained BERT as provided by Google (https://github.com/google-research/ber

The pre-trained model is also available in the BERT-uncased and BERT-

We can use the pre-trained model in the following two ways:

As a feature extractor by extracting embeddings

Extracting embeddings from pre-

Let's suppose we want to perform a sentiment analysis task, and say we

Figure 3.2 – Sample dataset

tokens = [I, love, Paris]

tokens = [ [CLS], I, love, Paris, [SEP] ]

tokens = [ [CLS], I, love, Paris, [SEP], [PAD], [PAD] ]

token_ids = [101, 1045, 2293, 3000, 102, 0, 0]

Now, we feed token_ids along with attention_mask as input to the pre-trained

Figure 3.3 – Pre-trained BERT

As shown in the preceding figure, is the embedding of the token

We learned that we have prepended the [CLS] token to the beginning of

In a very similar fashion, we can compute the vector representation of all

Hugging Face transformers

Hugging Face is an organization that is on the path of democratizing AI

We can install transformers directly using pip as shown here:

pip install transformers==3.5.1

Generating BERT embeddings

First, let's import the necessary modules:

from transformers import BertModel, BertTokenizer

Download and load the pre-trained bert-base-uncased model:

Preprocessing the input

Define the sentence:

sentence = 'I love Paris'

Let's print the tokens:

The preceding code will print the following:

['i', 'love', 'paris']

tokens = ['[CLS]'] + tokens + ['[SEP]']

Let's look at our updated tokens list:

The previous code will print the following:

['[CLS]', 'i', 'love', 'paris', '[SEP]']

tokens = tokens + ['[PAD]'] + ['[PAD]']

Let's print our updated tokens list:

The preceding code will print the following:

['[CLS]', 'i', 'love', 'paris', '[SEP]', '[PAD]', '[PAD]']

Let's print attention_mask:

The preceding code will print this:

As we can see, we have attention mask values 0 at positions where have a

Next, we convert all the tokens to their token IDs as follows:

Let's have a look at token_ids:

The preceding code will print the following:

[101, 1045, 2293, 3000, 102, 0, 0]

Now, we convert token_ids and attention_mask to tensors as shown in the follow-

Getting the embedding

hidden_rep, cls_head = model(token_ids, attention_mask = attention_mask)

In the preceding code, hidden_rep contains the embedding (representation) of

The preceding code will print the following:

The size [1,7,768] indicates [batch_size, sequence_length, hidden_size] .

We can obtain the representation of each token as follows:

hidden_rep[0][0] gives the representation of the first token, which is [CLS].

The preceding code will print the following:

The size [1,768] indicates [batch_size, hidden_size].

We learned that cls_head holds the aggregate representation of the sentence,

We learned how to extract embeddings from the pre-trained BERT model.

Extracting embeddings from all