Professional Documents
Culture Documents
In this chapter, we will learn how to use the pre-trained BERT model in
detail. First, we will look at the different configurations of the pre-trained
BERT model open sourced by Google. Then, we will learn how to use the
pre-trained BERT model as a feature extractor. We will also explore
Hugging Face's transformers library and learn how to use it to extract em-
beddings from the pre-trained BERT.
Moving on, we will understand how to extract embeddings from all en-
coder layers of BERT. Next, we will learn how to fine-tune the pre-trained
BERT model for the downstream tasks. First, we will learn to fine-tune the
pre-trained BERT model for a text classification task. Next, we will learn
to fine-tune BERT for sentiment analysis tasks using the transformers library.
Then, we will look into fine-tuning the pre-trained BERT model for natu-
ral language inference, question answering tasks, and named entity
recognition tasks.
t)
In the upcoming sections, we will learn how to use the pre-trained BERT
model as a feature extractor by extracting embeddings, and we will also
learn how to fine-tune the pre-trained BERT model for downstream tasks
in detail.
Let's learn how to extract embeddings from pre-trained BERT with an ex-
ample. Consider a sentence – I love Paris. Say we need to extract the con-
textual embedding of each word in the sentence. To do this, first, we tok-
enize the sentence and feed the tokens to the pre-trained BERT
model, which will return the embeddings for each of the tokens. Apart
from obtaining the token-level (word-level) representation, we can also
obtain the sentence-level representation.
In this section, let's learn how exactly we can extract the word-level and
sentence-level embedding from the pre-trained BERT model in detail.
As we can observe from the preceding table, we have sentences and their
corresponding labels, where 1 indicates positive sentiment and 0 indi-
cates negative sentiment. We can train a classifier to classify the senti-
ment of a sentence using the given dataset.
But we can't feed the given dataset directly to a classifier, since it has text.
So first, we need to vectorize the text. We can vectorize the text using
methods such as TF-IDF, word2vec, and others. In the previous chapter,
we learned that BERT learns the contextual embedding, unlike other con-
text-free embedding models such as word2vec. Now, we will see how to
use the pre-trained BERT model to vectorize the sentences in our dataset.
Let's take the first sentence in our dataset – I love Paris. First, we tokenize
the sentence using the WordPiece tokenizer and get the tokens (words).
After tokenizing the sentence, we have the following:
Now, we add the [CLS] token at the beginning and the [SEP] token at the end.
Thus, our tokens list becomes this:
Similarly, we can tokenize all the sentences in our training set. But the
length of each sentence varies, right? Yes, and so does the length of the to-
kens. We need to keep the length of all the tokens the same. Say we keep
the length of the tokens to 7 for all the sentences in our dataset. If we look
at our preceding tokens list, the tokens length is 5. To make the tokens length 7,
we add a new token called [PAD]. Thus, now our tokens are as follows:
As we can observe, now our tokens length is 7, as we have added two [PAD]
tokens. The next step is to make our model understand that the [PAD] token
is added only to match the tokens length and it is not part of the actual to-
kens. To do this, we introduce an attention mask. We set the attention
mask value to 1 in all positions and 0 to the position where we have a [PAD]
token, as shown here:
attention_mask = [ 1,1,1,1,1,0,0]
Next, we map all the tokens to a unique token ID. Suppose the following is
the mapped token ID:
It implies that ID 101 indicates the token [CLS], 1045 indicates the token I, 2293
indicates the token Paris, and so on.
The following figure shows how we use the pre-trained BERT model to ob-
tain the embedding. For clarity, the tokens are shown instead of token
IDs. As we can see, once we feed the tokens as the input, encoder 1 com-
putes the representation of all the tokens and sends it to the next encoder,
which is encoder 2. Encoder 2 takes the representation computed by en-
coder 1 as input, computes its representation, and sends it to the next en-
coder, which is encoder 3. In this way, each encoder sends its representa-
tion to the next encoder above it. The final encoder, which is encoder 12,
returns the final representation (embedding) of all the tokens in our
sentence:
We learned how to obtain the representation for each word in the sen-
tence I love Paris. But how do we obtain the representation of the com-
plete sentence?
Note that using the representation of the [CLS] token as a sentence repre-
sentation is not always a good idea. The efficient way to obtain the repre-
sentation of a sentence is either averaging or pooling the representation
of all the tokens. We will learn more about this in the upcoming chapters.
Now that we have learned how to use the pre-trained BERT model to ex-
tract an embedding (representation), in the next section, we will learn
how to do this using a library known as transformers.
As we can see, in this book, we use transformers version 3.5.1. Now that
we have installed transformers, let's get started.
In this section, we will learn how to extract embeddings from the pre-
trained BERT model. Consider the sentence I love Paris. Let's see how to
obtain the contextualized word embedding of all the words in the sen-
tence using the pre-trained BERT model with Hugging Face's transformers
library. We can also access the complete code from the GitHub repository
of the book. In order to run the code smoothly, clone the GitHub reposi-
tory of the book and run the code using Google Colab.
Next, we download the pre-trained BERT model. We can check all the
available pre-trained BERT models here – https://huggingface.co/transfor
mers/pre-trained_models.html.We use the 'bert-base-uncased' model. As the
name suggests, it is the BERT-base model with 12 encoders and it is
trained with uncased tokens. Since we are using BERT-base, the represen-
tation size will be 768.
model = BertModel.from_pretrained('bert-base-uncased')
Next, we download and load the tokenizer that was used to pre-train
the bert-base-uncased model:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Now, let's see how to preprocess the input before feeding it to BERT.
tokens = tokenizer.tokenize(sentence)
print(tokens)
Now, we will add the [CLS] token at the beginning and the [SEP] token at the
end of the tokens list:
print(tokens)
As we can observe, we have a [CLS] token at the beginning and an [SEP] to-
ken at the end of our tokens list. We can also see that length of our tokens list
is 5.
Say we need to keep the length of our tokens list to 7; in that case, we add
two [PAD] tokens at the end as shown in the following snippet:
print(tokens)
As we can see, now we have the tokens list with [PAD] tokens and the length of
our tokens list is 7.
Next, we create the attention mask. We set the attention mask value to 1 if
the token is not a [PAD] token, else we set the attention mask to 0, as shown
here:
attention_mask = [1 if i!= '[PAD]' else 0 for i in tokens]
print(attention_mask)
[1, 1, 1, 1, 1, 0, 0]
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(token_ids)
From the output, we can observe that each token is mapped to a unique
token ID.
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
That's it. Next, we feed token_ids and attention_mask to the pre-trained BERT
model and get the embedding.
As shown in the following code, we feed token_ids and attention_mask to model and
get the embeddings. Note that model returns the output as a tuple with two
values. The first value indicates the hidden state representation, hidden_rep,
and it consists of the representation of all the tokens obtained from the fi-
nal encoder (encoder 12), and the second value, cls_head, consists of the rep-
resentation of the [CLS] token:
torch.Size([1, 7, 768])
Our batch size is 1. The sequence length is the token length. Since we
have 7 tokens, the sequence length is 7. The hidden size is the representa-
tion (embedding) size and it is 768 for the BERT-base model.
In this way, we can obtain the contextual representation of all the tokens.
This is basically the contextualized word embeddings of all the words in
the given sentence.
Now, let's take a look at cls_head. It contains the representation of the [CLS]
token. Let's print the shape of cls_head :
print(cls_head.shape)
torch.Size([1, 768])
For instance, for NER task, the researchers have used the pre-trained
BERT model to extract features. Instead of using the embedding only from
the final encoder layer (final hidden layer) as a feature, they have experi-
mented with using embeddings from other encoder layers (other hidden
layers) as features and obtained the following F1 score:
Now, we will learn how to extract the embeddings from all the encoder
layers using the transformers library.
Next, download the pre-trained BERT model and tokenizer. As we can see,
while downloading the pre-trained BERT model, we need to set
output_hidden_states = True . Setting this to True helps us to obtain embeddings from
all the encoder layers:
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
Let's consider the sentence we saw in the previous section. First, we tok-
enize the sentence and add a [CLS] token at the beginning and an [SEP] token
at the end:
Suppose we need to keep the token length to 7. So, we add the [PAD] tokens
and also define the attention mask:
token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids = torch.tensor(token_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
Now that we have preprocessed the input, let's get the embeddings.
Since we set output_hidden_states = True while defining the model to get the em-
beddings from all the encoder layers, now the model returns an output
tuple with three values, as shown in the following code:
The first value, last_hidden_state, contains the representation of all the to-
kens obtained only from the final encoder layer (encoder 12).
Next, pooler_output indicates the representation of the [CLS] token from
the final encoder layer, which is further processed by a linear and
tanh activation function.
hidden_states contains the representation of all the tokens obtained from
all the encoder layers.
Now, let's take a look at each of these values and understand them in
more detail.
last_hidden_state.shape
torch.Size([1, 7, 768])
Our batch size is 1. The sequence length is the token length. Since we
have 7 tokens, the sequence length is 7. The hidden size is the representa-
tion (embedding) size and it is 768 for the BERT-base model.
Similarly, we can obtain the representation of all the tokens from the fi-
nal encoder layer.
pooler_output.shape
We learned that the [CLS] token holds the aggregate representation of the
sentence. Thus, we can use pooler_output as the representation of the
sentence I love Paris.
Finally, we have hidden_states, which contains the representation of all the to-
kens obtained from all the encoder layers. It is a tuple containing 13 val-
ues holding the representation of all encoder layers (hidden layers), from
the input embedding layer to the final encoder layer :
len(hidden_states)
13
As we can see, it contains 13 values holding the representation of all
layers:
Let's explore this more. First, let's print the shape of hidden_states[0], which
contains the representation of all the tokens obtained from the input em-
bedding layer :
hidden_states[0].shape
torch.Size([1, 7, 768])
Now, let's print the shape of hidden_states[1], which contains the representa-
tion of all tokens obtained from the first encoder layer :
hidden_states[1].shape
torch.Size([1, 7, 768])
Thus, in this way, we can obtain the embedding of tokens from all the en-
coder layers. We learned how to use the pre-trained BERT model to ex-
tract embeddings; can we also use pre-trained BERT for a downstream
task such as sentiment analysis? Yes! We will learn about this in the next
section.
So far, we have learned how to use the pre-trained BERT model. Now, let's
learn how to fine-tune the pre-trained BERT model for downstream
tasks. Note that fine-tuning implies that we are not training BERT from
scratch; instead, we are using the pre-trained BERT and updating its
weights according to our task.
In this section, we will learn how to fine-tune the pre-trained BERT model
for the following downstream tasks:
Text classification
Natural language inference
NER
Question-answering
Text classification
Let's learn how to fine-tune the pre-trained BERT model for a text classifi-
cation task. Say we are performing sentiment analysis. In the sentiment
analysis task, our goal is to classify whether a sentence is positive or neg-
ative. Suppose we have a dataset containing sentences along with their
labels.
Consider a sentence: I love Paris. First, we tokenize the sentence, add the
[CLS] token at the beginning, and add the [SEP] token at the end of the sen-
tence. Then, we feed the tokens as an input to the pre-trained BERT model
and get the embeddings of all the tokens.
Next, we ignore the embedding of all other tokens and take only the em-
bedding of [CLS] token, which is . The embedding of the [CLS] token will
hold the aggregate representation of the sentence. We feed to a classi-
fier (feed-forward network with softmax function) and train the classifier
to perform sentiment analysis.
Wait! How does this differ from what we saw at the beginning of the sec-
tion? How does fine-tuning the pre-trained BERT model differ from using
the pre-trained BERT model as a feature extractor?
During fine-tuning, we can adjust the weights of the model in the follow-
ing two ways:
Update the weights of the pre-trained BERT model along with the clas-
sification layer.
Update only the weights of the classification layer and not the pre-
trained BERT model. When we do this, it becomes the same as using
the pre-trained BERT model as a feature extractor.
As we can observe from the preceding figure, we feed the tokens to the
pre-trained BERT model and get the embeddings of all the tokens. We take
the embedding of the [CLS] token and feed it to a feedforward network
with a softmax function and perform classification.
Let's explore how to fine-tune the pre-trained BERT model for a sentiment
analysis task with the IMDB dataset. The IMDB dataset consists of movie
reviews along with the respective sentiment of the review. We can also
access the complete code from the GitHub repository of the book. In order
to run the code smoothly, clone the GitHub repository of the book and run
the code using Google Colab.
Load the model and dataset. First, let's download and load the dataset us-
ing the nlp library:
!gdown https://drive.google.com/uc?id=11_M4ootuT7I1G0RlihcC0cA3Elqotlc-
dataset = load_dataset('csv', data_files='./imdbs.csv', split='train')
nlp.arrow_dataset.Dataset
Next, let's split the dataset into train and test sets:
dataset = dataset.train_test_split(test_size=0.3)
dataset
{
'test': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='in
'train': Dataset(features: {'text': Value(dtype='string', id=None), 'label': Value(dtype='
}
train_set = dataset['train']
test_set = dataset['test']
Next, let's download and load the pre-trained BERT model. In this exam-
ple, we use the pre-trained bert-base-uncased model. As we can see, since we
are performing sequence classification, we use the BertForSequenceClassification
class:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
Next, we download and load the tokenizer that was used to pre-train
the bert-base-uncased model.
As we can see, we create the tokenizer using the BertTokenizerFast class instead
of BertTokenizer. The BertTokenizerFast class has many advantages compared
to BertTokenizer. We will learn about this in the next section:
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
Now that we have loaded the dataset and model, let's preprocess the
dataset.
We can preprocess the dataset quickly using our tokenizer. For example,
consider the sentence I love Paris.
First, we tokenize the sentence and add a [CLS] token at the beginning and
a [SEP] token at the end, as shown here:
Next, we map the tokens to the unique input IDs (token IDs). Suppose the
following are the unique input IDs (token IDs):
Then, we need to add the segment IDs (token type IDs). Wait, what are
segment IDs? Suppose we have two sentences in the input. In that case,
segment IDs are used to distinguish one sentence from the other. All the
tokens from the first sentence will be mapped to 0 and all the tokens from
the second sentence will be mapped to 1. Since here we have only one
sentence, all the tokens will be mapped to 0 as shown here:
token_type_ids = [0, 0, 0, 0, 0]
attention_mask = [1, 1, 1, 1, 1]
That's it. But instead of doing all the aforementioned steps manually, our
tokenizer will do these steps for us. We just need to pass the sentence to
the tokenizer as shown in the following code:
The preceding code will return the following. As we can see, our input
sentence is tokenized and mapped to input_ids, token_type_ids, and also
attention_mask :
{
'input_ids': [101, 1045, 2293, 3000, 102],
'token_type_ids': [0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1]
}
With the tokenizer, we can also pass any number of sentences and per-
form padding dynamically. To do that, we need to set padding to True and also
the maximum sequence length. For instance, as shown in the following
code, we pass three sentences and we set the maximum sequence length,
max_length , to 5:
{
'input_ids': [[101, 1045, 2293, 3000, 102], [101, 5055, 4875, 102, 0], [101, 4586, 2991, 10
'token_type_ids': [[0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0]],
'attention_mask': [[1, 1, 1, 1, 1], [1, 1, 1, 1, 0], [1, 1, 1, 1, 0]]
}
That's it – with the tokenizer, we can easily preprocess our dataset. So, we
define a function called preprocess to process the dataset as follows:
def preprocess(data):
return tokenizer(data['text'], padding=True, truncation=True)
Now, we preprocess the train and test sets using the preprocess function:
Next, we use the set_format function and select the columns that we need in
our dataset and the format we need them in as shown in the following
code:
train_set.set_format('torch',
columns=['input_ids', 'attention_mask', 'label'])
test_set.set_format('torch',
columns=['input_ids', 'attention_mask', 'label'])
That's it. Now that we have the dataset ready, let's train the model.
batch_size = 8
epochs = 2
warmup_steps = 500
weight_decay = 0.01
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=epochs,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
warmup_steps=warmup_steps,
weight_decay=weight_decay,
evaluate_during_training=True,
logging_dir='./logs',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_set,
eval_dataset=test_set
)
trainer.train()
After training, we can evaluate the model using the evaluate function:
trainer.evaluate()
In this way, we can fine-tune the pre-trained BERT model. Now that we
have learned how to fine-tune BERT for a text classification task, in the
next section, let's see how to fine-tune the BERT model for Natural
Language Inference (NLI).
Consider the sample dataset shown in the following figure; as we can see,
we have a premise and a hypothesis with a label indicating whether they
are entailment, contradiction, or undetermined:
Figure 3.7 – Sample NLI dataset
Premise: He is playing
Hypothesis: He is sleeping
First, we tokenize the sentence pair, then add a [CLS] token at the begin-
ning of the first sentence and an [SEP] token at the end of every sentence.
The tokens are as follows:
tokens = [ [CLS], He, is, playing, [SEP], He, is, sleeping [SEP]]
Now, we feed the tokens to the pre-trained BERT model and get the em-
bedding of each token. We learned that the representation of the [CLS] to-
ken holds the aggregate representation.
So, we take the representation of the [CLS] token, which is , and feed it
to a classifier (feeedforward + softmax), which returns the probability of
the sentence being a contradiction, an entailment, or neutral. Our results
will not be accurate in the initial iteration, but over a course of multiple
iterations, we will get better results:
Now that we have learned how to fine-tune BERT for NLI, in the next sec-
tion, we will learn how to fine-tune BERT for question-answering.
Question-answering
Now, our model has to extract an answer from the paragraph; it essen-
tially has to return the text span containing the answer. So, it should re-
turn the following:
Okay, how can we fine-tune the BERT model to do this task? To do this,
our model has to understand the starting and ending index of the text
span containing the answer in the given paragraph. For example, take the
question, "What is the immune system?" If our model understands that the
answer to this question starts from index 4 ("a") and ends at index 21
("disease"), then we can get the answer as shown here:
Now, how do we find the starting and ending index of the text span con-
taining the answer? If we get the probability of each token (word) in the
paragraph of being the starting and ending token (word) of the answer,
then we can easily extract the answer, right? Yes, but how we can achieve
this? To do this, we use two vectors called the start vector and the end
vector . The values of the start and end vectors will be learned during
training.
To compute this probability, for each token , we compute the dot product
between the representation of the token and the start vector . Next,
we apply the softmax function to the dot product and obtain the
probability:
Next, we compute the starting index by selecting the index of the token
that has a high probability of being the starting token.
In a very similar fashion, we compute the probability of each token
(word) in the paragraph being the ending token of the answer. To com-
pute this probability, for each token , we compute the dot product be-
tween the representation of the token and the end vector . Next, we
apply the softmax function to the dot product and obtain the
probability:
Next, we compute the ending index by selecting the index of the token
that has a high probability of being the ending token. Now, we can select
the text span that contains the answer using the starting and ending
index.
After computing the embedding, we compute the dot product with the
start/end vectors, apply the softmax function, and obtain the probabilities
of each token in the paragraph being the start/end word as shown here:
From the preceding figure, we can see how we compute the probability of
each token in the paragraph being the start/end word. Next, we select the
text span containing the answer using the starting and ending indexes
with the highest probability. To get a better understanding of how this
works, let's see how to use the fine-tuned question-answering BERT
model in the next section.
In this section, let's learn how to perform question answering with a fine-
tuned question-answering BERT model. First, let's import the necessary
modules:
from transformers import BertForQuestionAnswering, BertTokenizer
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-fin
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-fine-tuned
Now that we have downloaded the model and tokenizer, let's preprocess
the input.
First, we define the input to BERT, which is the question and paragraph
text:
Add a [CLS] token to the beginning of the question and an [SEP] token to the
end of both the question and the paragraph:
question_tokens = tokenizer.tokenize(question)
paragraph_tokens = tokenizer.tokenize(paragraph)
Combine the question and paragraph tokens and convert them to input_ids:
Next, we define segment_ids. Now, segment_ids will be 0 for all the tokens of the
question and 1 for all the tokens of the paragraph:
input_ids = torch.tensor([input_ids])
segment_ids = torch.tensor([segment_ids])
Now that we have processed the input, let's feed it to the model and get
the result.
We feed input_ids and segment_ids to the model, which returns the start score
and end score for all of the tokens:
Now, we select start_index, which is the index of the token that has the high-
est start score, and end_index, which is the index of the token that has the
highest end score:
start_index = torch.argmax(start_scores)
end_index = torch.argmax(end_scores)
That's it! Now, we print the text span between the start and end indexes
as our answer:
print(' '.join(tokens[start_index:end_index+1]))
a system of many biological structures and processes within an organism that protects again
Now, let's learn how to fine-tune the pre-trained BERT model to perform
NER. First, we tokenize the sentence, then we add the [CLS] token at the
beginning and the [SEP] token at the end. Then, we feed the tokens to the
pre-trained BERT model and obtain the representation of every
token. Next, we feed those token representations to a classifier (feedfor-
ward network + softmax function). Then, the classifier returns the cate-
gory to which the named entity belongs. This is shown in the following
figure:
Figure 3.10 - Fine-tuning the pre-trained BERT model for NER
Summary
Questions
Let's put our knowledge to the test. Try answering the following
questions: