You are on page 1of 11

26. 4.

2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Google Cloud TPUv2

Pre-training BERT from scratch with cloud TPU


Denis Antyukhov
May 9, 2019 · 7 min read

In this experiment, we will be pre-training a state-of-the-art Natural Language


Understanding model BERT on arbitrary text data using Google Cloud infrastructure.

This guide covers all stages of the procedure, including:

1. Setting up the training environment

2. Downloading raw text data

3. Preprocessing text data

4. Learning a new vocabulary

5. Creating sharded pre-training data

6. Setting up GCS storage for data and model

7. Training the model on a cloud TPU

https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 1/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

A few questions
What is this guide useful for?
With this guide, you will be able to train a BERT model on arbitrary text data. This is
useful if a pre-trained model for your language or use case is not available in open
source.

For whom is this guide?


This guide is intended for NLP researchers who are excited with the BERT technology
but are not satisfied with the performance of the available
open-sourced models.

How do I get started?


For persistent storage of training data and model, you will require a Google Cloud
Storage bucket. Please follow the Google Cloud TPU quickstart to create a GCP account
and GCS bucket. New Google Cloud users have $300 free credit to get started with any
GCP product.

Steps 1–5 of this tutorial can be run without a GCS bucket for demonstration purposes.
In that case, however, you will not be able to train the model.

What would it take?


Pre-training a BERT-Base model on a TPUv2 will take about 54 hours. Google Colab is
not designed for executing such long-running jobs and will interrupt the training
process every 8 hours or so. For uninterrupted training, consider using a paid pre-
emptible TPUv2 instance.

That said, at the time of writing (09.05.2019), with a Colab TPU, pre-training a BERT
model from scratch can be achieved at a negligible cost of storing the said model and
data in GCS (~1 USD).

How do I follow the guide?


The code below is a combination of Python and Bash.
It was designed to run in a Colab Jupyter environment.
Therefore, it would be most convenient to simply run it there.

However, all steps of the guide, except for the actual training part, might be run on a
separate machine. This could be useful if your dataset is too large
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 2/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

(or too private) to preprocess inside a Colab environment.

OK, show me the code.


Here you go.

Do I need to change anything in the code?


The only parameter you really have to set is your GCS BUCKET_NAME. Everything else
has default values which should work for most use-cases.

Now, let’s get to business.

. . .

Step 1: setting up training environment


First and foremost, we get the packages required to train the model.
The Jupyter environment allows executing bash commands directly from the notebook
by using an exclamation mark ‘!’, like this:

!pip install sentencepiece


!git clone https://github.com/google-research/bert

I will be exploiting this approach to make use of several other bash commands
throughout the experiment.
Now, let’s import the packages and authorize ourselves in Google Cloud.

1 import os
2 import sys
3 import json
4 import nltk
5 import random
6 import logging
7 import tensorflow as tf
8 import sentencepiece as spm
9
10 from glob import glob
11 from google.colab import auth, drive
12 from tensorflow.keras.utils import Progbar
13
14 sys.path.append("bert")
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 3/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

15
16 from bert import modeling, optimization, tokenization
17 from bert.run_pretraining import input_fn_builder, model_fn_builder
18
19 auth.authenticate_user()
20
21 # configure logging
22 log = logging.getLogger('tensorflow')
23 log.setLevel(logging.INFO)
24
25 # create formatter and add it to the handlers
26 formatter = logging.Formatter('%(asctime)s : %(message)s')
27 sh = logging.StreamHandler()
28 sh.setLevel(logging.INFO)
29 sh.setFormatter(formatter)
30 log.handlers = [sh]
31
32 if 'COLAB_TPU_ADDR' in os.environ:
33 log.info("Using TPU runtime")
34 USE_TPU = True
35 TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
36
37 with tf.Session(TPU_ADDRESS) as session:
38 log.info('TPU address is ' + TPU_ADDRESS)
39 # Upload credentials to TPU.
40 with open('/content/adc.json', 'r') as f:
41 auth_info = json.load(f)
42 tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
43
44 else:
45 log.warning('Not connected to TPU runtime')
46 USE_TPU = False

bert environment py hosted with ❤ by GitHub view raw


Setting up BERT training environment

Step 2: getting the data


We proceed with obtaining a corpus of text data. For this experiment, we will be using
the OpenSubtitles dataset, which is available for 65 languages here.

Unlike more commonly used text datasets (like Wikipedia) it does not require any
complex pre-processing. It also comes pre-formatted with one sentence per line, which
is a requirement for further processing steps.

https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 4/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Feel free to use the dataset for your language instead by setting the corresponding
language code.

1 AVAILABLE = {'af','ar','bg','bn','br','bs','ca','cs',
2 'da','de','el','en','eo','es','et','eu',
3 'fa','fi','fr','gl','he','hi','hr','hu',
4 'hy','id','is','it','ja','ka','kk','ko',
5 'lt','lv','mk','ml','ms','nl','no','pl',
6 'pt','pt_br','ro','ru','si','sk','sl','sq',
7 'sr','sv','ta','te','th','tl','tr','uk',
8 'ur','vi','ze_en','ze_zh','zh','zh_cn',
9 'zh_en','zh_tw','zh_zh'}
10
11 LANG_CODE = "en" #@param {type:"string"}
12
13 assert LANG_CODE in AVAILABLE, "Invalid language code selected"
14
15 !wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/mono/OpenSubtitles.raw.'$LANG_CODE
16 !gzip -d dataset.txt.gz
17 !tail dataset.txt

download_training_data.py hosted with ❤ by GitHub view raw

Download OPUS data

For demonstration purposes, we will only use a small fraction of the whole corpus by
default.

When training the real model, make sure to uncheck the DEMO_MODE checkbox to
use a 100x larger dataset.

Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base
model.

1 DEMO_MODE = True #@param {type:"boolean"}


2 if DEMO_MODE:
3 CORPUS_SIZE = 1000000
4 else:
5 CORPUS_SIZE = 100000000 #@param {type: "integer"}
6
7 !(head -n $CORPUS_SIZE dataset.txt) > subdataset.txt
8 !mv subdataset.txt dataset.txt

truncate_dataset.py hosted with ❤ by GitHub view raw

https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 5/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Truncate dataset

Step 3: preprocessing text


The raw text data we have downloaded contains punctuation, uppercase letters and
non-UTF symbols which we will remove before proceeding. During inference, we will
apply the same procedure to new data.

If your use-case requires different preprocessing (e.g. if uppercase letters or


punctuation are expected during inference), feel free to modify the function below to
accommodate for your needs.

De ne preprocessing routine

Now let’s preprocess the whole dataset.

Apply preprocessing

Step 4: building the vocabulary


For the next step, we will learn a new vocabulary that we will use to represent our
dataset.

The BERT paper uses a WordPiece tokenizer, which is not available in opensource.
Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not
directly compatible with BERT, with a small hack we can make it work.

SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will
crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for
building the vocabulary. Another option would be to use a machine with more RAM for
this step — that decision is up to you.

Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default.
We disable them explicitly by setting their indices to -1.

The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We
reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and
fine-tune the model after the pre-training phase is finished.

Learn SentencePiece vocabulary


https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 6/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Now, let’s see how we can make SentencePiece work for the BERT model.

Below is a sentence tokenized using the WordPiece vocabulary from a


pre-trained English BERT-base model from the official repo.

>>> wordpiece.tokenize("Colorless geothermal substations are


generating furiously")

['color',
'##less',
'geo',
'##thermal',
'sub',
'##station',
'##s',
'are',
'generating',
'furiously']

As we can see, the WordPiece tokenizer prepends the subwords which occur in the
middle of words with ‘##’. The subwords occurring at the beginning of words are
unchanged. If the subword occurs both in the beginning and in the middle of words,
both versions (with and without ‘##’) are added to the vocabulary.

SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let’s have a
look at the learned vocabulary:

Read the learned SentencePiece vocabulary

This gives:

Learnt vocab size: 31743


Sample tokens: ['▁cafe', '▁slippery', 'xious', '▁resonate',
'▁terrier', '▁feat', '▁frequencies', 'ainty', '▁punning', 'modern']

As we may observe, SentencePiece does quite the opposite to WordPiece. From the
documentation:

SentencePiece first escapes the whitespace with a meta-symbol “▁” (U+2581) as


follows:

https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 7/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Hello▁World .

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

Subwords which occur after whitespace (which are also those that most words begin
with) are prepended with ‘▁’, while others are unchanged. This excludes subwords
which only occur at the beginning of sentences and nowhere else. These cases should
be quite rare, however.

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a


simple conversion, removing “▁” from the tokens that contain it and adding “##” to
the ones that don’t.

We also add some special control symbols which are required by the BERT architecture.
By convention, we put those at the beginning of the vocabulary.

Additionally, we append some placeholder tokens to the vocabulary.


Those are useful if one wishes to update the pre-trained model with new,
task-specific tokens. In that case, the placeholder tokens are replaced with new real
ones, the pre-training data is re-generated, and the model is fine-tuned on new data.

Convert the vocabulary to use for BERT

Finally, we write the obtained vocabulary to file.

Dump vocabulary to le

Now let’s see how the new vocabulary works in practice:

>>> testcase = "Colorless geothermal substations are generating


furiously"
>>> bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)
>>> bert_tokenizer.tokenize(testcase)

['color',
'##less',
'geo',
'##ther',
'##mal',
'sub',
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 8/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

'##station',
'##s',
'are',
'generat',
'##ing',
'furious',
'##ly']

Looking good!

Step 5: generating pre-training data


With the vocabulary at hand, we are ready to generate pre-training data for the BERT
model.
Since our dataset might be quite large, we will split it into shards:

Split the dataset

Now, for each shard we need to call create_pretraining_data.py script from the BERT
repo. To that end, we will employ the xargs command.

Before we start generating, we need to set some parameters to pass to the script. You
can find out more about the meaning of them in the README.

De ne parameters for pre-training data

Running this might take quite some time depending on the size of your dataset.

Create pre-training data

Step 6: setting up persistent storage


To preserve our hard-earned assets, we will persist them to Google Cloud Storage.
Provided that you have created the GCS bucket, this should be easy.

We will create two directories in GCS, one for the data and one for the model. In the
model directory, we will put the model vocabulary and configuration file.

Configure your BUCKET_NAME variable here before proceeding, otherwise you


will not be able to train the model.

https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 9/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Con gure GCS bucket name

Below is the sample hyperparameter configuration for BERT-base. Change at your own
risk!

Con gure BERT hyperparameters and save to disk

Now, we are ready to push our assets to GCS

Upload assets to GCS

Step 7: training the model


We are almost ready to begin training our model.

Be advised that some of the parameters from previous steps are duplicated here, so as
to allow for convenient restart of the training procedure.

Make sure that the parameters set are exactly the same across the experiment.

Con gure training run

Prepare the training run configuration, build the estimator and


input function, power up the bass cannon.

Build estimator model and input function

Execute!

Execute BERT training procedure

Training the model with the default parameters for 1 million steps
will take ~54 hours of run time. In case the kernel restarts for some reason, you may
always continue training from the latest checkpoint.

This concludes the guide to pre-training BERT from scratch on a cloud TPU.

Next steps
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 10/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science

Okay, we’ve trained the model, now what?

That is a topic for a whole new discussion. There are a couple of things you could do:

1. Use the pre-trained model as a general-purpose NLU module

2. Fine-tune the model for some specific classification task

3. Create another DL model using BERT as a building block

4. ???

The really fun stuff is still to come, so stay woke. Meanwhile, check out the awesome
bert-as-a-service project and start serving your newly trained model in production.

Keep learning!

Other guides in this series


1. Pre-training BERT from scratch with cloud TPU [you are here]

2. Building a Search Engine with BERT and Tensorflow

3. Fine-tuning BERT with Keras and tf.Module

4. Improving sentence embeddings with BERT and Representation Learning

Machine Learning Deep Learning TensorFlow Colab Naturallanguageprocessing

About Help Legal

Get the Medium app

https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 11/11

You might also like