Pre-Training BERT From Scratch With Cloud TPU

26. 4.
2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
Google Cloud TPUv2
Pre-training BERT from scratch with cloud TPU

Denis Antyukhov
May 9, 2019 · 7 min read
In this experiment, we will be pre-training a state-of-the-art Natural Language

Understanding model BERT on arbitrary text data using Google Cloud infrastructure.
This guide covers all stages of the procedure, including:
1. Setting up the training environment
2. Downloading raw text data
3. Preprocessing text data
4. Learning a new vocabulary
5. Creating sharded pre-training data
6. Setting up GCS storage for data and model
7. Training the model on a cloud TPU
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 1/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
A few questions
What is this guide useful for?
With this guide, you will be able to train a BERT model on arbitrary text data. This is
useful if a pre-trained model for your language or use case is not available in open
source.
For whom is this guide?

This guide is intended for NLP researchers who are excited with the BERT technology
but are not satisfied with the performance of the available
open-sourced models.
How do I get started?

For persistent storage of training data and model, you will require a Google Cloud
Storage bucket. Please follow the Google Cloud TPU quickstart to create a GCP account
and GCS bucket. New Google Cloud users have $300 free credit to get started with any
GCP product.
Steps 1–5 of this tutorial can be run without a GCS bucket for demonstration purposes.
In that case, however, you will not be able to train the model.
What would it take?

Pre-training a BERT-Base model on a TPUv2 will take about 54 hours. Google Colab is
not designed for executing such long-running jobs and will interrupt the training
process every 8 hours or so. For uninterrupted training, consider using a paid pre-
emptible TPUv2 instance.
That said, at the time of writing (09.05.2019), with a Colab TPU, pre-training a BERT
model from scratch can be achieved at a negligible cost of storing the said model and
data in GCS (~1 USD).
How do I follow the guide?

The code below is a combination of Python and Bash.
It was designed to run in a Colab Jupyter environment.
Therefore, it would be most convenient to simply run it there.
However, all steps of the guide, except for the actual training part, might be run on a
separate machine. This could be useful if your dataset is too large
(or too private) to preprocess inside a Colab environment.
OK, show me the code.

Here you go.
Do I need to change anything in the code?

The only parameter you really have to set is your GCS BUCKET_NAME. Everything else
has default values which should work for most use-cases.
Now, let’s get to business.
. . .
Step 1: setting up training environment

First and foremost, we get the packages required to train the model.
The Jupyter environment allows executing bash commands directly from the notebook
by using an exclamation mark ‘!’, like this:
!pip install sentencepiece

!git clone https://github.com/google-research/bert
I will be exploiting this approach to make use of several other bash commands
throughout the experiment.
Now, let’s import the packages and authorize ourselves in Google Cloud.
1 import os
2 import sys
3 import json
4 import nltk
5 import random
6 import logging
7 import tensorflow as tf
8 import sentencepiece as spm
9
10 from glob import glob
11 from google.colab import auth, drive
12 from tensorflow.keras.utils import Progbar
13
14 sys.path.append("bert")
15
16 from bert import modeling, optimization, tokenization
17 from bert.run_pretraining import input_fn_builder, model_fn_builder
18
19 auth.authenticate_user()
20
21 # configure logging
22 log = logging.getLogger('tensorflow')
23 log.setLevel(logging.INFO)
24
25 # create formatter and add it to the handlers
26 formatter = logging.Formatter('%(asctime)s : %(message)s')
27 sh = logging.StreamHandler()
28 sh.setLevel(logging.INFO)
29 sh.setFormatter(formatter)
30 log.handlers = [sh]
31
32 if 'COLAB_TPU_ADDR' in os.environ:
33 log.info("Using TPU runtime")
34 USE_TPU = True
35 TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
36
37 with tf.Session(TPU_ADDRESS) as session:
38 log.info('TPU address is ' + TPU_ADDRESS)
39 # Upload credentials to TPU.
40 with open('/content/adc.json', 'r') as f:
41 auth_info = json.load(f)
42 tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
43
44 else:
45 log.warning('Not connected to TPU runtime')
46 USE_TPU = False
bert environment py hosted with ❤ by GitHub view raw

Setting up BERT training environment
Step 2: getting the data

We proceed with obtaining a corpus of text data. For this experiment, we will be using
the OpenSubtitles dataset, which is available for 65 languages here.
Unlike more commonly used text datasets (like Wikipedia) it does not require any
complex pre-processing. It also comes pre-formatted with one sentence per line, which
is a requirement for further processing steps.
Feel free to use the dataset for your language instead by setting the corresponding
language code.
1 AVAILABLE = {'af','ar','bg','bn','br','bs','ca','cs',
2 'da','de','el','en','eo','es','et','eu',
3 'fa','fi','fr','gl','he','hi','hr','hu',
4 'hy','id','is','it','ja','ka','kk','ko',
5 'lt','lv','mk','ml','ms','nl','no','pl',
6 'pt','pt_br','ro','ru','si','sk','sl','sq',
7 'sr','sv','ta','te','th','tl','tr','uk',
8 'ur','vi','ze_en','ze_zh','zh','zh_cn',
9 'zh_en','zh_tw','zh_zh'}
10
11 LANG_CODE = "en" #@param {type:"string"}
12
13 assert LANG_CODE in AVAILABLE, "Invalid language code selected"
14
15 !wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/mono/OpenSubtitles.raw.'$LANG_CODE
16 !gzip -d dataset.txt.gz
17 !tail dataset.txt
download_training_data.py hosted with ❤ by GitHub view raw
Download OPUS data
For demonstration purposes, we will only use a small fraction of the whole corpus by
default.
When training the real model, make sure to uncheck the DEMO_MODE checkbox to
use a 100x larger dataset.
Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base
model.
1 DEMO_MODE = True #@param {type:"boolean"}

2 if DEMO_MODE:
3 CORPUS_SIZE = 1000000
4 else:
5 CORPUS_SIZE = 100000000 #@param {type: "integer"}
6
7 !(head -n $CORPUS_SIZE dataset.txt) > subdataset.txt
8 !mv subdataset.txt dataset.txt
truncate_dataset.py hosted with ❤ by GitHub view raw
Truncate dataset
Step 3: preprocessing text

The raw text data we have downloaded contains punctuation, uppercase letters and
non-UTF symbols which we will remove before proceeding. During inference, we will
apply the same procedure to new data.
If your use-case requires different preprocessing (e.g. if uppercase letters or

punctuation are expected during inference), feel free to modify the function below to
accommodate for your needs.
De ne preprocessing routine
Now let’s preprocess the whole dataset.
Apply preprocessing
Step 4: building the vocabulary

For the next step, we will learn a new vocabulary that we will use to represent our
dataset.
The BERT paper uses a WordPiece tokenizer, which is not available in opensource.
Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not
directly compatible with BERT, with a small hack we can make it work.
SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will
crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for
building the vocabulary. Another option would be to use a machine with more RAM for
this step — that decision is up to you.
Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default.
We disable them explicitly by setting their indices to -1.
The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We
reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and
fine-tune the model after the pre-training phase is finished.
Learn SentencePiece vocabulary

Now, let’s see how we can make SentencePiece work for the BERT model.
Below is a sentence tokenized using the WordPiece vocabulary from a

pre-trained English BERT-base model from the official repo.
>>> wordpiece.tokenize("Colorless geothermal substations are

generating furiously")
['color',
'##less',
'geo',
'##thermal',
'sub',
'##station',
'##s',
'are',
'generating',
'furiously']
As we can see, the WordPiece tokenizer prepends the subwords which occur in the
middle of words with ‘##’. The subwords occurring at the beginning of words are
unchanged. If the subword occurs both in the beginning and in the middle of words,
both versions (with and without ‘##’) are added to the vocabulary.
SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let’s have a
look at the learned vocabulary:
Read the learned SentencePiece vocabulary
This gives:
Learnt vocab size: 31743

Sample tokens: ['▁cafe', '▁slippery', 'xious', '▁resonate',
'▁terrier', '▁feat', '▁frequencies', 'ainty', '▁punning', 'modern']
As we may observe, SentencePiece does quite the opposite to WordPiece. From the
documentation:
SentencePiece first escapes the whitespace with a meta-symbol “▁” (U+2581) as

follows:
Hello▁World .
Then, this text is segmented into small pieces, for example:
[Hello] [▁Wor] [ld] [.]
Subwords which occur after whitespace (which are also those that most words begin
with) are prepended with ‘▁’, while others are unchanged. This excludes subwords
which only occur at the beginning of sentences and nowhere else. These cases should
be quite rare, however.
So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a

simple conversion, removing “▁” from the tokens that contain it and adding “##” to
the ones that don’t.
We also add some special control symbols which are required by the BERT architecture.
By convention, we put those at the beginning of the vocabulary.
Additionally, we append some placeholder tokens to the vocabulary.

Those are useful if one wishes to update the pre-trained model with new,
task-specific tokens. In that case, the placeholder tokens are replaced with new real
ones, the pre-training data is re-generated, and the model is fine-tuned on new data.
Convert the vocabulary to use for BERT
Finally, we write the obtained vocabulary to file.
Dump vocabulary to le
Now let’s see how the new vocabulary works in practice:
>>> testcase = "Colorless geothermal substations are generating

furiously"
>>> bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME)
>>> bert_tokenizer.tokenize(testcase)
['color',
'##less',
'geo',
'##ther',
'##mal',
'sub',
'##station',
'##s',
'are',
'generat',
'##ing',
'furious',
'##ly']
Looking good!
Step 5: generating pre-training data

With the vocabulary at hand, we are ready to generate pre-training data for the BERT
model.
Since our dataset might be quite large, we will split it into shards:
Split the dataset
Now, for each shard we need to call create_pretraining_data.py script from the BERT
repo. To that end, we will employ the xargs command.
Before we start generating, we need to set some parameters to pass to the script. You
can find out more about the meaning of them in the README.
De ne parameters for pre-training data
Running this might take quite some time depending on the size of your dataset.
Create pre-training data
Step 6: setting up persistent storage

To preserve our hard-earned assets, we will persist them to Google Cloud Storage.
Provided that you have created the GCS bucket, this should be easy.
We will create two directories in GCS, one for the data and one for the model. In the
model directory, we will put the model vocabulary and configuration file.
Configure your BUCKET_NAME variable here before proceeding, otherwise you

will not be able to train the model.
Con gure GCS bucket name
Below is the sample hyperparameter configuration for BERT-base. Change at your own
risk!
Con gure BERT hyperparameters and save to disk
Now, we are ready to push our assets to GCS
Upload assets to GCS
Step 7: training the model

We are almost ready to begin training our model.
Be advised that some of the parameters from previous steps are duplicated here, so as
to allow for convenient restart of the training procedure.
Make sure that the parameters set are exactly the same across the experiment.
Con gure training run
Prepare the training run configuration, build the estimator and

input function, power up the bass cannon.
Build estimator model and input function
Execute!
Execute BERT training procedure
Training the model with the default parameters for 1 million steps
will take ~54 hours of run time. In case the kernel restarts for some reason, you may
always continue training from the latest checkpoint.
This concludes the guide to pre-training BERT from scratch on a cloud TPU.
Next steps
Okay, we’ve trained the model, now what?
That is a topic for a whole new discussion. There are a couple of things you could do:
1. Use the pre-trained model as a general-purpose NLU module
2. Fine-tune the model for some specific classification task
3. Create another DL model using BERT as a building block
4. ???
The really fun stuff is still to come, so stay woke. Meanwhile, check out the awesome
bert-as-a-service project and start serving your newly trained model in production.
Keep learning!
Other guides in this series

1. Pre-training BERT from scratch with cloud TPU [you are here]
2. Building a Search Engine with BERT and Tensorflow
3. Fine-tuning BERT with Keras and tf.Module
4. Improving sentence embeddings with BERT and Representation Learning
Machine Learning Deep Learning TensorFlow Colab Naturallanguageprocessing
About Help Legal
Get the Medium app

Pre-Training BERT From Scratch With Cloud TPU

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pre-Training BERT From Scratch With Cloud TPU

Uploaded by

Copyright:

Available Formats

26. 4.

Google Cloud TPUv2

Pre-training BERT from scratch with cloud TPU

In this experiment, we will be pre-training a state-of-the-art Natural Language

This guide covers all stages of the procedure, including:

1. Setting up the training environment

2. Downloading raw text data

3. Preprocessing text data

4. Learning a new vocabulary

5. Creating sharded pre-training data

6. Setting up GCS storage for data and model

7. Training the model on a cloud TPU

For whom is this guide?

How do I get started?

What would it take?

How do I follow the guide?

(or too private) to preprocess inside a Colab environment.

OK, show me the code.

Do I need to change anything in the code?

Now, let’s get to business.

Step 1: setting up training environment

!pip install sentencepiece

bert environment py hosted with ❤ by GitHub view raw

Step 2: getting the data

download_training_data.py hosted with ❤ by GitHub view raw

Download OPUS data

1 DEMO_MODE = True #@param {type:"boolean"}

truncate_dataset.py hosted with ❤ by GitHub view raw

Step 3: preprocessing text

If your use-case requires different preprocessing (e.g. if uppercase letters or

Now let’s preprocess the whole dataset.

Step 4: building the vocabulary

Learn SentencePiece vocabulary

Below is a sentence tokenized using the WordPiece vocabulary from a

>>> wordpiece.tokenize("Colorless geothermal substations are

Read the learned SentencePiece vocabulary

Learnt vocab size: 31743

SentencePiece first escapes the whitespace with a meta-symbol “▁” (U+2581) as

Then, this text is segmented into small pieces, for example:

[Hello] [▁Wor] [ld] [.]

So, in order to obtain a vocabulary analogous to WordPiece, we need to perform a

Additionally, we append some placeholder tokens to the vocabulary.

Convert the vocabulary to use for BERT

Finally, we write the obtained vocabulary to file.

Now let’s see how the new vocabulary works in practice:

>>> testcase = "Colorless geothermal substations are generating

Step 5: generating pre-training data

Split the dataset

De ne parameters for pre-training data

Create pre-training data

Step 6: setting up persistent storage

Configure your BUCKET_NAME variable here before proceeding, otherwise you

Con gure GCS bucket name

Con gure BERT hyperparameters and save to disk

Now, we are ready to push our assets to GCS

Upload assets to GCS

Step 7: training the model

Con gure training run

Prepare the training run configuration, build the estimator and

Build estimator model and input function

Execute BERT training procedure

Okay, we’ve trained the model, now what?