Professional Documents
Culture Documents
2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 1/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
A few questions
What is this guide useful for?
With this guide, you will be able to train a BERT model on arbitrary text data. This is
useful if a pre-trained model for your language or use case is not available in open
source.
Steps 1–5 of this tutorial can be run without a GCS bucket for demonstration purposes.
In that case, however, you will not be able to train the model.
That said, at the time of writing (09.05.2019), with a Colab TPU, pre-training a BERT
model from scratch can be achieved at a negligible cost of storing the said model and
data in GCS (~1 USD).
However, all steps of the guide, except for the actual training part, might be run on a
separate machine. This could be useful if your dataset is too large
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 2/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
. . .
I will be exploiting this approach to make use of several other bash commands
throughout the experiment.
Now, let’s import the packages and authorize ourselves in Google Cloud.
1 import os
2 import sys
3 import json
4 import nltk
5 import random
6 import logging
7 import tensorflow as tf
8 import sentencepiece as spm
9
10 from glob import glob
11 from google.colab import auth, drive
12 from tensorflow.keras.utils import Progbar
13
14 sys.path.append("bert")
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 3/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
15
16 from bert import modeling, optimization, tokenization
17 from bert.run_pretraining import input_fn_builder, model_fn_builder
18
19 auth.authenticate_user()
20
21 # configure logging
22 log = logging.getLogger('tensorflow')
23 log.setLevel(logging.INFO)
24
25 # create formatter and add it to the handlers
26 formatter = logging.Formatter('%(asctime)s : %(message)s')
27 sh = logging.StreamHandler()
28 sh.setLevel(logging.INFO)
29 sh.setFormatter(formatter)
30 log.handlers = [sh]
31
32 if 'COLAB_TPU_ADDR' in os.environ:
33 log.info("Using TPU runtime")
34 USE_TPU = True
35 TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']
36
37 with tf.Session(TPU_ADDRESS) as session:
38 log.info('TPU address is ' + TPU_ADDRESS)
39 # Upload credentials to TPU.
40 with open('/content/adc.json', 'r') as f:
41 auth_info = json.load(f)
42 tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
43
44 else:
45 log.warning('Not connected to TPU runtime')
46 USE_TPU = False
Unlike more commonly used text datasets (like Wikipedia) it does not require any
complex pre-processing. It also comes pre-formatted with one sentence per line, which
is a requirement for further processing steps.
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 4/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
Feel free to use the dataset for your language instead by setting the corresponding
language code.
1 AVAILABLE = {'af','ar','bg','bn','br','bs','ca','cs',
2 'da','de','el','en','eo','es','et','eu',
3 'fa','fi','fr','gl','he','hi','hr','hu',
4 'hy','id','is','it','ja','ka','kk','ko',
5 'lt','lv','mk','ml','ms','nl','no','pl',
6 'pt','pt_br','ro','ru','si','sk','sl','sq',
7 'sr','sv','ta','te','th','tl','tr','uk',
8 'ur','vi','ze_en','ze_zh','zh','zh_cn',
9 'zh_en','zh_tw','zh_zh'}
10
11 LANG_CODE = "en" #@param {type:"string"}
12
13 assert LANG_CODE in AVAILABLE, "Invalid language code selected"
14
15 !wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/mono/OpenSubtitles.raw.'$LANG_CODE
16 !gzip -d dataset.txt.gz
17 !tail dataset.txt
For demonstration purposes, we will only use a small fraction of the whole corpus by
default.
When training the real model, make sure to uncheck the DEMO_MODE checkbox to
use a 100x larger dataset.
Rest assured, 100M lines are perfectly sufficient to train a reasonably good BERT-base
model.
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 5/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
Truncate dataset
De ne preprocessing routine
Apply preprocessing
The BERT paper uses a WordPiece tokenizer, which is not available in opensource.
Instead, we will be using SentencePiece tokenizer in unigram mode. While it is not
directly compatible with BERT, with a small hack we can make it work.
SentencePiece requires quite a lot of RAM, so running it on the full dataset in Colab will
crash the kernel. To avoid this, we will randomly subsample a fraction of the dataset for
building the vocabulary. Another option would be to use a machine with more RAM for
this step — that decision is up to you.
Also, SentencePiece adds BOS and EOS control symbols to the vocabulary by default.
We disable them explicitly by setting their indices to -1.
The typical values for VOC_SIZE are somewhere in between 32000 and 128000. We
reserve NUM_PLACEHOLDERS tokens in case one wants to update the vocabulary and
fine-tune the model after the pre-training phase is finished.
Now, let’s see how we can make SentencePiece work for the BERT model.
['color',
'##less',
'geo',
'##thermal',
'sub',
'##station',
'##s',
'are',
'generating',
'furiously']
As we can see, the WordPiece tokenizer prepends the subwords which occur in the
middle of words with ‘##’. The subwords occurring at the beginning of words are
unchanged. If the subword occurs both in the beginning and in the middle of words,
both versions (with and without ‘##’) are added to the vocabulary.
SentencePiece has created two files: tokenizer.model and tokenizer.vocab. Let’s have a
look at the learned vocabulary:
This gives:
As we may observe, SentencePiece does quite the opposite to WordPiece. From the
documentation:
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 7/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
Hello▁World .
Subwords which occur after whitespace (which are also those that most words begin
with) are prepended with ‘▁’, while others are unchanged. This excludes subwords
which only occur at the beginning of sentences and nowhere else. These cases should
be quite rare, however.
We also add some special control symbols which are required by the BERT architecture.
By convention, we put those at the beginning of the vocabulary.
Dump vocabulary to le
['color',
'##less',
'geo',
'##ther',
'##mal',
'sub',
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 8/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
'##station',
'##s',
'are',
'generat',
'##ing',
'furious',
'##ly']
Looking good!
Now, for each shard we need to call create_pretraining_data.py script from the BERT
repo. To that end, we will employ the xargs command.
Before we start generating, we need to set some parameters to pass to the script. You
can find out more about the meaning of them in the README.
Running this might take quite some time depending on the size of your dataset.
We will create two directories in GCS, one for the data and one for the model. In the
model directory, we will put the model vocabulary and configuration file.
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 9/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
Below is the sample hyperparameter configuration for BERT-base. Change at your own
risk!
Be advised that some of the parameters from previous steps are duplicated here, so as
to allow for convenient restart of the training procedure.
Make sure that the parameters set are exactly the same across the experiment.
Execute!
Training the model with the default parameters for 1 million steps
will take ~54 hours of run time. In case the kernel restarts for some reason, you may
always continue training from the latest checkpoint.
This concludes the guide to pre-training BERT from scratch on a cloud TPU.
Next steps
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 10/11
26. 4. 2020 Pre-training BERT from scratch with cloud TPU - Towards Data Science
That is a topic for a whole new discussion. There are a couple of things you could do:
4. ???
The really fun stuff is still to come, so stay woke. Meanwhile, check out the awesome
bert-as-a-service project and start serving your newly trained model in production.
Keep learning!
https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379 11/11