Professional Documents
Culture Documents
With GPT-2
September 4, 2019 10 min read AI, Text Generation
●
●
●
●
●
At the same time, the Python code which allowed anyone to download the model (albeit smaller
versions out of concern the full model can be abused to mass-generate fake news) and the
TensorFlow code to load the downloaded model and generate predictions was open-sourced on
GitHub.
Neil Shepperd created a fork of OpenAI’s repo which contains additional code to allow
finetuning the existing OpenAI model on custom datasets. A notebook was created soon after,
which can be copied into Google Colaboratory and clones Shepperd’s repo to finetune GPT-2
backed by a free GPU. From there, the proliferation of GPT-2 generated text took off:
researchers such as Gwern Branwen made GPT-2 Poetry and Janelle Shane made GPT-2
Dungeons and Dragons character bios.
I waited to see if anyone would make a tool to help streamline this finetuning and text
generation workflow, a la textgenrnn which I had made for recurrent neural network-based text
generation. Months later, no one did. So I did it myself. Enter gpt-2-simple, a Python package
which wraps Shepperd’s finetuning code in a functional interface and adds many utilities for
model management and generation control.
Thanks to gpt-2-simple and this Colaboratory Notebook, you can easily finetune GPT-2 on your
own dataset with a simple function, and generate text to your own specifications!
The actual Transformer architecture GPT-2 uses is very complicated to explain (here’s a great
lecture). For the purposes of finetuning, since we can’t modify the architecture, it’s easier to
think of GPT-2 as a black box, taking in inputs and providing outputs. Like previous forms of text
generators, the inputs are a sequence of tokens, and the outputs are the probability of the next
token in the sequence, with these probabilities serving as weights for the AI to pick the next
token in the sequence. In this case, both the input and output tokens are byte pair encodings,
which instead of using character tokens (slower to train but includes case/formatting) or word
tokens (faster to train but does not include case/formatting) like most RNN approaches, the
inputs are “compressed” to the shortest combination of bytes including case/formatting, which
serves as a compromise between both approaches but unfortunately adds randomness to the
final generation length. The byte pair encodings are later decoded into readable text for human
generation.
The pretrained GPT-2 models were trained on websites linked from Reddit. As a result, the
model has a very strong grasp of the English language, allowing this knowledge to transfer to
other datasets and perform well with only a minor amount of additional finetuning. Due to the
English bias in encoder construction, languages with non-Latin characters like Russian and CJK
will perform poorly in finetuning.
When finetuning GPT-2, I recommend using the 124M model (the default) as it’s the best
balance of speed, size, and creativity. If you have large amounts of training data (>10 MB), then
the 355M model may work better.