Langauage Model

Have you noticed the ‘Smart Compose’ feature in Gmail that gives auto-suggestions to
complete sentences while writing an email? This is one of the various use-cases of
language models used in Natural Language Processing (NLP).
A language model is the core component of modern Natural Language Processing (NLP).
It’s a statistical tool that analyzes the pattern of human language for the prediction of
words.
NLP-based applications use language models for a variety of tasks, such as audio to text
conversion, speech recognition, sentiment analysis, summarization, spell correction, etc.
Let’s understand how language models help in processing these NLP tasks:
 Speech Recognition: Smart speakers, such as Alexa, use automatic speech

recognition (ASR) mechanisms for translating the speech into text. It translates the spoken
words into text and between this translation, the ASR mechanism analyzes the
intent/sentiments of the user by differentiating between the words. For example, analyzing
homophone phrases such as “Let her” or “Letter”, “But her” “Butter”.
 Machine Translation: When translating a Chinese phrase “我在吃” into English, the
translator can give several choices as output:
I eat lunch
I am eating
Me am eating
Eating am I
Here, the language model tells that the translation “I am eating” sounds natural and will
suggest the same as output.
Challenges with Language Modeling?
Formal languages (like a programming language) are precisely defined. All the words
and their usage is predefined in the system. Anyone who knows a specific programming
language can understand what’s written without any formal specification.
Natural language, on the other hand, isn’t designed; it evolves according to the
convenience and learning of an individual. There are several terms in natural language
that can be used in a number of ways. This introduces ambiguity but can still be
understood by humans.
Machines only understand the language of numbers. For creating language models, it is
necessary to convert all the words into a sequence of numbers. For the modellers, this is
known as encodings.
Encodings can be simple or complex. Generally, a number is assigned to every word and
this is called label-encoding. In the sentence “I love to play cricket on weekends”, every
word is assigned a number [1, 2, 3, 4, 5, 6]. This is an example of how encoding is done
(one-hot encoding).
How does Language Model Works?
Language Models determine the probability of the next word by analyzing the text in
data. These models interpret the data by feeding it through algorithms.
The algorithms are responsible for creating rules for the context in natural language. The
models are prepared for the prediction of words by learning the features and
characteristics of a language. With this learning, the model prepares itself for
understanding phrases and predicting the next words in sentences.
For training a language model, a number of probabilistic approaches are used. These
approaches vary on the basis of the purpose for which a language model is created. The
amount of text data to be analyzed and the math applied for analysis makes a difference
in the approach followed for creating and training a language model.
For example, a language model used for predicting the next word in a search query will
be absolutely different from those used in predicting the next word in a long document
(such as Google Docs). The approach followed to train the model would be unique in
both cases.
Types of Language Models:
There are primarily two types of language models:
1. Statistical Language Models

Statistical models include the development of probabilistic models that are able to predict
the next word in the sequence, given the words that precede it. A number of statistical
language models are in use already. Let’s take a look at some of those popular models:
N-Gram: This is one of the simplest approaches to language modelling. Here, a

probability distribution for a sequence of ‘n’ is created, where ‘n’ can be any number and
defines the size of the gram (or sequence of words being assigned a probability). If n=4, a
gram may look like: “can you help me”. Basically, ‘n’ is the amount of context that the
model is trained to consider. There are different types of N-Gram models such as
unigrams, bigrams, trigrams, etc.
Unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the query
likelihood model, which uses information retrieval to examine a pool of documents and
match the most relevant one to a specific query.
Bidirectional: Unlike n-gram models, which analyze text in one direction (backwards),

bidirectional models analyze text in both directions, backwards and forwards. These
models can predict any word in a sentence or body of text by using every other word in
the text. Examining text bidirectionally increases result accuracy. This type is often
utilized in machine learning and speech generation applications. For example, Google
uses a bidirectional model to process search queries.
Exponential: This type of statistical model evaluates text by using an equation which is a
combination of n-grams and feature functions. Here the features and parameters of the
desired results are already specified. The model is based on the principle of entropy,
which states that probability distribution with the most entropy is the best choice.
Exponential models have fewer statistical assumptions which mean the chances of having
accurate results are more.
Continuous Space: In this type of statistical model, words are arranged as a non-linear
combination of weights in a neural network. The process of assigning weight to a word is
known as word embedding. This type of model proves helpful in scenarios where the data
set of words continues to become large and include unique words.
In cases where the data set is large and consists of rarely used or unique words, linear
models such as n-gram do not work. This is because, with increasing words, the possible
word sequences increase, and thus the patterns predicting the next word become weaker.
2. Neural Language Models
These language models are based on neural networks and are often considered as an
advanced approach to execute NLP tasks. Neural language models overcome the
shortcomings of classical models such as n-gram and are used for complex tasks such as
speech recognition or machine translation.
Language is significantly complex and keeps on evolving. Therefore, the more complex
the language model is, the better it would be at performing NLP tasks. Compared to the
n-gram model, an exponential or continuous space model proves to be a better option for
NLP tasks because they are designed to handle ambiguity and language variation.
Meanwhile, language models should be able to manage dependencies. For example, a

model should be able to understand words derived from different languages.
Some Common Examples of Language Models
Language models are the cornerstone of Natural Language Processing (NLP) technology.
We have been making the best of language models in our routine, without even realizing
it. Let’s take a look at some of the examples of language models.
1. Speech Recognization
Voice assistants such as Siri and Alexa are examples of how language models help
machines in processing speech audio.
2. Machine Translation
Google Translator and Microsoft Translate are examples of how NLP models can help in
translating one language to another.
3. Sentiment Analysis
This helps in analyzing the sentiments behind a phrase. This use case of NLP models is
used in products that allow businesses to understand a customer’s intent behind opinions
or attitudes expressed in the text. Hubspot’s Service Hub is an example of how language
models can help in sentiment analysis.
4. Text Suggestions
Google services such as Gmail or Google Docs use language models to help users get
text suggestions while they compose an email or create long text documents,
respectively.
5. Parsing Tools
Parsing involves analyzing sentences or words that comply with syntax or grammar rules.
Spell checking tools are perfect examples of language modelling and parsing.
How do you plan to use Language Models?
There are several innovative ways in which language models can support NLP tasks. If
you have any idea in mind, then our AI experts can help you in creating language models
for executing simple to complex NLP tasks. As a part of our AI application development
services, we provide a free, no-obligation consultation session that allows our prospects
to share their ideas with AI experts and talk about its execution.
Problem of Modeling Language

Formal languages, like programming languages, can be fully specified.
All the reserved words can be defined and the valid ways that they can be used can be
precisely defined.
We cannot do this with natural language. Natural languages are not designed; they emerge,
and therefore there is no formal specification.
There may be formal rules for parts of the language, and heuristics, but natural language
that does not confirm is often used. Natural languages involve vast numbers of terms that
can be used in ways that introduce all kinds of ambiguities, yet can still be understood by
other humans.
Further, languages change, word usages change: it is a moving target.
Nevertheless, linguists try to specify the language with formal grammars and structures. It
can be done, but it is very difficult and the results can be fragile.
An alternative approach to specifying the model of the language is to learn it from
examples.
Neural Language Models
Recently, the use of neural networks in the development of language models has become
very popular, to the point that it may now be the preferred approach.
The use of neural networks in language modeling is often called Neural Language
Modeling, or NLM for short.
Neural network approaches are achieving better results than classical methods both on
standalone language models and when models are incorporated into larger models on
challenging tasks like speech recognition and machine translation.
A key reason for the leaps in improved performance may be the method’s ability to
generalize.
Nonlinear neural network models solve some of the shortcomings of traditional language
models: they allow conditioning on increasingly large context sizes with only a linear
increase in the number of parameters, they alleviate the need for manually designing
backoff orders, and they support generalization across different contexts.
Specifically, a word embedding is adopted that uses a real-valued vector to represent each
word in a project vector space. This learned representation of words based on their usage
allows words with a similar meaning to have a similar representation.
Neural Language Models (NLM) address the n-gram data sparsity issue through
parameterization of words as vectors (word embeddings) and using them as inputs to a
neural network. The parameters are learned as part of the training process. Word
embeddings obtained through NLMs exhibit the property whereby semantically close words
are likewise close in the induced vector space.
This generalization is something that the representation used in classical statistical

language models can not easily achieve.
“True generalization” is difficult to obtain in a discrete word indice space, since there is no
obvious relation between the word indices.
Further, the distributed representation approach allows the embedding representation to
scale better with the size of the vocabulary. Classical methods that have one discrete
representation per word fight the curse of dimensionality with larger and larger vocabularies
of words that result in longer and more sparse representations.
The neural network approach to language modeling can be described using the three
following model properties, taken from “A Neural Probabilistic Language Model“, 2003.
1. Associate each word in the vocabulary with a distributed word feature vector.
2. Express the joint probability function of word sequences in terms of the feature
vectors of these words in the sequence.
3. Learn simultaneously the word feature vector and the parameters of the probability
function.
This represents a relatively simple model where both the representation and probabilistic
model are learned together directly from raw text data.
Recently, the neural based approaches have started to and then consistently started to
outperform the classical statistical approaches.
We provide ample empirical evidence to suggest that connectionist language models are
superior to standard n-gram techniques, except their high computational (training)
complexity.
Initially, feed-forward neural network models were used to introduce the approach.
More recently, recurrent neural networks and then networks with a long-term memory like
the Long Short-Term Memory network, or LSTM, allow the models to learn the relevant
context over much longer input sequences than the simpler feed-forward networks.
[an RNN language model] provides further generalization: instead of considering just
several preceding words, neurons with input from recurrent connections are assumed to
represent short term memory. The model learns itself from the data how to represent
memory. While shallow feedforward neural networks (those with just one hidden layer) can
only cluster similar words, recurrent neural network (which can be considered as a deep
architecture) can perform clustering of similar histories. This allows for instance efficient
representation of patterns with variable length.
— Extensions of recurrent neural network language model, 2011.

Recently, researchers have been seeking the limits of these language models. In the paper
“Exploring the Limits of Language Modeling“, evaluating language models over large
datasets, such as the corpus of one million words, the authors find that LSTM-based neural
language models out-perform the classical methods.
… we have shown that RNN LMs can be trained on large amounts of data, and outperform
competing models including carefully tuned N-grams.
— Exploring the Limits of Language Modeling, 2016.

Further, they propose some heuristics for developing high-performing neural language
models in general:
 Size matters. The best models were the largest models, specifically number of
memory units.
 Regularization matters. Use of regularization like dropout on input connections
improves results.
 CNNs vs Embeddings. Character-level Convolutional Neural Network (CNN)
models can be used on the front-end instead of word embeddings, achieving similar and
sometimes better results.
 Ensembles matter. Combining the prediction from multiple models can offer large
improvements in model performance.
How to Develop a Word-Level Neural Language Model and Use it to Generate Text
A language model can predict the probability of the next word in the sequence, based on
the words already observed in the sequence.
Neural network models are a preferred method for developing statistical language models
because they can use a distributed representation where different words with similar
meanings have similar representation and because they can use a large context of recently
observed words when making predictions.
In this tutorial, you will discover how to develop a statistical language model using deep
learning in Python.
After completing this tutorial, you will know:
 How to prepare text for developing a word-based language model.

 How to design and fit a neural language model with a learned embedding and an
LSTM hidden layer.
 How to use the learned language model to generate new text with similar statistical
properties as the source text.
This tutorial is divided into 4 parts; they are:
1. The Republic by Plato
2. Data Preparation
3. Train Language Model
4. Use Language Model
The Republic by Plato
The Republic is the classical Greek philosopher Plato’s most famous work.
It is structured as a dialog (e.g. conversation) on the topic of order and justice within a city
state
The entire text is available for free in the public domain. It is available on the Project
Gutenberg website in a number of formats.
You can download the ASCII text version of the entire book (or books) here:
 Download The Republic by Plato (republic.txt)

Download the book text and place it in your current working directly with the filename
‘republic.txt‘
Open the file in a text editor and delete the front and back matter. This includes details
about the book at the beginning, a long analysis, and license information at the end.
The text should begin with:
BOOK I.
I went down yesterday to the Piraeus with Glaucon the son of Ariston,
…
And end with
…
And it shall be well with us both in this life and in the pilgrimage of a thousand years which
we have been describing.
Here is a direct link to the clean version of the data file:
Data Preparation
We will start by preparing the data for modeling.
The first step is to look at the data.

Review the Text
Open the text in an editor and just look at the text data.
For example, here is the first piece of dialog:
BOOK I.
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what manner they would
celebrate the festival, which was a new thing. I was delighted with the
procession of the inhabitants; but that of the Thracians was equally,
if not more, beautiful. When we had finished our prayers and viewed the
spectacle, we turned in the direction of the city; and at that instant
Polemarchus the son of Cephalus chanced to catch sight of us from a
distance as we were starting on our way home, and told his servant to
run and bid us wait for him. The servant took hold of me by the cloak
behind, and said: Polemarchus desires you to wait.
I turned round, and asked him where his master was.
There he is, said the youth, coming after you, if you will only wait.
Certainly we will, said Glaucon; and in a few minutes Polemarchus

appeared, and with him Adeimantus, Glaucon’s brother, Niceratus the son
of Nicias, and several others who had been at the procession.
Polemarchus said to me: I perceive, Socrates, that you and your

companion are already on your way to the city.
You are not far wrong, I said.
What do you see that we will need to handle in preparing the data?
Here’s what I see from a quick look:

 Book/Chapter headings (e.g. “BOOK I.”).
 British English spelling (e.g. “honoured”)
 Lots of punctuation (e.g. “–“, “;–“, “?–“, and more)
 Strange names (e.g. “Polemarchus”).
 Some long monologues that go on for hundreds of lines.
 Some quoted dialog (e.g. ‘…’)
These observations, and more, suggest at ways that we may wish to prepare the text data.
The specific way we prepare the data really depends on how we intend to model it, which in
turn depends on how we intend to use it.
Language Model Design

In this tutorial, we will develop a model of the text that we can then use to generate new
sequences of text.
The language model will be statistical and will predict the probability of each word given an
input sequence of text. The predicted word will be fed in as input to in turn generate the next
word.
A key design decision is how long the input sequences should be. They need to be long
enough to allow the model to learn the context for the words to predict. This input length will
also define the length of seed text used to generate new sequences when we use the
model.
There is no correct answer. With enough time and resources, we could explore the ability of
the model to learn with differently sized input sequences.
Instead, we will pick a length of 50 words for the length of the input sequences, somewhat
arbitrarily.
We could process the data so that the model only ever deals with self-contained sentences
and pad or truncate the text to meet this requirement for each input sequence. You could
explore this as an extension to this tutorial.
Instead, to keep the example brief, we will let all of the text flow together and train the model
to predict the next word across sentences, paragraphs, and even books or chapters in the
text.
Now that we have a model design, we can look at transforming the raw text into sequences
of 50 input words to 1 output word, ready to fit a model.
Load Text
The first step is to load the text into memory.
We can develop a small function to load the entire text file into memory and return it. The
function is called load_doc() and is listed below. Given a filename, it returns a sequence of
loaded text.
1 # load doc into memory
2 def load_doc(filename):
3 # open the file as read only
4 file = open(filename, 'r')
5 # read all text
6 text = file.read()
7 # close the file
8 file.close()
9 return text
Using this function, we can load the cleaner version of the document in the file
‘republic_clean.txt‘ as follows:
1 # load document
2 in_filename = 'republic_clean.txt'
3 doc = load_doc(in_filename)
4 print(doc[:200])
Running this snippet loads the document and prints the first 200 characters as a sanity
check.
BOOK I.
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what
So far, so good. Next, let’s clean the text.
Clean Text
We need to transform the raw text into a sequence of tokens or words that we can use as a
source to train the model.
Based on reviewing the raw text (above), below are some specific operations we will
perform to clean the text. You may want to explore more cleaning operations yourself as an
extension.
 Replace ‘–‘ with a white space so we can split words better.

 Split words based on white space.
 Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes
‘What’).
 Remove all words that are not alphabetic to remove standalone punctuation tokens.
 Normalize all words to lowercase to reduce the vocabulary size.
Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a
smaller model that trains faster.
We can implement each of these cleaning operations in this order in a function. Below is the
function clean_doc() that takes a loaded document as an argument and returns an array of
clean tokens.
1 import string
2
3 # turn a doc into clean tokens
4 def clean_doc(doc):
5 # replace '--' with a space ' '
6 doc = doc.replace('--', ' ')
7 # split into tokens by white space
8 tokens = doc.split()
9 # remove punctuation from each token
10 table = str.maketrans('', '', string.punctuation)
11 tokens = [w.translate(table) for w in tokens]
12 # remove remaining tokens that are not alphabetic

13 tokens = [word for word in tokens if word.isalpha()]
14 # make lower case
15 tokens = [word.lower() for word in tokens]
16 return tokens
We can run this cleaning operation on our loaded document and print out some of the
tokens and statistics as a sanity check.
1 # clean document
2 tokens = clean_doc(doc)
3 print(tokens[:200])
4 print('Total Tokens: %d' % len(tokens))
5 print('Unique Tokens: %d' % len(set(tokens)))
First, we can see a nice list of tokens that look cleaner than the raw text. We could remove
the ‘Book I‘ chapter markers and more, but this is a good start.
['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up',
'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what',
'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession',
'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished',
'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant',
'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting',
1
'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid', 'us', 'wait', 'for', 'him', 'the', 'servant', 'took', 'hold', 'of',
'me', 'by', 'the', 'cloak', 'behind', 'and', 'said', 'polemarchus', 'desires', 'you', 'to', 'wait', 'i', 'turned', 'round', 'and', 'asked', 'him',
'where', 'his', 'master', 'was', 'there', 'he', 'is', 'said', 'the', 'youth', 'coming', 'after', 'you', 'if', 'you', 'will', 'only', 'wait', 'certainly',
'we', 'will', 'said', 'glaucon', 'and', 'in', 'a', 'few', 'minutes', 'polemarchus', 'appeared', 'and', 'with', 'him', 'adeimantus', 'glaucons',
'brother', 'niceratus', 'the', 'son', 'of', 'nicias', 'and', 'several', 'others', 'who', 'had', 'been', 'at', 'the', 'procession', 'polemarchus',
'said']
We also get some statistics about the clean document.
We can see that there are just under 120,000 words in the clean text and a vocabulary of
just under 7,500 words. This is smallish and models fit on this data should be manageable
on modest hardware.
1 Total Tokens: 118684
2 Unique Tokens: 7409
Next, we can look at shaping the tokens into sequences and saving them to file.
Save Clean Text
We can organize the long list of tokens into sequences of 50 input words and 1 output word.
That is, sequences of 51 words.
We can do this by iterating over the list of tokens from token 51 onwards and taking the
prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.
We will transform the tokens into space-separated strings for later storage in a file.
The code to split the list of clean tokens into sequences with a length of 51 tokens is listed
below.
1 # organize into sequences of tokens
2 length = 50 + 1
3 sequences = list()
4 for i in range(length, len(tokens)):
5 # select sequence of tokens
6 seq = tokens[i-length:i]
7 # convert into a line
8 line = ' '.join(seq)
9 # store
10 sequences.append(line)
11 print('Total Sequences: %d' % len(sequences))
Running this piece creates a long list of lines.
Printing statistics on the list, we can see that we will have exactly 118,633 training patterns
to fit our model.
1 Total Sequences: 118633
Next, we can save the sequences to a new file for later loading.
We can define a new function for saving lines of text to a file. This new function is
called save_doc() and is listed below. It takes as input a list of lines and a filename. The
lines are written, one per line, in ASCII format.
1 # save tokens to file, one dialog per line
2 def save_doc(lines, filename):
3 data = '\n'.join(lines)
4 file = open(filename, 'w')
5 file.write(data)
6 file.close()
We can call this function and save our training sequences to the file
‘republic_sequences.txt‘.
1 # save sequences to file
2 out_filename = 'republic_sequences.txt'
3 save_doc(sequences, out_filename)
Take a look at the file with your text editor.
You will see that each line is shifted along one word, with a new word at the end to be
predicted; for example, here are the first 3 lines in truncated form:
book i i … catch sight of

i i went … sight of us
i went down … of us from
…
Complete Example
Tying all of this together, the complete code listing is provided below.
1 import string
2

7 # read all text
9 # close the file
10 file.close()
11 return text
12
13 # turn a doc into clean tokens
14 def clean_doc(doc):
15 # replace '--' with a space ' '
16 doc = doc.replace('--', ' ')
17 # split into tokens by white space
18 tokens = doc.split()
19 # remove punctuation from each token
20 table = str.maketrans('', '', string.punctuation)
21 tokens = [w.translate(table) for w in tokens]
22 # remove remaining tokens that are not alphabetic
23 tokens = [word for word in tokens if word.isalpha()]
24 # make lower case
25 tokens = [word.lower() for word in tokens]
26 return tokens
27
32 file.write(data)
33 file.close()
34
35 # load document
36 in_filename = 'republic_clean.txt'
38 print(doc[:200])
39
40 # clean document
41 tokens = clean_doc(doc)
42 print(tokens[:200])
43 print('Total Tokens: %d' % len(tokens))
44 print('Unique Tokens: %d' % len(set(tokens)))
45
46 # organize into sequences of tokens
47 length = 50 + 1
49 for i in range(length, len(tokens)):
51 seq = tokens[i-length:i]
52 # convert into a line
53 line = ' '.join(seq)
54 # store
55 sequences.append(line)
57
59 out_filename = 'republic_sequences.txt'
You should now have training data stored in the file ‘republic_sequences.txt‘ in your current
working directory.
Next, let’s look at how to fit a language model to this data.
Train Language Model

We can now train a statistical language model from the prepared data.
The model we will train is a neural language model. It has a few unique characteristics:
 It uses a distributed representation for words so that different words with similar meanings
will have a similar representation.
 It learns the representation at the same time as learning the model.
 It learns to predict the probability for the next word using the context of the last 100 words.
Specifically, we will use an Embedding Layer to learn the representation of words, and a
Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based
on their context.
Let’s start by loading our training data.
Load Sequences
We can load our training data using the load_doc() function we developed in the previous
section.
Once loaded, we can split the data into separate training sequences by splitting based on
new lines.
The snippet below will load the ‘republic_sequences.txt‘ data file from the current working
directory.
5 # read all text
7 # close the file
8 file.close()
9 return text
10
11 # load
12 in_filename = 'republic_sequences.txt'
14 lines = doc.split('\n')
Next, we can encode the training data.
Encode Sequences
The word embedding layer expects input sequences to be comprised of integers.
We can map each word in our vocabulary to a unique integer and encode our input
sequences. Later, when we make predictions, we can convert the prediction to numbers
and look up their associated words in the same mapping.
To do this encoding, we will use the Tokenizer class in the Keras API.

First, the Tokenizer must be trained on the entire training dataset, which means it finds all of
the unique words in the data and assigns each a unique integer.
We can then use the fit Tokenizer to encode all of the training sequences, converting each
sequence from a list of words to a list of integers.
1 # integer encode sequences of words
2 tokenizer = Tokenizer()
3 tokenizer.fit_on_texts(lines)
4 sequences = tokenizer.texts_to_sequences(lines)
We can access the mapping of words to integers as a dictionary attribute called word_index
on the Tokenizer object.
We need to know the size of the vocabulary for defining the embedding layer later. We can
determine the vocabulary by calculating the size of the mapping dictionary.
Words are assigned values from 1 to the total number of words (e.g. 7,409). The
Embedding layer needs to allocate a vector representation for each word in this vocabulary
from index 1 to the largest index and because indexing of arrays is zero-offset, the index of
the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1
in length.
Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1
larger than the actual vocabulary.
1 # vocabulary size
2 vocab_size = len(tokenizer.word_index) + 1
Sequence Inputs and Output

Now that we have encoded the input sequences, we need to separate them into input (X)
and output (y) elements.
We can do this with array slicing.
After separating, we need to one hot encode the output word. This means converting it from
an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate
the specific word at the index of the words integer value.
This is so that the model learns to predict the probability distribution for the next word and
the ground truth from which to learn from is 0 for all words except the actual word that
comes next.
Keras provides the to_categorical() that can be used to one hot encode the output words for
each input-output sequence pair.
Finally, we need to specify to the Embedding layer how long input sequences are. We know
that there are 50 words because we designed the model, but a good generic way to specify
that is to use the second dimension (number of columns) of the input data’s shape. That
way, if you change the length of sequences when preparing data, you do not need to
change this data loading code; it is generic.
1 # separate into input and output
2 sequences = array(sequences)
3 X, y = sequences[:,:-1], sequences[:,-1]
4 y = to_categorical(y, num_classes=vocab_size)
5 seq_length = X.shape[1]
Fit Model
We can now define and fit our language model on the training data.
The learned embedding needs to know the size of the vocabulary and the length of input
sequences as previously discussed. It also has a parameter to specify how many
dimensions will be used to represent each word. That is, the size of the embedding vector
space.
Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or
larger values.
We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and
a deeper network may achieve better results.
A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to
interpret the features extracted from the sequence. The output layer predicts the next word
as a single vector the size of the vocabulary with a probability for each word in the
vocabulary. A softmax activation function is used to ensure the outputs have the
characteristics of normalized probabilities.
1 # define model
2 model = Sequential()
3 model.add(Embedding(vocab_size, 50, input_length=seq_length))
4 model.add(LSTM(100, return_sequences=True))
5 model.add(LSTM(100))
6 model.add(Dense(100, activation='relu'))
7 model.add(Dense(vocab_size, activation='softmax'))
8 print(model.summary())
A summary of the defined network is printed as a sanity check to ensure we have

constructed what we intended.
1 _________________________________________________________________
2 Layer (type) Output Shape Param #
3 =================================================================
4 embedding_1 (Embedding) (None, 50, 50) 370500
5 _________________________________________________________________
6 lstm_1 (LSTM) (None, 50, 100) 60400
7 _________________________________________________________________
8 lstm_2 (LSTM) (None, 100) 80400
9 _________________________________________________________________
10 dense_1 (Dense) (None, 100) 10100
11 _________________________________________________________________
12 dense_2 (Dense) (None, 7410) 748410
13 =================================================================
14 Total params: 1,269,810
15 Trainable params: 1,269,810
16 Non-trainable params: 0
17 _________________________________________________________________
Next, the model is compiled specifying the categorical cross entropy loss needed to fit the
model. Technically, the model is learning a multi-class classification and this is the suitable
loss function for this type of problem. The efficient Adam implementation to mini-batch
gradient descent is used and accuracy is evaluated of the model.
Finally, the model is fit on the data for 100 training epochs with a modest batch size of 128
to speed things up.
Training may take a few hours on modern hardware without GPUs. You can speed it up
with a larger batch size and/or fewer training epochs.
1 # compile model
2 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
3 # fit model
4 model.fit(X, y, batch_size=128, epochs=100)
During training, you will see a summary of performance, including the loss and accuracy
evaluated from the training data at the end of each batch update.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
You will get different results, but perhaps an accuracy of just over 50% of predicting the
next word in the sequence, which is not bad. We are not aiming for 100% accuracy (e.g. a
model that memorized the text), but rather a model that captures the essence of the text.
1 ...
2 Epoch 96/100
3 118633/118633 [==============================] - 265s - loss: 2.0324 - acc: 0.5187
4 Epoch 97/100
5 118633/118633 [==============================] - 265s - loss: 2.0136 - acc: 0.5247
6 Epoch 98/100
7 118633/118633 [==============================] - 267s - loss: 1.9956 - acc: 0.5262
8 Epoch 99/100
9 118633/118633 [==============================] - 266s - loss: 1.9812 - acc: 0.5291
10 Epoch 100/100
11 118633/118633 [==============================] - 270s - loss: 1.9709 - acc: 0.5315
Save Model
At the end of the run, the trained model is saved to file.
Here, we use the Keras model API to save the model to the file ‘model.h5‘ in the current
working directory.
Later, when we load the model to make predictions, we will also need the mapping of words
to integers. This is in the Tokenizer object, and we can save that too using Pickle.
1 # save the model to file
2 model.save('model.h5')
3 # save the tokenizer
4 dump(tokenizer, open('tokenizer.pkl', 'wb'))
Complete Example
We can put all of this together; the complete example for fitting the language model is listed
below.
1 from numpy import array
2 from pickle import dump

3 from keras.preprocessing.text import Tokenizer
4 from keras.utils import to_categorical
5 from keras.models import Sequential
6 from keras.layers import Dense
7 from keras.layers import LSTM
8 from keras.layers import Embedding
9
14 # read all text
16 # close the file
17 file.close()
18 return text
19
20 # load
24
27 tokenizer.fit_on_texts(lines)
28 sequences = tokenizer.texts_to_sequences(lines)
29 # vocabulary size
31
37
38 # define model
40 model.add(Embedding(vocab_size, 50, input_length=seq_length))
41 model.add(LSTM(100, return_sequences=True))
43 model.add(Dense(100, activation='relu'))
46 # compile model
48 # fit model
49 model.fit(X, y, batch_size=128, epochs=100)
50
53 # save the tokenizer
54 dump(tokenizer, open('tokenizer.pkl', 'wb'))
Use Language Model

Now that we have a trained language model, we can use it.
In this case, we can use it to generate new sequences of text that have the same statistical
This is not practical, at least not for this example, but it gives a concrete example of what
the language model has learned.
We will start by loading the training sequences again.
Load Data
We can use the same code from the previous section to load the training data sequences of
text.
Specifically, the load_doc() function.
5 # read all text
7 # close the file
8 file.close()
9 return text
10
11 # load cleaned text sequences
We need the text so that we can choose a source sequence as input to the model for
generating a new sequence of text.
The model will require 50 words as input.
Later, we will need to specify the expected length of input. We can determine this from the
input sequences by calculating the length of one line of the loaded data and subtracting 1
for the expected output word that is also on the same line.
1 seq_length = len(lines[0].split()) - 1
Load Model
We can now load the model from file.
Keras provides the load_model() function for loading the model, ready for use.
1 # load the model
2 model = load_model('model.h5')
We can also load the tokenizer from file using the Pickle API.
1 # load the tokenizer
2 tokenizer = load(open('tokenizer.pkl', 'rb'))
We are ready to use the loaded model.
Generate Text
The first step in generating text is preparing a seed input.
We will select a random line of text from the input text for this purpose. Once selected, we
will print it so that we have some idea of what was used.
1 # select a seed text
2 seed_text = lines[randint(0,len(lines))]
3 print(seed_text + '\n')
Next, we can generate new words, one at a time.
First, the seed text must be encoded to integers using the same tokenizer that we used
when training the model.
1 encoded = tokenizer.texts_to_sequences([seed_text])[0]
The model can predict the next word directly by calling model.predict_classes() that will
return the index of the word with the highest probability.
1 # predict probabilities for each word
2 yhat = model.predict_classes(encoded, verbose=0)
We can then look up the index in the Tokenizers mapping to get the associated word.
1 out_word = ''
2 for word, index in tokenizer.word_index.items():
3 if index == yhat:
4 out_word = word
5 break
We can then append this word to the seed text and repeat the process.
Importantly, the input sequence is going to get too long. We can truncate it to the desired
length after the input sequence has been encoded to integers. Keras provides
the pad_sequences() function that we can use to perform this truncation.
1 encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
We can wrap all of this into a function called generate_seq() that takes as input the model,
the tokenizer, input sequence length, the seed text, and the number of words to generate. It
then returns a sequence of words generated by the model.
1 # generate a sequence from a language model
2 def generate_seq(model, tokenizer, seq_length, seed_text, n_words):
3 result = list()
4 in_text = seed_text
5 # generate a fixed number of words
6 for _ in range(n_words):
7 # encode the text as integer
8 encoded = tokenizer.texts_to_sequences([in_text])[0]
9 # truncate sequences to a fixed length
13 # map predicted word index to word
14 out_word = ''
16 if index == yhat:
17 out_word = word
18 break
19 # append to input
20 in_text += ' ' + out_word
21 result.append(out_word)
22 return ' '.join(result)
We are now ready to generate a sequence of new words given some seed text.
1 # generate new text
2 generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
3 print(generated)
Putting this all together, the complete code listing for generating text from the learned-
language model is listed below.
1 from random import randint
2 from pickle import load
3 from keras.models import load_model
4 from keras.preprocessing.sequence import pad_sequences
5
10 # read all text
12 # close the file
13 file.close()
14 return text
15
17 def generate_seq(model, tokenizer, seq_length, seed_text, n_words):

18 result = list()
29 out_word = ''
32 out_word = word
33 break
36 result.append(out_word)
37 return ' '.join(result)
38
39 # load cleaned text sequences
43 seq_length = len(lines[0].split()) - 1
44
45 # load the model
47
48 # load the tokenizer
49 tokenizer = load(open('tokenizer.pkl', 'rb'))
50
51 # select a seed text
52 seed_text = lines[randint(0,len(lines))]
53 print(seed_text + '\n')
54
55 # generate new text
56 generated = generate_seq(model, tokenizer, seq_length, seed_text, 50)
57 print(generated)
Running the example first prints the seed text.
when he said that a man when he grows old may learn many things for he can no more
learn much than he can run much youth is the time for any extraordinary toil of course and
therefore calculation and geometry and all the other elements of instruction which are a
Then 50 words of generated text are printed.
preparation for dialectic should be presented to the name of idle spendthrifts of whom the
other is the manifold and the unjust and is the best and the other which delighted to be the
opening of the soul of the soul and the embroiderer will have to be said at
You can see that the text seems reasonable. In fact, the addition of concatenation would
help in interpreting the seed and the generated text. Nevertheless, the generated text gets
the right kind of words in the right kind of order.
Try running the example a few times to see other examples of generated text. Let me know
in the comments below if you see anything interesting.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
 Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to
a fixed length (e.g. the longest sentence length).
 Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop
words removed.
 Tune Model. Tune the model, such as the size of the embedding or number of memory cells
in the hidden layer, to see if you can develop a better model.
 Deeper Model. Extend the model to have multiple LSTM hidden layers, perhaps with
dropout to see if you can develop a better model.
 Pre-Trained Word Embedding. Extend the model to use pre-trained word2vec or GloVe
vectors to see if it results in a better model.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
 Project Gutenberg
 The Republic by Plato on Project Gutenberg
 Republic (Plato) on Wikipedia
 Language model on Wikipedia
Summary
In this tutorial, you discovered how to develop a word-based language model using a word
embedding and a recurrent neural network.
Specifically, you learned:
 How to prepare text for developing a word-based language model.

 How to design and fit a neural language model with a learned embedding and an LSTM
hidden layer.
 How to use the learned language model to generate new text with similar statistical
Do you have any questions?
Ask your questions in the comments below and I will do my
best to answer.
Develop Deep Learning models for

Text Data Today!
Develop Your Own Text models in Minutes

...with just a few lines of python code
Discover how in my new Ebook:

Deep Learning for Natural Language Processing
It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much
more...
Finally Bring Deep Learning to your Natural Language Processing Projects
Skip the Academics. Just Results.
SEE WHAT'S INSIDE
Tweet Share Share

More On This Topic
 How to Develop Word-Based Neural Language Models

in…
 What Are Word Embeddings for Text?
 Gentle Introduction to Statistical Language Modeling…

 How to Develop a Character-Based Neural Language…
 How to Use Word Embedding Layers for Deep Learning…
 How to Develop Word Embeddings in Python with

Gensim
About Jason Brownlee

Jason Brownlee, PhD is a machine learning specialist who teaches developers how to get results with modern
machine learning methods via hands-on tutorials.
View all posts by Jason Brownlee →
How to Use The Pre-Trained VGG Model to Classify Objects in Photographs
How to Automatically Generate Textual Descriptions for Photographs with Deep Learning
288 Responses to How to Develop a Word-Level Neural Language

Model and Use it to Generate Text
1.
nike November 10, 2017 at 12:59 pm #
Thank you for providing this blog, have u use rnn to do recommend ? like use rnn recommend
movies ,use the user consume movies sequences
REPLY
o
Jason Brownlee November 11, 2017 at 9:15 am #
I have not used RNNs for recommender systems, sorry.

REPLY

Mike April 29, 2020 at 10:49 pm #
out_word = ”
for word, index in tokenizer.word_index.items():
if index == yhat:
out_word = word
break
Why are you doing sequential search on a dictionary?

REPLY

Jason Brownlee April 30, 2020 at 6:44 am #
It is a reverse lookup, by value not key.

REPLY

Mike April 30, 2020 at 7:14 pm #
I see. But isn’t there a tokenizer.index_word (:: index -> word) dictionary for this
purpose?

Jason Brownlee May 1, 2020 at 6:34 am #
It might be, great tip! Perhaps it wasn’t around back when I wrote this, or I didn’t
notice it.

Mike May 5, 2020 at 3:08 am #
Hey Jason, I have a question. I do not understand why the Embedding (input) layer and
output layer are +1 larger than the number of words in our vocabulary.
I am pretty certain that the output layer (and the input layer of one-hot vectors) must be the
exact size of our vocabulary so that each output value maps 1-1 with each of our vocabulary
word. If we add +1 to this size, where (to which word) does the extra output value map to?
REPLY

We add one “word” for “none” at index 0. This is for all words that we don’t know or that
we want to map to “don’t know”.
REPLY
2.
Stuart November 11, 2017 at 6:01 am #
Outstanding article, thank you! Two questions:
1. How much benefit is gained by removing punctuation? Wouldn’t a few more punctuation-based
tokens be a fairly trivial addition to several thousand word tokens?
2. Based on your experience, which method would you say is better at generating meaningful text on
modest hardware (the cheaper gpu AWS options) in a reasonable amount of time (within several
hours): word-level or character-level generation?
Also it would be great if you could include your hardware setup, python/keras versions, and how long
it took to generate your example text output.
REPLY
o
Good questions.
Test and see, I expect the vocab drops by close to half.
AWS is good value for money for one-off models. I strongly recommend it:
https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-
amazon-web-services/
I have a big i7 imac for local dev and run large models/experiments on AWS.
REPLY

Stuart November 11, 2017 at 10:49 am #
what do you mean “vocab drops by close to half” ?

REPLY

Sorry, I mean that removing punctuation such as full stops and commas from the end of
words will mean that we have fewer tokens to model.
REPLY

Mark July 23, 2020 at 10:09 pm #
part of the issue is that “Party” and “Party; ” won’t be evaluated as the same
REPLY
3.
Raoul November 12, 2017 at 11:55 pm #
Great tutorial and thank you for sharing!

Am I correct in assuming that the model always spits out the same output text given a specific seed
text?
REPLY
o
A trained model will, yes. The model will be different each time it is trained though:
https://machinelearningmastery.com/randomness-in-machine-learning/
REPLY

Gustavo February 15, 2019 at 3:27 am #
Not always, you can generate new text by fitting a multinomial distribution, where you take
the probability of a character occurring and not the maximum probability of a character. This
allows more diversity to the generated text, and you can combine with “temperature”
parameters to control this diversity.
REPLY

Jason Brownlee February 15, 2019 at 8:15 am #
Okay, thanks for sharing.

REPLY
4.
Sarang November 13, 2017 at 1:15 am #
Jason,
Thanks for the excellent blogs, what do you think about the future of deep learning?
Do you think deep learning is here to stay for another 10 years?
REPLY
o
Yes. The results are too valuable across a wide set of domains.
REPLY
5.
Roger January 3, 2018 at 8:40 pm #
Hi Jason, great blog!

Still, I got a question when running “model.add(LSTM(100, return_sequences=True))”.
TypeError: Expected int32, got list containing Tensors of type ‘_Message’ instead.
Could you please help? Thanks.
REPLY
o
by the way I am using os system, py3.5

REPLY

Kirstie January 5, 2018 at 1:03 am #
Hi Roger. I had the same issue, updating Tensorflow with pip install –upgrade Tensorflow
worked for me.
REPLY
o
Jason Brownlee January 4, 2018 at 8:10 am #
Sorry I have not seen this error before. Perhaps try posting to stackoverflow?
It may be the version of your libraries? Ensure everything is up to date.

REPLY
6.
Hi Jason,
Is it possible to use your codes above as a language model instead of predicting next word?
What I want is to judge “I am eating an apple” is more commonly used than “I an apple am eating”.
For short sentence, may be I don’t have 50 words as input.
Also, is it possible for Keras to output the probability with its index like the biggest probability for next
word is “republic” and I want to get the index for “republic” which can be matched in
tokenizer.word_index.
Thanks!
REPLY
o
do you have any suggestions if I want to use 3 previous words as input to predict next word?
Thanks.
REPLY

You can frame your problem, prepare the data and train the model.
3 words as input might not be enough.

REPLY

This is a Markov Assumption. The point of a recurrent NN model is to avoid that. If you ‘re
only going to use 3 words to predict the next then use an n-gram or a feedforward model
(like Bengio’s). No need for a recurrent model.
REPLY
o
Not sure about your application sorry.

Keras can predict probabilities across the vocabulary and you can use argmax() to get the index
of the word with the largest probability.
REPLY
7.
Cid February 14, 2018 at 3:47 am #
Hey Jason,
Thanks for the post. I noticed in the extensions part you mention Sentence-Wise Modelling.
I understand the technique of padding (after reading your other blog post). But how does it
incorporate a full stop when generating text. Is it a matter of post-processing the text? Could it be
possible/more convenient to tokenize a full stop prior to embedding?
REPLY
o
I generally recommend removing punctuation. It balloons the size of the vocab and in turn slows
down modeling.
REPLY

Cid February 14, 2018 at 8:00 pm #
OK thanks for the advise, how could I incorporate a sentence structure into my model then?
REPLY

Each sentence could be one “sample” or sequence of words as input.

REPLY
8.
Maria February 16, 2018 at 10:32 pm #
Hi Jason, I tried to use your model and train it with a corpus I had, everything seemed to work fine,
but at the and I have this error:
35 #print(sequences)
—> 36 X, y = sequences[:,:-1], sequences[:,-1]
37 # print(sequences[:,:-1])
38 # X, y = sequences[:-1], sequences[-1]
IndexError: too many indices for array
regarding the sliding of the sequences. Do you know how to fix it?
Thanks so much!
REPLY
o
Perhaps double check your loaded data has the shape that you expect?
REPLY

Ray March 1, 2018 at 7:31 am #
Hi Jason and Maria.

I am having the exact same problem too.
I was hoping Jason might have better suggestion
here is the error:
IndexError Traceback (most recent call last)

in ()
—-> 3 X, y = sequences[:,:-1], sequences[:,-1]

REPLY

Jason Brownlee March 1, 2018 at 3:05 pm #
Are you able to confirm that your Python 3 environment is up to date, including Keras
and tensorflow?
For example, here are the current versions I’m running with:
1 python: 3.6.4
2 scipy: 1.0.0
3 numpy: 1.14.0
4 matplotlib: 2.1.1
5 pandas: 0.22.0
6 statsmodels: 0.8.0
7 sklearn: 0.19.1
8 nltk: 3.2.5
9 gensim: 3.3.0
10 xgboost 0.6
11 tensorflow: 1.5.0
12 theano: 1.0.1
13 keras: 2.1.4
REPLY

Doron Ben Elazar March 20, 2018 at 8:14 am #
It means that your input is not even and np.array doesn’t parse it properly (the
author created paragraphs of 50 tokens each), a possible fix would be:
original_sequences = tokenizer.texts_to_sequences(text_chunk)
vocab_size = len(tokenizer.word_index) + 1
aligned_sequneces = []
for sequence in original_sequences:
aligned_sequence = np.zeros(max_len, dtype=np.int64)
aligned_sequence[:len(sequence)] = np.array(sequence, dtype=np.int64)
aligned_sequneces.append(aligned_sequence)
sequences = np.array(aligned_sequneces)

Cathy October 18, 2019 at 10:30 pm #
Do you use gensim for generating this code?

If you ys gensim, where are the gensim commands You used in your code?

Jason Brownlee October 19, 2019 at 6:37 am #
Gensim is not used in this tutorial.

Basil June 28, 2018 at 6:36 pm #
Don’t know if it too late to respond. the issue arises because u have by mistake typed
tokenizer.fit_on_sequences before instead of tokenizer.texts_to_sequences.
Hope this helps others who come to this page in the future!
Thanks Jason.
REPLY

Jason Brownlee June 29, 2018 at 5:52 am #
Thanks for the note.

JC Ulundo August 7, 2018 at 12:27 pm #
Basil I tried your suggestion but still encountering the same error.
Did anyone go through this error and got it fixed?

Jason Brownlee August 7, 2018 at 2:32 pm #
I have some suggestions here:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-
tutorial-not-work-for-me
o
Naetmul October 28, 2018 at 1:06 pm #
The main problem is that tokenizer.texts_to_sequences(lines) returns List of List, not

a guaranteed rectangular 2D list,
which means that sequences may have this form:
[[1, 2, 3], [2, 3, 4, 5]]
but the example in this article assumes that sequences should be a rectangular shaped list like:
[[1, 2, 3, 4], [2, 3, 4, 5]]
If you used a custom clean_doc function, you may need to use custom filter parameter for the
Tokenizer(), like tokenizer = Tokenizer(filters='\n').
The constructor of Tokenizer() has an optional parameter
filters: a string where each element is a character that will be filtered from the texts. The default
is all punctuation, plus tabs and line breaks, minus the ‘ character.
The example already removed all the punctuation and whitespaces, so it will be not a problem in
the example.
However, if you used a custom one, then it can be a problem.
REPLY
9.
vikas dixit March 5, 2018 at 9:26 pm #
Sir, i have a vocabulary size of 12000 and when i use to_categorical system throws Memory Error as
shown below:
/usr/local/lib/python2.7/dist-packages/keras/utils/np_utils.pyc in to_categorical(y, num_classes)

22 num_classes = np.max(y) + 1
23 n = y.shape[0]
—> 24 categorical = np.zeros((n, num_classes))
25 categorical[np.arange(n), y] = 1
26 return categorical
MemoryError:
How to solve this error??

REPLY
o
Jason Brownlee March 6, 2018 at 6:12 am #
Perhaps try running the code on a machine with more RAM, such as on S3?
Perhaps try mapping the vocab to integers manually with more memory efficient code?
REPLY
10.
Eli Mshomi March 7, 2018 at 9:40 am #
Can you generate a article based on the other related articles which and be human readable ?
REPLY
o
Yes, I believe so. The model will have to be large and carefully trained.
REPLY
11.
Adam March 12, 2018 at 9:52 pm #
Hi Jason,
I just tried out your code here with my own text sample (blog posts from a tumblr blog) and trained it
and it’s now gotten to the point where text is no longer “generated”, but rather just sent back
verbatim.
The sample set I’m using is rather small (each individual post is fairly small, so I made each
sequence 10 input 1 output) giving me around 25,000 sequences. Everything else was kept the
same code from your tutorial – after around 150 epochs, the accuracy was around 0.99 so I cut it off
to try generation.
When I change the seed text from something to the sample to something else from the vocabulary
(ie not a full line but a “random” line) then the text is fairly random which is what I wanted. When the
seed text is changed to something outside of the vocabulary, the same text is generated each time.
What should I do if I want something more random? Should I change something like the memory
cells in the LSTM layers? Reduce the sequence length? Just stop training at a higher loss/lower
accuracy?
Thanks a ton for your tutorials, I’ve learned a lot through them.
REPLY
o
Perhaps the model is overfit and could be trained less?
Perhaps a larger dataset is required?

REPLY

Adam March 13, 2018 at 9:17 am #
I think it might just be overfit. Sadly, there isn’t more data that I can grab (at least that i know
of currently) so I can’t grab much more data which sucks – that’s why I reduced the
sequence length to 10.
I checkpointed every epoch so that I can play around with what gives the best results.
Thanks for your advice!
REPLY

Perhaps also try a blend/ensemble of some of the checkpointed models or models from
multiple runs to see if it can give a small lift.
REPLY
12.
Prateek March 30, 2018 at 4:05 pm #
Hi Jason,
I check the amount of accuracy and loss on Tensorflow. I would like to know what exactly do you
means in accuracy in NLP?
In computer vision, if we wish to predict cat and the predicted out of the model is cat then we can
say that the accuracy of the model is greater than 95%.
I want to understand physically what do we mean by accuracy in NLP models. Can you please
explain?
REPLY
o
I would recommend using BLEU instead of accuracy. Accuracy does not have any useful
meaning.
Learn more about BLEU here:

https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
REPLY
13.
Jin Zhou April 14, 2018 at 12:10 pm #
Hi, I have a question about evaluating this model. As I know, perplexity is a very popular method.
For BLEU and perplexity, which one do you think is better? Could you give an example about how to
evaluate this model in keras.
REPLY
o
Both are good. I don’t have a favorite.

I cover how to calculate BLEU here:
https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
REPLY
14.
Jin April 23, 2018 at 7:11 pm #
Hi Jason, I am a little confused, you told us that we need 100 words as input, but your X_train is only
50 words per line. Could you explain that a little?
REPLY
o
Thanks, that was a typo. Fixed.

REPLY
15.
jason May 1, 2018 at 9:54 am #
Jason,
You have an embedding layer as the part of the model. The embedding weights will be trained along
with other layers. Why not separate and train it independently? In other words, first using Embedding
alone to train word vectors/embedding weights. Then run your model with embedding layer initializer
with embedding weights and setting trainable=false. I see most people use your approach to train a
model. But this is kind of against the purpose of embedding because the output is not word context
but 0/1 labels. Why not replace embedding with an ordinary layer with linear activation?
Another jason
REPLY
o
You can train them separately. I give examples on the blog.
Often, I find the model has better skill when the embedding is trained with the net.
REPLY

jason May 2, 2018 at 9:49 am #
“I give examples on the blog.” I guess you mean there are other posts in your blog talking
about “train them separately”.
REPLY

Yes, you can search here:

https://machinelearningmastery.com/site-search/
REPLY
16.
johhnybravo May 19, 2018 at 2:33 am #
ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array
with shape (1,) while predicting output
REPLY
o
This might help:

https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-
for-me
REPLY
o
Tanaya June 2, 2018 at 11:19 pm #
Hi. Did you get a solution to the problem? I am getting the same error.
REPLY

Ervin October 19, 2018 at 9:08 am #
I solved it by following this post:
https://stackoverflow.com/questions/39950311/keras-error-on-predict
REPLY

fanzhh January 11, 2019 at 12:36 pm #
thank you.
REPLY
17.
Nitin Mukesh May 30, 2018 at 6:33 pm #
Hi, I want to develop Image Captioning in keras. What are the pre requisite for this? I have done
your tutorials for object detection using CNN. What should I do next?
REPLY
o
You can follow this tutorial:

https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-
python/
Or I lay out all the required prior knowledge in this book:
https://machinelearningmastery.com/deep-learning-for-nlp/
REPLY
18.
mel_dagh June 1, 2018 at 1:59 am #
Hi Jason,
1.) Can we use this approach to predict if a word in a given sequence of the training data is highly
odd..i.e. that it does not occur in that position of the sequence with high probability.
2.) Is there a way to integrate pre-trained word embeddings (glove/word2vec) in the embedding
layer?
REPLY
o
It is more about generating new sequences than predicting words.
Here are examples of working with pre-trained word embeddings:

https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
REPLY
19.
Hello, this is simply an amazing post for beginners in NLP like me.
I have generated a language model, which is further generating text like I want it to.
My question is, after this generation, how do I filter out all the text that does not make sense,
syntactically or semantically?
REPLY
o
Thanks.
You might need another model to classify text, correct text, or filter text prior to using it.
REPLY

Will this also be using Keras? Do you recommend using nltk or SpaCy?
REPLY

Perhaps, I was trying not to be specific.

REPLY
20.
Venkat June 12, 2018 at 4:38 am #
Hi Jason,
Thanks for the amazing post.
I’m working on words correction in a sentence. Ideally the model should generate number of output
words equal to input words with correction of error word in the sentence.
Does language model help me for the same.? if it does please leave some hints on the model.
REPLY
o
Sounds like a fun project. A language model might be useful. Sorry, I don’t have a worked
example. I recommend checking the literature.
REPLY
21.
Amar June 20, 2018 at 8:53 pm #
Hi Jason,
I am working on a highly imbalanced text classification problem.

Ex : Classify a “Genuine Email ” vs “SPAM ” based on text in body of email.
Genuine email text = 30 k lines

Spam email text = 1k lines
I need to classify whether the next email is Genuine or SPAM.
I train the model using the same example as above.

The training data I feed to the model is only “Genuine text”.
Will I be able to classify the next sentence I feed to the model, from the generated text and
probability of words : as “GENUINE Email” vs “SPAM”?
(I am assuming that the model has never seen SPAM data, and hence the probability of the
generated text will be very less.)
Do you see any way I can achieve this with Language model? What would be an alternative
otherwise when it comes to rare event scenario for NLP use cases.
Thank you!!
REPLY
o
I would recommend using a CNN model for text classification, for example:
https://machinelearningmastery.com/best-practices-document-classification-deep-learning/
REPLY

Amar June 21, 2018 at 4:25 pm #
Thank you Jason!!

REPLY

Jason Brownlee June 21, 2018 at 5:00 pm #
No problem.
REPLY
22.
musa June 29, 2018 at 2:36 am #
Hi Jason,
Could you comment on overfitting when training language models. I’ve built a sentence based LSTM
language model and have split training: validation into a 80:20 split. I’m not seeing any imrovments
in my validation data whilst the accuracy of the model seems to be improving.
Thanks
REPLY
o
Overfitting a language model really only matters in the context of the problem for which you are
using the model.
A language model used alone is not really that useful, so overfitting doesn’t matter. It may be
purely a descriptive model rather than predictive.
Often, they are used in the input or output of another model, in which case, overfitting may limit
predictive skill.
REPLY
23.
Marc June 30, 2018 at 8:42 am #
Would would the implications of returning the hidden and cell states be here?
I’d imagine if the input sequence was long enough it wouldn’t matter too much as the temporal
relationships would be captured, but if we had shorter sequences or really long documents we would
consider doing this to improve the model’s ability to learn.. am I thinking about this correctly?
What would the drawback be to returning the hidden sequence and cell state and piping that into the
next observation?
REPLY
o
Jason Brownlee July 1, 2018 at 6:21 am #
I don’t follow, why would you return the cell state externally at all? It has no meaning outside of
the network.
REPLY

star July 5, 2018 at 2:25 am #
Hope you are doing well. I have a question which returns to my understanding from
embedding vectors.
For example if I have this sentence “ the weather is nice” and the goal of my model is
predicting “nice”, when I want to use pre trained google word embedding model, I must
search embedding google matrix and find the embedding vector related to words “the”
“weather” “is” “nice” and feed them as input to my model? Am I right?
REPLY

Why would you do that?

REPLY
24.
Fatemeh July 4, 2018 at 2:34 am #
Hello, Thank you for nice description. I want to use pre-trained google word embedding vectors, So I
think I don’t need to do sequence encoding, for example if I want to create sequences with the
length 10, I have to search embedding matrix and find the equivalence embedding vector for each of
the 10 words, right?
REPLY
o
Correct. You must map each word to its distributed representation when preparing the
embedding or the encoding.
REPLY

Fatemeh July 5, 2018 at 2:43 am #
Thank you, do you have sample code for that in your book? I purchased your professional
package
REPLY

All sample code is provided with the PDF in the code/ directory.
REPLY

fatemeh July 6, 2018 at 1:57 am #
Thank you. when I want to use model.fit , I have to specify X and y and I have used
pre trained google embedidng matrix. so every word has mapped to a a vector, and
my inputs are actually the sentences (with the length 4). now I don’t understand the
equivalent values for X. for example imagine the first sentence is “the weather is
nice” so the X will be “the weather is” and the y is “nice”. When I want to convert X to
integers, every word in X will be mapped to one vector? for example if the equivalent
vectors for the words in sentence in google model are :”the”= 0.9,0.6,0.8 and
“weather”=0.6,0.5,0.2 and “is”=0.3,0.1,0.5 , and “nice”=0.4,0.3,0.5 the input X will be
:[[0.9,0.6,0.8],[0.6,0.5,0.2],[0.3,0.1,0.5]] and the output y will be [0.4,0.3,0.5]?

You can specify or learn the mapping, but after that the model will map integers to
their vectors.
25.
mh July 5, 2018 at 4:19 am #
Hello Jason, why you didn’t convert X input to hot vector format? you only did this dot y output.
REPLY
o
We don’t need to as we use a word embedding.

REPLY
26.
Anshuman Mahapatra July 5, 2018 at 10:43 pm #
Hi..First of all would like to thank for the detailed explaination on the concept. I was executing this
model step by step to have a better understanding but am stuck while doing the predictions
yhat = model.predict_classes(encoded, verbose=0)
I am getting the below error:-
ValueError: Error when checking input: expected embedding_1_input to have shape (50,) but got
array with shape (1,)
REPLY
o

for-me
REPLY
o
ervin October 23, 2018 at 9:00 am #
i fixed this problem following this post
https://stackoverflow.com/questions/39950311/keras-error-on-predict
REPLY
27.
NLP_enthusiast August 3, 2018 at 5:08 pm #
Hello, nice tutorial!
When callling:
model.fit(X, y, batch_size=128, epochs=100)
what is X.shape and y.shape at this point?
I’m getting the error:
Error when checking input: expected embedding_1_input to have shape (500,) but got array with
shape (1,)
In my case:
X.shape = (500,) # 500 samples i my case
y.shape = (500, 200) # this is after y= to_categorical(y, num_classes=vocab_size)
Using Theano backend.

REPLY
o
Jason Brownlee August 4, 2018 at 6:01 am #

for-me
REPLY

NLP_enthusiast August 4, 2018 at 7:46 am #
thank you. Potentially useful to others: X.shape should be of the form (a, b) where a is the
length of “sequences” and b is the input sequence length to make forward predictions. Note
that modification/omission of string.maketrans() is likely necessary if using Python 2.x
(instead of Python 3.x) and that Theanos backend may also alleviate potential dimension
errors from Tensorflow.
REPLY

THanks.
REPLY

I don’t understand why the length of each sequence must be the same (i.e. fixed)? Isn’t
the point of RNNs to handle variable length inputs by taking as input one word at a time
and have the rest represented in the hidden state.
Is it used for optimization when we happen to use the same size for all but it’s not
actually necessary for us to do so? That would make more sense.
REPLY

You are correct and an dynamic RNN can do this.
Most of my implementations use a fixed length input/output for efficiency and zero
padding with masking to ignore the padding.
o
usman September 4, 2018 at 5:35 pm #
@NLP_enthusiast How did you solve this error?

REPLY

Carlos Aguayo November 20, 2018 at 10:49 am #
You can just replace these:
table = string.maketrans(string.punctuation, ‘ ‘)
tokens = [‘w.translate(table)’ for w in tokens]
with this:
tokens = [‘ ‘ if w in string.punctuation else w for w in tokens]

REPLY

Jason Brownlee November 20, 2018 at 2:04 pm #
Very nice!
REPLY
28.
Eric August 20, 2018 at 9:15 pm #
Does anyone have an example of how predict based on a user provided text string instead of
random sample data. I’m kind of new to this so my apologies in advance for such a simple question.
Thank you!
REPLY
o
The new input data must be prepared in the same way as the training data for the model.
E.g. the same data preparation steps and tokenizer.

REPLY
29.
Santosh August 21, 2018 at 6:28 am #
Can we implement this thing in Android platform to run trained model for given set of words from
user.
REPLY
o
Maybe, I don’t have any examples on Android, sorry.

REPLY
30.
anirban August 31, 2018 at 11:29 pm #
Hi,
One of the extensions suggested in the blog is
Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to a fixed
length (e.g. the longest sentence length).
So if I do a sentence wise splitting then do I retain the punctuations or remove it?
REPLY
o
Jason Brownlee September 1, 2018 at 6:21 am #
Sounds like a good start.

REPLY
31.
wu.zheng September 6, 2018 at 1:40 pm #
# separate into input and output

sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
is this model only use the 50 words before to predict last world ?
the context length is fixed ?

REPLY
o
Jason Brownlee September 6, 2018 at 2:13 pm #
In this example we use 50 words as input.

REPLY

riyaz September 5, 2020 at 4:56 am #
why input and output are same ? output must be one shift towards left . is it right?
REPLY

The output is not the same as the input.
REPLY

riyaz September 5, 2020 at 3:47 pm #
then what is mean by this

If you are new to array slices and indexes in Python, this will help you get started:
https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-
learning-python/
32.
Anshul Patel October 3, 2018 at 5:55 pm #
Hello Jason, I am working on word predictor using RNN’s too. However I have been encountering
the same problem faced by many others i.e. INDEXERROR : Too many Indices
lines = training_set.split(‘\n’)
tokenizer = Tokenizer()
tokenizer.fit_on_sequences(lines)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
vocab_size = len(tokenizer.word_index) + 1
X_train = sequences[:, :-1]
y_train = sequences[:,-1]
I have went through all the comments related to this error, However none of them solve my issue. I
wonder if there is any problem with the text I imported it is Pride and Prejudice book from Gutenberg.
Please help me : THANKS in Advance !!!

REPLY
o
I have not seen this error, are you able to confirm that your libraries are up to date and that you
copied all of the code from the tutorial?
REPLY
33.
Andrew October 16, 2018 at 4:51 pm #
Very nice article.
How long did it take to train model like this on 118633 sequences of 50 symbols from 7410 elements
dictionary?
Can you share on what machine did you train the model? ram, cpu, os?
REPLY
o
Just a normal workstation without GPU.
If you have machine problems, perhaps try an EC2 instance, I show how here:
https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-
amazon-web-services/
REPLY
34.
ervin October 20, 2018 at 3:36 am #
Great article!
2 questions:
1. In your ‘Extension’ section — you mentioned to try dropout. Since there is not validation/holdout
set, why should we use dropout? Isn’t that just a regularization technique and doesn’t help with
training data accuracy?
2. Speaking of accuracy — I trained my model to 99% accuracy. When I generate text and use exact
lines from PLATO as seed text, I should be getting almost an exact replica of PLATO right? Since
my model has 99% of getting the next word right in my seed text. I’m finding that this is not the case.
Am I interpreting this 99% accuracy right? What’s keeping it from making a replica.
REPLY
o
Yes, it is a regularization method. Yes, it can help as the model is trained using supervised
learning.
Accuracy is not a valid measure for a language model.
We don’t want the model to memorize Plato, we want the model to learn how to make Plato-like
text.
REPLY
35.
OkRo November 29, 2018 at 9:17 pm #
Hi , Great article!
I have a question – How can I use the mode to get probability of a word in a sentence?
e.g. P(Piraeus| went down yesterday to the?
REPLY
o
I don’t think the model can be used in this way.

REPLY
36.
Emi December 10, 2018 at 12:36 pm #
I am a big fan of your tutorials and I used to search your tutorials first when I want to learn things in
deep learning.
I currently have a problem as follows.
I have a huge corpus of unstructured text (where I have already cleaned and tokenised as; word1,
word2, word3, … wordN). I also have a target word list which should be outputted by analysing these
cleaned text (e.g., word88, word34, word 48, word10002, … , word8).
I want to build a language model, that correcly output these target words using my cleaned text. I am
just wondering if it is possible to do using deep learning techniques? Please let me know your
thoughts.
If you have tutorial related to above senario please share it with me (as I could not find).
Thank you for the great tutorials.
-Emi
REPLY
o
Jason Brownlee December 10, 2018 at 2:19 pm #
Almost sounds like a translation or text summarization task?
Perhaps models used for those problems would be a good starting point?
REPLY

Thank you for your reply. No, it is not translation or summarisation. I have given an example
below with more details 🙂
—————————————————————————————————————-
My current input preparation process is as follows:

Unstructured text -> cleaning the data -> get only the informative words -> calculate different
features
Example (consider we have only 5 words):

Informative words = {“Deep Learning”, “SVM”, “LSTM”, “Data Mining”, ‘Python’}
For each word I also have features (consider we have only 3 words)
Features = {Frequency, TF-IDF, MI}
However, I am not sure if I need these feautres when constructing deep learning model.
——————————————————————————————————————–
My output is a ranked list of informative words.

Target output = {‘SVM’, ‘Data Mining’, ‘Deep Learning’, ‘Python’, ‘LSTM’}
——————————————————————————————————————–
I am just trying to figure out if there is a way to obtain my target output using deep learning
or ML model? Please let me know your thoughts.
-Emi
REPLY

Jason Brownlee December 11, 2018 at 7:39 am #
Sounds like you want a model to output the same input, but ordered in some way.
Perhaps try an encoder-decoder?

REPLY

Hi Jason,
Thank you very much for your suggestion. I followed the following tutorial of yours
today related to encoder-decorder: https://machinelearningmastery.com/develop-
encoder-decoder-model-sequence-sequence-prediction-keras/
It is very well explained and I would like to use it for my task. However, I got one
small problem.
In your example, you are using 100000 trainging examples as mentioned below.
X=[22, 17, 23, 5, 29, 11] y=[23, 17, 22]

X=[28, 2, 46, 12, 21, 6] y=[46, 2, 28]
X=[12, 20, 45, 28, 18, 42] y=[45, 20, 12]
X=[3, 43, 45, 4, 33, 27] y=[45, 43, 3]
…
X=[34, 50, 21, 20, 11, 6] y=[21, 50, 34]
However, in my task I only have one input sequence and target sequence as shown
below (same input, but ordered in different way).
Informative words = {“Deep Learning”, “SVM”, “LSTM”, “Data Mining”, ‘Python’}

Target output = {‘SVM’, ‘Data Mining’, ‘Deep Learning’, ‘Python’, ‘LSTM’}
Would it be a problem?
PS: However, my input and target sequence are very long in my real dataset
(around 10000 words of length).

Jason Brownlee December 11, 2018 at 2:33 pm #
You will have to adapt the model to your problem and run tests in order to
_discover_ whether the model is suitable or not.
37.
Emi December 11, 2018 at 10:38 am #
Thank you very much for your valuable suggestion. i truely appreciate it 🙂
I have followed your tutorial of ‘Encorder-Decorder LSTM’ for time-series analysis. Do you have any
tutorial of ‘encorder-decorder’ that is close to my task? If so, can you please share the link with me?
REPLY
38.
Jin December 14, 2018 at 8:23 pm #
Hi Jason, I have a question about the pre-trained word vectors. I know I should set the embedding
layer with weights=[pre_embedding], but how should decide the order of pre_embedding? Like,
which word does the vector represent in a certain row. Also, should the first row always be all zeros?
REPLY
o
The index of each vector must match the encoded integer of each word in the vocabulary.
That is why we collect the vectors needed for each word in the vocab incrementally.
REPLY
39.
Lee January 9, 2019 at 1:42 pm #
I am running the model as described here and in the book, but the loss goes to nan frequently. My
computer (18 cores + Vega 64 graphics card) also takes much longer to run an epoch than shown
here. All cpu leads to 1 hour finishing time. Encoding as int8 and using the GPU via PlaidML speeds
it up to ~376 seconds, but nan’s out.
Any advice? The code is exactly as is used both here and the book, but I just can’t get it to finish a
run.
REPLY
o
I have two ideas:
Did you try running the code file provided with the book?
Are you able to confirm that your libraries are up to date?
REPLY
40.
Palani January 20, 2019 at 6:11 pm #
Great tutorial for a beginner Jason! Thanks!

REPLY
o
Thanks, I’m glad it helped.

REPLY
41.
Matt Lust February 5, 2019 at 9:17 am #
Im running this in Google Colab (albeit with a different and larger data set), The Colab System
crashes and my runtimes are basically reset.
I broke down each section of as you do in this example and found that it crashes at this code
# separate into input and output

y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]
I think the issue is that my dataset might be too large but I’m not sure.
heres a link to the Colab notebook.
https://colab.research.google.com/drive/1iTXV_iC-aBTSlQ8BtiwM7zSszmIrlB3k
REPLY
o
Jason Brownlee February 5, 2019 at 2:20 pm #
Sorry, I don’t have any experience with that environment.
I recommend running all code on your own workstation.

REPLY
o
Valerie May 2, 2019 at 11:06 pm #
Same problem(((( google colab doesnt have enough RAM for such a big matrix.
REPLY

Perhaps try running on your workstation or AWS EC2?

REPLY
42.
vijay February 13, 2019 at 4:49 pm #
Hey Jason,
I have two questions,
1.what will happen when we test with new sequence,instead of trying out same sequence already
there in training data?
2.why could you use voc_size-1?

REPLY
o
Good question, not sure. The model is tailored to the specific dataset, it might just generate
garbage or iterate back to what it knows.
Not sure about your second question, what are you referring to exactly?
REPLY
43.
saria March 13, 2019 at 5:19 am #
Hi Jason, I hope you can help me with my confusion.

So when we feed the data into LSTM, one is about the feature and another about timestamp.
If I am interested to keep the context as one paragraph, and the longest paragraph I have is 200
words, so I should set the timestamp to 200.
But what will happen to the feature?
what will be the feature fed to the model?
Are the features here are words? or paragraphs?
Sorry, this confused me a lot, I am not sure how to prepare my text data.
If my features here are words, then why even I need to split by paragraph? in that case I can split by
words and having timestamp 200, so the context of 200 will be kept for my case.
Am I right?
Thanks for taking the time:)

REPLY
o
If you are feeding words in, a feature will be one word, either one hot encoded or encoded using
a word embedding.
REPLY
44.
islam March 29, 2019 at 11:23 pm #
Thanks for every other informative website. Where else may

I get that kind of info written in such a perfect means? I
have a mission that I am just now working on, and I’ve been on the look
out for such information.
REPLY
o
Glad it helped.
REPLY
45.
torr March 31, 2019 at 3:51 pm #
Can you help me with code and good article for grammar checker.
REPLY
o
Sorry, I don’t have an example of a grammar checker.

REPLY
46.
Aksha April 6, 2019 at 3:45 am #
I am new to NLP realm. If you have an input text “The price of orange has increased” and output text
“Increase the production of orange”. Can we make our RNN model to predict the output text? Or
what algorithm should I use? Could you please let me know what algorithm to use for mapping input
sentence to output sentence.
REPLY
o
Yes, perhaps start with a model for text translation:

https://machinelearningmastery.com/?s=translation&post_type=post&submit=Search
REPLY
47.
Thomas L. Packer April 27, 2019 at 7:48 am #
For those who want to use a neural language model to calculate probabilities of sentences, look
here:
https://stackoverflow.com/questions/51123481/how-to-build-a-language-model-using-lstm-that-
assigns-probability-of-occurence-f
REPLY
o
Thanks for sharing.

REPLY
48.
Instead of writing a loop:

1 out_word = ''
3 if index == yhat:
4 out_word = word
5 break
why not use the tokenizer’s other dict:

1 tokenizer.index_word[yhat[0]]
REPLY
o
I’m new to this website. Who do you mark a code block in a comment? How do you add a profile
picture?
REPLY

You can use pre HTML tags (I fixed up your prior comment for you).
Profile pictures are based on gravatar, like any wordpress blog you might come across:
https://en.gravatar.com/
REPLY
o
Sounds good off the cuff, does it work?

REPLY

Thomas L. Packer May 3, 2019 at 3:13 am #
Thanks. I did try the other dict and it seemed to both work and run faster.
REPLY
49.
Shivam Bhati June 5, 2019 at 10:37 pm #
Hey
First of all, thank you for such a great project.
I have worked on this project and I got stuck at predicting the values.
the error says,

ValueError: Error when checking input: expected embedding_1_input to have shape (50,) but got
array with shape (1,)
Can you please help.

REPLY
o
The error suggests a mismatch between your data and the model’s expectation.
You can change your data or change the expectations of the model.
REPLY
50.
Sidharth June 18, 2019 at 6:11 pm #
Hi! Thanks for your code

Is there a way to convert keras model to trite or tensorflow model as on official documentation it
shows that tflite does not support LSTM layers
Thanks !
REPLY
o
I don’t know, sorry.

REPLY
51.
Shashank June 24, 2019 at 10:35 pm #
sir please please help me . I’m working on Text Summarization . Can I do it using Language
modelling because I dont have much knowledge about Neural Networks , or if you have any
suggestions , ideas please tell me . I have around 20 days to complete the project .
Thanks a lot!
REPLY
o
This will help:

And this:
https://machinelearningmastery.com/develop-word-embedding-model-predicting-movie-review-
sentiment/
REPLY
52.
Shashank June 24, 2019 at 11:40 pm #
Sir , how does the language model numeric data like money , date and all ? .
If there’s a problem , how to approach this problem sir ? I’m working on text summarization and such
numeric data may be important for summarizatio. how can i address them?
thanks a lot for the blog! love your posts. Seriously, very very , very helpful!
REPLY
o
For small datasets, it is a good idea to normalize these to a single form, e.g. words.
REPLY
53.
Kakoli August 1, 2019 at 4:12 am #
Thanks again for the wonderful post.

I tried word-level modelling as given here in Alice’s Adventures in Wonderland from Project
Gutenberg. But the loss was too big starting at 6.39 and did not reduce much. Was 6.18-6.19 for the
first 10 epochs. Any suggestions?
REPLY
o
Yes, I have suggestions for diagnosing and improving deep learning model performance here:
https://machinelearningmastery.com/start-here/#better
REPLY
54.
Kuro August 29, 2019 at 6:33 am #
I modified clean_doc so that it generates a stand-alone tokens for punctuations, except when a
single quote is used as apostrophe as in “don’t”, “Tom’s”.
On my first run of model making, I changed the batch_size and epochs parameters to 512 and 25,
thinking that it might speed up the process. It ended in 2 hours on MacBook Pro but running the
sequence generation program generated the text that mostly repeats “and the same, ” like this:
to be the best , and the same , and the same , and the same , and the same , and the same , and
the same , and the same , and the same , and the same , and the same , and the same , and
I changed back batch_size and epochs to the original values (125, 100) and ran the model building
program over night. Then the generated sequences looks more reasonable. For example:
be a fortress to be justice and not yours , as we were saying that the just man will be enough to
support the laws and loves among nature which we have been describing , and the disregard which
he saw the concupiscent and be the truest pleasures , and
Is there an intuitive interpretation of the bad result of my first try? The loss value at the last epoch
was 4.125 for the first try and 2.2130 for the second (good result). I forgot to record the accuracy.
REPLY
o
Hmmm, off the cuff, adding a lot more tokens may require an order of magnitude more data,
training and a larger model.
REPLY
55.
Kuro August 31, 2019 at 8:36 am #
I’m trying to apply these programs to a collection of song lyrics of a certain genre. A song typically
made of 50 to 200 words. Since they are from the same genre, the vocabulary size is relatively small
(talking about lost loves, soul etc.). In this case, would it make sense to reduce the sequence size
from 50 ? I’m thinking of something like 20.
The goal of my experiment is to generate a lyric by giving first 5 – 10 words, just for fun.
REPLY
o
Sounds fun.
Perhaps experiment and see what works best for your specific dataset.
I’d love to see what you discover.

REPLY
56.
Aly September 1, 2019 at 5:59 am #
I’m implementing this on a corpus of Arabic text, but whenever it runs I am getting the same word
repeated for the entire text generation process. I’m training on ~50,000 words with ~16,000 unique
words. I’ve messed around with number of layers, more data (but an issue as the number of unique
words also increases, an interesting find which feels like an error between Tokenizer and not
English), and epochs
Any way to fix this?

REPLY
o
Sorry to hear that, you may have to adjust the capacity of the model and training parameters of
the model for your new dataset.
Perhaps some of the tutorials here will help:

https://machinelearningmastery.com/start-here/#better
REPLY
57.
Irfan Danish October 2, 2019 at 1:51 am #
is there any way we can generate 2 or 3 different sample text from a single seed.
For example we input “Hello I’m” and model gives us
Hello I’m interested in
Hello I’m a bit confused
Hello I’m not sure
Instead of generating just one output it gives 2 to 3 best outputs.
Actually I trained your model instead of 50 I just used sequence length of three words, now I want
that when I input a seed of three words instead of just one sequence of three words I want to
generate 2 to 3 sequences which are correlated to that seed. Is it possible. Please let me know, I
would be very thankful to you!
REPLY
o
Yes, you can take the predicted probabilities and run a beam search to get multiple likely
sequences.
REPLY

Irfan Danish October 3, 2019 at 3:29 am #
Can you please give me a lit bit more explanation that how can I implement it or give me an
example. That would be nice!
REPLY

Try the search box.
Perhaps this will help:

https://machinelearningmastery.com/beam-search-decoder-natural-language-
processing/
REPLY
58.
Arjun October 17, 2019 at 4:38 pm #
hi,
what would be the X and y be like?
and could it be done by splitting the X and y into training and testing?
also when the model is created what would be the inputs for embedding layer?
then fit the model on X_train?
REPLY
o
Sorry, I don’t understand. What are you referring to exactly?

REPLY

while fitting the model i seem to get an error:

ValueError: Error when checking input: expected embedding_input to have 2 dimensions,
but got array with shape (264, 5, 1)
but the input should be 3D right?
so i had a doubt maybe something was wrong with the input i had given in the embedding
layer?
if so what are all the inputs to be given in the embedding layer?
REPLY

Input to the embedding is 2d, each sample is a vector of integers that map to a word.
REPLY

ok, so after the model is formed then we make the X_train 3D? when fitting?

No. Input to an embedding is 2d.
59.
hi, i would like to know if you have any idea about neural melody composition from lyrics using
RNN?
from a paperwork published it says that two encoders are used and given to a single decoder.
i wonder if you could provide any insight on it?
this is the paperwork:
https://arxiv.org/pdf/1809.04318.pdf
REPLY
o
Sorry, I am not familiar with that paper, perhaps try contacting the authors?
REPLY

ok.. thank you..

REPLY

Do you know how to give two sequences as input to a single decoder? Both encoders
and decoders are RNN structures. So how is it possible to have two encoders for a
single decoder?
REPLY

You could try a multiple-input model:

https://machinelearningmastery.com/keras-functional-api-deep-learning/
60.
It was not completely specific to my doubt, but even though thank you for helping.
REPLY
61.
hi, if i had two sequences as input and i have training and testing for both sequence inputs.
i managed to concatenate both the inputs and create a model.
but when it comes to fitting the model, how is it possible to give X and y of two sequences ?
REPLY
o
If you have a multi-input model, then the fit() function will take a list with each array of samples.
E.g.
X1 = …
X2 = …
X = [X1, X2]
model.fit(X, y, ….)
REPLY

So when we are training an RNN we should have the equal number of outputs as well?
REPLY

ValueError: Error when checking model target: the list of Numpy arrays that you are
passing to your model is not the size the model expected. Expected to see 1 array(s),
but instead got the following list of 2 arrays.
This was what I got when i gave multiple inputs to fit.
REPLY

Perhaps this tutorial will help you to get started with a multiple input model:

Sorry, I don’t understand?
Equal in what sense?

REPLY

equal in the sense, the number of inputs must be equal to the number of outputs?

Typically the number of input and output timesteps must be the same.
They can differ, but either they must be fixed, or you can use an alternate
architecture such as an encoder-decoder:
https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-
networks/
62.
What if we had two different inputs and we need a model with both these inputs aligned?
Could we give both these inputs in a single model or create two model with corresponding inputs
and then combine both models at the end?
REPLY
o
Sure, there are many different ways to solve a problem. Perhaps explore a few framings and see
what works best for your dataset?
REPLY
63.
sure..
Thank you..
REPLY
64.
how can we know if two inputs have been aligned ?
can we do it by merging two models?
either way we can know the result only after testing it right?
REPLY
o
Yes.
The idea is to build trust in your model beforehand using verification.

REPLY

Arjun November 1, 2019 at 3:47 pm #
verification in the sense?

REPLY

Confirming the model produces sensible outputs for a test set.

REPLY
65.
Hi Jason,
I was working on a text generation problem.
Seems I had a problem while I was fitting X_train and y_train.
It was related to incompatible shapes. But then I set the batch size to 1 and it ran.
But when it reached the evaluation part, it showed the previous error.
InvalidArgumentError: 2 root error(s) found.

(0) Invalid argument: Incompatible shapes: [32,5,5] vs. [32,5]
[[{{node metrics/mean_absolute_error/sub}}]]
[[metrics/mean_absolute_error/Identity/_153]]
0 successful operations.
0 derived errors ignored.
What could be the possible reason behind this?
REPLY
o
Sorry to hear that, I have some suggestions here:

for-me
REPLY

also I haven’t exactly copied your code as whole. I tried doing it on my own just taking
snippets. Even then these are errors which I have never seen before.
REPLY
66.
I am just confused why it would run while fitting but does not run while evaluating?
I mean we can’t do much tweaking with the arguments in evaluation?
REPLY
o
I don’t undertand your question sorry, perhaps you can elaborate?

REPLY
67.
ooh sorry my reply was based on the previous comment.
InvalidArgumentError: 2 root error(s) found.

[[metrics/mean_absolute_error/Identity/_153]]
0 successful operations.
0 derived errors ignored.
This error was found when i was fitting the model. But when I passed the batch size as 1, the model
fitted without any problem.
But when I tried to evaluate it the same previous error showed up.
Do you have any idea why it would work while fitting but not while evaluating..?
REPLY
o
Sorry, I have not seen this error.
Perhaps try posting your code and error to stackoverflow?

REPLY

ok sure..
Thank you
REPLY
68.
Hi jason,
Could you please give some insight on attention mechanism in keras?
REPLY
o
Keras does not support attention at this stage.

REPLY

Is there any other way to implement attention mechanism?

REPLY

Yes:
– implement it manually
– use a 3rd party implementation
– use tensorflow directly
– use pytorch
REPLY
69.
Is there any reason why the validation accuracy decreases while the training accuracy increases?
Is that a case of overfitting?
REPLY
o
Accuracy is noisy.
Look at train and validation loss.

REPLY

even the validation loss seem to be fluctuating.

So that might be a case of overfiiting right?
If so how can we solve this problem?
REPLY

Or the case that the validation dataset is too small and/or not representative of the
training dataset.
REPLY

I was training on the imdb dataset for sentiment analysis. I was training on 100000
words.
Everything seems to be going okay until the training part where the loss and
accuracy keeps on fluctuating.
The validation dataset is split from the whole dataset, so i dont think thats the issue.

Perhaps try an alternate model?

Perhaps try an alternate data preparation?
Perhaps try an alternate configuration?
…
70.
how can we know the total number of words in the imdb dataset?
not the vocabulary but the size of the dataset?
REPLY
o
Sum the length of all samples.

REPLY

One sample in this dataset is one review. And each review contains different number of
words. We are considering like 10000 or 100000 words for the dataset and splitting it into
training and testing. So i need to like get the total number of words.
REPLY
71.
Augusto December 28, 2019 at 9:07 am #
Hi Jason,
I did the exercise from your post “Text Generation With LSTM Recurrent Neural Networks in Python
with Keras”, but the alternative you are describing here by using a Language Model produces text
with more coherence, then could you please elaborate when to use one technique over the another.
Thanks in advance,
REPLY
o
Good question.
There are no good heuristics. Perhaps follow preference, or model skill for a specific dataset and
metric.
REPLY
72.
Fred January 3, 2020 at 6:48 am #
Hi! I’m trying to convert this example to to make a simple proof-of-concept model to do word
prediction that can do inference both backwards and forwards using the same trained model.
(Without duplicating the data)
I want to try split the text lines in the middle and have my target word there. Like this:
X1 y X2
1. [down yesterday to the piraeus with glaucon the son of ariston]
2. [yesterday to the piraeus with glaucon the son of ariston that]
3. [to the piraeus with glaucon the son of ariston that i]
etc
(X1 and X2 are actually 20 words each)
I keep getting various data formatting errors and I feel like I have tried so many things but obviously
there’s still plenty permutations at the correct way to do this still eludes me.
This is roughly my code,
X1 = X1.reshape((n_lines+1, 20))
X2 = X2.reshape((n_lines+1, 20))
y = y.reshape((n_lines+1, vocab_size))
model = Sequential()
model.add(Embedding(vocab_size, 40, input_length=seq_length))
model.add(LSTM(100, return_sequences=True, input_shape=(20, 1)))
model.add(LSTM(100))
model.add(Dense(100, activation=’relu’))
model.add(Dense(vocab_size, activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
model.fit(np.array([X1, X2]), np.array(y), batch_size=128, epochs=10)

Do you think this should be close to working and do you know why I can’t seem to be able to feed
both the ‘X’ features?
Cheers, Fred
REPLY
o
I’m not sure it is feasible.

REPLY

Fred January 3, 2020 at 9:25 am #
It was just suggested on the Google group that I try the Functional API, so I’m figuring out
how to do that now.
REPLY

This might help:

REPLY
73.
Wei Jiang January 25, 2020 at 8:06 pm #
I want to take everything into account, including punctuatioins, so that I comment out the following
line:
tokens = [word for word in tokens if word.isalpha()]
But when I run the training, I get the following error:

Traceback (most recent call last):
File “lm2.py”, line 34, in
Any idea?
REPLY
o
Perhaps confirm the shape and type of the data matches what we expect after your change?
REPLY

Wei Jiang January 26, 2020 at 11:45 am #
I am not sure about that. You can download the files that I have created/used from the
following OneDrive link:
https://1drv.ms/u/s!AqMx36ZH6wJMhINbfIy1INrq5onhzg?e=UmUe4V
REPLY

Sorry, I don’t have the capacity to debug your code for you:
https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-
code
REPLY
74.
sachin February 27, 2020 at 1:52 pm #
Sir, when we are considering the context of a sentence to classify it to a class, which neural network
architecture should I use.
For eg: I want to classify like this,
They killed many people: Non-Toxic
I will kill them: Toxic

REPLY
o
Good question, see this:

REPLY
75.
Dylan Lunde March 12, 2020 at 8:05 am #
Was the too many indices for array issue ever explicitly solved by anyone?
IndexError Traceback (most recent call last)

in ()
1 sequences = np.array(sequences)
—-> 2 X, y = sequences[:,:-1], sequences[:,-1]

REPLY
o
This will help:

for-me
REPLY
76.
Carson March 27, 2020 at 2:21 pm #
i have a question, you input 50 words into your neural nets and get one output world if i am not
wrong, but how you can get a 50 words text when you only put in 50 words text?
REPLY
o
You can use the same model recursively with output passed as input, or you can use a seq2seq
model implemented using an encoder-decoder model.
You can find many examples of encoder-decoder for NLP on this blog, perhaps start here:
https://machinelearningmastery.com/start-here/#nlp
REPLY
77.
Arsal April 26, 2020 at 8:43 am #
Can you make a tutorial on text generation using GANs?

REPLY
o
Thanks for the suggestion.
Language models are used for text generation, GANs are used for generating images.
REPLY
78.
Efstathios Chatzikyriakidis May 15, 2020 at 3:35 am #
Hi Jason,
Two issues:
“The model will require 100 words as input.”
I think it is 50.
Also:
“Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop words
removed.”
This is usually done in text classification. Doing such a think in a language model and use it for text
generation you will lead to bad results. Stop words are important for catching basic groups of words,
eg: “I went to the”.
REPLY
o
Thanks for the typo.
Sure, change it anyway you like!

REPLY
79.
Vipul May 30, 2020 at 5:03 pm #
I need a deep nueral network which select a word out of predefined candidates. Please suggest me
some solution.
REPLY
o
Perhaps model it as text classification:

REPLY
80.
Prem June 1, 2020 at 9:58 am #
For sentence-wise training, does model 2 from the following post essentially show it?
https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/
REPLY
81.
Laura June 4, 2020 at 8:29 pm #
Hi Jason! Thanks for your post.
I need to build a neural network which detect anomalies in sycalls execution as well as related to the
arguments these syscalls receive. Which solution could be suitable for this problem.
Thanks in advance!
REPLY
o
Perhaps start with a text input and class label output, e.g. text classification models. Test bag of
words and embedding representations to see what works well.
REPLY
82.
Laura June 5, 2020 at 12:04 am #
Hi Jason!
Thanks for your post!
I need to build a neural network to detect anomalies in syscalls exection as well as in the arguments
they receive. Which solution would you recommend me for this purpose?
Thanks in advance!
REPLY
o
I recommend prototyping and systematically evaluating a suite of different models and discover
what works well for your dataset.
REPLY
83.
Neha June 10, 2020 at 1:32 pm #
Hello sir, thank you for such a nice post, but sir how to work with csv files, how to load ,save them,I
am so new to deep learning, can you me idea of the syntax?
REPLY
o
This will show you how to load CSV files:

http://machinelearningmastery.com/load-machine-learning-data-python/
REPLY
84.
Kaalu June 16, 2020 at 12:15 pm #
Hi Jason,
Thanks for your step by step tutorial with relevant explanations. I am trying to use this technique in
generating snort rules for a specific malware family/ type (somehow similar to firewall rules /
Intrusion detection rules). Do you think this is possible? can you give me any pointers to consider?
Will it be possible since such rules need to follow a specific format or sequence with keywords.
this is how a sample rule looks like.
“alert tcp HOMENETany−>EXTERNAL_NET $HTTP_PORTS (msg:”MALWARE-BACKDOOR

Win.Backdoor.Demtranc variant outbound connection”; flow:to_server,established; content:”GET”;
nocase; http_method; content:”/AES”; fast_pattern; nocase; http_uri; pcre:”/\/AES\d+O\d+\.jsp\?[a-
z0-9=\x2b\x2f]{20}/iU”; metadata:policy balanced-ips drop, policy max-detect-ips drop, policy
security-ips drop, service http;
reference:url,www.virustotal.com/file/b3a97be4160fb261e138888df276f9076ed76fe2efca3c71b3ebf
7aa8713f4a4/analysis/; classtype:trojan-activity; sid:24115; rev:3;)
Your advice would be highly appreciated.

“
REPLY
o
Jason Brownlee June 16, 2020 at 1:40 pm #
You’re welcome.
Perhaps try it and see?
My best advice is to review the literature and see how others that have come before addressed
the same type of problem. It will save a ton of time and likely give good results quickly.
REPLY

Kaalu June 16, 2020 at 11:37 pm #
Hi Jason,
Thanks very much. Sadly haven’t found any literature where they have anything similar .
That’s why I reached out to you.
Will keep searching..

REPLY

Hang in there, perhaps search for another project that is “close enough” and mine it for
ideas.
REPLY

Ebenezer A. Laryea June 19, 2020 at 1:36 am #
thanks
85.
Hilal Ozer August 26, 2020 at 7:54 am #
Hi Jason,
Thanks for the great post. I used your code for morpheme prediction. At first I implemented it with
fixed sequence length correctly but then I have to make it with variable sequence length. So, I used
stateful LSTM with batch size 1 and set sequence length None.
I tried to fit the model one sample at a time. However I got the “ValueError: Input arrays should have
the same number of samples as target arrays. Found 1 input samples and 113 target samples.”
The input and output sample sizes are actually equal and “113” is the one hot vector’s size of the
output. The target output implementation is totally same with your code and runs correctly in my first
implementation with fixed sequence.
Do you have any idea why the model does not recognize one hot encoding?
Thanks in advance.
REPLY
o
You’re welcome.
If you are using a stateful LSTM you may need to make the target 3d instead of 2d, e.g.
[samples, timesteps, features]. I could be wrong, but I recall that might be an issue to consider.
REPLY
86.
Hilal Ozer August 27, 2020 at 7:14 am #
Thank you for your response. When I make it 2d, it ran successfully. It was 1d by mistake.
REPLY
o
No problem.
REPLY
87.
riyaz September 5, 2020 at 9:59 pm #
If you want to learn more, you can also check out the Keras Team’s text generation implementation
on GitHub: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py.
have a look on this code.. its well presented
REPLY
o
Thanks for sharing.
REPLY
88.
Minura Punchihewa September 13, 2020 at 4:09 am #
Hi Jason,
I have used a similar RNN architecture to develop a language model. What I would like to do now is,
when a complete sentence is provided to the model, to be able to generate the probability of it.
Note: This is the probability of the entire sentence that I am referring to, not just the next word.
Do you happen to know how to do this?
From what I have gathered, this mechanism is used in the implementation of speech recognition
software.
REPLY
o
Good question, I don’t have an example.
Perhaps calculate how to do this manually then implement it?

Perhaps check the literature to see if anyone has done this before and if so how?
Perhaps check open source libraries to see if they offer this capability and see how they did it?
REPLY
89.
Minura Punchihewa September 14, 2020 at 5:34 am #
I’ve been checking, still struggling to find an answer.

REPLY
90.
Pranav September 28, 2020 at 1:04 am #
Hello Jason Sir,
Thank you for providing such an amazing and informative blog on Text-generation . First on reading
the title I thought its going to be difficult , but explainations as well as the code were concise and
easy to grasp . Looking foward to read more blogs from you!!
REPLY
o
Thanks!
REPLY
91.
John Bueno October 7, 2020 at 2:43 pm #
I’ve followed the steps and am almost finished but am stuck on this error

File “C:/Users/andya/PycharmProjects/finalfinalFINALCHAIN/venv/Scripts/Monster.py”, line 45, in
yhat = model.predict_classes(encoded)
File “C:\Users\andya\PycharmProjects\finalfinalFINALCHAIN\venv\lib\site-
packages\keras\models.py”, line 1138, in predict_classes
steps=steps)
packages\keras\models.py”, line 1025, in predict
steps=steps)
packages\keras\engine\training.py”, line 1830, in predict
check_batch_axis=False)
packages\keras\engine\training.py”, line 129, in _standardize_input_data
str(data_shape))
ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array
with shape (51, 1)
For reference I explicitly used the same versions of just about everything that you did. Everything
works except for the first line to state
yhat = model.predict_classes(encoded, verbose=0)
I’ve tinkered with the code but sadly I am not quite mathmatically and software inclined enough to
find a proper solution. You may want to keep in mind that I have altered the text cleaner to keep
numbers and punctuation albeit when reverting it back to normal it doesn’t appear to fix anything. It
may also be worth noting that for testing purposes I’ve set the epoch count to 1 but I doubt that
should affect anything. Outside of that there shouldn’t be any important deviations.
REPLY
o
Sorry to hear that, I can confirm the example works with the latest version of the libraries, e.g.
Keras 2.4 and TensorFlow 2.3, ensure your libs are up to date.
Also, it looks like you are running from an IDE, perhaps try running from the command line:
https://machinelearningmastery.com/faq/single-faq/how-do-i-run-a-script-from-the-command-line
More suggestions here:
for-me
REPLY
92.
Payton F October 27, 2020 at 9:18 am #
Hi Jason,
What would your approach be to building a model trained on multiple different sources of text. For
example, if I want to train a model on speech transcripts so I can generate text in the style of a
certain speaker, would I store all the speeches in a single .txt file? I worry that if I do this, I will have
some misleading sequences such as when the sequence begins with words from one speech and
ends with words from the beginning of the next speech. Would it be better to somehow train the
model on one speech at a time rather than on a larger file of all speeches combined?
REPLY
o
Jason Brownlee October 27, 2020 at 1:01 pm #
Perhaps fit a separate model on each source then use an ensemble of the models / stacking to
combine.
REPLY
93.
123 November 6, 2020 at 6:43 am #
I am now not sure where you’re getting your information, but good topic.
I must spend a while learning much more or understanding more.
Thanks for wonderful info I used to be in search of this information for my mission.
REPLY
o
Thanks.
REPLY
94.
cnsn8 April 6, 2021 at 12:43 am #
thanks for great tutorials Jason. How can i add a simple control to this language model ? Such as
positive- negative text generation.
REPLY
o
You’re welcome.
Perhaps develop one model for each?

REPLY
95.
Eric April 27, 2021 at 8:35 pm #
Hi Jason,
At this step, I receive these codes.
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation=’relu’))
model.add(Dense(vocab_size, activation=’softmax’))
print(model.summary())
————————————————
Total Sequences: 118633
2021-04-27 06:24:25.190966: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found
2021-04-27 06:24:25.191304: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above
cudart dlerror if you do not have a GPU set up on your machine.
2021-04-27 06:24:33.866815: I tensorflow/stream_executor/platform/default/dso_loader.cc:49]
Successfully opened dynamic library nvcuda.dll
2021-04-27 06:24:34.937609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found
device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth:
104.43GiB/s
Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found
Could not load dynamic library ‘cublas64_11.dll’; dlerror: cublas64_11.dll not found
Could not load dynamic library ‘cublasLt64_11.dll’; dlerror: cublasLt64_11.dll not found
Could not load dynamic library ‘cufft64_10.dll’; dlerror: cufft64_10.dll not found
Could not load dynamic library ‘curand64_10.dll’; dlerror: curand64_10.dll not found
Could not load dynamic library ‘cusolver64_11.dll’; dlerror: cusolver64_11.dll not found
Could not load dynamic library ‘cusparse64_11.dll’; dlerror: cusparse64_11.dll not found
Could not load dynamic library ‘cudnn64_8.dll’; dlerror: cudnn64_8.dll not found
2021-04-27 06:24:34.952766: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot
dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed
properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for
how to download and setup the required libraries for your platform.
Skipping registering GPU devices…
2021-04-27 06:24:34.954395: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow
binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU
instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-27 06:24:34.955743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device
interconnect StreamExecutor with strength 1 edge matrix:
2021-04-27 06:24:34.956035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\LANGUAGE MODEL
BUILDING\LANGUAGE MODEL TEST.py”, line 105, in
model.add(LSTM(100, return_sequences=True))
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\training\tracking\base.py”, line 522, in _method_wrapper
result = method(self, *args, **kwargs)
packages\keras\engine\sequential.py”, line 223, in add
output_tensor = layer(self.outputs[0])
packages\keras\layers\recurrent.py”, line 660, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
packages\keras\engine\base_layer.py”, line 945, in __call__
return self._functional_construction_call(inputs, args, kwargs,
packages\keras\engine\base_layer.py”, line 1083, in _functional_construction_call
outputs = self._keras_tensor_symbolic_call(
packages\keras\engine\base_layer.py”, line 816, in _keras_tensor_symbolic_call
return self._infer_output_signature(inputs, args, kwargs, input_masks)
packages\keras\engine\base_layer.py”, line 856, in _infer_output_signature
outputs = call_fn(inputs, *args, **kwargs)
packages\keras\layers\recurrent_v2.py”, line 1139, in call
inputs, initial_state, _ = self._process_inputs(inputs, initial_state, None)
packages\keras\layers\recurrent.py”, line 860, in _process_inputs
initial_state = self.get_initial_state(inputs)
packages\keras\layers\recurrent.py”, line 642, in get_initial_state
init_state = get_initial_state_fn(
packages\keras\layers\recurrent.py”, line 2508, in get_initial_state
return list(_generate_zero_filled_state_for_cell(
packages\keras\layers\recurrent.py”, line 2990, in _generate_zero_filled_state_for_cell
return _generate_zero_filled_state(batch_size, cell.state_size, dtype)
packages\keras\layers\recurrent.py”, line 3006, in _generate_zero_filled_state
return tf.nest.map_structure(create_zeros, state_size)
packages\tensorflow\python\util\nest.py”, line 867, in map_structure
structure[0], [func(*x) for x in entries],
packages\tensorflow\python\util\nest.py”, line 867, in
structure[0], [func(*x) for x in entries],
packages\keras\layers\recurrent.py”, line 3003, in create_zeros
return tf.zeros(init_state_size, dtype=dtype)
packages\tensorflow\python\util\dispatch.py”, line 206, in wrapper
return target(*args, **kwargs)
packages\tensorflow\python\ops\array_ops.py”, line 2911, in wrapped
tensor = fun(*args, **kwargs)
packages\tensorflow\python\ops\array_ops.py”, line 2960, in zeros
output = _constant_if_small(zero, shape, dtype, name)
packages\tensorflow\python\ops\array_ops.py”, line 2896, in _constant_if_small
if np.prod(shape) < 1000:
File "”, line 5, in prod
packages\numpy\core\fromnumeric.py”, line 3030, in prod
return _wrapreduction(a, np.multiply, ‘prod’, axis, dtype, out,
packages\numpy\core\fromnumeric.py”, line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
packages\tensorflow\python\framework\ops.py”, line 867, in __array__
raise NotImplementedError(
NotImplementedError: Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a numpy array.
This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported
My libraries are :
Python 3.9
Theano 1.0.5
Numpy 1.20.2
How to Develop Word-Based Neural Language

Models in Python with Keras
Language modeling involves predicting the next word in a sequence given the sequence of
words already present.
A language model is a key element in many natural language processing models such as
machine translation and speech recognition. The choice of how the language model is
framed must match how the language model is intended to be used.
In this tutorial, you will discover how the framing of a language model affects the skill of the
model when generating short sequences from a nursery rhyme.
 The challenge of developing a good framing of a word-based language model for a

given application.
 How to develop one-word, two-word, and line-based framings for word-based
language models.
 How to generate sequences using a fit language model.
Tutorial Overview
1. Framing Language Modeling

2. Jack and Jill Nursery Rhyme
3. Model 1: One-Word-In, One-Word-Out Sequences
4. Model 2: Line-by-Line Sequence
5. Model 3: Two-Words-In, One-Word-Out Sequence
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Start Your FREE Crash-Course Now
Framing Language Modeling

A statistical language model is learned from raw text and predicts the probability of the next
word in the sequence given the words already present in the sequence.
Language models are a key component in larger models for challenging natural language
processing problems, like machine translation and speech recognition. They can also be
developed as standalone models and used for generating new sequences that have the
same statistical properties as the source text.
Language models both learn and predict one word at a time. The training of the network
involves providing sequences of words as input that are processed one at a time where a
prediction can be made and learned for each input sequence.
Similarly, when making predictions, the process can be seeded with one or a few words,
then predicted words can be gathered and presented as input on subsequent predictions in
order to build up a generated output sequence
Therefore, each model will involve splitting the source text into input and output sequences,
such that the model can learn to predict words.
There are many ways to frame the sequences from a source text for language modeling.
In this tutorial, we will explore 3 different ways of developing word-based language models
in the Keras deep learning library.
There is no single best approach, just different framings that may suit different applications.
Jack and Jill Nursery Rhyme

Jack and Jill is a simple nursery rhyme.
It is comprised of 4 lines, as follows:
Jack and Jill went up the hill

To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after
We will use this as our source text for exploring different framings of a word-based
language model.
We can define this text in Python as follows:
1 # source text
2 data = """ Jack and Jill went up the hill\n
3 To fetch a pail of water\n
4 Jack fell down and broke his crown\n
5 And Jill came tumbling after\n """
Model 1: One-Word-In, One-Word-Out Sequences

We can start with a very simple model.
Given one word as input, the model will learn to predict the next word in the sequence.
For example:
1 X, y
2 Jack, and
3 and, Jill
4 Jill, went
5 ...
The first step is to encode the text as integers.
Each lowercase word in the source text is assigned a unique integer and we can convert
the sequences of words to sequences of integers.
Keras provides the Tokenizer class that can be used to perform this encoding. First, the
Tokenizer is fit on the source text to develop the mapping from words to unique integers.
Then sequences of text can be converted to sequences of integers by calling
the texts_to_sequences() function.
1 # integer encode text
3 tokenizer.fit_on_texts([data])
4 encoded = tokenizer.texts_to_sequences([data])[0]
We will need to know the size of the vocabulary later for both defining the word embedding
layer in the model, and for encoding output words using a one hot encoding.
The size of the vocabulary can be retrieved from the trained Tokenizer by accessing
the word_index attribute.
1 # determine the vocabulary size
3 print('Vocabulary Size: %d' % vocab_size)
Running this example, we can see that the size of the vocabulary is 21 words.
We add one, because we will need to specify the integer for the largest encoded word as an
array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.
Next, we need to create sequences of words to fit the model with one word as input and one
word as output.
1 # create word -> word sequences
3 for i in range(1, len(encoded)):
4 sequence = encoded[i-1:i+1]
5 sequences.append(sequence)
Running this piece shows that we have a total of 24 input-output pairs to train the network.
We can then split the sequences into input (X) and output elements (y). This is
straightforward as we only have two columns in the data.
1 # split into X and y elements
3 X, y = sequences[:,0],sequences[:,1]
We will fit our model to predict a probability distribution across all words in the vocabulary.
That means that we need to turn the output element from a single integer into a one hot
encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the
value. This gives the network a ground truth to aim for from which we can calculate error
and update the model.
Keras provides the to_categorical() function that we can use to convert the integer to a one
hot encoding while specifying the number of classes as the vocabulary size.
1 # one hot encode outputs
We are now ready to define the neural network model.
The model uses a learned word embedding in the input layer. This has one real-valued
vector for each word in the vocabulary, where each word vector has a specified length. In
this case we will use a 10-dimensional projection. The input sequence contains a single
word, therefore the input_length=1.
The model has a single hidden LSTM layer with 50 units. This is far more than is needed.
The output layer is comprised of one neuron for each word in the vocabulary and uses a
softmax activation function to ensure the output is normalized to look like a probability.
1 # define model
3 model.add(Embedding(vocab_size, 10, input_length=1))
The structure of the network can be summarized as follows:
1 _________________________________________________________________
3 =================================================================
4 embedding_1 (Embedding) (None, 1, 10) 220
5 _________________________________________________________________
6 lstm_1 (LSTM) (None, 50) 12200
7 _________________________________________________________________
8 dense_1 (Dense) (None, 22) 1122
9 =================================================================
10 Total params: 13,542
11 Trainable params: 13,542
13 _________________________________________________________________
We will use this same general network structure for each example in this tutorial, with minor
changes to the learned embedding layer.
Next, we can compile and fit the network on the encoded text data. Technically, we are
modeling a multi-class classification problem (predict the word in the vocabulary), therefore
using the categorical cross entropy loss function. We use the efficient Adam implementation
of gradient descent and track accuracy at the end of each epoch. The model is fit for 500
training epochs, again, perhaps more than is needed.
The network configuration was not tuned for this and later experiments; an over-prescribed
configuration was chosen to ensure that we could focus on the framing of the language
model.
1 # compile network
3 # fit network
4 model.fit(X, y, epochs=500, verbose=2)
After the model is fit, we test it by passing it a given word from the vocabulary and having
the model predict the next word. Here we pass in ‘Jack‘ by encoding it and
calling model.predict_classes() to get the integer output for the predicted word. This is then
looked up in the vocabulary mapping to give the associated word.
1 # evaluate
2 in_text = 'Jack'
3 print(in_text)
5 encoded = array(encoded)
8 if index == yhat:
9 print(word)
This process could then be repeated a few times to build up a generated sequence of
words.
To make this easier, we wrap up the behavior in a function that we can call by passing in
our model and the seed word.
1 # generate a sequence from the model
2 def generate_seq(model, tokenizer, seed_text, n_words):

3 in_text, result = seed_text, seed_text
9 # predict a word in the vocabulary
12 out_word = ''
15 out_word = word
16 break
18 in_text, result = out_word, result + ' ' + out_word
19 return result
We can time all of this together. The complete code listing is provided below.
8
9 # generate a sequence from the model
10 def generate_seq(model, tokenizer, seed_text, n_words):
11 in_text, result = seed_text, seed_text

17 # predict a word in the vocabulary
20 out_word = ''
23 out_word = word
24 break
26 in_text, result = out_word, result + ' ' + out_word
27 return result
28
29 # source text
34 # integer encode text

41 # create word -> word sequences
47 # split into X and y elements
49 X, y = sequences[:,0],sequences[:,1]
50 # one hot encode outputs
52 # define model
54 model.add(Embedding(vocab_size, 10, input_length=1))
58 # compile network
60 # fit network
62 # evaluate
63 print(generate_seq(model, tokenizer, 'Jack', 6))
Running the example prints the loss and accuracy each training epoch.
1 ...
2 Epoch 496/500
3 0s - loss: 0.2358 - acc: 0.8750
4 Epoch 497/500
5 0s - loss: 0.2355 - acc: 0.8750
6 Epoch 498/500
7 0s - loss: 0.2352 - acc: 0.8750
8 Epoch 499/500
9 0s - loss: 0.2349 - acc: 0.8750
10 Epoch 500/500
11 0s - loss: 0.2346 - acc: 0.8750
We can see that the model does not memorize the source sequences, likely because there
is some ambiguity in the input sequences, for example:
1 jack => and
2 jack => fell
And so on.
At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.
We get a reasonable sequence as output that has some elements of the source.
1 Jack and jill came tumbling after down
This is a good first cut language model, but does not take full advantage of the LSTM’s
ability to handle sequences of input and disambiguate some of the ambiguous pairwise
sequences by using a broader context.
Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a
series of words that build up.
For example:
1 X, y
2 _, _, _, _, _, Jack, and
3 _, _, _, _, Jack, and Jill

4 _, _, _, Jack, and, Jill, went
5 _, _, Jack, and, Jill, went, up
6 _, Jack, and, Jill, went, up, the
7 Jack, and, Jill, went, up, the, hill
This approach may allow the model to use the context of each line to help the model in
those cases where a simple one-word-in-and-out model creates ambiguity.
In this case, this comes at the cost of predicting words across lines, which might be fine for
now if we are only interested in modeling and generating lines of text.
Note that in this representation, we will require a padding of sequences to ensure they meet
a fixed length input. This is a requirement when using Keras.
First, we can create the sequences of integers, line-by-line by using the Tokenizer already
fit on the source text.
1 # create line-based sequences
3 for line in data.split('\n'):
4 encoded = tokenizer.texts_to_sequences([line])[0]
6 sequence = encoded[:i+1]
Next, we can pad the prepared sequences. We can do this using

the pad_sequences() function provided in Keras. This first involves finding the longest
sequence, then using that as the length by which to pad-out all other sequences.
1 # pad input sequences
2 max_length = max([len(seq) for seq in sequences])
3 sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
4 print('Max Sequence Length: %d' % max_length)
Next, we can split the sequences into input and output elements, much like before.
1 # split into input and output elements
3 X, y = sequences[:,:-1],sequences[:,-1]
The model can then be defined as before, except the input sequences are now longer than
a single word. Specifically, they are max_length-1 in length, -1 because when we calculated
the maximum length of sequences, they included the input and output elements.
1 # define model
3 model.add(Embedding(vocab_size, 10, input_length=max_length-1))
7 # compile network
9 # fit network
We can use the model to generate new sequences as before. The generate_seq() function

can be updated to build up an input sequence by adding predictions to the list of input
words each iteration.
2 def generate_seq(model, tokenizer, max_length, seed_text, n_words):
8 # pre-pad sequences to a fixed length
9 encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')

13 out_word = ''
16 out_word = word
17 break
20 return in_text
Tying all of this together, the complete code example is provided below.
9

22 out_word = ''
25 out_word = word
26 break
29 return in_text
30
31 # source text
36 # prepare the tokenizer on the source text
42 # create line-based sequences
44 for line in data.split('\n'):
45 encoded = tokenizer.texts_to_sequences([line])[0]
47 sequence = encoded[:i+1]
50 # pad input sequences
58 # define model
66 # fit network
68 # evaluate model
69 print(generate_seq(model, tokenizer, max_length-1, 'Jack', 4))
70 print(generate_seq(model, tokenizer, max_length-1, 'Jill', 4))
Running the example achieves a better fit on the source data. The added context has
allowed the model to disambiguate some of the examples.
There are still two lines of text that start with ‘Jack‘ that may still be a problem for the
network.
1 ...
2 Epoch 496/500
3 0s - loss: 0.1039 - acc: 0.9524
4 Epoch 497/500
5 0s - loss: 0.1037 - acc: 0.9524
6 Epoch 498/500
7 0s - loss: 0.1035 - acc: 0.9524
8 Epoch 499/500
9 0s - loss: 0.1033 - acc: 0.9524
10 Epoch 500/500
11 0s - loss: 0.1032 - acc: 0.9524
At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.
The first generated line looks good, directly matching the source text. The second is a bit
strange. This makes sense, because the network only ever saw ‘Jill‘ within an input
sequence, not at the beginning of the sequence, so it has forced an output to use the word
‘Jill‘, i.e. the last line of the rhyme.
1 Jack fell down and broke
2 Jill jill came tumbling after
This was a good example of how the framing may result in better new lines, but not good
partial lines of input.
Model 3: Two-Words-In, One-Word-Out Sequence

We can use an intermediate between the one-word-in and the whole-sentence-in
approaches and pass in a sub-sequences of words as input.
This will provide a trade-off between the two framings allowing new lines to be generated
and for generation to be picked up mid line.
We will use 3 words as input to predict one word as output. The preparation of the
sequences is much like the first example, except with different offsets in the source
sequence arrays, as follows:
1 # encode 2 words -> 1 word

The complete example is listed below
9
22 out_word = ''
25 out_word = word
26 break
29 return in_text
30
31 # source text
40 # retrieve vocabulary size
43 # encode 2 words -> 1 word
49 # pad sequences

57 # define model
65 # fit network
67 # evaluate model
68 print(generate_seq(model, tokenizer, max_length-1, 'Jack and', 5))
69 print(generate_seq(model, tokenizer, max_length-1, 'And Jill', 3))
70 print(generate_seq(model, tokenizer, max_length-1, 'fell down', 5))
71 print(generate_seq(model, tokenizer, max_length-1, 'pail of', 5))
Running the example again gets a good fit on the source text at around 95% accuracy.
1 ...
2 Epoch 496/500
3 0s - loss: 0.0685 - acc: 0.9565
4 Epoch 497/500
5 0s - loss: 0.0685 - acc: 0.9565
6 Epoch 498/500
7 0s - loss: 0.0684 - acc: 0.9565

8 Epoch 499/500
9 0s - loss: 0.0684 - acc: 0.9565
10 Epoch 500/500
11 0s - loss: 0.0684 - acc: 0.9565
We look at 4 generation examples, two start of line cases and two starting mid line.
1 Jack and jill went up the hill
2 And Jill went up the
3 fell down and broke his crown and
4 pail of water jack fell down and
The first start of line case generated correctly, but the second did not. The second case was
an example from the 4th line, which is ambiguous with content from the first line. Perhaps a
further expansion to 3 input words would be better.
The two mid-line generation examples were generated correctly, matching the source text.
We can see that the choice of how the language model is framed and the requirements on
how the model will be used must be compatible. That careful design is required when using
language models in general, perhaps followed-up by spot testing with sequence generation
to confirm model requirements have been met.
Extensions
 Whole Rhyme as Sequence. Consider updating one of the above examples to build
up the entire rhyme as an input sequence. The model should be able to generate the entire
thing given the seed of the first word, demonstrate this.
 Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding
instead of learning the embedding as part of the model. This would not be required on such
a small source text, but could be good practice.
 Character Models. Explore the use of a character-based language model for the
source text instead of the word-based approach demonstrated in this tutorial.
How to Develop a Character-Based Neural
Language Model in Keras
A language model predicts the next word in the sequence based on the specific words that
have come before it in the sequence.
It is also possible to develop language models at the character level using neural networks.
The benefit of character-based language models is their small vocabulary and flexibility in
handling any words, punctuation, and other document structure. This comes at the cost of
requiring larger models that are slower to train.
Nevertheless, in the field of neural language models, character-based models offer a lot of
promise for a general, flexible and powerful approach to language modeling.
In this tutorial, you will discover how to develop a character-based neural language model.
 How to prepare text for character-based language modeling.

 How to develop a character-based language model using LSTMs.
 How to use a trained character-based language model to generate text
Tutorial Overview
1. Sing a Song of Sixpence

2. Data Preparation
3. Train Language Model
4. Generate Text
Need help with Deep Learning for Text Data?
Take my free 7-day email crash course now (with code).
Click to sign-up and also get a free PDF Ebook version of the course.
Start Your FREE Crash-Course Now
Sing a Song of Sixpence

The nursery rhyme “Sing a Song of Sixpence” is well known in the west.
The first verse is common, but there is also a 4 verse version that we will use to develop our
character-based language model.
It is short, so fitting the model will be fast, but not so short that we won’t see anything
interesting.
The complete 4 verse version we will use as source text is listed below.
1 Sing a song of sixpence,
2 A pocket full of rye.
3 Four and twenty blackbirds,
4 Baked in a pie.
5
6 When the pie was opened
7 The birds began to sing;
8 Wasn't that a dainty dish,
9 To set before the king.
10
11 The king was in his counting house,
12 Counting out his money;
13 The queen was in the parlour,
14 Eating bread and honey.
15
16 The maid was in the garden,
17 Hanging out the clothes,
18 When down came a blackbird
19 And pecked off her nose.
Copy the text and save it in a new file in your current working directory with the file name
‘rhyme.txt‘.
Data Preparation
The first step is to prepare the text data.
We will start by defining the type of language model.
Language Model Design

A language model must be trained on the text, and in the case of a character-based
language model, the input and output sequences must be characters.
The number of characters used as input will also define the number of characters that will
need to be provided to the model in order to elicit the first predicted character.
After the first character has been generated, it can be appended to the input sequence and
used as input for the model to generate the next character.
Longer sequences offer more context for the model to learn what character to output next
but take longer to train and impose more burden on seeding the model when generating
text.
We will use an arbitrary length of 10 characters for this model.
There is not a lot of text, and 10 characters is a few words.
We can now transform the raw text into a form that our model can learn; specifically, input
and output sequences of characters.
Load Text
We must load the text into memory so that we can work with it.
Below is a function named load_doc() that will load a text file given a filename and return
the loaded text.
5 # read all text
7 # close the file

8 file.close()
9 return text
We can call this function with the filename of the nursery rhyme ‘rhyme.txt‘ to load the text
into memory. The contents of the file are then printed to screen as a sanity check.
1 # load text
2 raw_text = load_doc('rhyme.txt')
3 print(raw_text)
Clean Text
Next, we need to clean the loaded text.
We will not do much to it here. Specifically, we will strip all of the new line characters so that
we have one long sequence of characters separated only by white space.
1 # clean
2 tokens = raw_text.split()
3 raw_text = ' '.join(tokens)
You may want to explore other methods for data cleaning, such as normalizing the case to
lowercase or removing punctuation in an effort to reduce the final vocabulary size and
develop a smaller and leaner model.
Create Sequences
Now that we have a long list of characters, we can create our input-output sequences used
to train the model.
Each input sequence will be 10 characters with one output character, making each
sequence 11 characters long.
We can create the sequences by enumerating the characters in the text, starting at the 11th
character at index 10.
1 # organize into sequences of characters
2 length = 10
4 for i in range(length, len(raw_text)):
6 seq = raw_text[i-length:i+1]
7 # store
8 sequences.append(seq)
Running this snippet, we can see that we end up with just under 400 sequences of
characters for training our language model.
Save Sequences
Finally, we can save the prepared data to file so that we can load it later when we develop
our model.
Below is a function save_doc() that, given a list of strings and a filename, will save the
strings to file, one per line.
5 file.write(data)
6 file.close()
We can call this function and save our prepared sequences to the filename
‘char_sequences.txt‘ in our current working directory.
2 out_filename = 'char_sequences.txt'
Complete Example
Tying all of this together, the complete code listing is provided below.
5 # read all text
7 # close the file
8 file.close()
9 return text
10
15 file.write(data)
16 file.close()
17
18 # load text
19 raw_text = load_doc('rhyme.txt')
20 print(raw_text)
21
22 # clean
23 tokens = raw_text.split()
24 raw_text = ' '.join(tokens)
25
26 # organize into sequences of characters
27 length = 10
29 for i in range(length, len(raw_text)):
31 seq = raw_text[i-length:i+1]
32 # store
33 sequences.append(seq)
35
37 out_filename = 'char_sequences.txt'
Run the example to create the ‘char_seqiences.txt‘ file.

Take a look inside you should see something like the following:
1 Sing a song
2 ing a song
3 ng a song o
4 g a song of
5 a song of
6 a song of s
7 song of si
8 song of six
9 ong of sixp
10 ng of sixpe
11 ...
We are now ready to train our character-based neural language model.
Train Language Model

In this section, we will develop a neural language model for the prepared sequence data.
The model will read encoded characters and predict the next character in the sequence. A
Long Short-Term Memory recurrent neural network hidden layer will be used to learn the
context from the input sequence in order to make the predictions.
Load Data
The first step is to load the prepared character sequence data from ‘char_sequences.txt‘.
We can use the same load_doc() function developed in the previous section. Once loaded,
we split the text by new line to give a list of sequences ready to be encoded.
5 # read all text
7 # close the file
8 file.close()
9 return text
10
11 # load
12 in_filename = 'char_sequences.txt'
13 raw_text = load_doc(in_filename)
14 lines = raw_text.split('\n')
Encode Sequences
The sequences of characters must be encoded as integers.
This means that each unique character will be assigned a specific integer value and each
sequence of characters will be encoded as a sequence of integers.
We can create the mapping given a sorted set of unique characters in the raw input data.
The mapping is a dictionary of character values to integer values.
1 chars = sorted(list(set(raw_text)))
2 mapping = dict((c, i) for i, c in enumerate(chars))
Next, we can process each sequence of characters one at a time and use the dictionary
mapping to look up the integer value for each character.
2 for line in lines:

3 # integer encode line
4 encoded_seq = [mapping[char] for char in line]
5 # store
6 sequences.append(encoded_seq)
The result is a list of integer lists.
We need to know the size of the vocabulary later. We can retrieve this as the size of the
dictionary mapping.
1 # vocabulary size
2 vocab_size = len(mapping)
Running this piece, we can see that there are 38 unique characters in the input sequence
data.
1 Vocabulary Size: 38
Split Inputs and Output

Now that the sequences have been integer encoded, we can separate the columns into
input and output sequences of characters.
We can do this using a simple array slice.
Next, we need to one hot encode each character. That is, each character becomes a vector
as long as the vocabulary (38 elements) with a 1 marked for the specific character. This
provides a more precise input representation for the network. It also provides a clear
objective for the network to predict, where a probability distribution over characters can be
output by the model and compared to the ideal case of all 0 values with a 1 for the actual
next character.
We can use the to_categorical() function in the Keras API to one hot encode the input and
output sequences.
1 sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
2 X = array(sequences)
We are now ready to fit the model.
Fit Model
The model is defined with an input layer that takes sequences that have 10 time steps and
38 features for the one hot encoded input sequences.
Rather than specify these numbers, we use the second and third dimensions on the X input
data. This is so that if we change the length of the sequences or size of the vocabulary, we
do not need to change the model definition.
The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial
and error.
The model has a fully connected output layer that outputs one vector with a probability
distribution across all characters in the vocabulary. A softmax activation function is used on
the output layer to ensure the output has the properties of a probability distribution.
1 # define model
3 model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
Running this prints a summary of the defined network as a sanity check.
1 _________________________________________________________________
3 =================================================================
4 lstm_1 (LSTM) (None, 75) 34200
5 _________________________________________________________________
6 dense_1 (Dense) (None, 38) 2888
7 =================================================================
8 Total params: 37,088

9 Trainable params: 37,088
11 _________________________________________________________________
The model is learning a multi-class classification problem, therefore we use the categorical
log loss intended for this type of problem. The efficient Adam implementation of gradient
descent is used to optimize the model and accuracy is reported at the end of each batch
update.
The model is fit for 100 training epochs, again found with a little trial and error.
1 # compile model
3 # fit model
Save Model
After the model is fit, we save it to file for later use.
The Keras model API provides the save() function that we can use to save the model to a
single file, including weights and topology information.
We also save the mapping from characters to integers that we will need to encode any input
when using the model and decode any output from the model.
1 # save the mapping
2 dump(mapping, open('mapping.pkl', 'wb'))
Complete Example
Tying all of this together, the complete code listing for fitting the character-based neural
2 from pickle import dump

7
12 # read all text
14 # close the file
15 file.close()
16 return text
17
18 # load
19 in_filename = 'char_sequences.txt'
20 raw_text = load_doc(in_filename)
21 lines = raw_text.split('\n')
22
23 # integer encode sequences of characters
24 chars = sorted(list(set(raw_text)))
25 mapping = dict((c, i) for i, c in enumerate(chars))
27 for line in lines:
28 # integer encode line
29 encoded_seq = [mapping[char] for char in line]
30 # store
31 sequences.append(encoded_seq)
32
33 # vocabulary size
34 vocab_size = len(mapping)
36
40 sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
41 X = array(sequences)
43
44 # define model
46 model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
49 # compile model
51 # fit model
53
56 # save the mapping
57 dump(mapping, open('mapping.pkl', 'wb'))
Running the example might take one minute.
You will see that the model learns the problem well, perhaps too well for generating
surprising sequences of characters.
1 ...
2 Epoch 96/100
3 0s - loss: 0.2193 - acc: 0.9950
4 Epoch 97/100
5 0s - loss: 0.2124 - acc: 0.9950
6 Epoch 98/100
7 0s - loss: 0.2054 - acc: 0.9950
8 Epoch 99/100
9 0s - loss: 0.1982 - acc: 0.9950
10 Epoch 100/100
11 0s - loss: 0.1910 - acc: 0.9950
At the end of the run, you will have two files saved to the current working directory,
specifically model.h5 and mapping.pkl.
Next, we can look at using the learned model.
Generate Text
We will use the learned language model to generate new sequences of text that have the
same statistical properties.
Load Model
The first step is to load the model saved to the file ‘model.h5‘.
We can use the load_model() function from the Keras API.
1 # load the model
We also need to load the pickled dictionary for mapping characters to integers from the file
‘mapping.pkl‘. We will use the Pickle API to load the object.
1 # load the mapping
2 mapping = load(open('mapping.pkl', 'rb'))
We are now ready to use the loaded model.

Generate Characters
We must provide sequences of 10 characters as input to the model in order to start the
generation process. We will pick these manually.
A given input sequence will need to be prepared in the same way as preparing the training
data for the model.
First, the sequence of characters must be integer encoded using the loaded mapping.
1 # encode the characters as integers
2 encoded = [mapping[char] for char in in_text]
Next, the sequences need to be one hot encoded using the to_categorical() Keras function.
1 # one hot encode
2 encoded = to_categorical(encoded, num_classes=len(mapping))
We can then use the model to predict the next character in the sequence.
We use predict_classes() instead of predict() to directly select the integer for the character

with the highest probability instead of getting the full probability distribution across the entire
set of characters.
1 # predict character
We can then decode this integer by looking up the mapping to see the character to which it
maps.
1 out_char = ''
2 for char, index in mapping.items():
3 if index == yhat:
4 out_char = char
5 break
This character can then be added to the input sequence. We then need to make sure that
the input sequence is 10 characters by truncating the first character from the input
sequence text.
We can use the pad_sequences() function from the Keras API that can perform this
truncation operation.
Putting all of this together, we can define a new function named generate_seq() for using
the loaded model to generate new sequences of text.
1 # generate a sequence of characters with a language model
2 def generate_seq(model, mapping, seq_length, seed_text, n_chars):
4 # generate a fixed number of characters
5 for _ in range(n_chars):
10 # one hot encode
14 # reverse map integer to character
15 out_char = ''
18 out_char = char
19 break
21 in_text += char
22 return in_text
Complete Example
Tying all of this together, the complete example for generating text using the fit neural
1 from pickle import load

2
3 from keras.models import load_model
6
7 # generate a sequence of characters with a language model
8 def generate_seq(model, mapping, seq_length, seed_text, n_chars):
10 # generate a fixed number of characters
11 for _ in range(n_chars):
16 # one hot encode
20 # reverse map integer to character
21 out_char = ''
24 out_char = char
25 break
27 in_text += char
28 return in_text
29
30 # load the model

model = load_model('model.h5')
31
# load the mapping
32
mapping = load(open('mapping.pkl', 'rb'))
33

34
# test start of rhyme
35
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
36
# test mid-line
37
print(generate_seq(model, mapping, 10, 'king was i', 20))
38
# test not in original
39
print(generate_seq(model, mapping, 10, 'hello worl', 20))
Running the example generates three sequences of text.
The first is a test to see how the model does at starting from the beginning of the rhyme.
The second is a test to see how well it does at beginning in the middle of a line. The final
example is a test to see how well it does with a sequence of characters never seen before.
1 Sing a song of sixpence, A poc
2 king was in his counting house
3 hello worls e pake wofey. The
We can see that the model did very well with the first two examples, as we would expect.
We can also see that the model still generated something for the new text, but it is
nonsense.
Extensions
 Padding. Update the example to provides sequences line by line only and use
padding to fill out each sequence to the maximum line length.
 Sequence Length. Experiment with different sequence lengths and see how they
impact the behavior of the model.
 Tune Model. Experiment with different model configurations, such as the number of
memory cells and epochs, and try to develop a better model for fewer resources.
Language modeling (LM) is the use of various statistical and probabilistic

techniques to determine the probability of a given sequence of words
occurring in a sentence. Language models analyze bodies of text data to
provide a basis for their word predictions. They are used in natural language
processing (NLP) applications, particularly ones that generate text as an
output. Some of these applications include , machine translation and question
answering.

Langauage Model

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Langauage Model

Uploaded by

Copyright:

Available Formats

Have you noticed the ‘Smart Compose’ feature in Gmail that gives auto-suggestions to

 Speech Recognition: Smart speakers, such as Alexa, use automatic speech

Challenges with Language Modeling?

How does Language Model Works?

Types of Language Models:

There are primarily two types of language models:

1. Statistical Language Models

N-Gram: This is one of the simplest approaches to language modelling. Here, a

Bidirectional: Unlike n-gram models, which analyze text in one direction (backwards),

2. Neural Language Models

Meanwhile, language models should be able to manage dependencies. For example, a

Some Common Examples of Language Models

How do you plan to use Language Models?

Problem of Modeling Language

This generalization is something that the representation used in classical statistical

— Extensions of recurrent neural network language model, 2011.

— Exploring the Limits of Language Modeling, 2016.

After completing this tutorial, you will know:

 How to prepare text for developing a word-based language model.

 Download The Republic by Plato (republic.txt)

The text should begin with:

And end with

Here is a direct link to the clean version of the data file:

The first step is to look at the data.

For example, here is the first piece of dialog:

I turned round, and asked him where his master was.

Certainly we will, said Glaucon; and in a few minutes Polemarchus

Polemarchus said to me: I perceive, Socrates, that you and your

You are not far wrong, I said.

Here’s what I see from a quick look:

Language Model Design

3 # open the file as read only

4 file = open(filename, 'r')

5 # read all text

7 # close the file

 Replace ‘–‘ with a white space so we can split words better.

3 # turn a doc into clean tokens

5 # replace '--' with a space ' '

6 doc = doc.replace('--', ' ')

7 # split into tokens by white space

9 # remove punctuation from each token

10 table = str.maketrans('', '', string.punctuation)

11 tokens = [w.translate(table) for w in tokens]

12 # remove remaining tokens that are not alphabetic

14 # make lower case

15 tokens = [word.lower() for word in tokens]

4 print('Total Tokens: %d' % len(tokens))

5 print('Unique Tokens: %d' % len(set(tokens)))

We also get some statistics about the clean document.

1 Total Tokens: 118684

2 Unique Tokens: 7409

That is, sequences of 51 words.

1 # organize into sequences of tokens

4 for i in range(length, len(tokens)):

5 # select sequence of tokens

7 # convert into a line

8 line = ' '.join(seq)

11 print('Total Sequences: %d' % len(sequences))

Running this piece creates a long list of lines.

1 Total Sequences: 118633

2 def save_doc(lines, filename):

4 file = open(filename, 'w')

Take a look at the file with your text editor.

book i i … catch sight of