Professional Documents
Culture Documents
complete sentences while writing an email? This is one of the various use-cases of
language models used in Natural Language Processing (NLP).
A language model is the core component of modern Natural Language Processing (NLP).
It’s a statistical tool that analyzes the pattern of human language for the prediction of
words.
NLP-based applications use language models for a variety of tasks, such as audio to text
conversion, speech recognition, sentiment analysis, summarization, spell correction, etc.
Let’s understand how language models help in processing these NLP tasks:
Machine Translation: When translating a Chinese phrase “我在吃” into English, the
translator can give several choices as output:
I eat lunch
I am eating
Me am eating
Eating am I
Here, the language model tells that the translation “I am eating” sounds natural and will
suggest the same as output.
Formal languages (like a programming language) are precisely defined. All the words
and their usage is predefined in the system. Anyone who knows a specific programming
language can understand what’s written without any formal specification.
Natural language, on the other hand, isn’t designed; it evolves according to the
convenience and learning of an individual. There are several terms in natural language
that can be used in a number of ways. This introduces ambiguity but can still be
understood by humans.
Machines only understand the language of numbers. For creating language models, it is
necessary to convert all the words into a sequence of numbers. For the modellers, this is
known as encodings.
Encodings can be simple or complex. Generally, a number is assigned to every word and
this is called label-encoding. In the sentence “I love to play cricket on weekends”, every
word is assigned a number [1, 2, 3, 4, 5, 6]. This is an example of how encoding is done
(one-hot encoding).
Language Models determine the probability of the next word by analyzing the text in
data. These models interpret the data by feeding it through algorithms.
The algorithms are responsible for creating rules for the context in natural language. The
models are prepared for the prediction of words by learning the features and
characteristics of a language. With this learning, the model prepares itself for
understanding phrases and predicting the next words in sentences.
For training a language model, a number of probabilistic approaches are used. These
approaches vary on the basis of the purpose for which a language model is created. The
amount of text data to be analyzed and the math applied for analysis makes a difference
in the approach followed for creating and training a language model.
For example, a language model used for predicting the next word in a search query will
be absolutely different from those used in predicting the next word in a long document
(such as Google Docs). The approach followed to train the model would be unique in
both cases.
Unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information
retrieval. The unigram is the foundation of a more specific model variant called the query
likelihood model, which uses information retrieval to examine a pool of documents and
match the most relevant one to a specific query.
Exponential: This type of statistical model evaluates text by using an equation which is a
combination of n-grams and feature functions. Here the features and parameters of the
desired results are already specified. The model is based on the principle of entropy,
which states that probability distribution with the most entropy is the best choice.
Exponential models have fewer statistical assumptions which mean the chances of having
accurate results are more.
Continuous Space: In this type of statistical model, words are arranged as a non-linear
combination of weights in a neural network. The process of assigning weight to a word is
known as word embedding. This type of model proves helpful in scenarios where the data
set of words continues to become large and include unique words.
In cases where the data set is large and consists of rarely used or unique words, linear
models such as n-gram do not work. This is because, with increasing words, the possible
word sequences increase, and thus the patterns predicting the next word become weaker.
These language models are based on neural networks and are often considered as an
advanced approach to execute NLP tasks. Neural language models overcome the
shortcomings of classical models such as n-gram and are used for complex tasks such as
speech recognition or machine translation.
Language is significantly complex and keeps on evolving. Therefore, the more complex
the language model is, the better it would be at performing NLP tasks. Compared to the
n-gram model, an exponential or continuous space model proves to be a better option for
NLP tasks because they are designed to handle ambiguity and language variation.
Language models are the cornerstone of Natural Language Processing (NLP) technology.
We have been making the best of language models in our routine, without even realizing
it. Let’s take a look at some of the examples of language models.
1. Speech Recognization
Voice assistants such as Siri and Alexa are examples of how language models help
machines in processing speech audio.
2. Machine Translation
Google Translator and Microsoft Translate are examples of how NLP models can help in
translating one language to another.
3. Sentiment Analysis
This helps in analyzing the sentiments behind a phrase. This use case of NLP models is
used in products that allow businesses to understand a customer’s intent behind opinions
or attitudes expressed in the text. Hubspot’s Service Hub is an example of how language
models can help in sentiment analysis.
4. Text Suggestions
Google services such as Gmail or Google Docs use language models to help users get
text suggestions while they compose an email or create long text documents,
respectively.
5. Parsing Tools
Parsing involves analyzing sentences or words that comply with syntax or grammar rules.
Spell checking tools are perfect examples of language modelling and parsing.
There are several innovative ways in which language models can support NLP tasks. If
you have any idea in mind, then our AI experts can help you in creating language models
for executing simple to complex NLP tasks. As a part of our AI application development
services, we provide a free, no-obligation consultation session that allows our prospects
to share their ideas with AI experts and talk about its execution.
The use of neural networks in language modeling is often called Neural Language
Modeling, or NLM for short.
Neural network approaches are achieving better results than classical methods both on
standalone language models and when models are incorporated into larger models on
challenging tasks like speech recognition and machine translation.
A key reason for the leaps in improved performance may be the method’s ability to
generalize.
Nonlinear neural network models solve some of the shortcomings of traditional language
models: they allow conditioning on increasingly large context sizes with only a linear
increase in the number of parameters, they alleviate the need for manually designing
backoff orders, and they support generalization across different contexts.
Specifically, a word embedding is adopted that uses a real-valued vector to represent each
word in a project vector space. This learned representation of words based on their usage
allows words with a similar meaning to have a similar representation.
Neural Language Models (NLM) address the n-gram data sparsity issue through
parameterization of words as vectors (word embeddings) and using them as inputs to a
neural network. The parameters are learned as part of the training process. Word
embeddings obtained through NLMs exhibit the property whereby semantically close words
are likewise close in the induced vector space.
“True generalization” is difficult to obtain in a discrete word indice space, since there is no
obvious relation between the word indices.
Further, the distributed representation approach allows the embedding representation to
scale better with the size of the vocabulary. Classical methods that have one discrete
representation per word fight the curse of dimensionality with larger and larger vocabularies
of words that result in longer and more sparse representations.
The neural network approach to language modeling can be described using the three
following model properties, taken from “A Neural Probabilistic Language Model“, 2003.
1. Associate each word in the vocabulary with a distributed word feature vector.
2. Express the joint probability function of word sequences in terms of the feature
vectors of these words in the sequence.
3. Learn simultaneously the word feature vector and the parameters of the probability
function.
This represents a relatively simple model where both the representation and probabilistic
model are learned together directly from raw text data.
Recently, the neural based approaches have started to and then consistently started to
outperform the classical statistical approaches.
We provide ample empirical evidence to suggest that connectionist language models are
superior to standard n-gram techniques, except their high computational (training)
complexity.
Initially, feed-forward neural network models were used to introduce the approach.
More recently, recurrent neural networks and then networks with a long-term memory like
the Long Short-Term Memory network, or LSTM, allow the models to learn the relevant
context over much longer input sequences than the simpler feed-forward networks.
[an RNN language model] provides further generalization: instead of considering just
several preceding words, neurons with input from recurrent connections are assumed to
represent short term memory. The model learns itself from the data how to represent
memory. While shallow feedforward neural networks (those with just one hidden layer) can
only cluster similar words, recurrent neural network (which can be considered as a deep
architecture) can perform clustering of similar histories. This allows for instance efficient
representation of patterns with variable length.
Size matters. The best models were the largest models, specifically number of
memory units.
Regularization matters. Use of regularization like dropout on input connections
improves results.
CNNs vs Embeddings. Character-level Convolutional Neural Network (CNN)
models can be used on the front-end instead of word embeddings, achieving similar and
sometimes better results.
Ensembles matter. Combining the prediction from multiple models can offer large
improvements in model performance.
How to Develop a Word-Level Neural Language Model and Use it to Generate Text
A language model can predict the probability of the next word in the sequence, based on
the words already observed in the sequence.
Neural network models are a preferred method for developing statistical language models
because they can use a distributed representation where different words with similar
meanings have similar representation and because they can use a large context of recently
observed words when making predictions.
In this tutorial, you will discover how to develop a statistical language model using deep
learning in Python.
The entire text is available for free in the public domain. It is available on the Project
Gutenberg website in a number of formats.
You can download the ASCII text version of the entire book (or books) here:
BOOK I.
I went down yesterday to the Piraeus with Glaucon the son of Ariston,
…
…
And it shall be well with us both in this life and in the pilgrimage of a thousand years which
we have been describing.
Data Preparation
We will start by preparing the data for modeling.
BOOK I.
I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what manner they would
celebrate the festival, which was a new thing. I was delighted with the
procession of the inhabitants; but that of the Thracians was equally,
if not more, beautiful. When we had finished our prayers and viewed the
spectacle, we turned in the direction of the city; and at that instant
Polemarchus the son of Cephalus chanced to catch sight of us from a
distance as we were starting on our way home, and told his servant to
run and bid us wait for him. The servant took hold of me by the cloak
behind, and said: Polemarchus desires you to wait.
There he is, said the youth, coming after you, if you will only wait.
What do you see that we will need to handle in preparing the data?
The specific way we prepare the data really depends on how we intend to model it, which in
turn depends on how we intend to use it.
The language model will be statistical and will predict the probability of each word given an
input sequence of text. The predicted word will be fed in as input to in turn generate the next
word.
A key design decision is how long the input sequences should be. They need to be long
enough to allow the model to learn the context for the words to predict. This input length will
also define the length of seed text used to generate new sequences when we use the
model.
There is no correct answer. With enough time and resources, we could explore the ability of
the model to learn with differently sized input sequences.
Instead, we will pick a length of 50 words for the length of the input sequences, somewhat
arbitrarily.
We could process the data so that the model only ever deals with self-contained sentences
and pad or truncate the text to meet this requirement for each input sequence. You could
explore this as an extension to this tutorial.
Instead, to keep the example brief, we will let all of the text flow together and train the model
to predict the next word across sentences, paragraphs, and even books or chapters in the
text.
Now that we have a model design, we can look at transforming the raw text into sequences
of 50 input words to 1 output word, ready to fit a model.
Load Text
The first step is to load the text into memory.
We can develop a small function to load the entire text file into memory and return it. The
function is called load_doc() and is listed below. Given a filename, it returns a sequence of
loaded text.
1 # load doc into memory
2 def load_doc(filename):
6 text = file.read()
8 file.close()
9 return text
Using this function, we can load the cleaner version of the document in the file
‘republic_clean.txt‘ as follows:
1 # load document
2 in_filename = 'republic_clean.txt'
3 doc = load_doc(in_filename)
4 print(doc[:200])
Running this snippet loads the document and prints the first 200 characters as a sanity
check.
BOOK I.
I went down yesterday to the Piraeus with Glaucon the son of Ariston,
that I might offer up my prayers to the goddess (Bendis, the Thracian
Artemis.); and also because I wanted to see in what
So far, so good. Next, let’s clean the text.
Clean Text
We need to transform the raw text into a sequence of tokens or words that we can use as a
source to train the model.
Based on reviewing the raw text (above), below are some specific operations we will
perform to clean the text. You may want to explore more cleaning operations yourself as an
extension.
We can implement each of these cleaning operations in this order in a function. Below is the
function clean_doc() that takes a loaded document as an argument and returns an array of
clean tokens.
1 import string
2
4 def clean_doc(doc):
8 tokens = doc.split()
16 return tokens
We can run this cleaning operation on our loaded document and print out some of the
tokens and statistics as a sanity check.
1 # clean document
2 tokens = clean_doc(doc)
3 print(tokens[:200])
First, we can see a nice list of tokens that look cleaner than the raw text. We could remove
the ‘Book I‘ chapter markers and more, but this is a good start.
['book', 'i', 'i', 'went', 'down', 'yesterday', 'to', 'the', 'piraeus', 'with', 'glaucon', 'the', 'son', 'of', 'ariston', 'that', 'i', 'might', 'offer', 'up',
'my', 'prayers', 'to', 'the', 'goddess', 'bendis', 'the', 'thracian', 'artemis', 'and', 'also', 'because', 'i', 'wanted', 'to', 'see', 'in', 'what',
'manner', 'they', 'would', 'celebrate', 'the', 'festival', 'which', 'was', 'a', 'new', 'thing', 'i', 'was', 'delighted', 'with', 'the', 'procession',
'of', 'the', 'inhabitants', 'but', 'that', 'of', 'the', 'thracians', 'was', 'equally', 'if', 'not', 'more', 'beautiful', 'when', 'we', 'had', 'finished',
'our', 'prayers', 'and', 'viewed', 'the', 'spectacle', 'we', 'turned', 'in', 'the', 'direction', 'of', 'the', 'city', 'and', 'at', 'that', 'instant',
'polemarchus', 'the', 'son', 'of', 'cephalus', 'chanced', 'to', 'catch', 'sight', 'of', 'us', 'from', 'a', 'distance', 'as', 'we', 'were', 'starting',
1
'on', 'our', 'way', 'home', 'and', 'told', 'his', 'servant', 'to', 'run', 'and', 'bid', 'us', 'wait', 'for', 'him', 'the', 'servant', 'took', 'hold', 'of',
'me', 'by', 'the', 'cloak', 'behind', 'and', 'said', 'polemarchus', 'desires', 'you', 'to', 'wait', 'i', 'turned', 'round', 'and', 'asked', 'him',
'where', 'his', 'master', 'was', 'there', 'he', 'is', 'said', 'the', 'youth', 'coming', 'after', 'you', 'if', 'you', 'will', 'only', 'wait', 'certainly',
'we', 'will', 'said', 'glaucon', 'and', 'in', 'a', 'few', 'minutes', 'polemarchus', 'appeared', 'and', 'with', 'him', 'adeimantus', 'glaucons',
'brother', 'niceratus', 'the', 'son', 'of', 'nicias', 'and', 'several', 'others', 'who', 'had', 'been', 'at', 'the', 'procession', 'polemarchus',
'said']
We can see that there are just under 120,000 words in the clean text and a vocabulary of
just under 7,500 words. This is smallish and models fit on this data should be manageable
on modest hardware.
Next, we can look at shaping the tokens into sequences and saving them to file.
Save Clean Text
We can organize the long list of tokens into sequences of 50 input words and 1 output word.
We can do this by iterating over the list of tokens from token 51 onwards and taking the
prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.
We will transform the tokens into space-separated strings for later storage in a file.
The code to split the list of clean tokens into sequences with a length of 51 tokens is listed
below.
2 length = 50 + 1
3 sequences = list()
6 seq = tokens[i-length:i]
9 # store
10 sequences.append(line)
Printing statistics on the list, we can see that we will have exactly 118,633 training patterns
to fit our model.
Next, we can save the sequences to a new file for later loading.
We can define a new function for saving lines of text to a file. This new function is
called save_doc() and is listed below. It takes as input a list of lines and a filename. The
lines are written, one per line, in ASCII format.
1 # save tokens to file, one dialog per line
3 data = '\n'.join(lines)
5 file.write(data)
6 file.close()
We can call this function and save our training sequences to the file
‘republic_sequences.txt‘.
1 # save sequences to file
2 out_filename = 'republic_sequences.txt'
3 save_doc(sequences, out_filename)
You will see that each line is shifted along one word, with a new word at the end to be
predicted; for example, here are the first 3 lines in truncated form:
Complete Example
Tying all of this together, the complete code listing is provided below.
1 import string
2
4 def load_doc(filename):
8 text = file.read()
10 file.close()
11 return text
12
14 def clean_doc(doc):
18 tokens = doc.split()
26 return tokens
27
30 data = '\n'.join(lines)
32 file.write(data)
33 file.close()
34
35 # load document
36 in_filename = 'republic_clean.txt'
37 doc = load_doc(in_filename)
38 print(doc[:200])
39
40 # clean document
41 tokens = clean_doc(doc)
42 print(tokens[:200])
45
47 length = 50 + 1
48 sequences = list()
51 seq = tokens[i-length:i]
54 # store
55 sequences.append(line)
57
59 out_filename = 'republic_sequences.txt'
60 save_doc(sequences, out_filename)
You should now have training data stored in the file ‘republic_sequences.txt‘ in your current
working directory.
Next, let’s look at how to fit a language model to this data.
The model we will train is a neural language model. It has a few unique characteristics:
It uses a distributed representation for words so that different words with similar meanings
will have a similar representation.
It learns the representation at the same time as learning the model.
It learns to predict the probability for the next word using the context of the last 100 words.
Specifically, we will use an Embedding Layer to learn the representation of words, and a
Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based
on their context.
Let’s start by loading our training data.
Load Sequences
We can load our training data using the load_doc() function we developed in the previous
section.
Once loaded, we can split the data into separate training sequences by splitting based on
new lines.
The snippet below will load the ‘republic_sequences.txt‘ data file from the current working
directory.
1 # load doc into memory
2 def load_doc(filename):
6 text = file.read()
8 file.close()
9 return text
10
11 # load
12 in_filename = 'republic_sequences.txt'
13 doc = load_doc(in_filename)
14 lines = doc.split('\n')
Encode Sequences
The word embedding layer expects input sequences to be comprised of integers.
We can map each word in our vocabulary to a unique integer and encode our input
sequences. Later, when we make predictions, we can convert the prediction to numbers
and look up their associated words in the same mapping.
We can then use the fit Tokenizer to encode all of the training sequences, converting each
sequence from a list of words to a list of integers.
2 tokenizer = Tokenizer()
3 tokenizer.fit_on_texts(lines)
4 sequences = tokenizer.texts_to_sequences(lines)
We can access the mapping of words to integers as a dictionary attribute called word_index
on the Tokenizer object.
We need to know the size of the vocabulary for defining the embedding layer later. We can
determine the vocabulary by calculating the size of the mapping dictionary.
Words are assigned values from 1 to the total number of words (e.g. 7,409). The
Embedding layer needs to allocate a vector representation for each word in this vocabulary
from index 1 to the largest index and because indexing of arrays is zero-offset, the index of
the word at the end of the vocabulary will be 7,409; that means the array must be 7,409 + 1
in length.
Therefore, when specifying the vocabulary size to the Embedding layer, we specify it as 1
larger than the actual vocabulary.
1 # vocabulary size
2 vocab_size = len(tokenizer.word_index) + 1
After separating, we need to one hot encode the output word. This means converting it from
an integer to a vector of 0 values, one for each word in the vocabulary, with a 1 to indicate
the specific word at the index of the words integer value.
This is so that the model learns to predict the probability distribution for the next word and
the ground truth from which to learn from is 0 for all words except the actual word that
comes next.
Keras provides the to_categorical() that can be used to one hot encode the output words for
each input-output sequence pair.
Finally, we need to specify to the Embedding layer how long input sequences are. We know
that there are 50 words because we designed the model, but a good generic way to specify
that is to use the second dimension (number of columns) of the input data’s shape. That
way, if you change the length of sequences when preparing data, you do not need to
change this data loading code; it is generic.
2 sequences = array(sequences)
3 X, y = sequences[:,:-1], sequences[:,-1]
4 y = to_categorical(y, num_classes=vocab_size)
5 seq_length = X.shape[1]
Fit Model
We can now define and fit our language model on the training data.
The learned embedding needs to know the size of the vocabulary and the length of input
sequences as previously discussed. It also has a parameter to specify how many
dimensions will be used to represent each word. That is, the size of the embedding vector
space.
Common values are 50, 100, and 300. We will use 50 here, but consider testing smaller or
larger values.
We will use a two LSTM hidden layers with 100 memory cells each. More memory cells and
a deeper network may achieve better results.
A dense fully connected layer with 100 neurons connects to the LSTM hidden layers to
interpret the features extracted from the sequence. The output layer predicts the next word
as a single vector the size of the vocabulary with a probability for each word in the
vocabulary. A softmax activation function is used to ensure the outputs have the
characteristics of normalized probabilities.
1 # define model
2 model = Sequential()
4 model.add(LSTM(100, return_sequences=True))
5 model.add(LSTM(100))
6 model.add(Dense(100, activation='relu'))
7 model.add(Dense(vocab_size, activation='softmax'))
8 print(model.summary())
1 _________________________________________________________________
3 =================================================================
5 _________________________________________________________________
7 _________________________________________________________________
8 lstm_2 (LSTM) (None, 100) 80400
9 _________________________________________________________________
11 _________________________________________________________________
13 =================================================================
16 Non-trainable params: 0
17 _________________________________________________________________
Next, the model is compiled specifying the categorical cross entropy loss needed to fit the
model. Technically, the model is learning a multi-class classification and this is the suitable
loss function for this type of problem. The efficient Adam implementation to mini-batch
gradient descent is used and accuracy is evaluated of the model.
Finally, the model is fit on the data for 100 training epochs with a modest batch size of 128
to speed things up.
Training may take a few hours on modern hardware without GPUs. You can speed it up
with a larger batch size and/or fewer training epochs.
1 # compile model
3 # fit model
During training, you will see a summary of performance, including the loss and accuracy
evaluated from the training data at the end of each batch update.
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
You will get different results, but perhaps an accuracy of just over 50% of predicting the
next word in the sequence, which is not bad. We are not aiming for 100% accuracy (e.g. a
model that memorized the text), but rather a model that captures the essence of the text.
1 ...
2 Epoch 96/100
4 Epoch 97/100
6 Epoch 98/100
8 Epoch 99/100
10 Epoch 100/100
Save Model
At the end of the run, the trained model is saved to file.
Here, we use the Keras model API to save the model to the file ‘model.h5‘ in the current
working directory.
Later, when we load the model to make predictions, we will also need the mapping of words
to integers. This is in the Tokenizer object, and we can save that too using Pickle.
2 model.save('model.h5')
Complete Example
We can put all of this together; the complete example for fitting the language model is listed
below.
9
11 def load_doc(filename):
15 text = file.read()
17 file.close()
18 return text
19
20 # load
21 in_filename = 'republic_sequences.txt'
22 doc = load_doc(in_filename)
23 lines = doc.split('\n')
24
26 tokenizer = Tokenizer()
27 tokenizer.fit_on_texts(lines)
28 sequences = tokenizer.texts_to_sequences(lines)
29 # vocabulary size
30 vocab_size = len(tokenizer.word_index) + 1
31
32 # separate into input and output
33 sequences = array(sequences)
34 X, y = sequences[:,:-1], sequences[:,-1]
35 y = to_categorical(y, num_classes=vocab_size)
36 seq_length = X.shape[1]
37
38 # define model
39 model = Sequential()
41 model.add(LSTM(100, return_sequences=True))
42 model.add(LSTM(100))
43 model.add(Dense(100, activation='relu'))
44 model.add(Dense(vocab_size, activation='softmax'))
45 print(model.summary())
46 # compile model
48 # fit model
50
52 model.save('model.h5')
In this case, we can use it to generate new sequences of text that have the same statistical
properties as the source text.
This is not practical, at least not for this example, but it gives a concrete example of what
the language model has learned.
Load Data
We can use the same code from the previous section to load the training data sequences of
text.
Specifically, the load_doc() function.
1 # load doc into memory
2 def load_doc(filename):
6 text = file.read()
8 file.close()
9 return text
10
12 in_filename = 'republic_sequences.txt'
13 doc = load_doc(in_filename)
14 lines = doc.split('\n')
We need the text so that we can choose a source sequence as input to the model for
generating a new sequence of text.
Later, we will need to specify the expected length of input. We can determine this from the
input sequences by calculating the length of one line of the loaded data and subtracting 1
for the expected output word that is also on the same line.
1 seq_length = len(lines[0].split()) - 1
Load Model
We can now load the model from file.
Keras provides the load_model() function for loading the model, ready for use.
1 # load the model
2 model = load_model('model.h5')
We can also load the tokenizer from file using the Pickle API.
Generate Text
The first step in generating text is preparing a seed input.
We will select a random line of text from the input text for this purpose. Once selected, we
will print it so that we have some idea of what was used.
2 seed_text = lines[randint(0,len(lines))]
3 print(seed_text + '\n')
First, the seed text must be encoded to integers using the same tokenizer that we used
when training the model.
1 encoded = tokenizer.texts_to_sequences([seed_text])[0]
The model can predict the next word directly by calling model.predict_classes() that will
return the index of the word with the highest probability.
1 # predict probabilities for each word
We can then look up the index in the Tokenizers mapping to get the associated word.
1 out_word = ''
3 if index == yhat:
4 out_word = word
5 break
We can then append this word to the seed text and repeat the process.
Importantly, the input sequence is going to get too long. We can truncate it to the desired
length after the input sequence has been encoded to integers. Keras provides
the pad_sequences() function that we can use to perform this truncation.
1 encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
We can wrap all of this into a function called generate_seq() that takes as input the model,
the tokenizer, input sequence length, the seed text, and the number of words to generate. It
then returns a sequence of words generated by the model.
1 # generate a sequence from a language model
3 result = list()
4 in_text = seed_text
6 for _ in range(n_words):
8 encoded = tokenizer.texts_to_sequences([in_text])[0]
14 out_word = ''
16 if index == yhat:
17 out_word = word
18 break
19 # append to input
21 result.append(out_word)
We are now ready to generate a sequence of new words given some seed text.
3 print(generated)
Putting this all together, the complete code listing for generating text from the learned-
language model is listed below.
5
7 def load_doc(filename):
11 text = file.read()
13 file.close()
14 return text
15
19 in_text = seed_text
21 for _ in range(n_words):
23 encoded = tokenizer.texts_to_sequences([in_text])[0]
29 out_word = ''
31 if index == yhat:
32 out_word = word
33 break
34 # append to input
36 result.append(out_word)
38
40 in_filename = 'republic_sequences.txt'
41 doc = load_doc(in_filename)
42 lines = doc.split('\n')
43 seq_length = len(lines[0].split()) - 1
44
46 model = load_model('model.h5')
47
50
52 seed_text = lines[randint(0,len(lines))]
53 print(seed_text + '\n')
54
57 print(generated)
when he said that a man when he grows old may learn many things for he can no more
learn much than he can run much youth is the time for any extraordinary toil of course and
therefore calculation and geometry and all the other elements of instruction which are a
preparation for dialectic should be presented to the name of idle spendthrifts of whom the
other is the manifold and the unjust and is the best and the other which delighted to be the
opening of the soul of the soul and the embroiderer will have to be said at
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
You can see that the text seems reasonable. In fact, the addition of concatenation would
help in interpreting the seed and the generated text. Nevertheless, the generated text gets
the right kind of words in the right kind of order.
Try running the example a few times to see other examples of generated text. Let me know
in the comments below if you see anything interesting.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to
a fixed length (e.g. the longest sentence length).
Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop
words removed.
Tune Model. Tune the model, such as the size of the embedding or number of memory cells
in the hidden layer, to see if you can develop a better model.
Deeper Model. Extend the model to have multiple LSTM hidden layers, perhaps with
dropout to see if you can develop a better model.
Pre-Trained Word Embedding. Extend the model to use pre-trained word2vec or GloVe
vectors to see if it results in a better model.
Further Reading
This section provides more resources on the topic if you are looking go deeper.
Project Gutenberg
The Republic by Plato on Project Gutenberg
Republic (Plato) on Wikipedia
Language model on Wikipedia
Summary
In this tutorial, you discovered how to develop a word-based language model using a word
embedding and a recurrent neural network.
How to Automatically Generate Textual Descriptions for Photographs with Deep Learning
1.
nike November 10, 2017 at 12:59 pm #
Thank you for providing this blog, have u use rnn to do recommend ? like use rnn recommend
movies ,use the user consume movies sequences
REPLY
o
Jason Brownlee November 11, 2017 at 9:15 am #
Mike April 29, 2020 at 10:49 pm #
out_word = ”
for word, index in tokenizer.word_index.items():
if index == yhat:
out_word = word
break
Jason Brownlee April 30, 2020 at 6:44 am #
Mike April 30, 2020 at 7:14 pm #
I see. But isn’t there a tokenizer.index_word (:: index -> word) dictionary for this
purpose?
Jason Brownlee May 1, 2020 at 6:34 am #
It might be, great tip! Perhaps it wasn’t around back when I wrote this, or I didn’t
notice it.
Mike May 5, 2020 at 3:08 am #
Hey Jason, I have a question. I do not understand why the Embedding (input) layer and
output layer are +1 larger than the number of words in our vocabulary.
I am pretty certain that the output layer (and the input layer of one-hot vectors) must be the
exact size of our vocabulary so that each output value maps 1-1 with each of our vocabulary
word. If we add +1 to this size, where (to which word) does the extra output value map to?
REPLY
Jason Brownlee May 5, 2020 at 6:34 am #
We add one “word” for “none” at index 0. This is for all words that we don’t know or that
we want to map to “don’t know”.
REPLY
2.
Stuart November 11, 2017 at 6:01 am #
1. How much benefit is gained by removing punctuation? Wouldn’t a few more punctuation-based
tokens be a fairly trivial addition to several thousand word tokens?
2. Based on your experience, which method would you say is better at generating meaningful text on
modest hardware (the cheaper gpu AWS options) in a reasonable amount of time (within several
hours): word-level or character-level generation?
Also it would be great if you could include your hardware setup, python/keras versions, and how long
it took to generate your example text output.
REPLY
o
Jason Brownlee November 11, 2017 at 9:26 am #
Good questions.
AWS is good value for money for one-off models. I strongly recommend it:
https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-
amazon-web-services/
I have a big i7 imac for local dev and run large models/experiments on AWS.
REPLY
Stuart November 11, 2017 at 10:49 am #
Jason Brownlee November 12, 2017 at 8:59 am #
Sorry, I mean that removing punctuation such as full stops and commas from the end of
words will mean that we have fewer tokens to model.
REPLY
Mark July 23, 2020 at 10:09 pm #
part of the issue is that “Party” and “Party; ” won’t be evaluated as the same
REPLY
3.
Raoul November 12, 2017 at 11:55 pm #
o
Jason Brownlee November 13, 2017 at 10:17 am #
A trained model will, yes. The model will be different each time it is trained though:
https://machinelearningmastery.com/randomness-in-machine-learning/
REPLY
Gustavo February 15, 2019 at 3:27 am #
Not always, you can generate new text by fitting a multinomial distribution, where you take
the probability of a character occurring and not the maximum probability of a character. This
allows more diversity to the generated text, and you can combine with “temperature”
parameters to control this diversity.
REPLY
Jason Brownlee February 15, 2019 at 8:15 am #
4.
Sarang November 13, 2017 at 1:15 am #
Jason,
Thanks for the excellent blogs, what do you think about the future of deep learning?
Do you think deep learning is here to stay for another 10 years?
REPLY
o
Jason Brownlee November 13, 2017 at 10:18 am #
Yes. The results are too valuable across a wide set of domains.
REPLY
5.
Roger January 3, 2018 at 8:40 pm #
o
Roger January 3, 2018 at 9:15 pm #
Kirstie January 5, 2018 at 1:03 am #
Hi Roger. I had the same issue, updating Tensorflow with pip install –upgrade Tensorflow
worked for me.
REPLY
o
Jason Brownlee January 4, 2018 at 8:10 am #
Sorry I have not seen this error before. Perhaps try posting to stackoverflow?
6.
Roger January 8, 2018 at 6:14 pm #
Hi Jason,
Is it possible to use your codes above as a language model instead of predicting next word?
What I want is to judge “I am eating an apple” is more commonly used than “I an apple am eating”.
For short sentence, may be I don’t have 50 words as input.
Also, is it possible for Keras to output the probability with its index like the biggest probability for next
word is “republic” and I want to get the index for “republic” which can be matched in
tokenizer.word_index.
Thanks!
REPLY
o
Roger January 8, 2018 at 8:53 pm #
do you have any suggestions if I want to use 3 previous words as input to predict next word?
Thanks.
REPLY
Jason Brownlee January 9, 2018 at 5:28 am #
You can frame your problem, prepare the data and train the model.
Mike April 30, 2020 at 8:09 pm #
This is a Markov Assumption. The point of a recurrent NN model is to avoid that. If you ‘re
only going to use 3 words to predict the next then use an n-gram or a feedforward model
(like Bengio’s). No need for a recurrent model.
REPLY
o
Jason Brownlee January 9, 2018 at 5:25 am #
7.
Cid February 14, 2018 at 3:47 am #
Hey Jason,
Thanks for the post. I noticed in the extensions part you mention Sentence-Wise Modelling.
I understand the technique of padding (after reading your other blog post). But how does it
incorporate a full stop when generating text. Is it a matter of post-processing the text? Could it be
possible/more convenient to tokenize a full stop prior to embedding?
REPLY
o
Jason Brownlee February 14, 2018 at 8:23 am #
I generally recommend removing punctuation. It balloons the size of the vocab and in turn slows
down modeling.
REPLY
Cid February 14, 2018 at 8:00 pm #
OK thanks for the advise, how could I incorporate a sentence structure into my model then?
REPLY
Jason Brownlee February 15, 2018 at 8:40 am #
8.
Maria February 16, 2018 at 10:32 pm #
Hi Jason, I tried to use your model and train it with a corpus I had, everything seemed to work fine,
but at the and I have this error:
34 sequences = array(sequences)
35 #print(sequences)
—> 36 X, y = sequences[:,:-1], sequences[:,-1]
37 # print(sequences[:,:-1])
38 # X, y = sequences[:-1], sequences[-1]
regarding the sliding of the sequences. Do you know how to fix it?
Thanks so much!
REPLY
o
Jason Brownlee February 17, 2018 at 8:45 am #
Perhaps double check your loaded data has the shape that you expect?
REPLY
Ray March 1, 2018 at 7:31 am #
Are you able to confirm that your Python 3 environment is up to date, including Keras
and tensorflow?
For example, here are the current versions I’m running with:
1 python: 3.6.4
2 scipy: 1.0.0
3 numpy: 1.14.0
4 matplotlib: 2.1.1
5 pandas: 0.22.0
6 statsmodels: 0.8.0
7 sklearn: 0.19.1
8 nltk: 3.2.5
9 gensim: 3.3.0
10 xgboost 0.6
11 tensorflow: 1.5.0
12 theano: 1.0.1
13 keras: 2.1.4
REPLY
Doron Ben Elazar March 20, 2018 at 8:14 am #
It means that your input is not even and np.array doesn’t parse it properly (the
author created paragraphs of 50 tokens each), a possible fix would be:
original_sequences = tokenizer.texts_to_sequences(text_chunk)
vocab_size = len(tokenizer.word_index) + 1
aligned_sequneces = []
for sequence in original_sequences:
aligned_sequence = np.zeros(max_len, dtype=np.int64)
aligned_sequence[:len(sequence)] = np.array(sequence, dtype=np.int64)
aligned_sequneces.append(aligned_sequence)
sequences = np.array(aligned_sequneces)
Cathy October 18, 2019 at 10:30 pm #
Jason Brownlee October 19, 2019 at 6:37 am #
Basil June 28, 2018 at 6:36 pm #
Don’t know if it too late to respond. the issue arises because u have by mistake typed
Hope this helps others who come to this page in the future!
Thanks Jason.
REPLY
Jason Brownlee June 29, 2018 at 5:52 am #
JC Ulundo August 7, 2018 at 12:27 pm #
Basil I tried your suggestion but still encountering the same error.
Did anyone go through this error and got it fixed?
Jason Brownlee August 7, 2018 at 2:32 pm #
o
Naetmul October 28, 2018 at 1:06 pm #
The example already removed all the punctuation and whitespaces, so it will be not a problem in
the example.
However, if you used a custom one, then it can be a problem.
REPLY
9.
vikas dixit March 5, 2018 at 9:26 pm #
Sir, i have a vocabulary size of 12000 and when i use to_categorical system throws Memory Error as
shown below:
MemoryError:
o
Jason Brownlee March 6, 2018 at 6:12 am #
Perhaps try running the code on a machine with more RAM, such as on S3?
Perhaps try mapping the vocab to integers manually with more memory efficient code?
REPLY
10.
Eli Mshomi March 7, 2018 at 9:40 am #
Can you generate a article based on the other related articles which and be human readable ?
REPLY
o
Jason Brownlee March 7, 2018 at 3:05 pm #
Yes, I believe so. The model will have to be large and carefully trained.
REPLY
11.
Adam March 12, 2018 at 9:52 pm #
Hi Jason,
I just tried out your code here with my own text sample (blog posts from a tumblr blog) and trained it
and it’s now gotten to the point where text is no longer “generated”, but rather just sent back
verbatim.
The sample set I’m using is rather small (each individual post is fairly small, so I made each
sequence 10 input 1 output) giving me around 25,000 sequences. Everything else was kept the
same code from your tutorial – after around 150 epochs, the accuracy was around 0.99 so I cut it off
to try generation.
When I change the seed text from something to the sample to something else from the vocabulary
(ie not a full line but a “random” line) then the text is fairly random which is what I wanted. When the
seed text is changed to something outside of the vocabulary, the same text is generated each time.
What should I do if I want something more random? Should I change something like the memory
cells in the LSTM layers? Reduce the sequence length? Just stop training at a higher loss/lower
accuracy?
Thanks a ton for your tutorials, I’ve learned a lot through them.
REPLY
o
Jason Brownlee March 13, 2018 at 6:28 am #
Adam March 13, 2018 at 9:17 am #
I think it might just be overfit. Sadly, there isn’t more data that I can grab (at least that i know
of currently) so I can’t grab much more data which sucks – that’s why I reduced the
sequence length to 10.
I checkpointed every epoch so that I can play around with what gives the best results.
Thanks for your advice!
REPLY
Jason Brownlee March 13, 2018 at 3:04 pm #
Perhaps also try a blend/ensemble of some of the checkpointed models or models from
multiple runs to see if it can give a small lift.
REPLY
12.
Prateek March 30, 2018 at 4:05 pm #
Hi Jason,
I check the amount of accuracy and loss on Tensorflow. I would like to know what exactly do you
means in accuracy in NLP?
In computer vision, if we wish to predict cat and the predicted out of the model is cat then we can
say that the accuracy of the model is greater than 95%.
I want to understand physically what do we mean by accuracy in NLP models. Can you please
explain?
REPLY
o
Jason Brownlee March 31, 2018 at 6:34 am #
I would recommend using BLEU instead of accuracy. Accuracy does not have any useful
meaning.
13.
Jin Zhou April 14, 2018 at 12:10 pm #
Hi, I have a question about evaluating this model. As I know, perplexity is a very popular method.
For BLEU and perplexity, which one do you think is better? Could you give an example about how to
evaluate this model in keras.
REPLY
o
Jason Brownlee April 15, 2018 at 6:21 am #
14.
Jin April 23, 2018 at 7:11 pm #
Hi Jason, I am a little confused, you told us that we need 100 words as input, but your X_train is only
50 words per line. Could you explain that a little?
REPLY
o
Jason Brownlee April 24, 2018 at 6:29 am #
15.
jason May 1, 2018 at 9:54 am #
Jason,
You have an embedding layer as the part of the model. The embedding weights will be trained along
with other layers. Why not separate and train it independently? In other words, first using Embedding
alone to train word vectors/embedding weights. Then run your model with embedding layer initializer
with embedding weights and setting trainable=false. I see most people use your approach to train a
model. But this is kind of against the purpose of embedding because the output is not word context
but 0/1 labels. Why not replace embedding with an ordinary layer with linear activation?
Another jason
REPLY
o
Jason Brownlee May 2, 2018 at 5:36 am #
Often, I find the model has better skill when the embedding is trained with the net.
REPLY
jason May 2, 2018 at 9:49 am #
“I give examples on the blog.” I guess you mean there are other posts in your blog talking
about “train them separately”.
REPLY
Jason Brownlee May 3, 2018 at 6:28 am #
16.
johhnybravo May 19, 2018 at 2:33 am #
ValueError: Error when checking : expected embedding_1_input to have shape (50,) but got array
with shape (1,) while predicting output
REPLY
o
Jason Brownlee May 19, 2018 at 7:45 am #
o
Tanaya June 2, 2018 at 11:19 pm #
Hi. Did you get a solution to the problem? I am getting the same error.
REPLY
Ervin October 19, 2018 at 9:08 am #
https://stackoverflow.com/questions/39950311/keras-error-on-predict
REPLY
fanzhh January 11, 2019 at 12:36 pm #
thank you.
REPLY
17.
Nitin Mukesh May 30, 2018 at 6:33 pm #
Hi, I want to develop Image Captioning in keras. What are the pre requisite for this? I have done
your tutorials for object detection using CNN. What should I do next?
REPLY
o
Jason Brownlee May 31, 2018 at 6:15 am #
18.
mel_dagh June 1, 2018 at 1:59 am #
Hi Jason,
1.) Can we use this approach to predict if a word in a given sequence of the training data is highly
odd..i.e. that it does not occur in that position of the sequence with high probability.
2.) Is there a way to integrate pre-trained word embeddings (glove/word2vec) in the embedding
layer?
REPLY
o
Jason Brownlee June 1, 2018 at 8:24 am #
19.
Tanaya June 8, 2018 at 7:29 pm #
Hello, this is simply an amazing post for beginners in NLP like me.
I have generated a language model, which is further generating text like I want it to.
My question is, after this generation, how do I filter out all the text that does not make sense,
syntactically or semantically?
REPLY
o
Jason Brownlee June 9, 2018 at 6:49 am #
Thanks.
You might need another model to classify text, correct text, or filter text prior to using it.
REPLY
Tanaya June 12, 2018 at 4:40 pm #
Will this also be using Keras? Do you recommend using nltk or SpaCy?
REPLY
Jason Brownlee June 13, 2018 at 6:15 am #
20.
Venkat June 12, 2018 at 4:38 am #
Hi Jason,
I’m working on words correction in a sentence. Ideally the model should generate number of output
words equal to input words with correction of error word in the sentence.
Does language model help me for the same.? if it does please leave some hints on the model.
REPLY
o
Jason Brownlee June 12, 2018 at 6:48 am #
Sounds like a fun project. A language model might be useful. Sorry, I don’t have a worked
example. I recommend checking the literature.
REPLY
21.
Amar June 20, 2018 at 8:53 pm #
Hi Jason,
(I am assuming that the model has never seen SPAM data, and hence the probability of the
generated text will be very less.)
Do you see any way I can achieve this with Language model? What would be an alternative
otherwise when it comes to rare event scenario for NLP use cases.
Thank you!!
REPLY
o
Jason Brownlee June 21, 2018 at 6:17 am #
I would recommend using a CNN model for text classification, for example:
https://machinelearningmastery.com/best-practices-document-classification-deep-learning/
REPLY
Amar June 21, 2018 at 4:25 pm #
Jason Brownlee June 21, 2018 at 5:00 pm #
No problem.
REPLY
22.
musa June 29, 2018 at 2:36 am #
Hi Jason,
Could you comment on overfitting when training language models. I’ve built a sentence based LSTM
language model and have split training: validation into a 80:20 split. I’m not seeing any imrovments
in my validation data whilst the accuracy of the model seems to be improving.
Thanks
REPLY
o
Jason Brownlee June 29, 2018 at 6:13 am #
Overfitting a language model really only matters in the context of the problem for which you are
using the model.
A language model used alone is not really that useful, so overfitting doesn’t matter. It may be
purely a descriptive model rather than predictive.
Often, they are used in the input or output of another model, in which case, overfitting may limit
predictive skill.
REPLY
23.
Marc June 30, 2018 at 8:42 am #
Would would the implications of returning the hidden and cell states be here?
I’d imagine if the input sequence was long enough it wouldn’t matter too much as the temporal
relationships would be captured, but if we had shorter sequences or really long documents we would
consider doing this to improve the model’s ability to learn.. am I thinking about this correctly?
What would the drawback be to returning the hidden sequence and cell state and piping that into the
next observation?
REPLY
o
Jason Brownlee July 1, 2018 at 6:21 am #
I don’t follow, why would you return the cell state externally at all? It has no meaning outside of
the network.
REPLY
star July 5, 2018 at 2:25 am #
Hope you are doing well. I have a question which returns to my understanding from
embedding vectors.
For example if I have this sentence “ the weather is nice” and the goal of my model is
predicting “nice”, when I want to use pre trained google word embedding model, I must
search embedding google matrix and find the embedding vector related to words “the”
“weather” “is” “nice” and feed them as input to my model? Am I right?
REPLY
Jason Brownlee July 5, 2018 at 8:01 am #
24.
Fatemeh July 4, 2018 at 2:34 am #
Hello, Thank you for nice description. I want to use pre-trained google word embedding vectors, So I
think I don’t need to do sequence encoding, for example if I want to create sequences with the
length 10, I have to search embedding matrix and find the equivalence embedding vector for each of
the 10 words, right?
REPLY
o
Jason Brownlee July 4, 2018 at 8:28 am #
Correct. You must map each word to its distributed representation when preparing the
embedding or the encoding.
REPLY
Fatemeh July 5, 2018 at 2:43 am #
Thank you, do you have sample code for that in your book? I purchased your professional
package
REPLY
Jason Brownlee July 5, 2018 at 8:01 am #
All sample code is provided with the PDF in the code/ directory.
REPLY
fatemeh July 6, 2018 at 1:57 am #
Thank you. when I want to use model.fit , I have to specify X and y and I have used
pre trained google embedidng matrix. so every word has mapped to a a vector, and
my inputs are actually the sentences (with the length 4). now I don’t understand the
equivalent values for X. for example imagine the first sentence is “the weather is
nice” so the X will be “the weather is” and the y is “nice”. When I want to convert X to
integers, every word in X will be mapped to one vector? for example if the equivalent
vectors for the words in sentence in google model are :”the”= 0.9,0.6,0.8 and
“weather”=0.6,0.5,0.2 and “is”=0.3,0.1,0.5 , and “nice”=0.4,0.3,0.5 the input X will be
:[[0.9,0.6,0.8],[0.6,0.5,0.2],[0.3,0.1,0.5]] and the output y will be [0.4,0.3,0.5]?
Jason Brownlee July 6, 2018 at 6:45 am #
You can specify or learn the mapping, but after that the model will map integers to
their vectors.
25.
mh July 5, 2018 at 4:19 am #
Hello Jason, why you didn’t convert X input to hot vector format? you only did this dot y output.
REPLY
o
Jason Brownlee July 5, 2018 at 8:02 am #
Hi..First of all would like to thank for the detailed explaination on the concept. I was executing this
model step by step to have a better understanding but am stuck while doing the predictions
yhat = model.predict_classes(encoded, verbose=0)
I am getting the below error:-
ValueError: Error when checking input: expected embedding_1_input to have shape (50,) but got
array with shape (1,)
REPLY
o
Jason Brownlee July 6, 2018 at 6:42 am #
o
ervin October 23, 2018 at 9:00 am #
https://stackoverflow.com/questions/39950311/keras-error-on-predict
REPLY
27.
NLP_enthusiast August 3, 2018 at 5:08 pm #
When callling:
model.fit(X, y, batch_size=128, epochs=100)
what is X.shape and y.shape at this point?
I’m getting the error:
Error when checking input: expected embedding_1_input to have shape (500,) but got array with
shape (1,)
In my case:
X.shape = (500,) # 500 samples i my case
y.shape = (500, 200) # this is after y= to_categorical(y, num_classes=vocab_size)
o
Jason Brownlee August 4, 2018 at 6:01 am #
NLP_enthusiast August 4, 2018 at 7:46 am #
thank you. Potentially useful to others: X.shape should be of the form (a, b) where a is the
length of “sequences” and b is the input sequence length to make forward predictions. Note
that modification/omission of string.maketrans() is likely necessary if using Python 2.x
(instead of Python 3.x) and that Theanos backend may also alleviate potential dimension
errors from Tensorflow.
REPLY
Jason Brownlee August 5, 2018 at 5:21 am #
THanks.
REPLY
Mike April 30, 2020 at 8:15 pm #
I don’t understand why the length of each sequence must be the same (i.e. fixed)? Isn’t
the point of RNNs to handle variable length inputs by taking as input one word at a time
and have the rest represented in the hidden state.
Is it used for optimization when we happen to use the same size for all but it’s not
actually necessary for us to do so? That would make more sense.
REPLY
Jason Brownlee May 1, 2020 at 6:36 am #
Most of my implementations use a fixed length input/output for efficiency and zero
padding with masking to ignore the padding.
o
usman September 4, 2018 at 5:35 pm #
Carlos Aguayo November 20, 2018 at 10:49 am #
table = string.maketrans(string.punctuation, ‘ ‘)
tokens = [‘w.translate(table)’ for w in tokens]
with this:
Jason Brownlee November 20, 2018 at 2:04 pm #
Very nice!
REPLY
28.
Eric August 20, 2018 at 9:15 pm #
Does anyone have an example of how predict based on a user provided text string instead of
random sample data. I’m kind of new to this so my apologies in advance for such a simple question.
Thank you!
REPLY
o
Jason Brownlee August 21, 2018 at 6:15 am #
The new input data must be prepared in the same way as the training data for the model.
29.
Santosh August 21, 2018 at 6:28 am #
Can we implement this thing in Android platform to run trained model for given set of words from
user.
REPLY
o
Jason Brownlee August 21, 2018 at 6:38 am #
30.
anirban August 31, 2018 at 11:29 pm #
Hi,
One of the extensions suggested in the blog is
Sentence-Wise Model. Split the raw data based on sentences and pad each sentence to a fixed
length (e.g. the longest sentence length).
So if I do a sentence wise splitting then do I retain the punctuations or remove it?
REPLY
o
Jason Brownlee September 1, 2018 at 6:21 am #
31.
wu.zheng September 6, 2018 at 1:40 pm #
is this model only use the 50 words before to predict last world ?
o
Jason Brownlee September 6, 2018 at 2:13 pm #
riyaz September 5, 2020 at 4:56 am #
why input and output are same ? output must be one shift towards left . is it right?
REPLY
Jason Brownlee September 5, 2020 at 6:53 am #
The output is not the same as the input.
REPLY
riyaz September 5, 2020 at 3:47 pm #
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
then what is mean by this
Jason Brownlee September 6, 2020 at 6:03 am #
If you are new to array slices and indexes in Python, this will help you get started:
https://machinelearningmastery.com/index-slice-reshape-numpy-arrays-machine-
learning-python/
32.
Anshul Patel October 3, 2018 at 5:55 pm #
Hello Jason, I am working on word predictor using RNN’s too. However I have been encountering
the same problem faced by many others i.e. INDEXERROR : Too many Indices
lines = training_set.split(‘\n’)
tokenizer = Tokenizer()
tokenizer.fit_on_sequences(lines)
tokenizer.fit_on_texts(lines)
sequences = tokenizer.texts_to_sequences(lines)
vocab_size = len(tokenizer.word_index) + 1
sequences = array(sequences)
X_train = sequences[:, :-1]
y_train = sequences[:,-1]
I have went through all the comments related to this error, However none of them solve my issue. I
wonder if there is any problem with the text I imported it is Pride and Prejudice book from Gutenberg.
o
Jason Brownlee October 4, 2018 at 6:12 am #
I have not seen this error, are you able to confirm that your libraries are up to date and that you
copied all of the code from the tutorial?
REPLY
33.
Andrew October 16, 2018 at 4:51 pm #
How long did it take to train model like this on 118633 sequences of 50 symbols from 7410 elements
dictionary?
Can you share on what machine did you train the model? ram, cpu, os?
REPLY
o
Jason Brownlee October 17, 2018 at 6:46 am #
If you have machine problems, perhaps try an EC2 instance, I show how here:
https://machinelearningmastery.com/develop-evaluate-large-deep-learning-models-keras-
amazon-web-services/
REPLY
34.
ervin October 20, 2018 at 3:36 am #
Great article!
2 questions:
1. In your ‘Extension’ section — you mentioned to try dropout. Since there is not validation/holdout
set, why should we use dropout? Isn’t that just a regularization technique and doesn’t help with
training data accuracy?
2. Speaking of accuracy — I trained my model to 99% accuracy. When I generate text and use exact
lines from PLATO as seed text, I should be getting almost an exact replica of PLATO right? Since
my model has 99% of getting the next word right in my seed text. I’m finding that this is not the case.
Am I interpreting this 99% accuracy right? What’s keeping it from making a replica.
REPLY
o
Jason Brownlee October 20, 2018 at 5:59 am #
Yes, it is a regularization method. Yes, it can help as the model is trained using supervised
learning.
We don’t want the model to memorize Plato, we want the model to learn how to make Plato-like
text.
REPLY
35.
OkRo November 29, 2018 at 9:17 pm #
Hi , Great article!
I have a question – How can I use the mode to get probability of a word in a sentence?
e.g. P(Piraeus| went down yesterday to the?
REPLY
o
Jason Brownlee November 30, 2018 at 6:31 am #
36.
Emi December 10, 2018 at 12:36 pm #
I am a big fan of your tutorials and I used to search your tutorials first when I want to learn things in
deep learning.
I currently have a problem as follows.
I have a huge corpus of unstructured text (where I have already cleaned and tokenised as; word1,
word2, word3, … wordN). I also have a target word list which should be outputted by analysing these
cleaned text (e.g., word88, word34, word 48, word10002, … , word8).
I want to build a language model, that correcly output these target words using my cleaned text. I am
just wondering if it is possible to do using deep learning techniques? Please let me know your
thoughts.
If you have tutorial related to above senario please share it with me (as I could not find).
-Emi
REPLY
o
Jason Brownlee December 10, 2018 at 2:19 pm #
Perhaps models used for those problems would be a good starting point?
REPLY
Emi December 10, 2018 at 4:36 pm #
Thank you for your reply. No, it is not translation or summarisation. I have given an example
below with more details 🙂
—————————————————————————————————————-
——————————————————————————————————————–
I am just trying to figure out if there is a way to obtain my target output using deep learning
or ML model? Please let me know your thoughts.
-Emi
REPLY
Jason Brownlee December 11, 2018 at 7:39 am #
Sounds like you want a model to output the same input, but ordered in some way.
Emi December 11, 2018 at 12:48 pm #
Hi Jason,
Thank you very much for your suggestion. I followed the following tutorial of yours
today related to encoder-decorder: https://machinelearningmastery.com/develop-
encoder-decoder-model-sequence-sequence-prediction-keras/
It is very well explained and I would like to use it for my task. However, I got one
small problem.
In your example, you are using 100000 trainging examples as mentioned below.
However, in my task I only have one input sequence and target sequence as shown
below (same input, but ordered in different way).
Would it be a problem?
PS: However, my input and target sequence are very long in my real dataset
(around 10000 words of length).
Jason Brownlee December 11, 2018 at 2:33 pm #
You will have to adapt the model to your problem and run tests in order to
_discover_ whether the model is suitable or not.
37.
Emi December 11, 2018 at 10:38 am #
Thank you very much for your valuable suggestion. i truely appreciate it 🙂
I have followed your tutorial of ‘Encorder-Decorder LSTM’ for time-series analysis. Do you have any
tutorial of ‘encorder-decorder’ that is close to my task? If so, can you please share the link with me?
REPLY
38.
Jin December 14, 2018 at 8:23 pm #
Hi Jason, I have a question about the pre-trained word vectors. I know I should set the embedding
layer with weights=[pre_embedding], but how should decide the order of pre_embedding? Like,
which word does the vector represent in a certain row. Also, should the first row always be all zeros?
REPLY
o
Jason Brownlee December 15, 2018 at 6:11 am #
The index of each vector must match the encoded integer of each word in the vocabulary.
That is why we collect the vectors needed for each word in the vocab incrementally.
REPLY
39.
Lee January 9, 2019 at 1:42 pm #
I am running the model as described here and in the book, but the loss goes to nan frequently. My
computer (18 cores + Vega 64 graphics card) also takes much longer to run an epoch than shown
here. All cpu leads to 1 hour finishing time. Encoding as int8 and using the GPU via PlaidML speeds
it up to ~376 seconds, but nan’s out.
Any advice? The code is exactly as is used both here and the book, but I just can’t get it to finish a
run.
REPLY
o
Jason Brownlee January 10, 2019 at 7:46 am #
Did you try running the code file provided with the book?
Are you able to confirm that your libraries are up to date?
REPLY
40.
Palani January 20, 2019 at 6:11 pm #
o
Jason Brownlee January 21, 2019 at 5:30 am #
Im running this in Google Colab (albeit with a different and larger data set), The Colab System
crashes and my runtimes are basically reset.
I broke down each section of as you do in this example and found that it crashes at this code
I think the issue is that my dataset might be too large but I’m not sure.
https://colab.research.google.com/drive/1iTXV_iC-aBTSlQ8BtiwM7zSszmIrlB3k
REPLY
o
Jason Brownlee February 5, 2019 at 2:20 pm #
o
Valerie May 2, 2019 at 11:06 pm #
Same problem(((( google colab doesnt have enough RAM for such a big matrix.
REPLY
Jason Brownlee May 3, 2019 at 6:21 am #
Hey Jason,
I have two questions,
1.what will happen when we test with new sequence,instead of trying out same sequence already
there in training data?
o
Jason Brownlee February 14, 2019 at 8:38 am #
Good question, not sure. The model is tailored to the specific dataset, it might just generate
garbage or iterate back to what it knows.
Not sure about your second question, what are you referring to exactly?
REPLY
43.
saria March 13, 2019 at 5:19 am #
If you are feeding words in, a feature will be one word, either one hot encoded or encoded using
a word embedding.
REPLY
44.
islam March 29, 2019 at 11:23 pm #
o
Jason Brownlee March 30, 2019 at 6:28 am #
Glad it helped.
REPLY
45.
torr March 31, 2019 at 3:51 pm #
Can you help me with code and good article for grammar checker.
REPLY
o
Jason Brownlee April 1, 2019 at 7:47 am #
46.
Aksha April 6, 2019 at 3:45 am #
I am new to NLP realm. If you have an input text “The price of orange has increased” and output text
“Increase the production of orange”. Can we make our RNN model to predict the output text? Or
what algorithm should I use? Could you please let me know what algorithm to use for mapping input
sentence to output sentence.
REPLY
o
Jason Brownlee April 6, 2019 at 6:52 am #
47.
Thomas L. Packer April 27, 2019 at 7:48 am #
For those who want to use a neural language model to calculate probabilities of sentences, look
here:
https://stackoverflow.com/questions/51123481/how-to-build-a-language-model-using-lstm-that-
assigns-probability-of-occurence-f
REPLY
o
Jason Brownlee April 28, 2019 at 6:51 am #
48.
Thomas L. Packer April 30, 2019 at 3:51 am #
3 if index == yhat:
4 out_word = word
5 break
REPLY
o
Thomas L. Packer April 30, 2019 at 3:52 am #
I’m new to this website. Who do you mark a code block in a comment? How do you add a profile
picture?
REPLY
Jason Brownlee April 30, 2019 at 7:03 am #
You can use pre HTML tags (I fixed up your prior comment for you).
Profile pictures are based on gravatar, like any wordpress blog you might come across:
https://en.gravatar.com/
REPLY
o
Jason Brownlee April 30, 2019 at 7:02 am #
Thomas L. Packer May 3, 2019 at 3:13 am #
Thanks. I did try the other dict and it seemed to both work and run faster.
REPLY
49.
Shivam Bhati June 5, 2019 at 10:37 pm #
Hey
First of all, thank you for such a great project.
I have worked on this project and I got stuck at predicting the values.
o
Jason Brownlee June 6, 2019 at 6:30 am #
The error suggests a mismatch between your data and the model’s expectation.
You can change your data or change the expectations of the model.
REPLY
50.
Sidharth June 18, 2019 at 6:11 pm #
Thanks !
REPLY
o
Jason Brownlee June 19, 2019 at 7:51 am #
51.
Shashank June 24, 2019 at 10:35 pm #
sir please please help me . I’m working on Text Summarization . Can I do it using Language
modelling because I dont have much knowledge about Neural Networks , or if you have any
suggestions , ideas please tell me . I have around 20 days to complete the project .
Thanks a lot!
REPLY
o
Jason Brownlee June 25, 2019 at 6:21 am #
52.
Shashank June 24, 2019 at 11:40 pm #
Sir , how does the language model numeric data like money , date and all ? .
If there’s a problem , how to approach this problem sir ? I’m working on text summarization and such
numeric data may be important for summarizatio. how can i address them?
thanks a lot for the blog! love your posts. Seriously, very very , very helpful!
REPLY
o
Jason Brownlee June 25, 2019 at 6:22 am #
For small datasets, it is a good idea to normalize these to a single form, e.g. words.
REPLY
53.
Kakoli August 1, 2019 at 4:12 am #
o
Jason Brownlee August 1, 2019 at 6:56 am #
Yes, I have suggestions for diagnosing and improving deep learning model performance here:
https://machinelearningmastery.com/start-here/#better
REPLY
54.
Kuro August 29, 2019 at 6:33 am #
I modified clean_doc so that it generates a stand-alone tokens for punctuations, except when a
single quote is used as apostrophe as in “don’t”, “Tom’s”.
On my first run of model making, I changed the batch_size and epochs parameters to 512 and 25,
thinking that it might speed up the process. It ended in 2 hours on MacBook Pro but running the
sequence generation program generated the text that mostly repeats “and the same, ” like this:
to be the best , and the same , and the same , and the same , and the same , and the same , and
the same , and the same , and the same , and the same , and the same , and the same , and
I changed back batch_size and epochs to the original values (125, 100) and ran the model building
program over night. Then the generated sequences looks more reasonable. For example:
be a fortress to be justice and not yours , as we were saying that the just man will be enough to
support the laws and loves among nature which we have been describing , and the disregard which
he saw the concupiscent and be the truest pleasures , and
Is there an intuitive interpretation of the bad result of my first try? The loss value at the last epoch
was 4.125 for the first try and 2.2130 for the second (good result). I forgot to record the accuracy.
REPLY
o
Jason Brownlee August 29, 2019 at 1:30 pm #
Hmmm, off the cuff, adding a lot more tokens may require an order of magnitude more data,
training and a larger model.
REPLY
55.
Kuro August 31, 2019 at 8:36 am #
I’m trying to apply these programs to a collection of song lyrics of a certain genre. A song typically
made of 50 to 200 words. Since they are from the same genre, the vocabulary size is relatively small
(talking about lost loves, soul etc.). In this case, would it make sense to reduce the sequence size
from 50 ? I’m thinking of something like 20.
The goal of my experiment is to generate a lyric by giving first 5 – 10 words, just for fun.
REPLY
o
Jason Brownlee September 1, 2019 at 5:33 am #
Sounds fun.
Perhaps experiment and see what works best for your specific dataset.
56.
Aly September 1, 2019 at 5:59 am #
I’m implementing this on a corpus of Arabic text, but whenever it runs I am getting the same word
repeated for the entire text generation process. I’m training on ~50,000 words with ~16,000 unique
words. I’ve messed around with number of layers, more data (but an issue as the number of unique
words also increases, an interesting find which feels like an error between Tokenizer and not
English), and epochs
o
Jason Brownlee September 2, 2019 at 5:23 am #
Sorry to hear that, you may have to adjust the capacity of the model and training parameters of
the model for your new dataset.
57.
Irfan Danish October 2, 2019 at 1:51 am #
is there any way we can generate 2 or 3 different sample text from a single seed.
For example we input “Hello I’m” and model gives us
Hello I’m interested in
Hello I’m a bit confused
Hello I’m not sure
Instead of generating just one output it gives 2 to 3 best outputs.
Actually I trained your model instead of 50 I just used sequence length of three words, now I want
that when I input a seed of three words instead of just one sequence of three words I want to
generate 2 to 3 sequences which are correlated to that seed. Is it possible. Please let me know, I
would be very thankful to you!
REPLY
o
Jason Brownlee October 2, 2019 at 8:02 am #
Yes, you can take the predicted probabilities and run a beam search to get multiple likely
sequences.
REPLY
Irfan Danish October 3, 2019 at 3:29 am #
Can you please give me a lit bit more explanation that how can I implement it or give me an
example. That would be nice!
REPLY
Jason Brownlee October 3, 2019 at 6:52 am #
58.
Arjun October 17, 2019 at 4:38 pm #
hi,
what would be the X and y be like?
and could it be done by splitting the X and y into training and testing?
also when the model is created what would be the inputs for embedding layer?
then fit the model on X_train?
REPLY
o
Jason Brownlee October 18, 2019 at 5:46 am #
Arjun October 18, 2019 at 3:26 pm #
Input to the embedding is 2d, each sample is a vector of integers that map to a word.
REPLY
Arjun October 21, 2019 at 4:52 pm #
ok, so after the model is formed then we make the X_train 3D? when fitting?
Jason Brownlee October 22, 2019 at 5:42 am #
59.
Arjun October 21, 2019 at 5:00 pm #
hi, i would like to know if you have any idea about neural melody composition from lyrics using
RNN?
from a paperwork published it says that two encoders are used and given to a single decoder.
i wonder if you could provide any insight on it?
this is the paperwork:
https://arxiv.org/pdf/1809.04318.pdf
REPLY
o
Jason Brownlee October 22, 2019 at 5:42 am #
Sorry, I am not familiar with that paper, perhaps try contacting the authors?
REPLY
Arjun October 22, 2019 at 3:18 pm #
Arjun October 22, 2019 at 4:15 pm #
Do you know how to give two sequences as input to a single decoder? Both encoders
and decoders are RNN structures. So how is it possible to have two encoders for a
single decoder?
REPLY
Jason Brownlee October 23, 2019 at 6:31 am #
60.
Arjun October 23, 2019 at 3:50 pm #
It was not completely specific to my doubt, but even though thank you for helping.
REPLY
61.
Arjun October 24, 2019 at 4:23 pm #
hi, if i had two sequences as input and i have training and testing for both sequence inputs.
i managed to concatenate both the inputs and create a model.
but when it comes to fitting the model, how is it possible to give X and y of two sequences ?
REPLY
o
Jason Brownlee October 25, 2019 at 6:35 am #
If you have a multi-input model, then the fit() function will take a list with each array of samples.
E.g.
X1 = …
X2 = …
X = [X1, X2]
model.fit(X, y, ….)
REPLY
Arjun October 25, 2019 at 4:03 pm #
So when we are training an RNN we should have the equal number of outputs as well?
REPLY
Arjun October 25, 2019 at 6:03 pm #
ValueError: Error when checking model target: the list of Numpy arrays that you are
passing to your model is not the size the model expected. Expected to see 1 array(s),
but instead got the following list of 2 arrays.
This was what I got when i gave multiple inputs to fit.
REPLY
Jason Brownlee October 26, 2019 at 4:36 am #
Perhaps this tutorial will help you to get started with a multiple input model:
https://machinelearningmastery.com/keras-functional-api-deep-learning/
Jason Brownlee October 26, 2019 at 4:34 am #
Arjun October 28, 2019 at 3:09 pm #
equal in the sense, the number of inputs must be equal to the number of outputs?
Jason Brownlee October 29, 2019 at 5:18 am #
Typically the number of input and output timesteps must be the same.
They can differ, but either they must be fixed, or you can use an alternate
architecture such as an encoder-decoder:
https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-
networks/
62.
Arjun October 28, 2019 at 3:23 pm #
What if we had two different inputs and we need a model with both these inputs aligned?
Could we give both these inputs in a single model or create two model with corresponding inputs
and then combine both models at the end?
REPLY
o
Jason Brownlee October 29, 2019 at 5:19 am #
Sure, there are many different ways to solve a problem. Perhaps explore a few framings and see
what works best for your dataset?
REPLY
63.
Arjun October 29, 2019 at 3:12 pm #
sure..
Thank you..
REPLY
64.
Arjun October 31, 2019 at 3:30 pm #
how can we know if two inputs have been aligned ?
can we do it by merging two models?
either way we can know the result only after testing it right?
REPLY
o
Jason Brownlee November 1, 2019 at 5:25 am #
Yes.
Arjun November 1, 2019 at 3:47 pm #
Jason Brownlee November 2, 2019 at 6:38 am #
65.
Arjun November 4, 2019 at 4:46 pm #
Hi Jason,
I was working on a text generation problem.
Seems I had a problem while I was fitting X_train and y_train.
It was related to incompatible shapes. But then I set the batch size to 1 and it ran.
But when it reached the evaluation part, it showed the previous error.
o
Jason Brownlee November 5, 2019 at 6:49 am #
Arjun November 5, 2019 at 5:06 pm #
also I haven’t exactly copied your code as whole. I tried doing it on my own just taking
snippets. Even then these are errors which I have never seen before.
REPLY
66.
Arjun November 5, 2019 at 3:25 pm #
I am just confused why it would run while fitting but does not run while evaluating?
I mean we can’t do much tweaking with the arguments in evaluation?
REPLY
o
Jason Brownlee November 6, 2019 at 6:30 am #
This error was found when i was fitting the model. But when I passed the batch size as 1, the model
fitted without any problem.
But when I tried to evaluate it the same previous error showed up.
Do you have any idea why it would work while fitting but not while evaluating..?
REPLY
o
Jason Brownlee November 7, 2019 at 6:34 am #
Arjun November 7, 2019 at 3:25 pm #
ok sure..
Thank you
REPLY
68.
Arjun November 6, 2019 at 5:13 pm #
Hi jason,
Could you please give some insight on attention mechanism in keras?
REPLY
o
Jason Brownlee November 7, 2019 at 6:35 am #
Arjun November 7, 2019 at 3:17 pm #
Jason Brownlee November 8, 2019 at 6:38 am #
Yes:
– implement it manually
– use a 3rd party implementation
– use tensorflow directly
– use pytorch
REPLY
69.
Arjun November 7, 2019 at 8:50 pm #
Is there any reason why the validation accuracy decreases while the training accuracy increases?
Is that a case of overfitting?
REPLY
o
Jason Brownlee November 8, 2019 at 6:40 am #
Accuracy is noisy.
Arjun November 8, 2019 at 3:19 pm #
Jason Brownlee November 9, 2019 at 6:10 am #
Or the case that the validation dataset is too small and/or not representative of the
training dataset.
REPLY
Arjun November 11, 2019 at 3:20 pm #
I was training on the imdb dataset for sentiment analysis. I was training on 100000
words.
Everything seems to be going okay until the training part where the loss and
accuracy keeps on fluctuating.
The validation dataset is split from the whole dataset, so i dont think thats the issue.
Jason Brownlee November 12, 2019 at 6:33 am #
how can we know the total number of words in the imdb dataset?
not the vocabulary but the size of the dataset?
REPLY
o
Jason Brownlee November 14, 2019 at 8:03 am #
Arjun November 14, 2019 at 3:36 pm #
One sample in this dataset is one review. And each review contains different number of
words. We are considering like 10000 or 100000 words for the dataset and splitting it into
training and testing. So i need to like get the total number of words.
REPLY
71.
Augusto December 28, 2019 at 9:07 am #
Hi Jason,
I did the exercise from your post “Text Generation With LSTM Recurrent Neural Networks in Python
with Keras”, but the alternative you are describing here by using a Language Model produces text
with more coherence, then could you please elaborate when to use one technique over the another.
Thanks in advance,
REPLY
o
Jason Brownlee December 29, 2019 at 5:59 am #
Good question.
There are no good heuristics. Perhaps follow preference, or model skill for a specific dataset and
metric.
REPLY
72.
Fred January 3, 2020 at 6:48 am #
Hi! I’m trying to convert this example to to make a simple proof-of-concept model to do word
prediction that can do inference both backwards and forwards using the same trained model.
(Without duplicating the data)
I want to try split the text lines in the middle and have my target word there. Like this:
X1 y X2
1. [down yesterday to the piraeus with glaucon the son of ariston]
2. [yesterday to the piraeus with glaucon the son of ariston that]
3. [to the piraeus with glaucon the son of ariston that i]
etc
(X1 and X2 are actually 20 words each)
I keep getting various data formatting errors and I feel like I have tried so many things but obviously
there’s still plenty permutations at the correct way to do this still eludes me.
X1 = X1.reshape((n_lines+1, 20))
X2 = X2.reshape((n_lines+1, 20))
y = y.reshape((n_lines+1, vocab_size))
model = Sequential()
model.add(Embedding(vocab_size, 40, input_length=seq_length))
model.add(LSTM(100, return_sequences=True, input_shape=(20, 1)))
model.add(LSTM(100))
model.add(Dense(100, activation=’relu’))
model.add(Dense(vocab_size, activation=’softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
Cheers, Fred
REPLY
o
Jason Brownlee January 3, 2020 at 7:37 am #
Fred January 3, 2020 at 9:25 am #
It was just suggested on the Google group that I try the Functional API, so I’m figuring out
how to do that now.
REPLY
Jason Brownlee January 4, 2020 at 8:14 am #
73.
Wei Jiang January 25, 2020 at 8:06 pm #
I want to take everything into account, including punctuatioins, so that I comment out the following
line:
Any idea?
REPLY
o
Jason Brownlee January 26, 2020 at 5:18 am #
Perhaps confirm the shape and type of the data matches what we expect after your change?
REPLY
Wei Jiang January 26, 2020 at 11:45 am #
I am not sure about that. You can download the files that I have created/used from the
following OneDrive link:
https://1drv.ms/u/s!AqMx36ZH6wJMhINbfIy1INrq5onhzg?e=UmUe4V
REPLY
Jason Brownlee January 27, 2020 at 7:00 am #
Sorry, I don’t have the capacity to debug your code for you:
https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-
code
REPLY
74.
sachin February 27, 2020 at 1:52 pm #
Sir, when we are considering the context of a sentence to classify it to a class, which neural network
architecture should I use.
o
Jason Brownlee February 28, 2020 at 5:55 am #
75.
Dylan Lunde March 12, 2020 at 8:05 am #
Was the too many indices for array issue ever explicitly solved by anyone?
o
Jason Brownlee March 12, 2020 at 8:54 am #
76.
Carson March 27, 2020 at 2:21 pm #
i have a question, you input 50 words into your neural nets and get one output world if i am not
wrong, but how you can get a 50 words text when you only put in 50 words text?
REPLY
o
Jason Brownlee March 28, 2020 at 6:11 am #
You can use the same model recursively with output passed as input, or you can use a seq2seq
model implemented using an encoder-decoder model.
You can find many examples of encoder-decoder for NLP on this blog, perhaps start here:
https://machinelearningmastery.com/start-here/#nlp
REPLY
77.
Arsal April 26, 2020 at 8:43 am #
o
Jason Brownlee April 27, 2020 at 5:23 am #
Language models are used for text generation, GANs are used for generating images.
REPLY
78.
Efstathios Chatzikyriakidis May 15, 2020 at 3:35 am #
Hi Jason,
Two issues:
I think it is 50.
Also:
“Simplify Vocabulary. Explore a simpler vocabulary, perhaps with stemmed words or stop words
removed.”
This is usually done in text classification. Doing such a think in a language model and use it for text
generation you will lead to bad results. Stop words are important for catching basic groups of words,
eg: “I went to the”.
REPLY
o
Jason Brownlee May 15, 2020 at 6:12 am #
79.
Vipul May 30, 2020 at 5:03 pm #
I need a deep nueral network which select a word out of predefined candidates. Please suggest me
some solution.
REPLY
o
Jason Brownlee May 31, 2020 at 6:19 am #
80.
Prem June 1, 2020 at 9:58 am #
For sentence-wise training, does model 2 from the following post essentially show it?
https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/
REPLY
81.
Laura June 4, 2020 at 8:29 pm #
Hi Jason! Thanks for your post.
I need to build a neural network which detect anomalies in sycalls execution as well as related to the
arguments these syscalls receive. Which solution could be suitable for this problem.
Thanks in advance!
REPLY
o
Jason Brownlee June 5, 2020 at 8:09 am #
Perhaps start with a text input and class label output, e.g. text classification models. Test bag of
words and embedding representations to see what works well.
REPLY
82.
Laura June 5, 2020 at 12:04 am #
Hi Jason!
Thanks for your post!
I need to build a neural network to detect anomalies in syscalls exection as well as in the arguments
they receive. Which solution would you recommend me for this purpose?
Thanks in advance!
REPLY
o
Jason Brownlee June 5, 2020 at 8:14 am #
I recommend prototyping and systematically evaluating a suite of different models and discover
what works well for your dataset.
REPLY
83.
Neha June 10, 2020 at 1:32 pm #
Hello sir, thank you for such a nice post, but sir how to work with csv files, how to load ,save them,I
am so new to deep learning, can you me idea of the syntax?
REPLY
o
Jason Brownlee June 11, 2020 at 5:48 am #
84.
Kaalu June 16, 2020 at 12:15 pm #
Hi Jason,
Thanks for your step by step tutorial with relevant explanations. I am trying to use this technique in
generating snort rules for a specific malware family/ type (somehow similar to firewall rules /
Intrusion detection rules). Do you think this is possible? can you give me any pointers to consider?
Will it be possible since such rules need to follow a specific format or sequence with keywords.
o
Jason Brownlee June 16, 2020 at 1:40 pm #
You’re welcome.
My best advice is to review the literature and see how others that have come before addressed
the same type of problem. It will save a ton of time and likely give good results quickly.
REPLY
Kaalu June 16, 2020 at 11:37 pm #
Hi Jason,
Thanks very much. Sadly haven’t found any literature where they have anything similar .
That’s why I reached out to you.
Jason Brownlee June 17, 2020 at 6:24 am #
Hang in there, perhaps search for another project that is “close enough” and mine it for
ideas.
REPLY
Ebenezer A. Laryea June 19, 2020 at 1:36 am #
thanks
85.
Hilal Ozer August 26, 2020 at 7:54 am #
Hi Jason,
Thanks for the great post. I used your code for morpheme prediction. At first I implemented it with
fixed sequence length correctly but then I have to make it with variable sequence length. So, I used
stateful LSTM with batch size 1 and set sequence length None.
I tried to fit the model one sample at a time. However I got the “ValueError: Input arrays should have
the same number of samples as target arrays. Found 1 input samples and 113 target samples.”
The input and output sample sizes are actually equal and “113” is the one hot vector’s size of the
output. The target output implementation is totally same with your code and runs correctly in my first
implementation with fixed sequence.
Do you have any idea why the model does not recognize one hot encoding?
Thanks in advance.
REPLY
o
Jason Brownlee August 26, 2020 at 1:42 pm #
You’re welcome.
If you are using a stateful LSTM you may need to make the target 3d instead of 2d, e.g.
[samples, timesteps, features]. I could be wrong, but I recall that might be an issue to consider.
REPLY
86.
Hilal Ozer August 27, 2020 at 7:14 am #
Thank you for your response. When I make it 2d, it ran successfully. It was 1d by mistake.
REPLY
o
Jason Brownlee August 27, 2020 at 7:43 am #
No problem.
REPLY
87.
riyaz September 5, 2020 at 9:59 pm #
If you want to learn more, you can also check out the Keras Team’s text generation implementation
on GitHub: https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py.
have a look on this code.. its well presented
REPLY
o
Jason Brownlee September 6, 2020 at 6:04 am #
Thanks for sharing.
REPLY
88.
Minura Punchihewa September 13, 2020 at 4:09 am #
Hi Jason,
I have used a similar RNN architecture to develop a language model. What I would like to do now is,
when a complete sentence is provided to the model, to be able to generate the probability of it.
Note: This is the probability of the entire sentence that I am referring to, not just the next word.
From what I have gathered, this mechanism is used in the implementation of speech recognition
software.
REPLY
o
Jason Brownlee September 13, 2020 at 6:12 am #
89.
Minura Punchihewa September 14, 2020 at 5:34 am #
90.
Pranav September 28, 2020 at 1:04 am #
Hello Jason Sir,
Thank you for providing such an amazing and informative blog on Text-generation . First on reading
the title I thought its going to be difficult , but explainations as well as the code were concise and
easy to grasp . Looking foward to read more blogs from you!!
REPLY
o
Jason Brownlee September 28, 2020 at 6:21 am #
Thanks!
REPLY
91.
John Bueno October 7, 2020 at 2:43 pm #
I’ve followed the steps and am almost finished but am stuck on this error
For reference I explicitly used the same versions of just about everything that you did. Everything
works except for the first line to state
yhat = model.predict_classes(encoded, verbose=0)
I’ve tinkered with the code but sadly I am not quite mathmatically and software inclined enough to
find a proper solution. You may want to keep in mind that I have altered the text cleaner to keep
numbers and punctuation albeit when reverting it back to normal it doesn’t appear to fix anything. It
may also be worth noting that for testing purposes I’ve set the epoch count to 1 but I doubt that
should affect anything. Outside of that there shouldn’t be any important deviations.
REPLY
o
Jason Brownlee October 8, 2020 at 8:27 am #
Sorry to hear that, I can confirm the example works with the latest version of the libraries, e.g.
Keras 2.4 and TensorFlow 2.3, ensure your libs are up to date.
Also, it looks like you are running from an IDE, perhaps try running from the command line:
https://machinelearningmastery.com/faq/single-faq/how-do-i-run-a-script-from-the-command-line
More suggestions here:
https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-
for-me
REPLY
92.
Payton F October 27, 2020 at 9:18 am #
Hi Jason,
What would your approach be to building a model trained on multiple different sources of text. For
example, if I want to train a model on speech transcripts so I can generate text in the style of a
certain speaker, would I store all the speeches in a single .txt file? I worry that if I do this, I will have
some misleading sequences such as when the sequence begins with words from one speech and
ends with words from the beginning of the next speech. Would it be better to somehow train the
model on one speech at a time rather than on a larger file of all speeches combined?
REPLY
o
Jason Brownlee October 27, 2020 at 1:01 pm #
Perhaps fit a separate model on each source then use an ensemble of the models / stacking to
combine.
REPLY
93.
123 November 6, 2020 at 6:43 am #
I am now not sure where you’re getting your information, but good topic.
I must spend a while learning much more or understanding more.
Thanks for wonderful info I used to be in search of this information for my mission.
REPLY
o
Jason Brownlee November 6, 2020 at 7:31 am #
Thanks.
REPLY
94.
cnsn8 April 6, 2021 at 12:43 am #
thanks for great tutorials Jason. How can i add a simple control to this language model ? Such as
positive- negative text generation.
REPLY
o
Jason Brownlee April 6, 2021 at 5:18 am #
You’re welcome.
95.
Eric April 27, 2021 at 8:35 pm #
Hi Jason,
At this step, I receive these codes.
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation=’relu’))
model.add(Dense(vocab_size, activation=’softmax’))
print(model.summary())
————————————————
Total Sequences: 118633
2021-04-27 06:24:25.190966: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found
2021-04-27 06:24:25.191304: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above
cudart dlerror if you do not have a GPU set up on your machine.
2021-04-27 06:24:33.866815: I tensorflow/stream_executor/platform/default/dso_loader.cc:49]
Successfully opened dynamic library nvcuda.dll
2021-04-27 06:24:34.937609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found
device 0 with properties:
pciBusID: 0000:01:00.0 name: GeForce GTX 1050 Ti computeCapability: 6.1
coreClock: 1.62GHz coreCount: 6 deviceMemorySize: 4.00GiB deviceMemoryBandwidth:
104.43GiB/s
2021-04-27 06:24:34.940037: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cudart64_110.dll’; dlerror: cudart64_110.dll not found
2021-04-27 06:24:34.941955: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cublas64_11.dll’; dlerror: cublas64_11.dll not found
2021-04-27 06:24:34.943931: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cublasLt64_11.dll’; dlerror: cublasLt64_11.dll not found
2021-04-27 06:24:34.945872: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cufft64_10.dll’; dlerror: cufft64_10.dll not found
2021-04-27 06:24:34.947770: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘curand64_10.dll’; dlerror: curand64_10.dll not found
2021-04-27 06:24:34.949522: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cusolver64_11.dll’; dlerror: cusolver64_11.dll not found
2021-04-27 06:24:34.951167: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cusparse64_11.dll’; dlerror: cusparse64_11.dll not found
2021-04-27 06:24:34.952449: W tensorflow/stream_executor/platform/default/dso_loader.cc:60]
Could not load dynamic library ‘cudnn64_8.dll’; dlerror: cudnn64_8.dll not found
2021-04-27 06:24:34.952766: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1766] Cannot
dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed
properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for
how to download and setup the required libraries for your platform.
Skipping registering GPU devices…
2021-04-27 06:24:34.954395: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow
binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU
instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-04-27 06:24:34.955743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device
interconnect StreamExecutor with strength 1 edge matrix:
2021-04-27 06:24:34.956035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]
Traceback (most recent call last):
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\LANGUAGE MODEL
BUILDING\LANGUAGE MODEL TEST.py”, line 105, in
model.add(LSTM(100, return_sequences=True))
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\training\tracking\base.py”, line 522, in _method_wrapper
result = method(self, *args, **kwargs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\engine\sequential.py”, line 223, in add
output_tensor = layer(self.outputs[0])
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 660, in __call__
return super(RNN, self).__call__(inputs, **kwargs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\engine\base_layer.py”, line 945, in __call__
return self._functional_construction_call(inputs, args, kwargs,
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\engine\base_layer.py”, line 1083, in _functional_construction_call
outputs = self._keras_tensor_symbolic_call(
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\engine\base_layer.py”, line 816, in _keras_tensor_symbolic_call
return self._infer_output_signature(inputs, args, kwargs, input_masks)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\engine\base_layer.py”, line 856, in _infer_output_signature
outputs = call_fn(inputs, *args, **kwargs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent_v2.py”, line 1139, in call
inputs, initial_state, _ = self._process_inputs(inputs, initial_state, None)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 860, in _process_inputs
initial_state = self.get_initial_state(inputs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 642, in get_initial_state
init_state = get_initial_state_fn(
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 2508, in get_initial_state
return list(_generate_zero_filled_state_for_cell(
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 2990, in _generate_zero_filled_state_for_cell
return _generate_zero_filled_state(batch_size, cell.state_size, dtype)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 3006, in _generate_zero_filled_state
return tf.nest.map_structure(create_zeros, state_size)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\util\nest.py”, line 867, in map_structure
structure[0], [func(*x) for x in entries],
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\util\nest.py”, line 867, in
structure[0], [func(*x) for x in entries],
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\keras\layers\recurrent.py”, line 3003, in create_zeros
return tf.zeros(init_state_size, dtype=dtype)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\util\dispatch.py”, line 206, in wrapper
return target(*args, **kwargs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\ops\array_ops.py”, line 2911, in wrapped
tensor = fun(*args, **kwargs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\ops\array_ops.py”, line 2960, in zeros
output = _constant_if_small(zero, shape, dtype, name)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\ops\array_ops.py”, line 2896, in _constant_if_small
if np.prod(shape) < 1000:
File "”, line 5, in prod
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\numpy\core\fromnumeric.py”, line 3030, in prod
return _wrapreduction(a, np.multiply, ‘prod’, axis, dtype, out,
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\numpy\core\fromnumeric.py”, line 87, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
File “D:\PYTHON SOFTWARE\Doc\LEARN PYTHON\venv\lib\site-
packages\tensorflow\python\framework\ops.py”, line 867, in __array__
raise NotImplementedError(
NotImplementedError: Cannot convert a symbolic Tensor (lstm/strided_slice:0) to a numpy array.
This error may indicate that you’re trying to pass a Tensor to a NumPy call, which is not supported
My libraries are :
Python 3.9
Theano 1.0.5
Numpy 1.20.2
A language model is a key element in many natural language processing models such as
machine translation and speech recognition. The choice of how the language model is
framed must match how the language model is intended to be used.
In this tutorial, you will discover how the framing of a language model affects the skill of the
model when generating short sequences from a nursery rhyme.
After completing this tutorial, you will know:
Tutorial Overview
This tutorial is divided into 5 parts; they are:
Click to sign-up and also get a free PDF Ebook version of the course.
Language models are a key component in larger models for challenging natural language
processing problems, like machine translation and speech recognition. They can also be
developed as standalone models and used for generating new sequences that have the
same statistical properties as the source text.
Language models both learn and predict one word at a time. The training of the network
involves providing sequences of words as input that are processed one at a time where a
prediction can be made and learned for each input sequence.
Similarly, when making predictions, the process can be seeded with one or a few words,
then predicted words can be gathered and presented as input on subsequent predictions in
order to build up a generated output sequence
Therefore, each model will involve splitting the source text into input and output sequences,
such that the model can learn to predict words.
There are many ways to frame the sequences from a source text for language modeling.
In this tutorial, we will explore 3 different ways of developing word-based language models
in the Keras deep learning library.
There is no single best approach, just different framings that may suit different applications.
We will use this as our source text for exploring different framings of a word-based
language model.
1 # source text
Given one word as input, the model will learn to predict the next word in the sequence.
For example:
1 X, y
2 Jack, and
3 and, Jill
4 Jill, went
5 ...
Each lowercase word in the source text is assigned a unique integer and we can convert
the sequences of words to sequences of integers.
Keras provides the Tokenizer class that can be used to perform this encoding. First, the
Tokenizer is fit on the source text to develop the mapping from words to unique integers.
Then sequences of text can be converted to sequences of integers by calling
the texts_to_sequences() function.
1 # integer encode text
2 tokenizer = Tokenizer()
3 tokenizer.fit_on_texts([data])
4 encoded = tokenizer.texts_to_sequences([data])[0]
We will need to know the size of the vocabulary later for both defining the word embedding
layer in the model, and for encoding output words using a one hot encoding.
The size of the vocabulary can be retrieved from the trained Tokenizer by accessing
the word_index attribute.
1 # determine the vocabulary size
2 vocab_size = len(tokenizer.word_index) + 1
Running this example, we can see that the size of the vocabulary is 21 words.
We add one, because we will need to specify the integer for the largest encoded word as an
array index, e.g. words encoded 1 to 21 with array indicies 0 to 21 or 22 positions.
Next, we need to create sequences of words to fit the model with one word as input and one
word as output.
2 sequences = list()
4 sequence = encoded[i-1:i+1]
5 sequences.append(sequence)
Running this piece shows that we have a total of 24 input-output pairs to train the network.
1 Total Sequences: 24
We can then split the sequences into input (X) and output elements (y). This is
straightforward as we only have two columns in the data.
1 # split into X and y elements
2 sequences = array(sequences)
3 X, y = sequences[:,0],sequences[:,1]
We will fit our model to predict a probability distribution across all words in the vocabulary.
That means that we need to turn the output element from a single integer into a one hot
encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the
value. This gives the network a ground truth to aim for from which we can calculate error
and update the model.
Keras provides the to_categorical() function that we can use to convert the integer to a one
hot encoding while specifying the number of classes as the vocabulary size.
1 # one hot encode outputs
2 y = to_categorical(y, num_classes=vocab_size)
The model uses a learned word embedding in the input layer. This has one real-valued
vector for each word in the vocabulary, where each word vector has a specified length. In
this case we will use a 10-dimensional projection. The input sequence contains a single
word, therefore the input_length=1.
The model has a single hidden LSTM layer with 50 units. This is far more than is needed.
The output layer is comprised of one neuron for each word in the vocabulary and uses a
softmax activation function to ensure the output is normalized to look like a probability.
1 # define model
2 model = Sequential()
4 model.add(LSTM(50))
5 model.add(Dense(vocab_size, activation='softmax'))
6 print(model.summary())
1 _________________________________________________________________
3 =================================================================
5 _________________________________________________________________
7 _________________________________________________________________
9 =================================================================
12 Non-trainable params: 0
13 _________________________________________________________________
We will use this same general network structure for each example in this tutorial, with minor
changes to the learned embedding layer.
Next, we can compile and fit the network on the encoded text data. Technically, we are
modeling a multi-class classification problem (predict the word in the vocabulary), therefore
using the categorical cross entropy loss function. We use the efficient Adam implementation
of gradient descent and track accuracy at the end of each epoch. The model is fit for 500
training epochs, again, perhaps more than is needed.
The network configuration was not tuned for this and later experiments; an over-prescribed
configuration was chosen to ensure that we could focus on the framing of the language
model.
1 # compile network
3 # fit network
After the model is fit, we test it by passing it a given word from the vocabulary and having
the model predict the next word. Here we pass in ‘Jack‘ by encoding it and
calling model.predict_classes() to get the integer output for the predicted word. This is then
looked up in the vocabulary mapping to give the associated word.
1 # evaluate
2 in_text = 'Jack'
3 print(in_text)
4 encoded = tokenizer.texts_to_sequences([in_text])[0]
5 encoded = array(encoded)
8 if index == yhat:
9 print(word)
This process could then be repeated a few times to build up a generated sequence of
words.
To make this easier, we wrap up the behavior in a function that we can call by passing in
our model and the seed word.
5 for _ in range(n_words):
7 encoded = tokenizer.texts_to_sequences([in_text])[0]
8 encoded = array(encoded)
12 out_word = ''
14 if index == yhat:
15 out_word = word
16 break
17 # append to input
19 return result
We can time all of this together. The complete code listing is provided below.
8
13 for _ in range(n_words):
15 encoded = tokenizer.texts_to_sequences([in_text])[0]
16 encoded = array(encoded)
20 out_word = ''
22 if index == yhat:
23 out_word = word
24 break
25 # append to input
27 return result
28
29 # source text
35 tokenizer = Tokenizer()
36 tokenizer.fit_on_texts([data])
37 encoded = tokenizer.texts_to_sequences([data])[0]
39 vocab_size = len(tokenizer.word_index) + 1
42 sequences = list()
44 sequence = encoded[i-1:i+1]
45 sequences.append(sequence)
48 sequences = array(sequences)
49 X, y = sequences[:,0],sequences[:,1]
51 y = to_categorical(y, num_classes=vocab_size)
52 # define model
53 model = Sequential()
55 model.add(LSTM(50))
56 model.add(Dense(vocab_size, activation='softmax'))
57 print(model.summary())
58 # compile network
60 # fit network
62 # evaluate
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
Running the example prints the loss and accuracy each training epoch.
1 ...
2 Epoch 496/500
3 0s - loss: 0.2358 - acc: 0.8750
4 Epoch 497/500
6 Epoch 498/500
8 Epoch 499/500
10 Epoch 500/500
We can see that the model does not memorize the source sequences, likely because there
is some ambiguity in the input sequences, for example:
And so on.
At the end of the run, ‘Jack‘ is passed in and a prediction or new sequence is generated.
We get a reasonable sequence as output that has some elements of the source.
This is a good first cut language model, but does not take full advantage of the LSTM’s
ability to handle sequences of input and disambiguate some of the ambiguous pairwise
sequences by using a broader context.
For example:
1 X, y
2 _, _, _, _, _, Jack, and
This approach may allow the model to use the context of each line to help the model in
those cases where a simple one-word-in-and-out model creates ambiguity.
In this case, this comes at the cost of predicting words across lines, which might be fine for
now if we are only interested in modeling and generating lines of text.
Note that in this representation, we will require a padding of sequences to ensure they meet
a fixed length input. This is a requirement when using Keras.
First, we can create the sequences of integers, line-by-line by using the Tokenizer already
fit on the source text.
2 sequences = list()
4 encoded = tokenizer.texts_to_sequences([line])[0]
6 sequence = encoded[:i+1]
7 sequences.append(sequence)
Next, we can split the sequences into input and output elements, much like before.
1 # split into input and output elements
2 sequences = array(sequences)
3 X, y = sequences[:,:-1],sequences[:,-1]
4 y = to_categorical(y, num_classes=vocab_size)
The model can then be defined as before, except the input sequences are now longer than
a single word. Specifically, they are max_length-1 in length, -1 because when we calculated
the maximum length of sequences, they included the input and output elements.
1 # define model
2 model = Sequential()
4 model.add(LSTM(50))
5 model.add(Dense(vocab_size, activation='softmax'))
6 print(model.summary())
7 # compile network
9 # fit network
3 in_text = seed_text
5 for _ in range(n_words):
7 encoded = tokenizer.texts_to_sequences([in_text])[0]
13 out_word = ''
15 if index == yhat:
16 out_word = word
17 break
18 # append to input
20 return in_text
Tying all of this together, the complete code example is provided below.
9
12 in_text = seed_text
14 for _ in range(n_words):
16 encoded = tokenizer.texts_to_sequences([in_text])[0]
22 out_word = ''
24 if index == yhat:
25 out_word = word
26 break
27 # append to input
29 return in_text
30
31 # source text
37 tokenizer = Tokenizer()
38 tokenizer.fit_on_texts([data])
40 vocab_size = len(tokenizer.word_index) + 1
43 sequences = list()
45 encoded = tokenizer.texts_to_sequences([line])[0]
47 sequence = encoded[:i+1]
48 sequences.append(sequence)
55 sequences = array(sequences)
56 X, y = sequences[:,:-1],sequences[:,-1]
57 y = to_categorical(y, num_classes=vocab_size)
58 # define model
59 model = Sequential()
61 model.add(LSTM(50))
62 model.add(Dense(vocab_size, activation='softmax'))
63 print(model.summary())
64 # compile network
66 # fit network
68 # evaluate model
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
Running the example achieves a better fit on the source data. The added context has
allowed the model to disambiguate some of the examples.
There are still two lines of text that start with ‘Jack‘ that may still be a problem for the
network.
1 ...
2 Epoch 496/500
4 Epoch 497/500
6 Epoch 498/500
8 Epoch 499/500
10 Epoch 500/500
At the end of the run, we generate two sequences with different seed words: ‘Jack‘ and ‘Jill‘.
The first generated line looks good, directly matching the source text. The second is a bit
strange. This makes sense, because the network only ever saw ‘Jill‘ within an input
sequence, not at the beginning of the sequence, so it has forced an output to use the word
‘Jill‘, i.e. the last line of the rhyme.
1 Jack fell down and broke
This was a good example of how the framing may result in better new lines, but not good
partial lines of input.
This will provide a trade-off between the two framings allowing new lines to be generated
and for generation to be picked up mid line.
We will use 3 words as input to predict one word as output. The preparation of the
sequences is much like the first example, except with different offsets in the source
sequence arrays, as follows:
4 sequence = encoded[i-2:i+1]
5 sequences.append(sequence)
9
12 in_text = seed_text
14 for _ in range(n_words):
16 encoded = tokenizer.texts_to_sequences([in_text])[0]
22 out_word = ''
24 if index == yhat:
25 out_word = word
26 break
27 # append to input
29 return in_text
30
31 # source text
37 tokenizer = Tokenizer()
38 tokenizer.fit_on_texts([data])
39 encoded = tokenizer.texts_to_sequences([data])[0]
41 vocab_size = len(tokenizer.word_index) + 1
44 sequences = list()
46 sequence = encoded[i-2:i+1]
47 sequences.append(sequence)
49 # pad sequences
55 X, y = sequences[:,:-1],sequences[:,-1]
56 y = to_categorical(y, num_classes=vocab_size)
57 # define model
58 model = Sequential()
60 model.add(LSTM(50))
61 model.add(Dense(vocab_size, activation='softmax'))
62 print(model.summary())
63 # compile network
65 # fit network
67 # evaluate model
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
Running the example again gets a good fit on the source text at around 95% accuracy.
1 ...
2 Epoch 496/500
4 Epoch 497/500
6 Epoch 498/500
10 Epoch 500/500
We look at 4 generation examples, two start of line cases and two starting mid line.
The first start of line case generated correctly, but the second did not. The second case was
an example from the 4th line, which is ambiguous with content from the first line. Perhaps a
further expansion to 3 input words would be better.
The two mid-line generation examples were generated correctly, matching the source text.
We can see that the choice of how the language model is framed and the requirements on
how the model will be used must be compatible. That careful design is required when using
language models in general, perhaps followed-up by spot testing with sequence generation
to confirm model requirements have been met.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Whole Rhyme as Sequence. Consider updating one of the above examples to build
up the entire rhyme as an input sequence. The model should be able to generate the entire
thing given the seed of the first word, demonstrate this.
Pre-Trained Embeddings. Explore using pre-trained word vectors in the embedding
instead of learning the embedding as part of the model. This would not be required on such
a small source text, but could be good practice.
Character Models. Explore the use of a character-based language model for the
source text instead of the word-based approach demonstrated in this tutorial.
How to Develop a Character-Based Neural
Language Model in Keras
A language model predicts the next word in the sequence based on the specific words that
have come before it in the sequence.
It is also possible to develop language models at the character level using neural networks.
The benefit of character-based language models is their small vocabulary and flexibility in
handling any words, punctuation, and other document structure. This comes at the cost of
requiring larger models that are slower to train.
Nevertheless, in the field of neural language models, character-based models offer a lot of
promise for a general, flexible and powerful approach to language modeling.
In this tutorial, you will discover how to develop a character-based neural language model.
Click to sign-up and also get a free PDF Ebook version of the course.
It is short, so fitting the model will be fast, but not so short that we won’t see anything
interesting.
The complete 4 verse version we will use as source text is listed below.
4 Baked in a pie.
5
10
15
Copy the text and save it in a new file in your current working directory with the file name
‘rhyme.txt‘.
Data Preparation
The first step is to prepare the text data.
We will start by defining the type of language model.
The number of characters used as input will also define the number of characters that will
need to be provided to the model in order to elicit the first predicted character.
After the first character has been generated, it can be appended to the input sequence and
used as input for the model to generate the next character.
Longer sequences offer more context for the model to learn what character to output next
but take longer to train and impose more burden on seeding the model when generating
text.
We can now transform the raw text into a form that our model can learn; specifically, input
and output sequences of characters.
Load Text
We must load the text into memory so that we can work with it.
Below is a function named load_doc() that will load a text file given a filename and return
the loaded text.
1 # load doc into memory
2 def load_doc(filename):
6 text = file.read()
9 return text
We can call this function with the filename of the nursery rhyme ‘rhyme.txt‘ to load the text
into memory. The contents of the file are then printed to screen as a sanity check.
1 # load text
2 raw_text = load_doc('rhyme.txt')
3 print(raw_text)
Clean Text
Next, we need to clean the loaded text.
We will not do much to it here. Specifically, we will strip all of the new line characters so that
we have one long sequence of characters separated only by white space.
1 # clean
2 tokens = raw_text.split()
You may want to explore other methods for data cleaning, such as normalizing the case to
lowercase or removing punctuation in an effort to reduce the final vocabulary size and
develop a smaller and leaner model.
Create Sequences
Now that we have a long list of characters, we can create our input-output sequences used
to train the model.
Each input sequence will be 10 characters with one output character, making each
sequence 11 characters long.
We can create the sequences by enumerating the characters in the text, starting at the 11th
character at index 10.
2 length = 10
3 sequences = list()
4 for i in range(length, len(raw_text)):
6 seq = raw_text[i-length:i+1]
7 # store
8 sequences.append(seq)
Running this snippet, we can see that we end up with just under 400 sequences of
characters for training our language model.
Save Sequences
Finally, we can save the prepared data to file so that we can load it later when we develop
our model.
Below is a function save_doc() that, given a list of strings and a filename, will save the
strings to file, one per line.
1 # save tokens to file, one dialog per line
3 data = '\n'.join(lines)
5 file.write(data)
6 file.close()
We can call this function and save our prepared sequences to the filename
‘char_sequences.txt‘ in our current working directory.
1 # save sequences to file
2 out_filename = 'char_sequences.txt'
3 save_doc(sequences, out_filename)
Complete Example
Tying all of this together, the complete code listing is provided below.
2 def load_doc(filename):
3 # open the file as read only
6 text = file.read()
8 file.close()
9 return text
10
13 data = '\n'.join(lines)
15 file.write(data)
16 file.close()
17
18 # load text
19 raw_text = load_doc('rhyme.txt')
20 print(raw_text)
21
22 # clean
23 tokens = raw_text.split()
25
27 length = 10
28 sequences = list()
31 seq = raw_text[i-length:i+1]
32 # store
33 sequences.append(seq)
35
37 out_filename = 'char_sequences.txt'
38 save_doc(sequences, out_filename)
1 Sing a song
2 ing a song
3 ng a song o
4 g a song of
5 a song of
6 a song of s
7 song of si
8 song of six
9 ong of sixp
10 ng of sixpe
11 ...
The model will read encoded characters and predict the next character in the sequence. A
Long Short-Term Memory recurrent neural network hidden layer will be used to learn the
context from the input sequence in order to make the predictions.
Load Data
The first step is to load the prepared character sequence data from ‘char_sequences.txt‘.
We can use the same load_doc() function developed in the previous section. Once loaded,
we split the text by new line to give a list of sequences ready to be encoded.
1 # load doc into memory
2 def load_doc(filename):
6 text = file.read()
8 file.close()
9 return text
10
11 # load
12 in_filename = 'char_sequences.txt'
13 raw_text = load_doc(in_filename)
14 lines = raw_text.split('\n')
Encode Sequences
The sequences of characters must be encoded as integers.
This means that each unique character will be assigned a specific integer value and each
sequence of characters will be encoded as a sequence of integers.
We can create the mapping given a sorted set of unique characters in the raw input data.
The mapping is a dictionary of character values to integer values.
1 chars = sorted(list(set(raw_text)))
Next, we can process each sequence of characters one at a time and use the dictionary
mapping to look up the integer value for each character.
1 sequences = list()
5 # store
6 sequences.append(encoded_seq)
We need to know the size of the vocabulary later. We can retrieve this as the size of the
dictionary mapping.
1 # vocabulary size
2 vocab_size = len(mapping)
Running this piece, we can see that there are 38 unique characters in the input sequence
data.
1 Vocabulary Size: 38
1 sequences = array(sequences)
2 X, y = sequences[:,:-1], sequences[:,-1]
Next, we need to one hot encode each character. That is, each character becomes a vector
as long as the vocabulary (38 elements) with a 1 marked for the specific character. This
provides a more precise input representation for the network. It also provides a clear
objective for the network to predict, where a probability distribution over characters can be
output by the model and compared to the ideal case of all 0 values with a 1 for the actual
next character.
We can use the to_categorical() function in the Keras API to one hot encode the input and
output sequences.
1 sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
2 X = array(sequences)
3 y = to_categorical(y, num_classes=vocab_size)
Fit Model
The model is defined with an input layer that takes sequences that have 10 time steps and
38 features for the one hot encoded input sequences.
Rather than specify these numbers, we use the second and third dimensions on the X input
data. This is so that if we change the length of the sequences or size of the vocabulary, we
do not need to change the model definition.
The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial
and error.
The model has a fully connected output layer that outputs one vector with a probability
distribution across all characters in the vocabulary. A softmax activation function is used on
the output layer to ensure the output has the properties of a probability distribution.
1 # define model
2 model = Sequential()
4 model.add(Dense(vocab_size, activation='softmax'))
5 print(model.summary())
1 _________________________________________________________________
3 =================================================================
5 _________________________________________________________________
7 =================================================================
10 Non-trainable params: 0
11 _________________________________________________________________
The model is learning a multi-class classification problem, therefore we use the categorical
log loss intended for this type of problem. The efficient Adam implementation of gradient
descent is used to optimize the model and accuracy is reported at the end of each batch
update.
The model is fit for 100 training epochs, again found with a little trial and error.
1 # compile model
3 # fit model
Save Model
After the model is fit, we save it to file for later use.
The Keras model API provides the save() function that we can use to save the model to a
single file, including weights and topology information.
1 # save the model to file
2 model.save('model.h5')
We also save the mapping from characters to integers that we will need to encode any input
when using the model and decode any output from the model.
Complete Example
Tying all of this together, the complete code listing for fitting the character-based neural
language model is listed below.
7
9 def load_doc(filename):
13 text = file.read()
15 file.close()
16 return text
17
18 # load
19 in_filename = 'char_sequences.txt'
20 raw_text = load_doc(in_filename)
21 lines = raw_text.split('\n')
22
24 chars = sorted(list(set(raw_text)))
26 sequences = list()
30 # store
31 sequences.append(encoded_seq)
32
33 # vocabulary size
34 vocab_size = len(mapping)
36
38 sequences = array(sequences)
39 X, y = sequences[:,:-1], sequences[:,-1]
41 X = array(sequences)
42 y = to_categorical(y, num_classes=vocab_size)
43
44 # define model
45 model = Sequential()
47 model.add(Dense(vocab_size, activation='softmax'))
48 print(model.summary())
49 # compile model
51 # fit model
53
55 model.save('model.h5')
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
You will see that the model learns the problem well, perhaps too well for generating
surprising sequences of characters.
1 ...
2 Epoch 96/100
4 Epoch 97/100
6 Epoch 98/100
8 Epoch 99/100
10 Epoch 100/100
At the end of the run, you will have two files saved to the current working directory,
specifically model.h5 and mapping.pkl.
Next, we can look at using the learned model.
Generate Text
We will use the learned language model to generate new sequences of text that have the
same statistical properties.
Load Model
The first step is to load the model saved to the file ‘model.h5‘.
We can use the load_model() function from the Keras API.
1 # load the model
2 model = load_model('model.h5')
We also need to load the pickled dictionary for mapping characters to integers from the file
‘mapping.pkl‘. We will use the Pickle API to load the object.
1 # load the mapping
A given input sequence will need to be prepared in the same way as preparing the training
data for the model.
First, the sequence of characters must be integer encoded using the loaded mapping.
Next, the sequences need to be one hot encoded using the to_categorical() Keras function.
1 # one hot encode
We can then use the model to predict the next character in the sequence.
We can then decode this integer by looking up the mapping to see the character to which it
maps.
1 out_char = ''
3 if index == yhat:
4 out_char = char
5 break
This character can then be added to the input sequence. We then need to make sure that
the input sequence is 10 characters by truncating the first character from the input
sequence text.
We can use the pad_sequences() function from the Keras API that can perform this
truncation operation.
Putting all of this together, we can define a new function named generate_seq() for using
the loaded model to generate new sequences of text.
1 # generate a sequence of characters with a language model
3 in_text = seed_text
5 for _ in range(n_chars):
12 # predict character
15 out_char = ''
17 if index == yhat:
18 out_char = char
19 break
20 # append to input
21 in_text += char
22 return in_text
Complete Example
Tying all of this together, the complete example for generating text using the fit neural
language model is listed below.
6
9 in_text = seed_text
11 for _ in range(n_chars):
18 # predict character
21 out_char = ''
23 if index == yhat:
24 out_char = char
25 break
26 # append to input
27 in_text += char
28 return in_text
29
Note: Your results may vary given the stochastic nature of the algorithm or evaluation
procedure, or differences in numerical precision. Consider running the example a few times
and compare the average outcome.
The first is a test to see how the model does at starting from the beginning of the rhyme.
The second is a test to see how well it does at beginning in the middle of a line. The final
example is a test to see how well it does with a sequence of characters never seen before.
We can see that the model did very well with the first two examples, as we would expect.
We can also see that the model still generated something for the new text, but it is
nonsense.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
Padding. Update the example to provides sequences line by line only and use
padding to fill out each sequence to the maximum line length.
Sequence Length. Experiment with different sequence lengths and see how they
impact the behavior of the model.
Tune Model. Experiment with different model configurations, such as the number of
memory cells and epochs, and try to develop a better model for fewer resources.