Why do my keras text generation results do not reproduce?
on Nlp, Keras, Deep learning, Text generation, Python
Building a simple and nice text generator in Keras is not a difficult task, yet there are a few mistakes in the framework, that prevent you from succeeding.
Today we will discuss a most popular example of an LSTM in Python, written by Trung Tran. In his post, he provides a simple architecture of a 2-layered char LSTM, that can learn rather fast and reproduce simple phrases:
“Albus Dumbledore, I should, do you? But he doesn’t want to adding the thing that you are at Hogwarts, so we can run and get more than one else, you see you, Harry.”
“What about this thing, you shouldn’t,” Harry said to Ron and Hermione. “I have no furious test,” said Hermione in a small voice.
“Well, you can’t be the baby way?” said Harry. “He was a great Beater, he didn’t want to ask for more time.”
The main problem with this code is…the resulting model is not producing its results after saving, the results seem random even after resuming training!
Later on, a very similar architecture was added as an official example in Keras repository.
There are a lot of errors opened, discussing the model seems “untrained”:
The original code can be found on github, but here I will provide both the code and my own vision on how to fix the errors.
Fixing reproducibility
Fisrt of all, spoiler the error is not in the main code, but in load_data function, that makes a set of chars of your data:
# method for preparing the training data
def load_data(data_dir, seq_length):
data = open(data_dir, 'r').read()
chars = list(set(data)) # <---- HERE IT IS!
VOCAB_SIZE = len(chars)
ix_to_char = {ix:char for ix, char in enumerate(chars)}
char_to_ix = {char:ix for ix, char in enumerate(chars)}
The chars list is different all the time you pass a new text file, and the dictionary of symbol indexes ({0:’a’, 1:’b’,2:’c’…}) that you pass yet your model does not match with the one it was trained on. To escape that error, just save the ix_to_char explicitly as pickle, json or text and pass it to generate_text function when loading your model.
Adding more info
When providing little data to a generator like this one (and by little data I mean you are providing a few books, not a big webcorpus), in fact, you are waiting for your model to overfit. This way it can learn to reproduce words from the texts, but not to produce new ones, unless you are working with an agglutinative language (English is not one of those!).
So, if you are not limited to the speed of learning, you better provide more complete information about your texts. As you could see in the original post, the model gets each n-gram of symbols with some step:
for i in range(0, len(data)/seq_length):
X_sequence = data[i*seq_length:(i+1)*seq_length]
X_sequence_ix = [char_to_ix[value] for value in X_sequence]
input_sequence = np.zeros((seq_length, VOCAB_SIZE))
for j in range(seq_length):
input_sequence[j][X_sequence_ix[j]] = 1.
X[i] = input_sequence
y_sequence = data[i*seq_length+1:(i+1)*seq_length+1]
y_sequence_ix = [char_to_ix[value] for value in y_sequence]
Let’s see how the sequences are formed: for each step i, which can get value from 0 to (symbol length of the data)/(sequence length), we get sequences with no character overlap: 0-80, 80-160, 160-240 and so on.
This is a complete barbarism using language data!
With this kind of overlap you are using only 1/(sequence length) - 1/80 in our case - of the actual data. Anyone who has ever used n-grams knows that the language model performs better if you provide it with full sequential information: 0-80, 1-81, 2-82, etc.
To prevent excessive learning slowdown, which you will inevitably face when using an overlap of 1 symbol, you can declare a ‘step’ variable which would stand for an overlap length you want:
step = 3
l = [i for i in range(0, len(data) - seq_length, step)]
for i in range(len(l)):
X_sequence = data[l[i]:l[i]+seq_length]
X_sequence_ix = [char_to_ix[value] for value in X_sequence]
input_sequence = np.zeros((seq_length, VOCAB_SIZE))
for j in range(seq_length):
input_sequence[j][X_sequence_ix[j]] = 1.
X[i] = input_sequence
y_sequence = data[l[i]+1:l[i]+seq_length+1]
With this example, you are getting an overlap of 0-80, 3-83, 6-86, etc.
As this infers lso the length of the vector you are getting, don’t forget to change the length of X and Y:
from:
X = np.zeros((len(data)/seq_length, seq_length, VOCAB_SIZE))
y = np.zeros((len(data)/seq_length, seq_length, VOCAB_SIZE))
for i in range(0, len(data)/seq_length):
X_sequence = data[i*seq_length:(i+1)*seq_length]
to:
X = np.zeros((len(l), seq_length, VOCAB_SIZE)) #(len(data) - seq_length)//step
y = np.zeros((len(l), seq_length, VOCAB_SIZE))
for i in range(len(l)):
X_sequence = data[l[i]:l[i]+seq_length]
Further tuning
The better you know your data the better is the model. As usual in deep learning, you should check the quality of the model after every N iterations - for example, check the generated output after every 100 epochs:
if nb_epoch % 10 == 0:
generate_text(model, 20, VOCAB_SIZE, ix_to_char)
You can adjust the length of the context your model is looking at - in English, the basic parameter value is from 40 to 80 symbols.
Primary results
With this architecture, a really human-like result can be achieved on a small data. As we have fixed main issues, you can now save it and use in production.
I tried to make a simple Telegram-bot, which generates proverbs “of different cultures” - Armenian, Indian, Sufi, Hasidic and Jewish (all in Russian) - you can find all the source code in my repository.
Have fun!
# coding: utf-8
from __future__ import print_function
import matplotlib.pyplot as plt
import numpy as np
import time
import csv
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout
from keras.layers.recurrent import LSTM, SimpleRNN
from keras.layers.wrappers import TimeDistributed
from __future__ import print_function
import numpy as np
# method for generating text
def generate_text(model, length, vocab_size, ix_to_char):
# starting with random character
ix = [np.random.randint(vocab_size)]
y_char = [ix_to_char[ix[-1]]]
X = np.zeros((1, length, vocab_size))
for i in range(length):
# appending the last predicted character to sequence
X[0, i, :][ix[-1]] = 1
print(ix_to_char[ix[-1]], end="")
ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
y_char.append(ix_to_char[ix[-1]])
return ('').join(y_char)
# method for preparing the training data
def load_data(data_dir, seq_length):
data = open(data_dir, 'r').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)
print('Data length: {} characters'.format(len(data)))
print('Vocabulary size: {} characters'.format(VOCAB_SIZE))
ix_to_char = {ix:char for ix, char in enumerate(chars)}
char_to_ix = {char:ix for ix, char in enumerate(chars)}
step = 3
l = [i for i in range(0, len(data) - seq_length, step)]
X = np.zeros((len(l), seq_length, VOCAB_SIZE)) #(len(data) - seq_length)//step
y = np.zeros((len(l), seq_length, VOCAB_SIZE))
for i in range(len(l)):
X_sequence = data[l[i]:l[i]+seq_length]
X_sequence_ix = [char_to_ix[value] for value in X_sequence]
input_sequence = np.zeros((seq_length, VOCAB_SIZE))
for j in range(seq_length):
#print(i)
input_sequence[j][X_sequence_ix[j]] = 1.
X[i] = input_sequence
y_sequence = data[l[i]+1:l[i]+seq_length+1]
y_sequence_ix = [char_to_ix[value] for value in y_sequence]
target_sequence = np.zeros((seq_length, VOCAB_SIZE))
for j in range(seq_length):
target_sequence[j][y_sequence_ix[j]] = 1.
y[i] = target_sequence
return X, y, VOCAB_SIZE, ix_to_char
DATA_DIR ="/home/yourdir/text.txt"
BATCH_SIZE = 128
HIDDEN_DIM = 500
SEQ_LENGTH = 80
WEIGHTS = ""
MODE = 'train'
GENERATE_LENGTH = 500
LAYER_NUM = 2
# Creating training data
X, y, VOCAB_SIZE, ix_to_char = load_data(DATA_DIR, SEQ_LENGTH)
# Creating and compiling the Network
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
for i in range(LAYER_NUM - 1):
model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
# Generate some sample before training to know how bad it is!
generate_text(model, 150, VOCAB_SIZE, ix_to_char)
if not WEIGHTS == '':
model.load_weights(WEIGHTS)
nb_epoch = int(WEIGHTS[WEIGHTS.rfind('_') + 1:WEIGHTS.find('.')])
else:
nb_epoch = 0
# Training if there is no trained weights specified
if MODE == 'train' or WEIGHTS == '':
while True:
print('\n\nEpoch: {}\n'.format(nb_epoch))
model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, nb_epoch=1)
nb_epoch += 1
generate_text(model, 100, VOCAB_SIZE, ix_to_char)
if nb_epoch % 10 == 0:
model.save_weights('/home/workdir/models/model_checkpoint_layer_{}_hidden_{}_epoch_{}.hdf5'.format(LAYER_NUM, HIDDEN_DIM, nb_epoch))
json_string = model.to_json()
with open(r"/home/workdir/models/model.json", "w") as text_file:
text_file.write(json_string)
charset = open('', 'w', encoding='utf-8').write(''.join([ix_to_char[c] for c in ix_to_char]))
charset.close()
# Else, loading the trained weights and performing generation only
elif WEIGHTS == '':
# Loading the trained weights
model.load_weights(WEIGHTS)
generate_text(model, GENERATE_LENGTH, VOCAB_SIZE, ix_to_char)
print('\n\n')
else:
print('\n\nNothing to do!')
# Generate some sample to know how bad it is!
generate_text(model, 15, VOCAB_SIZE, ix_to_char)