NLP for learners – Split and train data which exceed the memory
In a previous post we learned how to build a model and train it using
fit(). In this post, we will learn how to train by splitting up the data using the generator function.
The input values you give to the LSTM are third-order tensors (3-dimensional matrices), which produce much larger data than the original text. If the sentences used for training are long, the data will exceed the memory capacity, causing an error.
Here, we read a text that consist of multiple sentences and train them as a data set of 10 lines of sentences. The word vector is one-dimensional. That is, each word is represented by a single number that corresponds to it. The model predicts one word following each of the five words.
fit() and generator functions
When training, there are two ways to give the input x, answer y directly or call the generator function.
If you give
y directly, you can train only one data set.
On the other hand, the generator function is called many times during training and returns different
y values to
model.fit(x, y, ......) model.fit(train_generator(), ......)
fit_generator() was used to call generator functions, but now
fit() allows you to write generator functions in
fit_generator() has been deprecated. Newer versions of TensorFlow display a warning.
texts = sequence.pad_sequences(texts, maxlen=30, padding="pre", truncating="post")
texts contains a vector of words for each of the multiple sentences. Each sentence has a different length. However, to call a generator function, the lengths of the list must be identical. So we can align the length of the list by inserting zeros before or after the sentences.
texts = [[ 1 2 3 4 ] [ 1 2 3 4 5 6 ] [ 1 2 ]] ----> [[ 0 0 0 0 0 1 2 3 4 ] [ 0 0 0 1 2 3 4 5 6 ] [ 0 0 0 0 0 0 0 1 2 ]]
maxlen=30, and it means that the number of words in a sentence is 30.
When padding, you can either insert a zero at the beginning of a sentence or at the end of a sentence. If you insert a zero after a sentence, the zero is predicted based on five words. This is equivalent to not predicting a word and is not desirable.
Defines the train_generator() as a generator function.
The size of the batch data generated by the generator is determined by necessity.
fit() calls the generator function many times,
while True: loops infinitely and returns a data set. In this case, the number of iterations is equal to the number of epochs.
for step in range(len(texts)//batch_size):
batch_size=10.That is, a batch consists of 10 lines of sentences, each line of sentences consisting of 30 fixed lengths of data with padding.
len(text) is the number of lines of text and is divided by
// divides the text by rounding down to the nearest whole number.
For example, if you have 218 lines of text, then
218 // 10 = 21. This means that the whole text is divided into 21 steps. The last 8 lines are cut off. Therefore, the
for statement is repeated 21 times.
for line in range(batch_size):
for statement is repeated 10 times.
dataset = TimeseriesGenerator( texts[step*batch_size+line], texts[step*batch_size+line], length=seq_length, batch_size=1) for batch in dataset: X, Y = batch x.extend(X) y.extend(Y)
TimeseriesGenerator() generates time series data. See the previous article for details.
x = np.reshape(x,(25*batch_size,seq_length,1))
Converts the data into third-order tensors.Each statement consists of 30 pieces of data; 25 pieces of data are generated; in the case of 10 lines, there are 10*25*5=1250 pieces of data, so we convert them into a list of
[ 1 2 3 4 5 .... 30] X = [ 1 2 3 4 5 ] Y =  [ 2 3 4 5 6 ]  [ 3 4 5 6 7 ]  ...... ...... [ 25 26 27 28 29 ]  reshape ---> x = [[    ] [    ] [    ] ...... [    ]]
x = x / float(len(char_indices)+1)
Normalize to a value between 0 and 1 by dividing the value by the number of words.
y = np_utils.to_categorical(y, len(char_indices)+1)
y into one-hot format.
yield x, y
fit(). Typically, when a function returns a value, it uses
return; if you use it, calling the function again takes you back to the beginning of the iteration. However, if you use
yield, you continue in the middle of the iteration.
The generator function returns a list of
fit() based on an iteration respectively, and
fit() trains on it. Thus,
fit() trains the entire data set and an epoch is completed.
model.fit( train_generator(), steps_per_epoch=len(texts) // batch_size, epochs=100, verbose=1)
train_generator() is called via
fit() to train.
steps_per_epoch indicates how many times the
train_generator() should be called. In this case, the generator function is called 21 times.
In general, the larger the batch size, the faster the processing speed. However, a larger batch size means a larger memory requirement.
Here is the whole code.
import numpy as np import sys import io import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' from tensorflow import keras from keras.models import Sequential #from keras.models import Model from keras.layers import Embedding, Dense, LSTM from keras.optimizers import Adam from keras.utils import np_utils from keras.preprocessing import sequence from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import TimeseriesGenerator #read the text with io.open('articles.txt', encoding='utf-8') as f: text = f.read() texts = text.replace('eos', 'eos\n').splitlines() #make the dictionary tokenizer = Tokenizer() tokenizer.fit_on_texts(texts) char_indices = tokenizer.word_index #make the inverted dictionary indices_char = dict([(value, key) for (key, value) in char_indices.items()]) np.save('voa_char_indices', char_indices) np.save('voa_indices_char', indices_char) #vectorization texts = tokenizer.texts_to_sequences(texts) texts = sequence.pad_sequences(texts, maxlen=30, padding="pre", truncating="post") #make dataset batch_size = 10 seq_length = 5 def train_generator(): while True: for step in range(len(texts)//batch_size): x =  y =  for line in range(batch_size): dataset = TimeseriesGenerator( texts[step*batch_size+line], texts[step*batch_size+line], length=seq_length, batch_size=1) for batch in dataset: X, Y = batch x.extend(X) y.extend(Y) x = np.reshape(x,(25*batch_size,seq_length,1)) x = x / float(len(char_indices)+1) y = np_utils.to_categorical(y, len(char_indices)+1) yield x, y #build the model print('build the model....') model = Sequential() model.add(LSTM(128,input_shape=(seq_length, 1))) model.add(Dense(len(char_indices)+1, activation='softmax')) optimizer = Adam(lr=0.01) model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) #training model.fit( train_generator(), steps_per_epoch=len(texts) // batch_size, epochs=100, verbose=1) #save the model model.save('model_voa.h5')
Makes predictions using a trained model. Generates a sentence from a text file of approximately 30 KB.
the european union eu started sending millions of dollars services aid australia’s for teams raised . the money came with promises to improve migrant of centers state lawmaker will coast . the centers are paid for millennials people did horrible divide a single countries .
We generated sentences by making predictions in sequence. However, it failed to generate meaningful sentences.
the european union eu started . the money came with promises . the centers are paid for the modern .
This is the result of increasing the text to approximately 500 KB. The sentences are now shorter. This is due to the inclusion of the
eos symbol at the end of the sentence in the data. As the number of
eos in the data has increased, the model now predicts
eos with a high probability. The word that appears with such a high frequency is called a stop word.
model.add(LSTM(512,input_shape=(seq_length, 1))) ----> the european union eu started sending millions this dollars . the money came with promises . the centers are paid for own .
This is the result of increasing the LSTM layer to 512. However, there is not much change.
As a result of increasing the size of the data used for training, a new problem seems to have arisen.
Here is the whole code.
import numpy as np import sys import io import os os.environ['TF_CPP_MIN_LOG_LEVEL']='2' from keras.models import load_model from keras.utils import np_utils from tensorflow.keras.preprocessing.text import Tokenizer #read the text with io.open('articles.txt', encoding='utf-8') as f: #with io.open('voacorpus/asitis01.txt', encoding='utf-8') as f: text = f.read() texts = text.replace('eos', 'eos\n').splitlines() tokenizer = Tokenizer() tokenizer.fit_on_texts(texts) char_indices = np.load('voa_char_indices.npy', allow_pickle=True).tolist() indices_char = np.load('voa_indices_char.npy', allow_pickle=True).tolist() indices_char = '<null>' pre_vec = tokenizer.texts_to_sequences(texts) texts_vec =  for line in range(len(pre_vec)): if len(pre_vec[line]) > 9: texts_vec.append(pre_vec[line]) line = 0 #load the model print('load the model....') model = load_model('model_voa.h5') #prediction x = np.zeros(5) for line in range(10): chars = '' for i in range(30): if i == 0: for j in range(5): x[j] = texts_vec[line][j] chars += indices_char[x[j]] + ' ' else: for j in range(4): x[j] = x[j+1] x = index x_pred = np.reshape(x,[1, 5, 1]) x_pred = x_pred / float(len(char_indices)+1) prediction = model.predict(x_pred) index = np.argmax(prediction) result = indices_char[index] + ' ' chars += result if indices_char[index] == 'eos': break chars = chars.replace('eos', '.') print(chars)