NLP for learners – Split and train data which exceed the memory

公開日:2020/09/14

最終更新日:2023/01/06

In a previous post we learned how to build a model and train it using fit(). In this post, we will learn how to train by splitting up the data using the generator function.

The input values you give to the LSTM are third-order tensors (3-dimensional matrices), which produce much larger data than the original text. If the sentences used for training are long, the data will exceed the memory capacity, causing an error.

Here, we read a text that consist of multiple sentences and train them as a data set of 10 lines of sentences. The word vector is one-dimensional. That is, each word is represented by a single number that corresponds to it. The model predicts one word following each of the five words.

fit() and generator functions

When training, there are two ways to give the input x, answer y directly or call the generator function.

If you give fit() x, y directly, you can train only one data set.

On the other hand, the generator function is called many times during training and returns different x, y values to fit().

model.fit(x, y, ......)
model.fit(train_generator(), ......)

Previously, fit_generator() was used to call generator functions, but now fit() allows you to write generator functions in fit(), so fit_generator() has been deprecated. Newer versions of TensorFlow display a warning.

Padding

texts = sequence.pad_sequences(texts, maxlen=30, padding="pre", truncating="post")

The list texts contains a vector of words for each of the multiple sentences. Each sentence has a different length. However, to call a generator function, the lengths of the list must be identical. So we can align the length of the list by inserting zeros before or after the sentences.

texts =
[[ 1 2 3 4 ]
 [ 1 2 3 4 5 6 ]
 [ 1 2 ]]

---->

[[ 0 0 0 0 0 1 2 3 4 ]
 [ 0 0 0 1 2 3 4 5 6 ]
 [ 0 0 0 0 0 0 0 1 2 ]]

Set maxlen=30, and it means that the number of words in a sentence is 30.

When padding, you can either insert a zero at the beginning of a sentence or at the end of a sentence. If you insert a zero after a sentence, the zero is predicted based on five words. This is equivalent to not predicting a word and is not desirable.

Generator function

def train_generator():

Defines the train_generator() as a generator function.

The size of the batch data generated by the generator is determined by necessity.

    while True:

fit() calls the generator function many times, while True: loops infinitely and returns a data set. In this case, the number of iterations is equal to the number of epochs.

        for step in range(len(texts)//batch_size):

Set batch_size=10.That is, a batch consists of 10 lines of sentences, each line of sentences consisting of 30 fixed lengths of data with padding. len(text) is the number of lines of text and is divided by batch_size.

// divides the text by rounding down to the nearest whole number.

For example, if you have 218 lines of text, then 218 // 10 = 21. This means that the whole text is divided into 21 steps. The last 8 lines are cut off. Therefore, the for statement is repeated 21 times.

            for line in range(batch_size):

Since batch_size=10, the for statement is repeated 10 times.

                dataset = TimeseriesGenerator(
                    texts[step*batch_size+line],
                    texts[step*batch_size+line],
                    length=seq_length,
                    batch_size=1)
                for batch in dataset:
                    X, Y = batch
                    x.extend(X[0])
                    y.extend(Y)

TimeseriesGenerator() generates time series data. See the previous article for details.

            x = np.reshape(x,(25*batch_size,seq_length,1))

Converts the data into third-order tensors.Each statement consists of 30 pieces of data; 25 pieces of data are generated; in the case of 10 lines, there are 10*25*5=1250 pieces of data, so we convert them into a list of (250,5,1).

[ 1 2 3 4 5 .... 30]

X = [  1  2  3  4  5 ]  Y = [6]
    [  2  3  4  5  6 ]      [7]
    [  3  4  5  6  7 ]      [8]

    ......             ......

    [ 25 26 27 28 29 ]      [30]

reshape --->

x = [[[1]
      [2]
      [3]
      [4]
      [5]]
     [[2]
      [3]
      [4]
      [5]
      [6]]
     [[3]
      [4]
      [5]
      [6]
      [7]]

     ......

     [[25]
      [26]
      [27]
      [28]
      [29]]]

            x = x / float(len(char_indices)+1)

Normalize to a value between 0 and 1 by dividing the value by the number of words.

            y = np_utils.to_categorical(y, len(char_indices)+1)

Convert y into one-hot format.

            yield x, y

Return x and y to fit(). Typically, when a function returns a value, it uses return; if you use it, calling the function again takes you back to the beginning of the iteration. However, if you use yield, you continue in the middle of the iteration.

The generator function returns a list of (250,5,1) to fit() based on an iteration respectively, and fit() trains on it. Thus, fit() trains the entire data set and an epoch is completed.

Training

model.fit(
    train_generator(),
    steps_per_epoch=len(texts) // batch_size,
    epochs=100,
    verbose=1)

The train_generator() is called via fit() to train. steps_per_epoch indicates how many times the train_generator() should be called. In this case, the generator function is called 21 times.

In general, the larger the batch size, the faster the processing speed. However, a larger batch size means a larger memory requirement.

Here is the whole code.

import numpy as np
import sys
import io
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
from tensorflow import keras
from keras.models import Sequential
#from keras.models import Model
from keras.layers import Embedding, Dense, LSTM
from keras.optimizers import Adam
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import TimeseriesGenerator
#read the text
with io.open('articles.txt', encoding='utf-8') as f:
    text = f.read()
texts = text.replace('eos', 'eos\n').splitlines()
#make the dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
char_indices = tokenizer.word_index
#make the inverted dictionary
indices_char = dict([(value, key) for (key, value) in char_indices.items()])
np.save('voa_char_indices', char_indices)
np.save('voa_indices_char', indices_char)
#vectorization
texts = tokenizer.texts_to_sequences(texts)
texts = sequence.pad_sequences(texts, maxlen=30, padding="pre", truncating="post")
#make dataset
batch_size = 10
seq_length = 5
def train_generator():
    while True:
        for step in range(len(texts)//batch_size):
            x = []
            y = []
            for line in range(batch_size):
                dataset = TimeseriesGenerator(
                    texts[step*batch_size+line],
                    texts[step*batch_size+line],
                    length=seq_length,
                    batch_size=1)
                for batch in dataset:
                    X, Y = batch
                    x.extend(X[0])
                    y.extend(Y)
            x = np.reshape(x,(25*batch_size,seq_length,1))
            x = x / float(len(char_indices)+1)
            y = np_utils.to_categorical(y, len(char_indices)+1)
            yield x, y
#build the model
print('build the model....')
model = Sequential()
model.add(LSTM(128,input_shape=(seq_length, 1)))
model.add(Dense(len(char_indices)+1, activation='softmax'))
optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
#training
model.fit(
    train_generator(),
    steps_per_epoch=len(texts) // batch_size,
    epochs=100,
    verbose=1)
#save the model
model.save('model_voa.h5')

Prediction

Makes predictions using a trained model. Generates a sentence from a text file of approximately 30 KB.

the european union eu started sending millions of dollars services aid australia’s for teams raised .
the money came with promises to improve migrant of centers state lawmaker will coast .
the centers are paid for millennials people did horrible divide a single countries .

We generated sentences by making predictions in sequence. However, it failed to generate meaningful sentences.

the european union eu started .
the money came with promises .
the centers are paid for the modern .

This is the result of increasing the text to approximately 500 KB. The sentences are now shorter. This is due to the inclusion of the eos symbol at the end of the sentence in the data. As the number of eos in the data has increased, the model now predicts eos with a high probability. The word that appears with such a high frequency is called a stop word.

model.add(LSTM(512,input_shape=(seq_length, 1)))
 ---->

the european union eu started sending millions this dollars .
the money came with promises .
the centers are paid for own .

This is the result of increasing the LSTM layer to 512. However, there is not much change.

As a result of increasing the size of the data used for training, a new problem seems to have arisen.