NLP for learners – Separate training data and validation data

In the previous post, we learned how to use generator functions. Here, we will learn how to separate data into training data and validation data.

Training and validation data

Essentially, the goal of machine learning is to generate appropriate prediction results for unknown data using trained models. Therefore, part of the data is used for training and the rest is used for validation.

The validation data are unknown to the model, and val_loss, val_accuracy indicate how accurately the trained model can predict them.

Setting up verification data

For further information on text preprocessing and generator functions, see the previous post.

train_val_rate = 0.8

Here, we use a simple method to divide the data into training data and validation data. The first 80 percent of the loaded text is used for the training data and the remaining 20 percent is used for the validation data.

train_start = 0
train_end = round(len(texts) * train_val_rate)

The text consists of 6672 sentences, about 640KB. round() truncates the value to an integer. Thus, round(len(texts) * train_val_rate) = 5337.

val_start = train_end + 1
val_end = len(texts)

On the other hand, the validation data are lines val_start = 5338 to val_end = 6672 of the text.

model.fit(
    train_generator(train_start, train_end),
    steps_per_epoch=(train_end - train_start) // batch_size,
    validation_data=train_generator(val_start, val_end),
    validation_steps=(val_end - val_start) // batch_size,
    epochs=100,
    verbose=1)

The train_generator(train_start, train_end) calls lines from 0 to 5337 as training data.

Also, validation_data=train_generator(val_start, val_end) calls lines from 5338 to 6672 as validation data.

Since batch_size=100, then (train_end - train_start) // batch_size = (5337-0)//100 = 53. This means that fit() will call the generator function 53 times.

On the other hand, validation_steps=(val_end - val_start) // batch_size = (6672-5338)//100 = 13. The generator function is called 13 times.

Train them.

Epoch 100/100
53/53 - 5s - loss: 3.2926 - accuracy: 0.4244 - val_loss: 7.3059 - val_accuracy: 0.2865

The new values val_loss and val_accuracy are displayed, where val_loss represents the loss of the validation data and val_accuracy represents the accuracy of the validation data.

On the other hand, loss represents loss of the training data and accuracy represents the accuracy of the training data. This is the accuracy of the predictions made using the training data as it is.

Unfortunately, the accuracy of the validation data is about 28 percent, indicating that its performance in predicting unknown data is poor.

Here is the overall code.

import numpy as np
import sys
import io
import os
import stanza
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM
from keras.optimizers import Adam
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import TimeseriesGenerator
#read the text
with io.open('articles_u.txt', encoding='utf-8') as f:
    text = f.read()
texts = text.replace('eos', 'eos\n').splitlines()
#make the dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
char_indices = tokenizer.word_index
#make the inverted dictionary
indices_char = dict([(value, key) for (key, value) in char_indices.items()])
np.save('voa_char_indices', char_indices)
np.save('voa_indices_char', indices_char)
#vectorization
texts = tokenizer.texts_to_sequences(texts)
texts = sequence.pad_sequences(texts, maxlen=30, padding="pre", truncating="post")
#make dataset
batch_size = 100
seq_length = 5
def train_generator(start, end):
    while True:
        for step in range((end - start) // batch_size):
            x = []
            y = []
            for line in range(batch_size):
                dataset = TimeseriesGenerator(
                    texts[start+step*batch_size+line],
                    texts[start+step*batch_size+line],
                    length=seq_length,
                    batch_size=1)
                for batch in dataset:
                    X, Y = batch
                    x.extend(X[0])
                    y.extend(Y)
            x = np.reshape(x,(25*batch_size,seq_length,1))
            x = x / float(len(char_indices)+1)
            y = np_utils.to_categorical(y, len(char_indices)+1)
            yield x, y
#build the model
print('build the model....')
model = Sequential()
model.add(LSTM(128, input_shape=(seq_length, 3)))
model.add(Dense(len(char_indices)+1, activation='softmax'))
optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
#training
train_val_rate = 0.8
train_start = 0
train_end = round(len(texts) * train_val_rate)
val_start = train_end + 1
val_end = len(texts)
model.fit(
    train_generator(train_start, train_end),
    steps_per_epoch=(train_end - train_start) // batch_size,
    validation_data=train_generator(val_start, val_end),
    validation_steps=(val_end - val_start) // batch_size,
    epochs=100,
    verbose=2)
    #callbacks=[early_stopping, reduce_lr])
#save the model
model.save('u_model.h5')