NLP for learners – Embed trained word vectors in the Embedding layer

公開日:2020/09/29

最終更新日:2023/01/06

In the previous article, we built a model that uses LSTM to predict the next one based on the five words.

However, as the text used for training becomes larger, the accuracy decreases dramatically. This is because each word is represented by a single value (first order vector). If there are few clues for prediction, artificial intelligence cannot make correct predictions.

This is also true for the human brain’s processing of language. The human brain perceives each word in a sentence as containing more information than mere symbols. However, there are many mysteries about what “more information” means that are not yet understood.

Here we will try to train with word vectors, giving more information for each word.

Word2Vec word vector

A word vector is a concept based on the hypothesis that each word can be represented by a vector with about 200 dimensions.

Here we extract and use the information from the pre-trained word vectors. We will use the GoogleNews-vectors-negative300.bin as a word vector. You can get the file from this page. It’s a large file and it will take some time to download.

Also, to use word vectors, you need to install gensim beforehand.

Reading a text file

with io.open('articles.txt', encoding='utf-8') as f:
    text = f.read().replace('eos', '.\n').splitlines()

text =
 ['i expect all of you to be here five minutes before the test begins without fail .',
  'the poor old woman had her bag stolen again .',
  'a rush-hour traffic jam delayed my arrival by two hours .',
  ...... ]

Loads a text file. The size of the text is about 1 MB. The text is represented by the symbol “eos” at the end of a sentence. Converts the symbol to a period, a line break, and then to a list.

texts = []
for line in range(len(text)):
    chars = text[line].split()
    if len(chars) >= 7 and len(chars) <= 40:
        texts.append(text[line])

texts =[['i', 'expect', 'all', 'of', 'you', 'to', 'be', 'here', 'five',
         'minutes', 'before', 'the', 'test', 'begins', 'without',
         'fail', '.'],
        ......]

.split() splits a sentence by spaces between words. The list is stored in chars, and only sentences between 7 and 40 words are stored in texts. This is because we don’t want to use sentences that are too short or too long for training.

Creating a dictionary

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
char_indices = tokenizer.word_index
indices_char = dict([(value, key) for (key, value) in char_indices.items()])

char_indices =
 {'the': 1, 'to': 2, 'of': 3, 'and': 4, 'a': 5, 'in': 6, 'that': 7, 'is': 8,
  'i': 9, 'it': 10, 'for': 11, 'was': 12, ......}
indices_char =
 {1: 'the', 2: 'to', 3: 'of', 4: 'and', 5: 'a', 6: 'in', 7: 'that', 8: 'is',
  9: 'i', 10: 'it', 11: 'for', 12: 'was', ......}

Creates a dictionary and a reverse lookup dictionary using Tokenizer. The number of words is 10499. char_indices is used to find the id of a word and indices_char is used to find the word from the id.

char_indices['the']  ---> 1
char_indices['to']  ---> 2

indices_char[1] ---> 'the'
indices_char[2] ---> 'to'

Extracting word vectors

word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

Extracts the word vectors and stores them in word2vec object. Since word vectors are very large files, loading them may fail if the on-board memory is small.

The extracted word vector is a 300-dimensional vector. That is, each word is represented by a collection of 300 values.

Let’s try to get the word vector of “take”.

print(word2vec['take'])

[-0.05102539  0.00415039  0.02490234 -0.03515625  0.05444336 -0.08496094
  0.14160156 -0.02966309  0.02697754  0.04003906  0.07666016 -0.07421875
  0.14160156  0.27539062 -0.16503906  0.10644531  0.18847656  0.0267334
 -0.06396484 -0.04370117 -0.34179688 -0.02148438  0.3046875   0.17382812
 -0.07910156 -0.01757812  0.01599121 -0.078125   -0.08007812  0.14160156
 -0.04711914 -0.0098877  -0.36523438  0.11328125  0.0222168  -0.24609375
  0.07373047 -0.06982422  0.07275391  0.15332031  0.15429688  0.02600098
  0.23046875 -0.0625     -0.01928711 -0.09716797 -0.10107422 -0.07421875
  0.1328125  -0.006073   -0.09619141  0.22851562  0.0559082  -0.00263977
 -0.03198242 -0.20898438 -0.11035156  0.00823975 -0.07373047 -0.05493164
 -0.10058594 -0.10058594 -0.03491211 -0.15234375  0.09228516 -0.30273438
 -0.09667969  0.18359375 -0.09570312  0.14746094 -0.03344727  0.07226562
  0.22070312  0.06982422 -0.12109375 -0.14550781 -0.14355469  0.07763672
  0.03686523  0.09765625  0.02966309 -0.02856445  0.08984375 -0.13964844
  0.15820312  0.09960938  0.08984375  0.02600098  0.05712891  0.08837891
  0.03076172  0.08789062  0.01019287 -0.20410156 -0.06835938 -0.26171875
  0.03735352  0.18457031 -0.01367188  0.04418945 -0.03808594 -0.07861328
  0.09179688 -0.14355469 -0.10498047 -0.21972656  0.08837891 -0.22558594
  0.07373047  0.06884766 -0.01422119  0.00250244 -0.04760742 -0.28710938
  0.11914062  0.02941895  0.00332642 -0.10009766  0.03662109  0.10253906
 -0.01428223 -0.03173828 -0.14648438  0.18066406 -0.03222656  0.03613281
 -0.03027344 -0.15820312  0.05444336  0.08740234  0.12011719  0.08007812
 -0.01098633  0.02258301  0.03295898 -0.0859375   0.171875    0.09130859
 -0.20019531  0.0559082  -0.01226807 -0.02636719 -0.16699219 -0.0612793
 -0.09423828  0.05053711 -0.15234375 -0.13183594 -0.04199219 -0.15722656
 -0.06689453  0.0456543  -0.234375   -0.03564453 -0.03613281 -0.02636719
  0.02856445 -0.29492188 -0.05224609  0.1875      0.13964844  0.10742188
 -0.12451172 -0.23339844  0.0859375  -0.14160156  0.16210938 -0.01953125
  0.06982422 -0.12890625 -0.20703125 -0.07373047  0.08447266  0.01879883
  0.07714844 -0.07275391  0.07226562  0.11816406 -0.18652344  0.12207031
 -0.19628906 -0.15332031 -0.04125977  0.08691406 -0.09863281 -0.15917969
 -0.03613281  0.05419922  0.078125   -0.03491211  0.14746094  0.10498047
  0.24707031  0.13085938 -0.03015137 -0.13964844  0.1640625   0.0135498
  0.03466797 -0.17675781 -0.00239563  0.04736328 -0.07910156  0.16210938
 -0.13867188  0.04418945  0.0255127   0.05371094  0.01239014 -0.23828125
 -0.00512695  0.2578125   0.15429688  0.16503906 -0.10644531 -0.171875
  0.13378906 -0.10302734  0.05712891 -0.18261719  0.02197266 -0.07519531
 -0.11669922 -0.12890625  0.04663086 -0.078125   -0.04638672 -0.16796875
  0.19921875  0.04223633  0.08789062 -0.140625    0.05981445  0.02050781
  0.11376953  0.11572266  0.12988281 -0.11474609  0.04370117 -0.18652344
 -0.04785156  0.07275391 -0.04174805 -0.18359375  0.03979492 -0.0859375
 -0.11523438 -0.012146    0.14746094 -0.26367188  0.09765625  0.11865234
 -0.16015625  0.03491211  0.04150391  0.01031494  0.00460815  0.03808594
 -0.16308594  0.03686523 -0.02307129 -0.14746094 -0.03735352  0.11572266
  0.20117188  0.10498047 -0.00689697  0.31445312 -0.22167969  0.10107422
  0.2578125   0.359375    0.24804688  0.03173828  0.07373047 -0.10791016
  0.16796875 -0.23632812 -0.07714844  0.05761719  0.22265625 -0.23046875
  0.18359375  0.1015625   0.20898438 -0.01348877 -0.09765625 -0.09960938
 -0.20507812 -0.10009766  0.203125    0.26757812  0.01306152  0.23535156
  0.00576782  0.06640625  0.1796875  -0.09814453  0.05151367  0.0123291 ]

Each word is represented by 300 values, with words that are close in meaning having close values.

We then extract the word vector for each id.

null_word = np.zeros(300)

null_word = [0 0 0 0 .... 0]

np.zeros() creates a list with all the elements as 0. np.zeros(300) creates a list with 300 elements. null_word is used when the word id is 0 and the word vector to be extracted does not exist in word2vec.

embedding_matrix = np.zeros((len(char_indices)+1, 300))

Create a list embedding_matrix which contains the extracted word vectors. Note that the list is not a dictionary, but a two-dimensional array of (number of words + 1, 300). Also, the list must contain the case where the id is 0, so the list must be the number of words + 1.

for id, word in indices_char.items():
    try:
        embedding_matrix[id] = word2vec[word]
    except:
        embedding_matrix[id] = null_word

Extract the id and word from the dictionary one by one.

For beginners, this for statement can be very confusing. For example, it starts with id=1, word='the', then id=2, word='to'…, and so on, extracting the elements of the list one by one.

word2vec returns an error if the word to be extracted does not exist in the file. Therefore, the code extracts the word vector once with try: and in case of an error, it stores an array of all zeroes with except:.

embedding_matrix[0] = null_word

When id=0, it stores an array of all 0 elements.

The list embedding_matrix has a 300-dimensional vector corresponding to the id of the word.

For example, indices_char[1]='the', so embedding_matrix[1] represents the word vector of ‘the’.

embedding_matrix =
[[0. 0. 0. 0. ....]
 [ 0.08007812  0.10498047  0.04980469  0.0534668 ....]
 [0. 0. 0. 0. ....]
 [0. 0. 0. 0. ....]
 [0. 0. 0. 0. ....]
 [0. 0. 0. 0. ....]
 [ 0.0703125   0.08691406  0.08789062  0.0625 ....]
 [-1.57470703e-02 -2.83203125e-02  8.34960938e-02  5.02929688e-02
 ....]
 [ 0.00704956 -0.07324219  0.171875    0.02258301 ....]
 [-2.25585938e-01 -1.95312500e-02  9.08203125e-02  2.37304688e-01 ....]
 ....]

Embedding layer

To embed a word vector, use the Embedding layer.

model.add(Embedding(len(char_indices)+1,
                    300,
                    weights=[embedding_matrix],
                    mask_zero=True,
                    trainable=False))

For the Embedding layer, you can specify (number of words, output dimension, …).

Here, we are using a 300-dimensional word vector, so the output dimension is 300.

And in weights you specify the list of word vectors you created above, embedding_matrix.

Also, trainable=False means that this layer is excluded from training.

When the Embedding layer is given a word id as an input value, it converts it to a word vector and passes it to the next LSTM layer.

One of the useful features of the Embedding layer is mask_zero, which will exclude from training if the input values are all zeros. If you have padded the input values, you can improve the accuracy of training by removing the zeros.

Training

Here is the entire code that works.

For more information on sample_generator, see NLP for learners – Split and train data which exceed the memory.

For more information on EarlyStopping and ReduceLROnPlateau, see NLP for learners – Changing learning rates and stopping early(ReduceLROnPlateau/EarlyStopping).

For more information on validation_data, see NLP for learners – Separate training data and validation data.

For information on cp_callback, see NLP for learners – Interrupt and resume training(ModelCheckpoint).

If the memory on board is low, this code will cause an error. In this case, you should reduce the value of batch_size.

import numpy as np
import io
import os
import sys
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import gensim
from keras.models import Model, load_model, Sequential
from keras.layers import Embedding, Dense, LSTM
from keras.optimizers import Adam
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
#read the text
with io.open('articles_u2.txt', encoding='utf-8') as f:
    text = f.read().replace('eos', '.\n').splitlines()
texts = []
for line in range(len(text)):
    chars = text[line].split()
    if len(chars) >= 7 and len(chars) <= 40:
        texts.append(text[line])
#make the dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
char_indices = tokenizer.word_index
indices_char = dict([(value, key) for (key, value) in char_indices.items()])
#word2vec
print('load the word vectors....')
word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
null_word = np.zeros(300)
embedding_matrix = np.zeros((len(char_indices)+1, 300))
for id, word in indices_char.items():
    try:
        embedding_matrix[id] = word2vec[word]
    except:
        embedding_matrix[id] = null_word
embedding_matrix[0] = null_word
texts = tokenizer.texts_to_sequences(texts)
seq_length = 40
texts = sequence.pad_sequences(texts, maxlen=seq_length, padding='pre', truncating='post')
#make dataset
batch_size = 200
time_step = 5
def sample_generator(start, end):
    while True:
        for step in range((end - start) // batch_size):
            x = []
            x_vec = []
            y = []
            for line in range(batch_size):
                dataset = TimeseriesGenerator(
                    texts[start+step*batch_size+line],
                    texts[start+step*batch_size+line],
                    length=time_step,
                    batch_size=1)
                for batch in dataset:
                    X, Y = batch
                    x.extend(X[0])
                    y.extend(Y)
            x = np.reshape(x,((seq_length-time_step)*batch_size,time_step))
            y = np_utils.to_categorical(y, len(char_indices)+1)
            yield x, y
#build the model
print('build the model....')
model = Sequential()
model.add(Embedding(len(char_indices)+1,
                    300,
                    weights=[embedding_matrix],
                    mask_zero=True,
                    trainable=False))
model.add(LSTM(512))
model.add(Dense(len(char_indices)+1, activation='softmax'))
optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy'])
EarlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, min_lr=0.0001)
cp_callback = ModelCheckpoint(
    filepath="u_model.h5",
    verbose=0,
    save_weights_only=False,
    save_freq="epoch")
#training
train_val_rate = 0.8
train_start = 0
train_end = round(len(texts) * train_val_rate)
val_start = train_end + 1
val_end = len(texts)
model.fit(
    sample_generator(train_start, train_end),
    steps_per_epoch=(train_end - train_start) // batch_size,
    validation_data=sample_generator(val_start, val_end),
    validation_steps=(val_end - val_start) // batch_size,
    initial_epoch=0,
    epochs=100,
    verbose=1,
    callbacks=[cp_callback, EarlyStopping, reduce_lr])

Earlystopping stopped training at 40 epoch. At this point, the English text generated looks like this

i expect all of you can see the benefits of the different types of blogging
the poor old woman had a stay in the real world is a similar importance
a rush hour traffic jam delayed cars and the same brand is not a good

The results are much better than without the use of word vector embedding. However, while these sentences are partially understandable, they produce sentences that do not make sense as a whole.

i expect all of you are going to be able to live alone for the
the poor old woman had taken care of the government 's the farms were smaller
a rush hour traffic jam out to the fog on the web and the international

This is the result of training 200 epoch without using earlystopping. There was no improvement.