NLP(自然言語処理) Embeddingレイヤーに学習済み単語ベクトルを埋め込む

公開日:2020/09/28

最終更新日:2023/01/04

以前の記事，kerasとLSTMを用いて学習と予測を行うで，LSTMを用いて5つの単語をもとに次の1つの単語を予測するモデルを構築しました。

しかし，訓練に用いるテキストが大きくなると，精度が極端に下がる問題が発生します。これは，それぞれの単語が一つの値（一次ベクトル）によって表されているためです。予測のための手掛かりが少ないと，人工知能は正しい予測を行うことができません。

このことは人間の脳が言語を処理する場合にも当てはまります。人間の脳は文中のそれぞれの単語を単なる記号ではなく，それ以上の情報を含むものとして認識しています。しかし，「それ以上の情報」が何であるかは未だに解明されていない多くの謎を含んでいます。

ここでは単語ベクトルを用いて，それぞれの単語により多くの情報を与えた上で訓練を行ってみます。

Word2Vecの単語ベクトル

単語ベクトルは，それぞれの単語がその意味を200次元ほどのベクトルとして表現することができるという仮説に基づく概念です。

ここでは，あらかじめ用意された単語ベクトルの情報を抽出して利用します。単語ベクトルはGoogleNews-vectors-negative300.binを使用します。ファイルはこちらのページから入手できるでしょう。大きなファイルなので，ダウンロードには時間がかかります。

また，単語ベクトルを利用するためには，あらかじめgensimをインストールする必要があります。

テキストの読み込み

with io.open('articles.txt', encoding='utf-8') as f:
    text = f.read().replace('eos', '.\n').splitlines()

text = 
 ['i expect all of you to be here five minutes before the test begins without fail .',
  'the poor old woman had her bag stolen again .',
  'a rush-hour traffic jam delayed my arrival by two hours .',
  ...... ]

テキストファイルを読み込みます。テキストの大きさはおよそ1MBです。テキストは文末をeosという記号で表しているので，ピリオドと改行に変換し，さらにリストに変換します。

texts = []
for line in range(len(text)):
    chars = text[line].split()
    if len(chars) >= 7 and len(chars) <= 40:
        texts.append(text[line])

texts =[['i', 'expect', 'all', 'of', 'you', 'to', 'be', 'here', 'five',
         'minutes', 'before', 'the', 'test', 'begins', 'without',
         'fail', '.'],
        ......]

.split()は文を単語の間の空白で分割します。分割したリストはいったんcharsに格納し，単語の数が7個以上40個以下の文だけをtextsに格納します。これは，短すぎたり長すぎたりする文を訓練に用いたくないからです。この処理は必須ではありません。

辞書の作成

tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
char_indices = tokenizer.word_index
indices_char = dict([(value, key) for (key, value) in char_indices.items()])

char_indices = 
 {'the': 1, 'to': 2, 'of': 3, 'and': 4, 'a': 5, 'in': 6, 'that': 7, 'is': 8,
  'i': 9, 'it': 10, 'for': 11, 'was': 12, ......}
indices_char = 
 {1: 'the', 2: 'to', 3: 'of', 4: 'and', 5: 'a', 6: 'in', 7: 'that', 8: 'is',
  9: 'i', 10: 'it', 11: 'for', 12: 'was', ......}

Tokenizerを用いて，辞書と逆引き辞書を作成します。単語の数は10499個あります。char_indicesは単語からidを求める場合に使用し，indices_charはidから単語を求める場合に使用します。

char_indices['the']  ---> 1
char_indices['to']  ---> 2

indices_char[1] ---> 'the'
indices_char[2] ---> 'to'

単語ベクトルの読み込み

word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

単語ベクトルを読み込み，オブジェクトword2vecに格納します。単語ベクトルは非常に大きなファイルであるため，搭載されているメモリが少ないと読み込みに失敗するかもしれません。

読み込まれた単語ベクトルは300次元のベクトルです。つまり，それぞれの単語は300個の値の集まりで表されています。

試しに，”take”の単語ベクトルを取得してみましょう。

print(word2vec['take'])

[-0.05102539  0.00415039  0.02490234 -0.03515625  0.05444336 -0.08496094
  0.14160156 -0.02966309  0.02697754  0.04003906  0.07666016 -0.07421875
  0.14160156  0.27539062 -0.16503906  0.10644531  0.18847656  0.0267334
 -0.06396484 -0.04370117 -0.34179688 -0.02148438  0.3046875   0.17382812
 -0.07910156 -0.01757812  0.01599121 -0.078125   -0.08007812  0.14160156
 -0.04711914 -0.0098877  -0.36523438  0.11328125  0.0222168  -0.24609375
  0.07373047 -0.06982422  0.07275391  0.15332031  0.15429688  0.02600098
  0.23046875 -0.0625     -0.01928711 -0.09716797 -0.10107422 -0.07421875
  0.1328125  -0.006073   -0.09619141  0.22851562  0.0559082  -0.00263977
 -0.03198242 -0.20898438 -0.11035156  0.00823975 -0.07373047 -0.05493164
 -0.10058594 -0.10058594 -0.03491211 -0.15234375  0.09228516 -0.30273438
 -0.09667969  0.18359375 -0.09570312  0.14746094 -0.03344727  0.07226562
  0.22070312  0.06982422 -0.12109375 -0.14550781 -0.14355469  0.07763672
  0.03686523  0.09765625  0.02966309 -0.02856445  0.08984375 -0.13964844
  0.15820312  0.09960938  0.08984375  0.02600098  0.05712891  0.08837891
  0.03076172  0.08789062  0.01019287 -0.20410156 -0.06835938 -0.26171875
  0.03735352  0.18457031 -0.01367188  0.04418945 -0.03808594 -0.07861328
  0.09179688 -0.14355469 -0.10498047 -0.21972656  0.08837891 -0.22558594
  0.07373047  0.06884766 -0.01422119  0.00250244 -0.04760742 -0.28710938
  0.11914062  0.02941895  0.00332642 -0.10009766  0.03662109  0.10253906
 -0.01428223 -0.03173828 -0.14648438  0.18066406 -0.03222656  0.03613281
 -0.03027344 -0.15820312  0.05444336  0.08740234  0.12011719  0.08007812
 -0.01098633  0.02258301  0.03295898 -0.0859375   0.171875    0.09130859
 -0.20019531  0.0559082  -0.01226807 -0.02636719 -0.16699219 -0.0612793
 -0.09423828  0.05053711 -0.15234375 -0.13183594 -0.04199219 -0.15722656
 -0.06689453  0.0456543  -0.234375   -0.03564453 -0.03613281 -0.02636719
  0.02856445 -0.29492188 -0.05224609  0.1875      0.13964844  0.10742188
 -0.12451172 -0.23339844  0.0859375  -0.14160156  0.16210938 -0.01953125
  0.06982422 -0.12890625 -0.20703125 -0.07373047  0.08447266  0.01879883
  0.07714844 -0.07275391  0.07226562  0.11816406 -0.18652344  0.12207031
 -0.19628906 -0.15332031 -0.04125977  0.08691406 -0.09863281 -0.15917969
 -0.03613281  0.05419922  0.078125   -0.03491211  0.14746094  0.10498047
  0.24707031  0.13085938 -0.03015137 -0.13964844  0.1640625   0.0135498
  0.03466797 -0.17675781 -0.00239563  0.04736328 -0.07910156  0.16210938
 -0.13867188  0.04418945  0.0255127   0.05371094  0.01239014 -0.23828125
 -0.00512695  0.2578125   0.15429688  0.16503906 -0.10644531 -0.171875
  0.13378906 -0.10302734  0.05712891 -0.18261719  0.02197266 -0.07519531
 -0.11669922 -0.12890625  0.04663086 -0.078125   -0.04638672 -0.16796875
  0.19921875  0.04223633  0.08789062 -0.140625    0.05981445  0.02050781
  0.11376953  0.11572266  0.12988281 -0.11474609  0.04370117 -0.18652344
 -0.04785156  0.07275391 -0.04174805 -0.18359375  0.03979492 -0.0859375
 -0.11523438 -0.012146    0.14746094 -0.26367188  0.09765625  0.11865234
 -0.16015625  0.03491211  0.04150391  0.01031494  0.00460815  0.03808594
 -0.16308594  0.03686523 -0.02307129 -0.14746094 -0.03735352  0.11572266
  0.20117188  0.10498047 -0.00689697  0.31445312 -0.22167969  0.10107422
  0.2578125   0.359375    0.24804688  0.03173828  0.07373047 -0.10791016
  0.16796875 -0.23632812 -0.07714844  0.05761719  0.22265625 -0.23046875
  0.18359375  0.1015625   0.20898438 -0.01348877 -0.09765625 -0.09960938
 -0.20507812 -0.10009766  0.203125    0.26757812  0.01306152  0.23535156
  0.00576782  0.06640625  0.1796875  -0.09814453  0.05151367  0.0123291 ]

それぞれの単語は300個の値によって表現され，意味の近い単語は近い値を持ちます。

次に，上で作成した辞書のそれぞれのidごとに単語ベクトルを取得します。

null_word = np.zeros(300)

null_word = [0 0 0 0 .... 0]

np.zeros()は要素がすべて0であるリストを作成します。np.zeros(300)は要素が300個あるリストを作成します。null_wordは単語のidが0である場合と，抽出する単語ベクトルがword2vecの中に存在しない場合に用います。

embedding_matrix = np.zeros((len(char_indices)+1, 300))

抽出した単語ベクトルを格納するリストembedding_matrixを作成します。リストは辞書形式ではなく，(語彙数+1, 300)の2次元配列であることに注意してください。また，リストにはidが0の場合も含む必要があるため，リストは語彙数+1にしなければなりません。

for id, word in indices_char.items():
    try:
        embedding_matrix[id] = word2vec[word]
    except:
        embedding_matrix[id] = null_word

辞書からidと単語を一つずつ取り出します。

初心者にとってこのfor文の書き方は非常に分かりにくいものかもしれません。例えば，初めはid=1, word='the'，次にid=2, word='to' ・・・のように辞書型リストの要素が一つずつ取り出されます。

word2vecは取得しようとする単語がファイルの中に存在しない場合，エラーを返します。従って，try:でいったん単語ベクトルを取得し，エラーの場合にはexcept:で要素がすべて0の配列を格納します。

embedding_matrix[0] = null_word

また，id=0の場合は，要素がすべて0の配列を格納します。

構築されたリストembedding_matrixは単語のidに対応する300次元のベクトルを持ちます。

例えば，indices_char[1]='the'なので，embedding_matrix[1]は’the’の単語ベクトルを表します。

embedding_matrix =
[[0. 0. 0. 0. ....]
 [ 0.08007812  0.10498047  0.04980469  0.0534668 ....]
 [0. 0. 0. 0. ....]
 [0. 0. 0. 0. ....]
 [0. 0. 0. 0. ....]
 [0. 0. 0. 0. ....]
 [ 0.0703125   0.08691406  0.08789062  0.0625 ....]
 [-1.57470703e-02 -2.83203125e-02  8.34960938e-02  5.02929688e-02
 ....]
 [ 0.00704956 -0.07324219  0.171875    0.02258301 ....]
 [-2.25585938e-01 -1.95312500e-02  9.08203125e-02  2.37304688e-01 ....]
 ....]

Embeddingレイヤー

単語ベクトルを埋め込むには，Embeddingレイヤーを用います。

model.add(Embedding(len(char_indices)+1,
                    300,
                    weights=[embedding_matrix],
                    mask_zero=True,
                    trainable=False))

Embeddingレイヤーには，(語彙数，出力次数，・・・)を指定します。

ここでは，300次元の単語ベクトルを使用しているので，出力次数は300となります。

そして，weightsに上で作成した単語ベクトルのリストembedding_matrixを指定します。

また，trainable=Falseは，このレイヤーを訓練から除外することを意味します。

Embeddingレイヤーは単語のidを入力値として与えられると，それを単語ベクトルに変換し，次のLSTMレイヤーに渡します。

Embeddingレイヤーの便利な機能の一つにmask_zeroがあります。mask_zeroは入力値がすべて0である場合に，それを訓練から除外します。入力値をパディングしている場合，0を除外することで訓練の精度が向上するでしょう。パディングについては，pad_sequencesとTimeseriesGeneratorの初歩を参照してください。

訓練の実行

動作するコード全体を示します。

sequence.pad_sequencesについては，pad_sequencesとTimeseriesGeneratorの初歩を参照してください。

sample_generatorについては，メモリに乗らない大きなデータを分割して訓練するを参照してください。

EarlyStoppingとReduceLROnPlateauについては，学習率の変更と早期停止(ReduceLROnPlateau/EarlyStopping)を参照してください。

validation_dataについては，訓練データと検証データを分けるを参照してください。

cp_callbackについては，訓練の中断と再開(ModelCheckpoint)を参照してください。

搭載されているメモリが少ない場合，このコードはエラーを起こします。その場合は，batch_sizeの値を小さくして下さい。

import numpy as np
import io
import os
import sys
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import gensim
from keras.models import Model, load_model, Sequential
from keras.layers import Embedding, Dense, LSTM
from keras.optimizers import Adam
from keras.utils import np_utils
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import TimeseriesGenerator
from keras.callbacks import ModelCheckpoint, EarlyStopping, ReduceLROnPlateau
#read the text
with io.open('articles_u2.txt', encoding='utf-8') as f:
    text = f.read().replace('eos', '.\n').splitlines()
texts = []
for line in range(len(text)):
    chars = text[line].split()
    if len(chars) >= 7 and len(chars) <= 40:
        texts.append(text[line])
#make the dictionary
tokenizer = Tokenizer()
tokenizer.fit_on_texts(texts)
char_indices = tokenizer.word_index
indices_char = dict([(value, key) for (key, value) in char_indices.items()])
#word2vec
print('load the word vectors....')
word2vec = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
null_word = np.zeros(300)
embedding_matrix = np.zeros((len(char_indices)+1, 300))
for id, word in indices_char.items():
    try:
        embedding_matrix[id] = word2vec[word]
    except:
        embedding_matrix[id] = null_word
embedding_matrix[0] = null_word
texts = tokenizer.texts_to_sequences(texts)
seq_length = 40
texts = sequence.pad_sequences(texts, maxlen=seq_length, padding='pre', truncating='post')
#make dataset
batch_size = 200
time_step = 5
def sample_generator(start, end):
    while True:
        for step in range((end - start) // batch_size):
            x = []
            x_vec = []
            y = []
            for line in range(batch_size):
                dataset = TimeseriesGenerator(
                    texts[start+step*batch_size+line],
                    texts[start+step*batch_size+line],
                    length=time_step,
                    batch_size=1)
                for batch in dataset:
                    X, Y = batch
                    x.extend(X[0])
                    y.extend(Y)
            x = np.reshape(x,((seq_length-time_step)*batch_size,time_step))
            y = np_utils.to_categorical(y, len(char_indices)+1)
            yield x, y
#build the model
print('build the model....')
model = Sequential()
model.add(Embedding(len(char_indices)+1,
                    300,
                    weights=[embedding_matrix],
                    mask_zero=True,
                    trainable=False))
model.add(LSTM(512))
model.add(Dense(len(char_indices)+1, activation='softmax'))
optimizer = Adam(lr=0.01)
model.compile(loss='categorical_crossentropy',
    optimizer=optimizer,
    metrics=['accuracy'])
EarlyStopping = EarlyStopping(monitor='val_loss', min_delta=0, patience=10, mode='auto')
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, min_lr=0.0001)
cp_callback = ModelCheckpoint(
    filepath="u_model.h5", 
    verbose=0, 
    save_weights_only=False,
    save_freq="epoch")
#training
train_val_rate = 0.8
train_start = 0
train_end = round(len(texts) * train_val_rate)
val_start = train_end + 1
val_end = len(texts)
model.fit(
    sample_generator(train_start, train_end),
    steps_per_epoch=(train_end - train_start) // batch_size,
    validation_data=sample_generator(val_start, val_end),
    validation_steps=(val_end - val_start) // batch_size,
    initial_epoch=0,
    epochs=100,
    verbose=1,
    callbacks=[cp_callback, EarlyStopping, reduce_lr])

earlystoppingは訓練を40epochで停止しました。この時点で生成された英文は以下のようなものです。

i expect all of you can see the benefits of the different types of blogging
the poor old woman had a stay in the real world is a similar importance
a rush hour traffic jam delayed cars and the same brand is not a good

単語ベクトルの埋め込みを使用しない場合に比べて，結果は大きく改善しました。しかしながら，これらの文は部分的には理解できるものの，全体としては意味を成さない文が生成されています。

i expect all of you are going to be able to live alone for the
the poor old woman had taken care of the government 's the farms were smaller
a rush hour traffic jam out to the fog on the web and the international

earlystoppingを使用せずに，200epochを訓練した結果です。改善は見られませんでした。