NLP(自然言語処理) Stanzaによる構文解析の結果を取得する

公開日:2020/09/21

最終更新日:2023/01/04

Stanza はPythonを用いて文章の構文解析を行うことができます。Stanza は多くの言語に対応し，文について様々な情報を提供します。

stanza を使用するには，事前にインストールが必要です。

UPOS

手に入る基本的な情報の一つに品詞(UPOS)があります。

UPOSは省略記号で表されます。それらの意味は以下のようになります。

1.	ADJ	adjective	形容詞
2.	ADP	adposition	接置詞（前置詞）
3.	ADV	adverb	副詞
4.	AUX	auxiliary	助動詞
5.	CCONJ	coordinating conjunction	等位接続詞
6.	DET	determiner	限定詞
7.	INTJ	interjection	間投詞
8.	NOUN	noun	名詞
9.	NUM	numeral	数詞
10.	PART	particle	不変化詞
11.	PRON	pronoun	代名詞
12.	PROPN	proper noun	固有名詞
13.	PUNCT	punctuation	句読点
14.	SCONJ	subordinating conjunction	従位接続詞
15.	SYM	symbol	記号
16.	VERB	verb	動詞
17.	X	other	その他

XPOS

UPOSに比べて，XPOSは詳細な品詞の情報を提供します。

それらの意味は以下のようになります。

Number	Tag	Description
1.	CC	Coordinating conjunction	等位接続詞
2.	CD	Cardinal number	基数
3.	DT	Determiner	限定詞：名詞や名詞句を修飾する
4.	EX	Existential there	存在を表す語
5.	FW	Foreign word	外来語
6.	IN	Preposition or subordinating conjunction	接置詞[前置詞]または従位接続詞
7.	JJ	Adjective	形容詞
8.	JJR	Adjective, comparative	形容詞，比較級
9.	JJS	Adjective, superlative	形容詞，最上級
10.	LS	List item marker	項目の先頭を示す記号
11.	MD	Modal	法助動詞
12.	NN	Noun, singular or mass	名詞，単数または不可算
13.	NNS	Noun, plural	名詞，複数形
14.	NNP	Proper noun, singular	固有名詞，単数形
15.	NNPS	Proper noun, plural	固有名詞，複数形
16.	PDT	Predeterminer	前限定辞
17.	POS	Possessive ending	所有格の終端語[‘s]
18.	PRP	Personal pronoun	人称代名詞
19.	PRP$	Possessive pronoun	所有格
20.	RB	Adverb	副詞
21.	RBR	Adverb, comparative	副詞，比較級
22.	RBS	Adverb, superlative	副詞，最上級
23.	RP	Particle	不変化詞
24.	SYM	Symbol	記号
25.	TO	to	to
26.	UH	Interjection	感動詞
27.	VB	Verb, base form	動詞，原形
28.	VBD	Verb, past tense	動詞，過去形
29.	VBG	Verb, gerund or present participle	動詞，動名詞または現在分詞
30.	VBN	Verb, past participle	動詞，過去分詞
31.	VBP	Verb, non-3rd person singular present	動詞，三人称単数以外
32.	VBZ	Verb, 3rd person singular present	動詞，三人称単数
33.	WDT	Wh-determiner	what や which など
34.	WP	Wh-pronoun	関係代名詞
35.	WP$	Possessive wh-pronoun	所有代名詞：whose
36.	WRB	Wh-adverb	関係副詞

レンマ

レンマは辞書の見出し語です。例えば，take は takes, took, taken, taking に変化しますが，これらのレンマは take です。

依存関係

ある単語がどの単語に依存しているかを知ることができます。

情報の抽出

with io.open('articles_u.txt', encoding='utf-8') as f:
    text = f.read()
texts = text.replace('eos', '.\n').splitlines()

テキストを読み込みます。テキストファイルは文末を eos という記号で表しているので，ピリオドに置き換えます。

stanza.download('en')

stanza.download('en')で英語のモデルをダウンロードします。モデルは 1Gバイトを超える大きさです。一度ダウンロードされると，2回目からはダウンロードがスキップされます。モデルはテキストを解析するために必要なものです。

nlp = stanza.Pipeline(lang='en')

パイプラインの言語として英語を指定します。パイプラインは，テキストを受け取り，分析の結果を返す装置としてイメージすることができます。

for line in range(3):

ここでは，リスト texts のはじめの3つの文について構文解析を行います。

texts[0] = 'i expect all of you to be here five minutes before the test begins without fail .'
texts[1] = 'the poor old woman had her bag stolen again .'
texts[2] = 'a rush-hour traffic jam delayed my arrival by two hours .'

構文解析を行いましょう。

    doc = nlp(texts[line])

それぞれの文について構文解析を行い，オブジェクト doc に格納します。

doc =
  [
    {
      "id": 1,
      "text": "i",
      "lemma": "i",
      "upos": "PRON",
      "xpos": "PRP",
      "feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
      "head": 2,
      "deprel": "nsubj",
      "misc": "start_char=0|end_char=1",
      "ner": "O"
    },
    {
      "id": 2,
      "text": "expect",
      "lemma": "expect",
      "upos": "VERB",
      "xpos": "VBP",
      "feats": "Mood=Ind|Tense=Pres|VerbForm=Fin",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=2|end_char=8",
      "ner": "O"
    },
    {
      "id": 3,
      "text": "all",
      "lemma": "all",
      "upos": "DET",
      "xpos": "DT",
      "head": 2,
      "deprel": "obj",
      "misc": "start_char=9|end_char=12",
      "ner": "O"
    }, ......

オブジェクトの情報をリストに変換します。

    for word in doc.sentences[0].words:
        char.append(word.text)
        lemma.append(word.lemma)
        pos.append(word.pos)
        xpos.append(word.xpos)
        deprel.append(word.deprel)

for文を用いて，それぞれの単語についてオブジェクトから情報を抽出します。

    for word in doc.sentences[0].words:
        head.extend([lemma[word.head-1] if word.head > 0 else "root"])

head はどの単語に依存しているかをidで示します。オブジェクト doc を参照してください。例えば，all は expect に依存していますが，head には id として 2 が格納されています。

id では実際にどの単語に依存しているかが分からないので，リスト内表記を用いて id に対応するレンマを取得します。例えば，単語が 'all' のとき head には 'expect' が格納されます。

また，headが0の場合，head に 'root' を格納します。rootは他に依存する単語が存在しないことを表しています。

    chars.append(char)
    lemmas.append(lemma)
    poses.append(pos)
    xposes.append(xpos)
    heads.append(head)
    deprels.append(deprel)

それぞれの文について得られた結果をリストに追加します。

chars =
  [['i', 'expect', 'all', 'of', 'you', 'to', 'be', 'here', 'five', 'minutes', 'before', 'the', 'test', 'begins', 'without', 'fail', '.'],
   ['the', 'poor', 'old', 'woman', 'had', 'her', 'bag', 'stolen', 'again', '.'],
   ['a', 'rush', '-', 'hour', 'traffic', 'jam', 'delayed', 'my', 'arrival', 'by', 'two', 'hours', '.']]
lemmas =
  [['i', 'expect', 'all', 'of', 'you', 'to', 'be', 'here', 'five', 'minute', 'before', 'the', 'test', 'begin', 'without', 'fail', '.'],
   ['the', 'poor', 'old', 'woman', 'have', 'she', 'bag', 'steal', 'again', '.'],
   ['a', 'rush', '-', 'hour', 'traffic', 'jam', 'delay', 'my', 'arrival', 'by', 'two', 'hour', '.']]
poses =
  [['PRON', 'VERB', 'DET', 'ADP', 'PRON', 'PART', 'AUX', 'ADV', 'NUM', 'NOUN', 'SCONJ', 'DET', 'NOUN', 'VERB', 'ADP', 'NOUN', 'PUNCT'],
   ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'PRON', 'NOUN', 'VERB', 'ADV', 'PUNCT'],
   ['DET', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'PRON', 'NOUN', 'ADP', 'NUM', 'NOUN', 'PUNCT']]
xposes =
  [['PRP', 'VBP', 'DT', 'IN', 'PRP', 'TO', 'VB', 'RB', 'CD', 'NNS', 'IN', 'DT', 'NN', 'VBZ', 'IN', 'NN', '.'],
   ['DT', 'JJ', 'JJ', 'NN', 'VBD', 'PRP$', 'NN', 'VBN', 'RB', '.'],
   ['DT', 'NN', 'HYPH', 'NN', 'NN', 'NN', 'VBD', 'PRP$', 'NN', 'IN', 'CD', 'NNS', '.']]
heads =
  [['expect', 'root', 'expect', 'you', 'all', 'here', 'here', 'expect', 'minute', 'here', 'begin', 'test', 'begin', 'here', 'fail', 'begin', 'expect'],
   ['woman', 'woman', 'woman', 'have', 'root', 'bag', 'have', 'have', 'steal', 'have'],
   ['jam', 'hour', 'hour', 'jam', 'jam', 'delay', 'root', 'arrival', 'delay', 'hour', 'hour', 'delay', 'delay']]
deprels =
  [['nsubj', 'root', 'obj', 'case', 'nmod', 'mark', 'cop', 'xcomp', 'nummod', 'obl:tmod', 'mark', 'det', 'nsubj', 'advcl', 'case', 'obl', 'punct'],
   ['det', 'amod', 'amod', 'nsubj', 'root', 'nmod:poss', 'obj', 'xcomp', 'advmod', 'punct'],
   ['det', 'compound', 'punct', 'compound', 'compound', 'nsubj', 'root', 'nmod:poss', 'obj', 'case', 'nummod', 'obl', 'punct']]

deprel については，マニュアルを参照して下さい。

全体のコードを示します。

import numpy as np
import sys
import io
import os
import stanza
#read the text
with io.open('articles_u.txt', encoding='utf-8') as f:
    text = f.read()
texts = text.replace('eos', '.\n').splitlines()
#load stanza
stanza.download('en')
nlp = stanza.Pipeline(lang='en')
chars = []
lemmas = []
poses = []
xposes = []
heads = []
deprels = []
for line in range(3):
    char = []
    lemma = []
    pos = []
    xpos = []
    head = []
    deprel = []
    print('analyzing: '+str(line+1)+' / '+str(len(texts)), end='\r')
    doc = nlp(texts[line])
    for word in doc.sentences[0].words:
        char.append(word.text)
        lemma.append(word.lemma)
        pos.append(word.pos)
        xpos.append(word.xpos)
        deprel.append(word.deprel)
    for word in doc.sentences[0].words:
        head.extend([lemma[word.head-1] if word.head > 0 else "root"])
    chars.append(char)
    lemmas.append(lemma)
    poses.append(pos)
    xposes.append(xpos)
    heads.append(head)
    deprels.append(deprel)

NLP(自然言語処理) Stanzaによる構文解析の結果を取得する

UPOS

XPOS

レンマ

依存関係

情報の抽出

関連