NLP for learners – A simple code of text parsing with Stanza

Stanza allows you to parse sentences using Python. It is available in many languages and can provide a variety of information about a sentence.

To use stanza, you need to install it in advance.

UPOS

One of the basic information available is the part of speech (UPOS).

UPOS are represented by an abbreviation. They have the following meanings.

1.ADJadjective
2.ADJadposition
3.ADVadverb
4.AUXauxiliary
5.CCONJcoordinating conjunction
6.DETdeterminer
7.INTJinterjection
8.NOUNnoun
9.NUMnumeral
10.PARTparticle
11.PRONpronoun
12.PROPNproper noun
13.PUNCTpunctuation
14.SCONJsubordinating conjunction
15.SYMsymbol
16.VERBverb
17.Xother

XPOS

Compared to UPOS, XPOS provides more detailed parts of speech information.

They mean the following.

1.CCCoordinating conjunction
2.CDCardinal number
3.DTDeterminer
4.EXExistential there
5.FWForeign word
6.INPreposition or subordinating conjunction
7.JJAdjective
8.JJRAdjective, comparative
9.JJSAdjective, superlative
10.LSList item marker
11.MDModal
12.NNNoun, singular or mass
13.NNSNoun, plural
14.NNPProper noun, singular
15.NNPSProper noun, plural
16.PDTPredeterminer
17.POSPossessive ending
18.PRPPersonal pronoun
19.PRP$Possessive pronoun
20.RBAdverb
21.RBRAdverb, comparative
22.RBSAdverb, superlative
23.RPParticle
24.SYMSymbol
25.TOto
26.UHInterjection
27.VBVerb, base form
28.VBDVerb, past tense
29.VBGVerb, gerund or present participle
30.VBNVerb, past participle
31.VBPVerb, non-3rd person singular present
32.VBZVerb, 3rd person singular present
33.WDTWh-determiner
34.WPWh-pronoun
35.WP$Possessive wh-pronoun
36.WRBWh-adverb

Lemma

Lemma is a dictionary headword. For example, take changes to takes, took, taken, taking, but these lemmas are take.

Dependency

You can find out which word a certain word depends on.

Extraction

with io.open('articles_u.txt', encoding='utf-8') as f:
    text = f.read()
texts = text.replace('eos', '.\n').splitlines()

Reads the text. The text file has the symbol eos at the end of a sentence, so replace it with a period.

stanza.download('en')

Download the English model with stanza.download('en'). The model should be larger than 1Gbyte. Once downloaded, download is skipped the second time. The model is needed to parse the text.

nlp = stanza.Pipeline(lang='en')

Specifies English as the language of the pipeline. The pipeline can be imagined as a device that gets the text and returns the parsing.

for line in range(3):

Here we will parse the first three sentences of the list texts.

texts[0] = 'i expect all of you to be here five minutes before the test begins without fail .'
texts[1] = 'the poor old woman had her bag stolen again .'
texts[2] = 'a rush-hour traffic jam delayed my arrival by two hours .'

Let’s do some parsing.

    doc = nlp(texts[line])

Parses each sentence and stores it in the object doc.

doc =
  [
    {
      "id": 1,
      "text": "i",
      "lemma": "i",
      "upos": "PRON",
      "xpos": "PRP",
      "feats": "Case=Nom|Number=Sing|Person=1|PronType=Prs",
      "head": 2,
      "deprel": "nsubj",
      "misc": "start_char=0|end_char=1",
      "ner": "O"
    },
    {
      "id": 2,
      "text": "expect",
      "lemma": "expect",
      "upos": "VERB",
      "xpos": "VBP",
      "feats": "Mood=Ind|Tense=Pres|VerbForm=Fin",
      "head": 0,
      "deprel": "root",
      "misc": "start_char=2|end_char=8",
      "ner": "O"
    },
    {
      "id": 3,
      "text": "all",
      "lemma": "all",
      "upos": "DET",
      "xpos": "DT",
      "head": 2,
      "deprel": "obj",
      "misc": "start_char=9|end_char=12",
      "ner": "O"
    }, ......

Converts object information into a list.

    for word in doc.sentences[0].words:
        char.append(word.text)
        lemma.append(word.lemma)
        pos.append(word.pos)
        xpos.append(word.xpos)
        deprel.append(word.deprel)

Extract information about each word from the object using a for statement.

    for word in doc.sentences[0].words:
        head.extend([lemma[word.head-1] if word.head > 0 else "root"])

head is the id of the word it depends on. See the object doc. For example, 'all’ depends on 'expect', but the head has an id of 2.

Since we don’t know which word an id actually depends on, we use in-list notation to get the corresponding lemma for the id. For example, if the word is 'all', then the head will contain the word 'expect'.

If the head is 0, the head is set to 'root', indicating that there is no other word that depends on it.

    chars.append(char)
    lemmas.append(lemma)
    poses.append(pos)
    xposes.append(xpos)
    heads.append(head)
    deprels.append(deprel)

Add the results obtained for each statement to the list.

chars =
  [['i', 'expect', 'all', 'of', 'you', 'to', 'be', 'here', 'five', 'minutes', 'before', 'the', 'test', 'begins', 'without', 'fail', '.'],
   ['the', 'poor', 'old', 'woman', 'had', 'her', 'bag', 'stolen', 'again', '.'],
   ['a', 'rush', '-', 'hour', 'traffic', 'jam', 'delayed', 'my', 'arrival', 'by', 'two', 'hours', '.']]
lemmas =
  [['i', 'expect', 'all', 'of', 'you', 'to', 'be', 'here', 'five', 'minute', 'before', 'the', 'test', 'begin', 'without', 'fail', '.'],
   ['the', 'poor', 'old', 'woman', 'have', 'she', 'bag', 'steal', 'again', '.'],
   ['a', 'rush', '-', 'hour', 'traffic', 'jam', 'delay', 'my', 'arrival', 'by', 'two', 'hour', '.']]
poses =
  [['PRON', 'VERB', 'DET', 'ADP', 'PRON', 'PART', 'AUX', 'ADV', 'NUM', 'NOUN', 'SCONJ', 'DET', 'NOUN', 'VERB', 'ADP', 'NOUN', 'PUNCT'],
   ['DET', 'ADJ', 'ADJ', 'NOUN', 'VERB', 'PRON', 'NOUN', 'VERB', 'ADV', 'PUNCT'],
   ['DET', 'NOUN', 'PUNCT', 'NOUN', 'NOUN', 'NOUN', 'VERB', 'PRON', 'NOUN', 'ADP', 'NUM', 'NOUN', 'PUNCT']]
xposes =
  [['PRP', 'VBP', 'DT', 'IN', 'PRP', 'TO', 'VB', 'RB', 'CD', 'NNS', 'IN', 'DT', 'NN', 'VBZ', 'IN', 'NN', '.'],
   ['DT', 'JJ', 'JJ', 'NN', 'VBD', 'PRP$', 'NN', 'VBN', 'RB', '.'],
   ['DT', 'NN', 'HYPH', 'NN', 'NN', 'NN', 'VBD', 'PRP$', 'NN', 'IN', 'CD', 'NNS', '.']]
heads =
  [['expect', 'root', 'expect', 'you', 'all', 'here', 'here', 'expect', 'minute', 'here', 'begin', 'test', 'begin', 'here', 'fail', 'begin', 'expect'],
   ['woman', 'woman', 'woman', 'have', 'root', 'bag', 'have', 'have', 'steal', 'have'],
   ['jam', 'hour', 'hour', 'jam', 'jam', 'delay', 'root', 'arrival', 'delay', 'hour', 'hour', 'delay', 'delay']]
deprels =
  [['nsubj', 'root', 'obj', 'case', 'nmod', 'mark', 'cop', 'xcomp', 'nummod', 'obl:tmod', 'mark', 'det', 'nsubj', 'advcl', 'case', 'obl', 'punct'],
   ['det', 'amod', 'amod', 'nsubj', 'root', 'nmod:poss', 'obj', 'xcomp', 'advmod', 'punct'],
   ['det', 'compound', 'punct', 'compound', 'compound', 'nsubj', 'root', 'nmod:poss', 'obj', 'case', 'nummod', 'obl', 'punct']]

For information on deprel, please refer to the manual.

Here is the overall code.

import numpy as np
import sys
import io
import os
import stanza
#read the text
with io.open('articles_u.txt', encoding='utf-8') as f:
    text = f.read()
texts = text.replace('eos', '.\n').splitlines()
#load stanza
stanza.download('en')
nlp = stanza.Pipeline(lang='en')
chars = []
lemmas = []
poses = []
xposes = []
heads = []
deprels = []
for line in range(3):
    char = []
    lemma = []
    pos = []
    xpos = []
    head = []
    deprel = []
    print('analyzing: '+str(line+1)+' / '+str(len(texts)), end='\r')
    doc = nlp(texts[line])
    for word in doc.sentences[0].words:
        char.append(word.text)
        lemma.append(word.lemma)
        pos.append(word.pos)
        xpos.append(word.xpos)
        deprel.append(word.deprel)
    for word in doc.sentences[0].words:
        head.extend([lemma[word.head-1] if word.head > 0 else "root"])
    chars.append(char)
    lemmas.append(lemma)
    poses.append(pos)
    xposes.append(xpos)
    heads.append(head)
    deprels.append(deprel)