Lecture 3. POS tagging. Key word and phrase extraction

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск

Part of speech (POS)

Part of speech [Manning, Shuetze, 1999]

Words of a language are grouped into classes which show similar syntactic behavior. These word classes are called parts of speech (POS). Three important parts of speech are noun, verb, and adjective. The major types of morphological process are in ection, derivation, and compounding.

There are around 9 POS according to different schools:

  • Nouns (NN, NP), pronouns (PN, PRP), adjectives (JJ): number, gender, case
  • Adjective (JJ): comparative, superlative, short form
  • Verbs (VB): subject number, subject person, tense, aspect, modality, participles, voice
  • Adverbs (RB), prepositions (IN), conjunctions (, CS), articles (AT)

and particles (RP): nothing

POS ambiguation

Ship (noun or verb?)

  • a luxury cruise ship
  • Both products are due to ship at the beginning of June
  • A new engine was shipped over from the US
  • The port is closed to all shipping

Contest (noun or verb?)

  • Stone decided to hold a contest to see who could write the best song.
  • She plans to contest a seat in Congress next year.

POS taggers

  • Corpus- or dictionary-based VS rule-based
  • Ngram-based taggers:
    • unigram tagging: assign the most frequent tag
    • ngram tagging: look at the context of n previous words (requires a lot of training data)
  • Trade-off between the accuracy and the coverage: combine different taggers

NLTK POS default tagger

In[1]: from nltk.tag import pos tag

In[2]: print pos tag(['ship'])

Out[1]: [('ship', 'NN')]

In[3]: print pos tag(['shipping'])

Out[2]: [('shipping', 'VBG')]

See [1] for more details on learning taggers.

Exercise 3.1 Genre comparison

Key word and phrase extraction

Supervised methods for key word and phrase extraction

Unsupervised methods for key word and phrase extraction from a single text

Bigram association measures

Bigram association measures in NLTK

TextRank: using graph centrality measures for key word and phrase extraction (1) [Mihalcea, Tarau, 2004]

Unsupervised methods for key word and phrase selection from a text in a collection

Variants of TF and IDF weights

TF-IDF in NLTK

TF-IDF alternatives

Using TF-IDF to measure text similarity