Lecture 3. POS tagging. Key word and phrase extraction

Содержание

1 Part of speech (POS)
2 POS ambiguation
3 POS taggers
- 3.1 NLTK POS default tagger
4 Exercise 3.1 Genre comparison
5 Key word and phrase extraction
6 Supervised methods for key word and phrase extraction
7 Unsupervised methods for key word and phrase extraction from a single text
8 Bigram association measures
- 8.1 Bigram association measures in NLTK
9 TextRank: using graph centrality measures for key word and phrase extraction (1) [Mihalcea, Tarau, 2004]
10 Unsupervised methods for key word and phrase selection from a text in a collection
11 Variants of TF and IDF weights
12 TF-IDF in NLTK
13 TF-IDF alternatives
14 Using TF-IDF to measure text similarity

Part of speech (POS)

Part of speech [Manning, Shuetze, 1999]

Words of a language are grouped into classes which show similar syntactic behavior. These word classes are called parts of speech (POS). Three important parts of speech are noun, verb, and adjective. The major types of morphological process are in ection, derivation, and compounding.

There are around 9 POS according to different schools:

Nouns (NN, NP), pronouns (PN, PRP), adjectives (JJ): number, gender, case
Adjective (JJ): comparative, superlative, short form
Verbs (VB): subject number, subject person, tense, aspect, modality, participles, voice
Adverbs (RB), prepositions (IN), conjunctions (, CS), articles (AT)

and particles (RP): nothing

POS ambiguation

Ship (noun or verb?)

a luxury cruise ship
Both products are due to ship at the beginning of June
A new engine was shipped over from the US
The port is closed to all shipping

Contest (noun or verb?)

Stone decided to hold a contest to see who could write the best song.
She plans to contest a seat in Congress next year.

POS taggers

Corpus- or dictionary-based VS rule-based
Ngram-based taggers:
- unigram tagging: assign the most frequent tag
- ngram tagging: look at the context of n previous words (requires a lot of training data)
Trade-off between the accuracy and the coverage: combine different taggers

NLTK POS default tagger

In[1]: from nltk.tag import pos tag

In[2]: print pos tag(['ship'])

Out[1]: [('ship', 'NN')]

In[3]: print pos tag(['shipping'])

Out[2]: [('shipping', 'VBG')]

See [1] for more details on learning taggers.

Exercise 3.1 Genre comparison

Text genre [Santini, Sharoff, 2009]

The concept of genre is hard to agree upon. Many interpretations have been proposed since Aristotles Poetics without reaching any definite conclusions about the inventory or even principles for classifying documents into genres. The lack of an agreed definition of what genre is causes the problem of the loose boundaries between the term \genre" with other neighbouring terms, such as "register", "domain", "topic", and "style".

Exercise 3.1

Input: Two texts of different genre (for example, Wikipedia article and blog post) Output: rank all of POS tags for both texts

How can you describe the difference between two genres?

Key word and phrase extraction

There are many definitions of key word and phrase. Thus there are many methods for their extraction:

supervised VS unsupervised
frequency-based VS more complex
from individual text VS from text collection
word (unigram) VS bigram VS ngram
term VS named entity VS collocation
sequential words VS using window

Supervised methods for key word and phrase extraction

I am a word. Am I a key word? Let us build a classifier.

Am I in the beginning or in the end of the sentence?
Am I capitalized?
How many times do I occur?
Am I used in Wikipedia as a title of a category or an article?
Am I a term?
Am I a NE?
etc.

But we need a collection of marked up texts!

Lecture 3. POS tagging. Key word and phrase extraction

Содержание

Part of speech (POS)

POS ambiguation

POS taggers

NLTK POS default tagger

Exercise 3.1 Genre comparison

Key word and phrase extraction

Supervised methods for key word and phrase extraction

Unsupervised methods for key word and phrase extraction from a single text

Bigram association measures

Bigram association measures in NLTK

TextRank: using graph centrality measures for key word and phrase extraction (1) [Mihalcea, Tarau, 2004]

Unsupervised methods for key word and phrase selection from a text in a collection

Variants of TF and IDF weights

TF-IDF in NLTK

TF-IDF alternatives

Using TF-IDF to measure text similarity

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Действия

Поиск

Навигация

Инструменты