Lecture 3. POS tagging. Key word and phrase extraction — различия между версиями
Polidson (обсуждение | вклад) (→Unsupervised methods for key word and phrase extraction from a single text) |
Polidson (обсуждение | вклад) (→Bigram association measures in NLTK) |
||
Строка 114: | Строка 114: | ||
=== Bigram association measures in NLTK === | === Bigram association measures in NLTK === | ||
+ | |||
+ | ==== NLTK BigramCollocationFinder ==== | ||
+ | |||
+ | <code> | ||
+ | |||
+ | In[1]: from nltk.collocations import * | ||
+ | |||
+ | In[2]: bigram measures = nltk.collocations.BigramAssocMeasures() | ||
+ | |||
+ | In[3]: finder = BigramCollocationFinder.from words(tokens) | ||
+ | |||
+ | In[4]: finder.apply freq filter(3) | ||
+ | |||
+ | In[5]: for i in finder.nbest(bigram measures.pmi, 20): | ||
+ | |||
+ | ... | ||
+ | |||
+ | </code> | ||
+ | |||
+ | Bigram measures: | ||
+ | * <code> bigram measures.pmi </code> | ||
+ | * <code> bigram measures.student_t </code> | ||
+ | * <code> bigram measures.chi_sq </code> | ||
+ | * <code> igram measures.likelihood_ratio </code> | ||
+ | |||
+ | See [http://www.nltk.org/_modules/nltk/metrics/association.html|http://www.nltk.org/_modules/nltk/metrics/association.html] for more more bigram association measures. | ||
== TextRank: using graph centrality measures for key word and phrase extraction (1) [Mihalcea, Tarau, 2004] == | == TextRank: using graph centrality measures for key word and phrase extraction (1) [Mihalcea, Tarau, 2004] == |
Версия 00:57, 24 августа 2015
Содержание
- 1 Part of speech (POS)
- 2 POS ambiguation
- 3 POS taggers
- 4 Exercise 3.1 Genre comparison
- 5 Key word and phrase extraction
- 6 Supervised methods for key word and phrase extraction
- 7 Unsupervised methods for key word and phrase extraction from a single text
- 8 Bigram association measures
- 9 TextRank: using graph centrality measures for key word and phrase extraction (1) [Mihalcea, Tarau, 2004]
- 10 Unsupervised methods for key word and phrase selection from a text in a collection
- 11 Variants of TF and IDF weights
- 12 TF-IDF in NLTK
- 13 TF-IDF alternatives
- 14 Using TF-IDF to measure text similarity
Part of speech (POS)
Part of speech [Manning, Shuetze, 1999]
Words of a language are grouped into classes which show similar syntactic behavior. These word classes are called parts of speech (POS). Three important parts of speech are noun, verb, and adjective. The major types of morphological process are in ection, derivation, and compounding.
There are around 9 POS according to different schools:
- Nouns (NN, NP), pronouns (PN, PRP), adjectives (JJ): number, gender, case
- Adjective (JJ): comparative, superlative, short form
- Verbs (VB): subject number, subject person, tense, aspect, modality, participles, voice
- Adverbs (RB), prepositions (IN), conjunctions (, CS), articles (AT)
and particles (RP): nothing
POS ambiguation
Ship (noun or verb?)
- a luxury cruise ship
- Both products are due to ship at the beginning of June
- A new engine was shipped over from the US
- The port is closed to all shipping
Contest (noun or verb?)
- Stone decided to hold a contest to see who could write the best song.
- She plans to contest a seat in Congress next year.
POS taggers
- Corpus- or dictionary-based VS rule-based
- Ngram-based taggers:
- unigram tagging: assign the most frequent tag
- ngram tagging: look at the context of n previous words (requires a lot of training data)
- Trade-off between the accuracy and the coverage: combine different taggers
NLTK POS default tagger
In[1]: from nltk.tag import pos tag
In[2]: print pos tag(['ship'])
Out[1]: [('ship', 'NN')]
In[3]: print pos tag(['shipping'])
Out[2]: [('shipping', 'VBG')]
See [1] for more details on learning taggers.
Exercise 3.1 Genre comparison
Text genre [Santini, Sharoff, 2009]
The concept of genre is hard to agree upon. Many interpretations have been proposed since Aristotles Poetics without reaching any definite conclusions about the inventory or even principles for classifying documents into genres. The lack of an agreed definition of what genre is causes the problem of the loose boundaries between the term \genre" with other neighbouring terms, such as "register", "domain", "topic", and "style".
Exercise 3.1
Input: Two texts of different genre (for example, Wikipedia article and blog post) Output: rank all of POS tags for both texts
How can you describe the difference between two genres?
Key word and phrase extraction
There are many definitions of key word and phrase. Thus there are many methods for their extraction:
- supervised VS unsupervised
- frequency-based VS more complex
- from individual text VS from text collection
- word (unigram) VS bigram VS ngram
- term VS named entity VS collocation
- sequential words VS using window
Supervised methods for key word and phrase extraction
I am a word. Am I a key word? Let us build a classifier.
- Am I in the beginning or in the end of the sentence?
- Am I capitalized?
- How many times do I occur?
- Am I used in Wikipedia as a title of a category or an article?
- Am I a term?
- Am I a NE?
- etc.
But we need a collection of marked up texts!
Unsupervised methods for key word and phrase extraction from a single text
- POS patterns
- Association measures: PMI, T-Score, LLR
- Graph methods: TextRank [Mihalcea, Tarau, 2004]
- Syntactic patterns
Exercise 3.2
Input: sif1.txt (or your own text)
Key word: top <math> n_1 </math> NN
Key phrase: top <math> n_2 </math> phrases, that satisfy the following patterns: JJ + NN, NN + NN, NN + IN + NN
Output: list of key words and phrases
Hint: use nltk.ngrams to get ngrams.
Bigram association measures
Bigram association measures in NLTK
NLTK BigramCollocationFinder
In[1]: from nltk.collocations import *
In[2]: bigram measures = nltk.collocations.BigramAssocMeasures()
In[3]: finder = BigramCollocationFinder.from words(tokens)
In[4]: finder.apply freq filter(3)
In[5]: for i in finder.nbest(bigram measures.pmi, 20):
...
Bigram measures:
-
bigram measures.pmi
-
bigram measures.student_t
-
bigram measures.chi_sq
-
igram measures.likelihood_ratio
See [2] for more more bigram association measures.