Lecture 8. Suffix trees for NLP

Содержание

1 Text (collection) representation model
2 Suffix tree
3 Annotated suffix tree
4 Spam filtering
5 Unsupervised text categorization
6 Text summarization
7 Taxonomy refinement
8 German compound splitting
9 Reference graph
10 Profound filtering
11 Implementation

Text (collection) representation model

Vector Space Models

Vector Space Models, VSM [Salton, Wong, Yang, 1975] : every text is a vector in a space of terms. Usually is used along with cosine similarity measure.

Pro:

Makes linear algebra operation and thus machine learning applicable
Is simple and based on human intuition

Contra:

The more the words are there in the the text, the higher is the cosinesimilarity
Word order is lost
Need some additional hacks to use bigrams and ngrams as terms

Applications:

Information retrieval
Topic, genre and maybe sentiment classification
Extractive summarization
Spam filtering

Generalized Vector Space Models

Generalized Vector Space Models, GVSM [Wong, Ziarko, Wong, 1985]: introduces a term to term correlation in the VSM.

Pro:

Is still easy
Synonyms or near synonyms are introduced into the model
Term to term correlation may be computed in several ways: co-occurrences in the collection under consideration, co-occurrences in a large corpus, thesaurus-based relations

Contra:

A lot of computations: if jVj = n (the number of terms), there are 2ndimensions in the GVSM
Word order is lost
Need some additional hacks to use bigrams and ngrams as terms

Applications: mainly information retrieval

Vector Space Model of Semantics

Vector Space Model of Semantics [Turney, Pantel, 2010]: terms are vector in the space of contexts (usually, also terms).

Pro:

Is still easy
Allows to find semantically similar words
Very loved by scientific community

Contra:

Word order is lost
Need some additional hacks to use bigrams and ngrams as terms

Applications:

Word similarity detection, clustering, classification
Automatic thesaurus generation
Word sense disambiguation
Context-sensitive spelling correction
Textual advertising

Language Model [Ponte, Croft, 1998]

Language Model [Ponte, Croft, 1998]: a probability distribution over sequences of words.

Unigram model: p(t 1 t 2 t 3 ) = p(t 1 )p(t 2 )p(t 3 )

Bigram model: p(t 1 t 2 t 3 ) = p(t 1 |BOS)p(t 2 |t 1 )p(t 3 |t 2 )p(EOS|t 3 )

Language Model [Ponte, Croft, 1998]: a probability distribution over sequences of words.

Pro:

The word order is not lost
Makes Markov chain theory applicable

Contra:

The problem of unseen words
How to choose n?
Long relations between words or sentences (for example, anaphora) are lost

Applications:

Information retrieval
Text generation
Abstractive summarization
Speech recognition
Speech generation

To sum up:

VSM-like models: no word order, algebraic structures, classification of any kind, stemming and lemmatization for dimension reduction

LM-like models: partially word order, Markov chains, generation of text and speech

What if new word appears? What if there is a typo? What if stemming is not always correct? Use edit distance on words or stems or use symbol ngrams instead as terms or ...

Lecture 8. Suffix trees for NLP

Содержание

Text (collection) representation model

Vector Space Models

Generalized Vector Space Models

Vector Space Model of Semantics

Language Model [Ponte, Croft, 1998]

Suffix tree

Annotated suffix tree

Annotated suffix tree construction

Annotated suffix tree scoring

Applications of AST

Spam filtering

Unsupervised text categorization

Text summarization

Taxonomy refinement

German compound splitting

Reference graph

Profound filtering

Implementation

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Действия

Поиск

Навигация

Инструменты