Lecture 2. Tokenization and word counts — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(How many words?)
(Zipf's law)
Строка 21: Строка 21:
  
 
== Zipf's law ==
 
== Zipf's law ==
 +
 +
=== Zipf's law ([Gelbukh, Sidorov, 2001]) ===
 +
 +
In any large enough text, the frequency ranks (starting from the highest)
 +
of types are inversely proportional to the corresponding frequencies:
 +
f = 1/r
 +
 +
''f'' — frequency of a type;
 +
 +
''r'' — rank of a type (its position in the list of all types in order of their frequency of occurrence).
  
 
== Heaps' law ==
 
== Heaps' law ==

Версия 00:04, 23 августа 2015

How many words?

"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain

Type and token

Type is an element of the vocabulary.

Token is an instance of that type in the text.


N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies:

f = 1/r

f — frequency of a type;

r — rank of a type (its position in the list of all types in order of their frequency of occurrence).

Heaps' law

Why tokenization is difficult?

Rule-based tokenization

Sentence segmentation

Natural Language Toolkit (NLTK)

Learning to tokenize

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?