Lecture 2. Tokenization and word counts — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(Zipf's law)
(Heaps' law)
Строка 33: Строка 33:
  
 
== Heaps' law ==
 
== Heaps' law ==
 +
 +
Heaps' law ([Gelbukh, Sidorov, 2001])
 +
 +
The number of different types in a text is roughly proportional to an exponent of its size:
 +
<math> |V| = K * N^b </math>
 +
 +
''N'' = number of tokens;
 +
 +
|V| = size of vocabulary (i.e. number of types);
 +
 +
''K'', ''b'' — free parameters, <math> K ;&cap [10; 100]; b ;&cap [0.4; 0.6] </math>
  
 
== Why tokenization is difficult? ==
 
== Why tokenization is difficult? ==

Версия 00:19, 24 августа 2015

How many words?

"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain

Type and token

Type is an element of the vocabulary.

Token is an instance of that type in the text.


N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies:

f = 1/r

f — frequency of a type;

r — rank of a type (its position in the list of all types in order of their frequency of occurrence).

Heaps' law

Heaps' law ([Gelbukh, Sidorov, 2001])

The number of different types in a text is roughly proportional to an exponent of its size: <math> |V| = K * N^b </math>

N = number of tokens;

|V| = size of vocabulary (i.e. number of types);

K, b — free parameters, <math> K ;&cap [10; 100]; b ;&cap [0.4; 0.6] </math>

Why tokenization is difficult?

Rule-based tokenization

Sentence segmentation

Natural Language Toolkit (NLTK)

Learning to tokenize

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?