Lecture 2. Tokenization and word counts

Содержание

1 How many words?
- 1.1 Type and token
2 Zipf's law
- 2.1 Zipf's law ([Gelbukh, Sidorov, 2001])
3 Heaps' law
4 Why tokenization is difficult?
5 Rule-based tokenization
6 Sentence segmentation
7 Natural Language Toolkit (NLTK)
8 Learning to tokenize
9 Exercise 1.1 Word counts
10 Lemmatization (Normalization)
11 Stemming
12 Exercise 1.2 Word counts (continued)
13 Exercise 1.3 Do we need all words?

How many words?

"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain

Type and token

Type is an element of the vocabulary.

Token is an instance of that type in the text.

N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies:

f = 1/r

f — frequency of a type;

r — rank of a type (its position in the list of all types in order of their frequency of occurrence).

Lecture 2. Tokenization and word counts

Содержание

How many words?

Type and token

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

Heaps' law

Why tokenization is difficult?

Rule-based tokenization

Sentence segmentation

Natural Language Toolkit (NLTK)

Learning to tokenize

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Действия

Поиск

Навигация

Инструменты