Lecture 2. Tokenization and word counts

Материал из Wiki - Факультет компьютерных наук

Версия от 23:44, 22 августа 2015; Polidson (обсуждение | вклад)

(разн.) ← Предыдущая | Текущая версия (разн.) | Следующая → (разн.)

Перейти к: навигация, поиск

Содержание

1 How many words?
2 Zipf's law
3 Heaps' law
4 Why tokenization is difficult?
5 Rule-based tokenization
6 Sentence segmentation
7 Natural Language Toolkit (NLTK)
8 Learning to tokenize
9 Exercise 1.1 Word counts
10 Lemmatization (Normalization)
11 Stemming
12 Exercise 1.2 Word counts (continued)
13 Exercise 1.3 Do we need all words?

How many words?

Zipf's law

Heaps' law

Why tokenization is difficult?

Rule-based tokenization

Sentence segmentation

Natural Language Toolkit (NLTK)

Learning to tokenize

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?

Источник — «http://wiki.cs.hse.ru/index.php?title=Lecture_2._Tokenization_and_word_counts&oldid=16907»