Lecture 2. Tokenization and word counts
Материал из Wiki - Факультет компьютерных наук
Версия от 23:44, 22 августа 2015; Polidson (обсуждение | вклад)
Содержание
- 1 How many words?
- 2 Zipf's law
- 3 Heaps' law
- 4 Why tokenization is difficult?
- 5 Rule-based tokenization
- 6 Sentence segmentation
- 7 Natural Language Toolkit (NLTK)
- 8 Learning to tokenize
- 9 Exercise 1.1 Word counts
- 10 Lemmatization (Normalization)
- 11 Stemming
- 12 Exercise 1.2 Word counts (continued)
- 13 Exercise 1.3 Do we need all words?