Lecture 2. Tokenization and word counts — различия между версиями
Материал из Wiki - Факультет компьютерных наук
Polidson (обсуждение | вклад) (Новая страница: «== How many words? == == Zipf's law == == Heaps' law == == Why tokenization is difficult? == == Rule-based tokenization == == Sentence segmentation == == N…») |
Polidson (обсуждение | вклад) (→How many words?) |
||
Строка 1: | Строка 1: | ||
== How many words? == | == How many words? == | ||
+ | |||
+ | "The rain in Spain stays mainly in the plain." | ||
+ | 9 '''tokens''': The, rain, in, Spain, stays, mainly, in, the, plain | ||
+ | 7 (or 8) '''types''': T = the rain, in, Spain, stays, mainly, plain | ||
+ | |||
+ | === Type and token === | ||
+ | |||
+ | ''Type'' is an element of the vocabulary. | ||
+ | |||
+ | ''Token'' is an instance of that type in the text. | ||
+ | |||
+ | |||
+ | N = number of tokens; | ||
+ | |||
+ | V - vocabulary (i.e. all types); | ||
+ | |||
+ | |V| = size of vocabulary (i.e. number of types). | ||
+ | |||
+ | How are N and |V| related? | ||
== Zipf's law == | == Zipf's law == |
Версия 23:49, 22 августа 2015
Содержание
- 1 How many words?
- 2 Zipf's law
- 3 Heaps' law
- 4 Why tokenization is difficult?
- 5 Rule-based tokenization
- 6 Sentence segmentation
- 7 Natural Language Toolkit (NLTK)
- 8 Learning to tokenize
- 9 Exercise 1.1 Word counts
- 10 Lemmatization (Normalization)
- 11 Stemming
- 12 Exercise 1.2 Word counts (continued)
- 13 Exercise 1.3 Do we need all words?
How many words?
"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain
Type and token
Type is an element of the vocabulary.
Token is an instance of that type in the text.
N = number of tokens;
V - vocabulary (i.e. all types);
|V| = size of vocabulary (i.e. number of types).
How are N and |V| related?