Lecture 2. Tokenization and word counts — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(Новая страница: «== How many words? == == Zipf's law == == Heaps' law == == Why tokenization is difficult? == == Rule-based tokenization == == Sentence segmentation == == N…»)
 
(How many words?)
Строка 1: Строка 1:
 
== How many words? ==
 
== How many words? ==
 +
 +
"The rain in Spain stays mainly in the plain."
 +
9 '''tokens''': The, rain, in, Spain, stays, mainly, in, the, plain
 +
7 (or 8) '''types''': T = the rain, in, Spain, stays, mainly, plain
 +
 +
=== Type and token ===
 +
 +
''Type'' is an element of the vocabulary.
 +
 +
''Token'' is an instance of that type in the text.
 +
 +
 +
N = number of tokens;
 +
 +
V - vocabulary (i.e. all types);
 +
 +
|V| = size of vocabulary (i.e. number of types).
 +
 +
How are N and |V| related?
  
 
== Zipf's law ==
 
== Zipf's law ==

Версия 23:49, 22 августа 2015

How many words?

"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain

Type and token

Type is an element of the vocabulary.

Token is an instance of that type in the text.


N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

Zipf's law

Heaps' law

Why tokenization is difficult?

Rule-based tokenization

Sentence segmentation

Natural Language Toolkit (NLTK)

Learning to tokenize

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?