Lecture 2. Tokenization and word counts — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(Why tokenization is difficult?)
(Rule-based tokenization)
Строка 60: Строка 60:
  
 
== Rule-based tokenization ==
 
== Rule-based tokenization ==
 +
 +
For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules.
 +
 +
=RE in Python=
 +
 +
<code>
 +
In[1]: import re
 +
 +
In[2]: prog = re.compile('[A-Za-z]+')
 +
 +
In[3]: prog.findall("Words, words, words.")
 +
 +
Out[1]: ['Words', 'words', 'words']
 +
</code>
  
 
== Sentence segmentation ==  
 
== Sentence segmentation ==  

Версия 00:23, 24 августа 2015

How many words?

"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain

Type and token

Type is an element of the vocabulary.

Token is an instance of that type in the text.


N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies:

f = 1/r

f — frequency of a type;

r — rank of a type (its position in the list of all types in order of their frequency of occurrence).

Heaps' law

Heaps' law ([Gelbukh, Sidorov, 2001])

The number of different types in a text is roughly proportional to an exponent of its size: <math> |V| = K * N^b </math>

N = number of tokens;

|V| = size of vocabulary (i.e. number of types);

K, b — free parameters, <math> K ;&cap [10; 100]; b ;&cap [0.4; 0.6] </math>

Why tokenization is difficult?

  • Easy example: "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks."
    • is \." a token?
    • is $3.88 a single token?
    • is \New York" a single token?
  • Real data may contain noise in it: code, markup, URLs, faulty punctuation
  • Real data contains misspellings: "an dthen she aksed"
  • Period "." does not always mean the end of sentence: m.p.h., PhD.


Nevertheless tokenization is important for all other text processing steps. There are rule-based and machine learning-based approaches to development of tokenizers.

Rule-based tokenization

For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules.

RE in Python

In[1]: import re

In[2]: prog = re.compile('[A-Za-z]+')

In[3]: prog.findall("Words, words, words.")

Out[1]: ['Words', 'words', 'words']

Sentence segmentation

Natural Language Toolkit (NLTK)

Learning to tokenize

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?