Lecture 2. Tokenization and word counts — различия между версиями
Polidson (обсуждение | вклад) (→Sentence segmentation) |
Polidson (обсуждение | вклад) (→Heaps' law) |
||
Строка 43: | Строка 43: | ||
|V| = size of vocabulary (i.e. number of types); | |V| = size of vocabulary (i.e. number of types); | ||
− | ''K'', ''b'' — free parameters, <math> K | + | ''K'', ''b'' — free parameters, <math> K ∩ [10; 100]; b ∩ [0.4; 0.6] </math> |
== Why tokenization is difficult? == | == Why tokenization is difficult? == |
Версия 00:26, 24 августа 2015
Содержание
How many words?
"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain
Type and token
Type is an element of the vocabulary.
Token is an instance of that type in the text.
N = number of tokens;
V - vocabulary (i.e. all types);
|V| = size of vocabulary (i.e. number of types).
How are N and |V| related?
Zipf's law
Zipf's law ([Gelbukh, Sidorov, 2001])
In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies:
f = 1/r
f — frequency of a type;
r — rank of a type (its position in the list of all types in order of their frequency of occurrence).
Heaps' law
Heaps' law ([Gelbukh, Sidorov, 2001])
The number of different types in a text is roughly proportional to an exponent of its size: <math> |V| = K * N^b </math>
N = number of tokens;
|V| = size of vocabulary (i.e. number of types);
K, b — free parameters, <math> K ∩ [10; 100]; b ∩ [0.4; 0.6] </math>
Why tokenization is difficult?
- Easy example: "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks."
- is \." a token?
- is $3.88 a single token?
- is \New York" a single token?
- Real data may contain noise in it: code, markup, URLs, faulty punctuation
- Real data contains misspellings: "an dthen she aksed"
- Period "." does not always mean the end of sentence: m.p.h., PhD.
Nevertheless tokenization is important for all other text processing steps.
There are rule-based and machine learning-based approaches to development of tokenizers.
Rule-based tokenization
For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules.
RE in Python
In[1]: import re
In[2]: prog = re.compile('[A-Za-z]+')
In[3]: prog.findall("Words, words, words.")
Out[1]: ['Words', 'words', 'words']
Sentence segmentation
What are the sentence boundaries?
- ?, ! are usually unambiguous
- Period "." is an issue
- Direct speech is also an issue: She said, "What time will you be home?" and I said, "I don't know!". Even worse in Russian!
Let us learn a classifier for sentence segmentation.
Binary classifier
A binary classifier <math> f : X ⇒ 0; 1 </math> takes input data X (a set of sentences) and decides EndOfSentence (0) or NotEndOfSentence (1).