Lecture 2. Tokenization and word counts — различия между версиями

Версия 00:33, 24 августа 2015

Содержание

1 How many words?
- 1.1 Type and token
2 Zipf's law
- 2.1 Zipf's law ([Gelbukh, Sidorov, 2001])
3 Heaps' law
4 Why tokenization is difficult?
5 Rule-based tokenization
- 5.1 RE in Python
6 Sentence segmentation
- 6.1 Binary classifier
7 Natural Language Toolkit (NLTK)
- 7.1 Learning to tokenize
  - 7.1.1 Punkt tokenizer
8 Exercise 1.1 Word counts
9 Lemmatization (Normalization)
10 Stemming
11 Exercise 1.2 Word counts (continued)
12 Exercise 1.3 Do we need all words?

How many words?

"The rain in Spain stays mainly in the plain." 9 tokens: The, rain, in, Spain, stays, mainly, in, the, plain 7 (or 8) types: T = the rain, in, Spain, stays, mainly, plain

Type and token

Type is an element of the vocabulary.

Token is an instance of that type in the text.

N = number of tokens;

V - vocabulary (i.e. all types);

|V| = size of vocabulary (i.e. number of types).

How are N and |V| related?

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

In any large enough text, the frequency ranks (starting from the highest) of types are inversely proportional to the corresponding frequencies:

f = 1/r

f — frequency of a type;

r — rank of a type (its position in the list of all types in order of their frequency of occurrence).

Heaps' law

Heaps' law ([Gelbukh, Sidorov, 2001])

The number of different types in a text is roughly proportional to an exponent of its size: <math> |V| = K * N^b </math>

N = number of tokens;

|V| = size of vocabulary (i.e. number of types);

K, b — free parameters, <math> K ∈ [10; 100]; b ∈ [0.4; 0.6] </math>

Why tokenization is difficult?

Easy example: "Good muffins cost $3.88 in New York. Please buy me two of them. Thanks."
- is \." a token?
- is $3.88 a single token?
- is \New York" a single token?
Real data may contain noise in it: code, markup, URLs, faulty punctuation
Real data contains misspellings: "an dthen she aksed"
Period "." does not always mean the end of sentence: m.p.h., PhD.

Nevertheless tokenization is important for all other text processing steps. There are rule-based and machine learning-based approaches to development of tokenizers.

Rule-based tokenization

For example, define a token as a sequence of upper and lower case letters: A-Za-z. Reqular expression is a nice tool for programming such rules.

RE in Python

In[1]: import re

In[2]: prog = re.compile('[A-Za-z]+')

In[3]: prog.findall("Words, words, words.")

Out[1]: ['Words', 'words', 'words']

Sentence segmentation

What are the sentence boundaries?

?, ! are usually unambiguous
Period "." is an issue
Direct speech is also an issue: She said, "What time will you be home?" and I said, "I don't know!". Even worse in Russian!

Let us learn a classifier for sentence segmentation.

Binary classifier

A binary classifier <math> f : X ⇒ 0; 1 </math> takes input data X (a set of sentences) and decides EndOfSentence (0) or NotEndOfSentence (1).

What can be the features for classification? I am a period, am I EndOfSentence?

Lots of blanks after me?
Lots of lower case letters and ? or ! after me?
Do I belong to abbreviation?
etc.

We need a lot of hand-markup.

Natural Language Toolkit (NLTK)

Do we need to program this? No! There is Natural Language Toolkit (NLTK) for everything.

NLTK tokenizers In[1]: from nltk.tokenize import RegexpTokenizer, wordpunct tokenize

In[2]: s = 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'

In[3]: tokenizer = RegexpTokenizer('\w+ | \$ [\d \.]+ | S \+')

In[4]: tokenizer.tokenize(s)

In[5]: wordpunct tokenize(s)

Learning to tokenize

nltk.tokenize.punkt is a tool for learning to tokenize from your data. It includes pre-trained Punkt tokenizer for English.

Punkt tokenizer

In[1]: import nltk.data

In[2]: sent detector = nltk.data.load('tokenizers/punkt/english.pickle')

In[3]: sent detector.tokenize(s)

@@ Строка 116: / Строка 116: @@
 </code>
-== Learning to tokenize ==
+=== Learning to tokenize ===
+nltk.tokenize.punkt is a tool for learning to tokenize from your data. It includes pre-trained Punkt tokenizer for English.
+==== Punkt tokenizer ====
+<code>
+In[1]: import nltk.data
+In[2]: sent detector = nltk.data.load('tokenizers/punkt/english.pickle')
+In[3]: sent detector.tokenize(s)
+</code>
 == Exercise 1.1 Word counts ==

Lecture 2. Tokenization and word counts — различия между версиями

Версия 00:33, 24 августа 2015

Содержание

How many words?

Type and token

Zipf's law

Zipf's law ([Gelbukh, Sidorov, 2001])

Heaps' law

Why tokenization is difficult?

Rule-based tokenization

RE in Python

Sentence segmentation

Binary classifier

Natural Language Toolkit (NLTK)

Learning to tokenize

Punkt tokenizer

Exercise 1.1 Word counts

Lemmatization (Normalization)

Stemming

Exercise 1.2 Word counts (continued)

Exercise 1.3 Do we need all words?

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Действия

Поиск

Навигация

Инструменты