Lecture 6. Synonyms and near-synonyms detection — различия между версиями
Polidson (обсуждение | вклад) (Новая страница: «== Examples == * '''Synonyms''': Netherlands and Holland, buy and purchase * '''Near synonyms''': pants, trousers and slacks, mistake and error == Approaches to…») |
Polidson (обсуждение | вклад) (→Distributional semantics) |
||
Строка 31: | Строка 31: | ||
=== Distributional semantics === | === Distributional semantics === | ||
+ | |||
+ | [[Файл:L6p1.jpg|500px|слева]] | ||
+ | |||
+ | <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> <br> | ||
+ | <br> <br> <br> <br> <br> <br> | ||
+ | |||
+ | ==== Exercise 6.1 ==== | ||
+ | |||
+ | Calculate PPMI for Table 1. | ||
+ | |||
+ | ==== Exercise 6.2 ==== | ||
+ | |||
+ | Input: def.txt or your own text | ||
+ | |||
+ | Output 1: term-context matrix | ||
+ | |||
+ | Output 2: term-term similarity matrix (use cosine similarity) | ||
+ | |||
+ | Output 3: 2D visualization by means of LSA | ||
+ | |||
+ | Hint: use cfd = nltk.ConditionalFreqDist((term, context) for ...) for computing conditional frequency dictionary | ||
+ | |||
+ | Hint: use R for SVD and visualization | ||
=== word2vec [Mikolov, Chen, Corrado, Dean, 2013] === | === word2vec [Mikolov, Chen, Corrado, Dean, 2013] === |
Версия 03:23, 3 сентября 2015
Examples
- Synonyms: Netherlands and Holland, buy and purchase
- Near synonyms: pants, trousers and slacks, mistake and error
Approaches to synonyms and near-synonyms detection
- Thesaurus-based approach
- Distributional semantics
- Context-based approach
- word2vec
- Web search-based approach
Synonyms in WordNet
Given a word, look for synonyms in every synset.
WordNet NLTK interface
In[1]: for i,j in enumerate(wn.synsets('error')):
In[2]: print "Meaning",i, "NLTK ID:", j.name()
In[3]: print "Definition:",j.definition()
In[4]: print "Synonyms:", ", ".join(j.lemma names())
Wordnet Web interface: [1]
Distributional semantics
Exercise 6.1
Calculate PPMI for Table 1.
Exercise 6.2
Input: def.txt or your own text
Output 1: term-context matrix
Output 2: term-term similarity matrix (use cosine similarity)
Output 3: 2D visualization by means of LSA
Hint: use cfd = nltk.ConditionalFreqDist((term, context) for ...) for computing conditional frequency dictionary
Hint: use R for SVD and visualization
word2vec [Mikolov, Chen, Corrado, Dean, 2013]
Very complex machine learning (deep learning) applied to term-context matrices.
There are two regimes:
- CBOW predicts the current word based on the context
- Skip-gram predicts surrounding words given the current word
word2vec project page: [2] demo: [3]
Example: vec(Madrid) - vec(Spain) + vec(France) = vec(Paris)