Lecture 5. Language sources — различия между версиями
Материал из Wiki - Факультет компьютерных наук
Polidson (обсуждение | вклад) |
Polidson (обсуждение | вклад) (→Types of language sources) |
||
Строка 33: | Строка 33: | ||
* English-Spanish large statistical dictionary of in ectional forms | * English-Spanish large statistical dictionary of in ectional forms | ||
* Exploiting web-based collective knowledge for micropost normalisation | * Exploiting web-based collective knowledge for micropost normalisation | ||
+ | |||
+ | == Thesaurus == | ||
+ | |||
+ | == Ontology == | ||
+ | |||
+ | == Text corpus == | ||
+ | |||
+ | == Speech corpus == | ||
+ | |||
+ | == Web corpus == |
Версия 01:30, 24 августа 2015
Содержание
Types of language sources
- Word list
- Dictionary: definitions for words
- Thesaurus: words grouped together according to similarity of meaning
- Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
- Corpus
- Text corpus: a large and structured set of texts
- Speech corpus: a large set of speech audio files
- Web corpus: text corpus, collected from Web
- Wikipedia (DBpedia)
- Test datasets
Word lists
- List of stopwords (in NLTK, too)
- Moby words[1]
- List of Wikipedia articles
- Lists of words for language learners
- Lists of German compounds
- Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.
Dictionary
- Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages
using definitions and descriptions in English. [3]
- Wiktionary as a source for automatic pronunciation extraction
- Extracting lexical semantic knowledge from Wikipedia and Wiktionary
- Using Wikipedia and Wiktionary in domain-specific information retrieval
- Wiktionary and NLP: Improving synonymy networks
- FreeLing dictionaries [4]
- English-Spanish large statistical dictionary of in ectional forms
- Exploiting web-based collective knowledge for micropost normalisation