Lecture 5. Language sources — различия между версиями
Материал из Wiki - Факультет компьютерных наук
Polidson (обсуждение | вклад) |
Polidson (обсуждение | вклад) |
||
Строка 19: | Строка 19: | ||
* Lists of German compounds | * Lists of German compounds | ||
* Lists of common spam words [http://emailmarketing.comm100.com/|http://emailmarketing.comm100.com/], email-marketing-ebook/spam-words.aspx. | * Lists of common spam words [http://emailmarketing.comm100.com/|http://emailmarketing.comm100.com/], email-marketing-ebook/spam-words.aspx. | ||
+ | |||
+ | === Dictionary === | ||
+ | |||
+ | * Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages | ||
+ | using definitions and descriptions in English. [https://en.wiktionary.org|https://en.wiktionary.org] | ||
+ | ** Wiktionary as a source for automatic pronunciation extraction | ||
+ | ** Extracting lexical semantic knowledge from Wikipedia and Wiktionary | ||
+ | ** Using Wikipedia and Wiktionary in domain-specific information retrieval | ||
+ | ** Wiktionary and NLP: Improving synonymy networks | ||
+ | *FreeLing dictionaries [http://nlp.lsi.upc.edu/freeling/index.php?option=com_content&task=view&id=23&Itemid=58|http://nlp.lsi.upc.edu/freeling/index.php?option=com_content&task=view&id=23&Itemid=58] | ||
+ | |||
+ | |||
+ | * English-Spanish large statistical dictionary of in ectional forms | ||
+ | * Exploiting web-based collective knowledge for micropost normalisation |
Версия 01:28, 24 августа 2015
Types of language sources
- Word list
- Dictionary: definitions for words
- Thesaurus: words grouped together according to similarity of meaning
- Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
- Corpus
- Text corpus: a large and structured set of texts
- Speech corpus: a large set of speech audio files
- Web corpus: text corpus, collected from Web
- Wikipedia (DBpedia)
- Test datasets
Word lists
- List of stopwords (in NLTK, too)
- Moby words[1]
- List of Wikipedia articles
- Lists of words for language learners
- Lists of German compounds
- Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.
Dictionary
- Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages
using definitions and descriptions in English. [3]
- Wiktionary as a source for automatic pronunciation extraction
- Extracting lexical semantic knowledge from Wikipedia and Wiktionary
- Using Wikipedia and Wiktionary in domain-specific information retrieval
- Wiktionary and NLP: Improving synonymy networks
- FreeLing dictionaries [4]
- English-Spanish large statistical dictionary of in ectional forms
- Exploiting web-based collective knowledge for micropost normalisation