Lecture 5. Language sources — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
Строка 19: Строка 19:
 
* Lists of German compounds
 
* Lists of German compounds
 
* Lists of common spam words [http://emailmarketing.comm100.com/|http://emailmarketing.comm100.com/], email-marketing-ebook/spam-words.aspx.
 
* Lists of common spam words [http://emailmarketing.comm100.com/|http://emailmarketing.comm100.com/], email-marketing-ebook/spam-words.aspx.
 +
 +
=== Dictionary ===
 +
 +
* Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages
 +
using definitions and descriptions in English. [https://en.wiktionary.org|https://en.wiktionary.org]
 +
** Wiktionary as a source for automatic pronunciation extraction
 +
** Extracting lexical semantic knowledge from Wikipedia and Wiktionary
 +
** Using Wikipedia and Wiktionary in domain-specific information retrieval
 +
** Wiktionary and NLP: Improving synonymy networks
 +
*FreeLing dictionaries [http://nlp.lsi.upc.edu/freeling/index.php?option=com_content&task=view&id=23&Itemid=58|http://nlp.lsi.upc.edu/freeling/index.php?option=com_content&task=view&id=23&Itemid=58]
 +
 +
 +
* English-Spanish large statistical dictionary of in ectional forms
 +
* Exploiting web-based collective knowledge for micropost normalisation

Версия 01:28, 24 августа 2015

Types of language sources

  • Word list
  • Dictionary: definitions for words
  • Thesaurus: words grouped together according to similarity of meaning
  • Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
  • Corpus
    • Text corpus: a large and structured set of texts
    • Speech corpus: a large set of speech audio files
    • Web corpus: text corpus, collected from Web
  • Wikipedia (DBpedia)
  • Test datasets

Word lists

  • List of stopwords (in NLTK, too)
  • Moby words[1]
  • List of Wikipedia articles
  • Lists of words for language learners
  • Lists of German compounds
  • Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.

Dictionary

  • Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages

using definitions and descriptions in English. [3]

    • Wiktionary as a source for automatic pronunciation extraction
    • Extracting lexical semantic knowledge from Wikipedia and Wiktionary
    • Using Wikipedia and Wiktionary in domain-specific information retrieval
    • Wiktionary and NLP: Improving synonymy networks
  • FreeLing dictionaries [4]


  • English-Spanish large statistical dictionary of in ectional forms
  • Exploiting web-based collective knowledge for micropost normalisation