Lecture 5. Language sources — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(Новая страница: «==Types of language sources == * Word list * Dictionary: definitions for words * Thesaurus: words grouped together according to similarity of meaning * Ontology:…»)
 
Строка 11: Строка 11:
 
* Wikipedia (DBpedia)
 
* Wikipedia (DBpedia)
 
* Test datasets
 
* Test datasets
 +
 +
=== Word lists ===
 +
* List of stopwords (in NLTK, too)
 +
* Moby words[http://icon.shef.ac.uk/Moby/mwords.html|http://icon.shef.ac.uk/Moby/mwords.html]
 +
* List of Wikipedia articles
 +
* Lists of words for language learners
 +
* Lists of German compounds
 +
* Lists of common spam words [http://emailmarketing.comm100.com/|http://emailmarketing.comm100.com/], email-marketing-ebook/spam-words.aspx.

Версия 01:25, 24 августа 2015

Types of language sources

  • Word list
  • Dictionary: definitions for words
  • Thesaurus: words grouped together according to similarity of meaning
  • Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
  • Corpus
    • Text corpus: a large and structured set of texts
    • Speech corpus: a large set of speech audio files
    • Web corpus: text corpus, collected from Web
  • Wikipedia (DBpedia)
  • Test datasets

Word lists

  • List of stopwords (in NLTK, too)
  • Moby words[1]
  • List of Wikipedia articles
  • Lists of words for language learners
  • Lists of German compounds
  • Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.