Lecture 5. Language sources

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск

Types of language sources

  • Word list
  • Dictionary: definitions for words
  • Thesaurus: words grouped together according to similarity of meaning
  • Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
  • Corpus
    • Text corpus: a large and structured set of texts
    • Speech corpus: a large set of speech audio files
    • Web corpus: text corpus, collected from Web
  • Wikipedia (DBpedia)
  • Test datasets

Word lists

  • List of stopwords (in NLTK, too)
  • Moby words[1]
  • List of Wikipedia articles
  • Lists of words for language learners
  • Lists of German compounds
  • Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.