Lecture 5. Language sources — различия между версиями

Материал из Wiki - Факультет компьютерных наук
Перейти к: навигация, поиск
(Thesaurus)
(Ontology)
Строка 47: Строка 47:
  
 
== Ontology ==
 
== Ontology ==
 +
 +
=== Definition ===
 +
 +
Ontology is a formal definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a
 +
particular domain. RDF is a formal model for ontology description, OWL is one of the main language for ontology description. Taxonomy is an
 +
ontology with one hierarchical relation.
 +
 +
* BabelNet is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and a semantic network which connects concepts and named entities in a very large network of semantic relations, made up of more than 13 million entries, called Babel synsets. Each Babel synset represents a given meaning and contains all the synonyms which express that meaning in a range of different languages. [http://babelnet.org/|http://babelnet.org/]
 +
 +
* SNOMED CT (Systematized Nomenclature of Medicine-ClinicalTerms) is a comprehensive clinical terminology, originally created by the College of American Pathologists (CAP) and, as of April 2007, owned, maintained, and distributed by the International Health Terminology Standards Development Organisation (IHTSDO), a not-for-profit association in Denmark. [http://www.ihtsdo.org/snomed-ct|http://www.ihtsdo.org/snomed-ct]
 +
 +
* ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the defacto standard classification  system for the computing field. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. [http://www.acm.org/about/class/class/2012|http://www.acm.org/about/class/class/2012]
  
 
== Text corpus ==
 
== Text corpus ==

Версия 01:38, 24 августа 2015

Types of language sources

  • Word list
  • Dictionary: definitions for words
  • Thesaurus: words grouped together according to similarity of meaning
  • Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
  • Corpus
    • Text corpus: a large and structured set of texts
    • Speech corpus: a large set of speech audio files
    • Web corpus: text corpus, collected from Web
  • Wikipedia (DBpedia)
  • Test datasets

Word lists

  • List of stopwords (in NLTK, too)
  • Moby words[1]
  • List of Wikipedia articles
  • Lists of words for language learners
  • Lists of German compounds
  • Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.

Dictionary

  • Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages

using definitions and descriptions in English. [3]

    • Wiktionary as a source for automatic pronunciation extraction
    • Extracting lexical semantic knowledge from Wikipedia and Wiktionary
    • Using Wikipedia and Wiktionary in domain-specific information retrieval
    • Wiktionary and NLP: Improving synonymy networks
  • FreeLing dictionaries [4]


  • English-Spanish large statistical dictionary of in ectional forms
  • Exploiting web-based collective knowledge for micropost normalisation

Thesaurus

Definition

Thesaurus is a reference work that lists words grouped together according to similarity of meaning.

  • WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms(synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. [5]
  • VerbNet is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon. VerbNet is organized into verb classes. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function. [6]
  • FrameNet project is building a lexical database of English, based on annotating examples of how words are used in actual texts. It provides a unique training dataset for semantic role labeling. FrameNet is based on a theory of meaning called Frame Semantics, deriving from the work of Charles J. Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words can best be understood on the basis of a semantic frame. [7]

Ontology

Definition

Ontology is a formal definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain. RDF is a formal model for ontology description, OWL is one of the main language for ontology description. Taxonomy is an ontology with one hierarchical relation.

  • BabelNet is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and a semantic network which connects concepts and named entities in a very large network of semantic relations, made up of more than 13 million entries, called Babel synsets. Each Babel synset represents a given meaning and contains all the synonyms which express that meaning in a range of different languages. [8]
  • SNOMED CT (Systematized Nomenclature of Medicine-ClinicalTerms) is a comprehensive clinical terminology, originally created by the College of American Pathologists (CAP) and, as of April 2007, owned, maintained, and distributed by the International Health Terminology Standards Development Organisation (IHTSDO), a not-for-profit association in Denmark. [9]
  • ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the defacto standard classification system for the computing field. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. [10]

Text corpus

Speech corpus

Web corpus