Lecture 5. Language sources — различия между версиями
Материал из Wiki - Факультет компьютерных наук
Polidson (обсуждение | вклад) (→Types of language sources) |
Polidson (обсуждение | вклад) (→Thesaurus) |
||
Строка 35: | Строка 35: | ||
== Thesaurus == | == Thesaurus == | ||
+ | |||
+ | === Definition === | ||
+ | |||
+ | Thesaurus is a reference work that lists words grouped together according to similarity of meaning. | ||
+ | |||
+ | * WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms(synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. [https://wordnet.princeton.edu/|https://wordnet.princeton.edu/] | ||
+ | |||
+ | * VerbNet is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon. VerbNet is organized into verb classes. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function. [http://verbs.colorado.edu/~mpalmer/projects/verbnet.html|http://verbs.colorado.edu/~mpalmer/projects/verbnet.html] | ||
+ | |||
+ | * FrameNet project is building a lexical database of English, based on annotating examples of how words are used in actual texts. It provides a unique training dataset for semantic role labeling. FrameNet is based on a theory of meaning called Frame Semantics, deriving from the work of Charles J. Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words can best be understood on the basis of a semantic frame. [https://framenet.icsi.berkeley.edu/fndrupal/about|https://framenet.icsi.berkeley.edu/fndrupal/about] | ||
== Ontology == | == Ontology == |
Версия 01:33, 24 августа 2015
Содержание
[убрать]Types of language sources
- Word list
- Dictionary: definitions for words
- Thesaurus: words grouped together according to similarity of meaning
- Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
- Corpus
- Text corpus: a large and structured set of texts
- Speech corpus: a large set of speech audio files
- Web corpus: text corpus, collected from Web
- Wikipedia (DBpedia)
- Test datasets
Word lists
- List of stopwords (in NLTK, too)
- Moby words[1]
- List of Wikipedia articles
- Lists of words for language learners
- Lists of German compounds
- Lists of common spam words [2], email-marketing-ebook/spam-words.aspx.
Dictionary
- Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages
using definitions and descriptions in English. [3]
- Wiktionary as a source for automatic pronunciation extraction
- Extracting lexical semantic knowledge from Wikipedia and Wiktionary
- Using Wikipedia and Wiktionary in domain-specific information retrieval
- Wiktionary and NLP: Improving synonymy networks
- FreeLing dictionaries [4]
- English-Spanish large statistical dictionary of in ectional forms
- Exploiting web-based collective knowledge for micropost normalisation
Thesaurus
Definition
Thesaurus is a reference work that lists words grouped together according to similarity of meaning.
- WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms(synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. [5]
- VerbNet is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon. VerbNet is organized into verb classes. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function. [6]
- FrameNet project is building a lexical database of English, based on annotating examples of how words are used in actual texts. It provides a unique training dataset for semantic role labeling. FrameNet is based on a theory of meaning called Frame Semantics, deriving from the work of Charles J. Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words can best be understood on the basis of a semantic frame. [7]