Lecture 5. Language sources
Types of language sources
- Word list
- Dictionary: definitions for words
- Thesaurus: words grouped together according to similarity of meaning
- Ontology: formal naming and definitions of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain of discourse
- Text corpus: a large and structured set of texts
- Speech corpus: a large set of speech audio files
- Web corpus: text corpus, collected from Web
- Wikipedia (DBpedia)
- Test datasets
- List of stopwords (in NLTK, too)
- Moby words
- List of Wikipedia articles
- Lists of words for language learners
- Lists of German compounds
- Lists of common spam words , email-marketing-ebook/spam-words.aspx.
- Wiktionary: collaborative project to produce a free-content multilingual dictionary. It aims to describe all words of all languages
using definitions and descriptions in English. 
- Wiktionary as a source for automatic pronunciation extraction
- Extracting lexical semantic knowledge from Wikipedia and Wiktionary
- Using Wikipedia and Wiktionary in domain-specific information retrieval
- Wiktionary and NLP: Improving synonymy networks
- FreeLing dictionaries 
- English-Spanish large statistical dictionary of in ectional forms
- Exploiting web-based collective knowledge for micropost normalisation
Thesaurus is a reference work that lists words grouped together according to similarity of meaning.
- WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms(synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. 
- VerbNet is the largest on-line verb lexicon currently available for English. It is a hierarchical domain-independent, broad-coverage verb lexicon. VerbNet is organized into verb classes. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function. 
- FrameNet project is building a lexical database of English, based on annotating examples of how words are used in actual texts. It provides a unique training dataset for semantic role labeling. FrameNet is based on a theory of meaning called Frame Semantics, deriving from the work of Charles J. Fillmore and colleagues. The basic idea is straightforward: that the meanings of most words can best be understood on the basis of a semantic frame. 
Ontology is a formal definition of the types, properties, and interrelationships of the entities that really or fundamentally exist for a particular domain. RDF is a formal model for ontology description, OWL is one of the main language for ontology description. Taxonomy is an ontology with one hierarchical relation.
- BabelNet is both a multilingual encyclopedic dictionary, with lexicographic and encyclopedic coverage of terms, and a semantic network which connects concepts and named entities in a very large network of semantic relations, made up of more than 13 million entries, called Babel synsets. Each Babel synset represents a given meaning and contains all the synonyms which express that meaning in a range of different languages. 
- SNOMED CT (Systematized Nomenclature of Medicine-ClinicalTerms) is a comprehensive clinical terminology, originally created by the College of American Pathologists (CAP) and, as of April 2007, owned, maintained, and distributed by the International Health Terminology Standards Development Organisation (IHTSDO), a not-for-profit association in Denmark. 
- ACM Computing Classification System has been developed as a poly-hierarchical ontology that can be utilized in semantic web applications. It replaces the traditional 1998 version of the ACM Computing Classification System (CCS), which has served as the defacto standard classification system for the computing field. It relies on a semantic vocabulary as the single source of categories and concepts that reflect the state of the art of the computing discipline and is receptive to structural change as it evolves in the future. 
A corpus is a large and structured set of texts.
- The British national corpus is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres with the intention that it be a representative sample of spoken and written British English of that time. 
- The Brown Corpus of Present-Day American English was compiled in the 1960s by Henry Kucera and W. Nelson Francis at Brown University, Providence, Rhode Island as a general corpus (text collection) in the field of corpus linguistics. It contains 500 samples of English-language text, totaling roughly one million words, compiled from works published in the United States in 1961.
A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions.
The Santa Barbara Corpus includes transcriptions, audio, and timestamps which correlate transcription and audio at the level of individual intonation units.
How can we collect texts from the Web?
- Use a crawler to crawl and other tools for boilerplate and duplicate removal
- Use search engine API for iterative search (find new pages, generatequeries, search for them)
- WaCky [Baroni, Bernardini, Ferraresi, Zanchetta, 2009] 
- Focused Web crawling