Treebank

Treebank

A treebank or parsed corpus is a text corpus in which each sentence has been parsed, i.e. annotated with syntactic structure. Syntactic structure is commonly represented as a tree structure, hence the name Treebank. The term Parsed Corpus is often used interchangeably with Treebank: with the emphasis on the primacy of sentences rather than trees.

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information.

Treebanks can be created completely "manually", where linguists annotate each sentence with syntactic structure, or "semi-automatically", where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour intensive project that can take teams of graduate linguists many years. The level of annotation detail and the breadth of the linguistic sample determines the difficulty of the task and the length of time required to build a treebank.

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the [http://www.bultreebank.org/ BulTreeBank] follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the [http://www.cis.upenn.edu/~treebank/ Penn Treebank] or [http://www.ucl.ac.uk/english-usage/projects/ice-gb/index.htm ICE-GB] ) and those that annotate dependency structure (for example the [http://ufal.mff.cuni.cz/pdt/ Prague Dependency Treebank] ).

It is important to clarify the distinction between the formal representation and the file format used. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats.

For example, the syntactic analysis for "John loves Mary", shown in the figure on the right, may be represented by simple labelled brackets in a text file, like this (following the [http://www.cis.upenn.edu/~treebank/ Penn Treebank] notation): (S (NP (NNP John)) (VP (VBZ loves) (NP (NNP Mary))) (. .))

This type of representation is popular because it is 'light' on resources, and the tree structure is relatively easy to 'read' without software tools. However as corpora become increasingly more complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation. If you want to review schemes, see the [http://www.scs.leeds.ac.uk/amalgam/amalgam/multi-parsed.html Amalgam Multi-Treebank] , a pico corpus of 20 sentences annotated by different grammars and notation schemes.

What is the purpose of a treebank ?

Treebanks can be used in corpus linguistics for studying syntactic phenomena or in computational linguistics for training or testing parsers.

The value of parsed corpora is beginning to be understood. Introspection about grammar is inevitably partial, as linguists have found when attempting to parse actual speech and writing.

Once completely parsed, a corpus will contain evidence of both frequency (how common different grammatical structures are in use) and coverage (the discovery of new, unanticipated, grammatical phenomena).

An automatically parsed corpus that is not corrected by human linguists is useful. It can provide evidence of "rule frequency" for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that it is only by a process of correcting and completing a corpus by hand is it possible then to identify rules "absent" from the parser knowledge base. (As a bonus, frequencies are likely to be more accurate.)

Potentially, however, by far the most interesting question for theoretical linguists and psycholinguists is interaction evidence in parsed corpora. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others. The idea here is not to improve parsing algorithms but to go to the heart of the question of linguistic choice: to try to understand how speakers and writers make decisions as they form sentences.

Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of 'non-syntactic' phenomena on grammatical choices.

The parsing and exploitation of parsed corpora has become an important subdiscipline of Corpus Linguistics ever since the first large-scale treebank, [http://www.cis.upenn.edu/~treebank/ The Penn Treebank] , was published. Many of the theoretical criticisms of lexical corpora do not apply to parsed corpora. Results from a parsed corpus are more closely commensurate with linguistic theories. However, a new epistemological problem arises: a parsed corpus necessarily requires a "particular" analysis, and this analysis, and the theory behind it, may be incorrect or deficient.

Theoretical linguists, following Noam Chomsky, have made a distinction between Internal (I-) Language and External (E-) Language, or Deep Grammar and Surface Grammar. A treebank necessarily only represents the "performance" of the grammar - the Surface Grammar or the E-Language. The big question remains: is it possible, by studying interaction in E-Language in corpora, to perceive the impacts of constraints on I-Language?

The value of parsed corpora for general linguistics, therefore, remains an open question.

Searching treebanks

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists.

The question facing a new researcher is not only, "which corpus is relevant to my needs?" but also "how can I find the information I want in this corpus, and how do I know that the results of my experiments mean what I think they do?"

Tools

* Phrase structure grammar
** [http://www.ldc.upenn.edu/ldc/online/treebank/ tgrep; tgrep2]
** [http://corpussearch.sourceforge.net/ CorpusSearch]
** Linguistic DataBase (LDB)
** VIQTORYA
** [http://www.ucl.ac.uk/english-usage/resources/icecup ICECUP III] ; [http://www.ucl.ac.uk/english-usage/resources/icecup/iv.htm ICECUP IV]
* Dependency grammar
** [http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch TigerSearch]
** [http://quest.ms.mff.cuni.cz/netgraph/indexEn.html Netgraph]

* Others
** [http://www.hcrc.ed.ac.uk/gsearch/ GSearch]
** [http://lse.umiacs.umd.edu Linguist's Search Engine]

Wallis 2008 [Wallis, Sean (2008). Searching treebanks and other structured corpora. Chapter 34 in Lüdeling, A. & Kytö, M. (ed.) "Corpus Linguistics: An International Handbook." Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter.] discusses the principles of searching treebanks in detail and reviews the state of the art (in 2006).

In addition to strictly Treebank search tools, some tools for searching speech data also exist. These tools are designed to support searches on overlapping hierarchies or graph structures.

List of treebanks sorted by language

* Arabic: [http://www.ircs.upenn.edu/arabic/ Penn Arabic Treebank] , [http://ufal.mff.cuni.cz/padt/PADT_1.0/index.html Prague Arabic Dependency Treebank (PADT)]
* Basque: [http://www.dlsi.ua.es/projectes/3lb/index_en.html Eus3LB] , see also [http://ixa.si.ehu.es/Ixa/Argitalpenak/proba/1068549887/publikoak/guia.pdf Annotation guide for Eus3LB] and the [http://ixa.si.ehu.es/Ixa group's home page]
* Bulgarian: [http://www.bultreebank.org/ BulTreeBank] (HPSG-based Syntactic Treebank)
* Catalan: [http://www.dlsi.ua.es/projectes/3lb/index_en.html Cat3LB]
* Chinese: [http://www.cis.upenn.edu/%7Echinese/ctb.html Penn Chinese Treebank] , [http://godel.iis.sinica.edu.tw/CKIP/engversion/treebank.htm Sinica Treebank] by CKIP, [http://ling.cuc.edu.cn/htliu/ctreebank.htm a tentative Chinese Dependency Treebank]
* Czech: [http://ufal.mff.cuni.cz/pdt/ Prague Dependency Treebank]
* Danish: [http://www.id.cbs.dk/~mtk/treebank/ Danish Dependency Treebank] , [http://corp.hum.sdu.dk/arboretum.html Arboretum: A syntactic tree corpus of Danish]
* Dutch: [http://lands.let.kun.nl/cgn/ehome.htm CGN] , [http://www.let.rug.nl/%7Evannoord/trees/ Alpino]
* English:
** [http://www.cis.upenn.edu/~treebank/ Penn] ;
** [http://www.cis.upenn.edu/~creswell/dependency/ English Dependency Treebank] ?;
** [http://www.ucl.ac.uk/english-usage/ice/index.htm BLLIP WSJ corpus] ;
** [http://www.ucl.ac.uk/english-usage/projects/ice-gb British Component of the International Corpus of English (ICE-GB)] ;
** [http://www.ucl.ac.uk/english-usage/projects/dcpse Diachronic Corpus of Present-Day Spoken English (DCPSE)] ;
** Lancaster Parsed Corpus;
** [http://www.grsampson.net/RSue.html Susanne Corpus] , [http://www.grsampson.net/RChristine.html Christine Corpus] , [http://www.grsampson.net/RLucy.html Lucy Corpus] ;
** Verbmobil treebanks;
** [http://redwoods.stanford.edu/ LinGO Redwoods] ;
** [http://www.scs.leeds.ac.uk/amalgam/amalgam/multi-parsed.html Multi-Treebank] ;
** [http://www2.parc.com/istl/groups/nltt/fsbank/default.html The PARC 700 Dependency Bank] ;
** [http://childes.psy.cmu.edu/ CHILDES] Brown Eve corpus with dependency annotation, see Sagae, K., MacWhinney, B., and Lavie, A. (2004) [http://www.cs.cmu.edu/~sagae/docs/sagae-LREC2004-final.pdf Adding syntactic annotations to transcripts of parent-child dialogs] . In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). Lisbon, Portugal.
** [http://www.ling.su.se/DaLi/research/smultron/index.htm SMULTRON - Parallel Treebank EN-DE-SV]
* Estonian: [http://math.ut.ee/~heli_u/syntkorpus.html Syntactically analyzed and disambiguated text corpus] , see also [http://corp.hum.sdu.dk/tgrepeye_est.html Arborest]
* French: [http://treebank.linguist.jussieu.fr/buildingFrench.html Paris 7] , [http://corp.hum.sdu.dk/arboratoire.html L'Arboratoire]
* German: [http://www.coli.uni-saarland.de/projects/sfb378/negra-corpus/ NEGRA] , [http://www.ims.uni-stuttgart.de/projekte/TIGER/ TIGER] , [http://www.sfs.uni-tuebingen.de/en_tuebads.shtml The Tuebingen Treebank of Spoken German (TueBa-D/S)] , [http://www.sfs.uni-tuebingen.de/en_tuebadz.shtml The Tuebingen Treebank of Written German (TueBa-D/Z)] , [http://www.ling.su.se/DaLi/research/smultron/index.htm SMULTRON - Parallel Treebank EN-DE-SV]
* Greek, Modern: [http://www.ilsp.gr/homepages/prokopidis/documents/gdt_tlt2005.pdf Greek Dependency Treebank]
* Greek, Ancient: [http://foni.uio.no:3000 PROIEL Corpus]
* Hebrew: [http://mila.cs.technion.ac.il/website/english/resources/corpora/treebank/index.html Hebrew Treebank]
* Hindi: [http://www.iiit.net/ltrc/Publications/Techreports/tr014/guidelines_anncorra AnnCorra]
* Hungarian: [http://www.inf.u-szeged.hu/projectdirs/hlt/ikta37-en.htm Hungarian treebank]
* Italian: [http://www.di.unito.it/~tutreeb/index.html TUT - Turin University Treebank] , [http://torvald.aksis.uib.no/corpora/2005-1/0385.html VIT - Venice Italian Treebank] , [http://corpus1.mpi.nl/ds/imdi_browser/BC?virtpath=/IMDI-corpora/Pisa%20resources/ISTT&metadata=1 ISST - Italian Syntactic-Semantic Treebank]
* Japanese: [http://acl.ldc.upenn.edu/W/W98/W98-0513.pdf ATR Dependency corpus] , [http://www.kc.t.u-tokyo.ac.jp/nl-resource/corpus-e.html Kyoto Text Corpus] , [http://www.phonetik.uni-muenchen.de/Forschung/Verbmobil/Verbmobil.html Verbmobil treebanks]
* Korean: [http://www.cis.upenn.edu/~xtag/koreantag/#Treebank Korean Treebank]
* Latin:
** [http://nlp.perseus.tufts.edu/syntax/treebank/ Latin Dependency Treebank] ;
** [http://itreebank.marginalia.it/ "Index Thomisticus" Treebank] .
** [http://foni.uio.no:3000 PROIEL Corpus]
* Norwegian: [http://spraktek.aksis.uib.no/projects/trepil TREPIL Norwegian treebank]
* Polish: [http://dach.ipipan.waw.pl/CRIT2/ A Treebank / Test Suite for Polish] (HPSG treebank)
* Portuguese: [http://acdc.linguateca.pt/treebank/info_floresta_English.html Projecto Floresta Sintá(c)tica]
* Russian: [http://acl.ldc.upenn.edu/C/C00/C00-2143.pdf Dependency Treebank for Russian] , see also [http://proling.iitp.ru/bibitems/treebank_lrec.pdf another paper]
* Slovene: [http://nl.ijs.si/sdt/ Slovene Dependency Treebank]
* Spanish: [http://www.dlsi.ua.es/projectes/3lb/index_en.html Cast3LB] , [http://www.lllf.uam.es/%7Esandoval/UAMTreebank.html UAM Treebank of Spanish]
* Swedish: [http://w3.msi.vxu.se/~nivre/research/Talbanken05.html Talbanken05] , [http://w3.msi.vxu.se/~nivre/research/st.html Swedish Treebank] , [http://www.ling.su.se/DaLi/research/smultron/index.htm SMULTRON - Parallel Treebank EN-DE-SV]
* Thai: [http://naist.cpe.ku.ac.th/tred/ NAiST Thai Treebank]
* Turkish: [http://www.ii.metu.edu.tr/~corpus/treebank.html METU-Sabanci Treebank]

References


Wikimedia Foundation. 2010.

Игры ⚽ Нужно сделать НИР?

Look at other dictionaries:

  • treebank — noun /ˈtɹiː.bæŋk/ A database of sentences which are annotated with syntactic information, often in the form of a tree. If one wants to use a treebank for linguistic investigation …   Wiktionary

  • Baumbank — Eine Baumbank (engl. Treebank), auch geparstes Korpus, ist ein Textkorpus, in dem jeder Satz geparst, also mit syntaktischer Struktur annotiert wurde. Der Begriff Baumbank bezieht sich darauf, dass die syntaktische Struktur gewöhnlich als eine… …   Deutsch Wikipedia

  • Dependency grammar — Hybrid constituency/dependency tree from the Quranic Arabic Corpus Dependency grammar (DG) is a class of syntactic theories developed by Lucien Tesnière. It is distinct from phrase structure grammars, as it lacks phrasal nodes. Structure is… …   Wikipedia

  • Stochastic context-free grammar — A stochastic context free grammar (SCFG; also probabilistic context free grammar, PCFG) is a context free grammar in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the… …   Wikipedia

  • Natural language processing — (NLP) is a field of computer science and linguistics concerned with the interactions between computers and human (natural) languages; it began as a branch of artificial intelligence.[1] In theory, natural language processing is a very attractive… …   Wikipedia

  • Corpus linguistics — is the study of language as expressed in samples (corpora) or real world text. This method represents a digestive approach to deriving a set of abstract rules by which a natural language is governed or else relates to another language. Originally …   Wikipedia

  • Text corpus — In linguistics, a corpus (plural corpora ) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or… …   Wikipedia

  • Parsing — In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar.Parsing is also… …   Wikipedia

  • Tree-adjoining grammar — (TAG) is a grammar formalism defined by Aravind Joshi. Tree adjoining grammars are somewhat similar to context free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context free grammars have rules for… …   Wikipedia

  • PDT — may refer to: Computers: PHP Development Tools, an IDE plugin for the Eclipse platform Portable data terminal, an electronic device that is used to enter or retrieve data via wireless transmission Medicine: Patient delivered therapy Photodynamic… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”