Croatian National Corpus

Croatian National Corpus: Croatian National Corpus (Croatian: Hrvatski nacionalni korpus, HNK) is the biggest and the most important corpus of the Croatian language. Its compilation started in 1998 at the Institute of Linguistics^[1] of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of the Croatian language started to appear even earlier^[2]. The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The initial composition was divided in two constituents:

30-million corpus of contemporary Croatian language (30m) where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were excluded.

Croatian Electronic Text Archive (HETA) where the complete text were included, particularly serial publications (volumes, series, editions etc.) which would imbalance the 30m if they were inserted there.

Since 2004, with the adoption of the concept of the 3rd generation corpus, the two-constituent structure has been abandoned in favor of several subcorpora and larger size. Since 2005 HNK 105 million tokens and is composed of number of different subcorpora which can be searched individually and all together in a whole corpus. Since 2004 HNK also migrated to a new server platform, namely Manatee/Bonito server-client architecture. For searching the HNK (today still with free test access) a free client program Bonito^[3] is needed. It has been produced at the Natural Language Processing Laboratory^[4] of the Faculty of Informatics^[5], Masaryk University in Brno, Czech Republic. Its interface features complex and more elaborated queries over corpus, different types of statistical results, total or partial word lists according to different query criteria (with their frequencies), frequency distribution of types, automatic collocation detection etc.

References

^ Institute of Linguistics

^ Tadić 1990, 1996, 1998

^ Bonito

^ Natural Language Processing Laboratory

^ Faculty of Informatics

External links

Croatian National Corpus website

(Croatian) Hrvatska jezična riznica, another online Croatian corpus, by the Institute of Croatian Language and Linguistics

v · d · e Croatian language

Features
Alphabet

Dialects
Shtokavian · Chakavian · Kajkavian · Burgenland Croatian · Molise Croatian

Names
Patronymic names · List of exonyms · Months

History and literature
Literature · Declaration on the Status and Name of the Croatian Literary Language

Promotion and purism

Croatian National Corpus · Days of the Croatian Language · Council for Standard Croatian Language Norm · Institute of Croatian Language and Linguistics · Croatian Encyclopedia · Linguistic purism · Studies

Related topics
Croatian Sign Language

Categories:
Corpora
Croatian language
Online databases

Игры ⚽ Поможем сделать НИР

Look at other dictionaries:

Croatian Language Corpus — The Croatian Language Corpus (Croatian: Hrvatski jezični korpus, HJK) is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ). Contents 1 Background 2 Goals 3 … Wikipedia
Croatian Encyclopedia — Author(s) Dalibor Brozović, Tomislav Ladan … Wikipedia
Croatian Sign Language — Hrvatski znakovni jezik Signed in Croatia Native signers (30,000 all dialects of YSL) (date missing) Language family … Wikipedia
Croatian studies — (Croatian: Kroatistika, German: Kroatistik, Polish: Kroatystyka) is an academic discipline within Slavic studies which is concerned with the study of Croatian language, literature, history and culture. Within Slavic studies it belongs to the… … Wikipedia
Croatian language — Hrvatski redirects here. For other uses, see Hrvatski (disambiguation). Croatian hrvatski Pronunciation … Wikipedia
Croatian linguistic purism — One of the features of standard Croatian language and in common with several languages such as Czech, Finnish, Slovenian, Tamil or Turkish is word coinage using roots or elements perceived as being characteristic or unique to the speech of the… … Wikipedia
Croatian months — The Croatian months used with the Gregorian calendar by Croats differ from the original Latin month names: No. Latin name English name Croatian name Croatian meaning 1 Ianuarius January Siječanj month of cutting (wood) 2 Februarius February… … Wikipedia
National and University Library in Zagreb — (Croatian: Nacionalna i sveučilišna knjižnica u Zagrebu, NSK; formerly Nacionalna i sveučilišna biblioteka u Zagrebu, NSB) is the national library of Croatia and central library of the University of Zagreb … Wikipedia
Text corpus — In linguistics, a corpus (plural corpora ) or text corpus is a large and structured set of texts (now usually electronically stored and processed). They are used to do statistical analysis and hypothesis testing, checking occurrences or… … Wikipedia
Molise Croatian dialect — South Slavic languages and dialects Western South Slavic Slo … Wikipedia

Academic Dictionaries and Encyclopedias

Croatian National Corpus

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Croatian National Corpus

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link