n-gram

n-gram: Not to be confused with engram.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application. n-grams are collected from a text or speech corpus.

An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"; size 4 is a "four-gram" and size 5 or more is simply called an "n-gram". Some language models built from n-grams are "(n − 1)-order Markov models".

An n-gram model is a type of probabilistic model for predicting the next item in such a sequence. n-gram models are used in various areas of statistical natural language processing and genetic sequence analysis.

Contents

1 Examples

2 n-gram models

3 Applications and considerations

4 n-grams for approximate matching

5 Other applications

6 Bias-versus-variance trade-off

6.1 Smoothing techniques

7 See also

8 References

9 Bibliography

10 External links

Examples

Here are examples of word level 3-grams and 4-grams (and counts of the number of times they appeared) from the Google n-gram corpus.^[1]

ceramics collectables collectibles (55)

ceramics collectables fine (130)

ceramics collected by (52)

ceramics collectible pottery (50)

ceramics collectibles cooking (45)

4-grams

serve as the incoming (92)

serve as the incubator (99)

serve as the independent (794)

serve as the index (223)

serve as the indication (72)

serve as the indicator (120)

n-gram models

An n-gram model models sequences, notably natural languages, using the statistical properties of n-grams.

This idea can be traced to an experiment by Claude Shannon's work in information theory. His question was, given a sequence of letters (for example, the sequence "for ex"), what is the likelihood of the next letter? From training data, one can derive a probability distribution for the next letter given a history of size $n$ : a = 0.4, b = 0.00001, c = 0, ....; where the probabilities of all possible "next-letters" sum to 1.0.

More concisely, an n-gram model predicts $x i$ based on $x_{i-(n-1)}, \dots, x_{i-1}$ . In Probability terms, this is $P(x_{i} | x_{i-(n-1)}, \dots, x_{i-1})$ . When used for language modeling, independence assumptions are made so that each word depends only on the last n-1 words. This Markov model is used as an approximation of the true underlying language. This assumption is important because it massively simplifies the problem of learning the language model from data. In addition, because of the open nature of language, it is common to group words unknown to the language model together.

Note that in a simple n-gram language model, the probability of a word, conditioned on some number of previous words (one word in a bigram model, two words in a trigram model, etc.) can be described as following a categorical distribution (often imprecisely called a "multinomial distribution").

In practice, however, it is necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. See #Smoothing techniques for details.

Applications and considerations

n-gram models are widely used in statistical natural language processing. In speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution. For parsing, words are modeled such that each n-gram is composed of n words. For language identification, sequences of characters/graphemes (e.g., letters of the alphabet) are modeled for different languages. For a sequence of words, (for example "the dog smelled like a skunk"), the trigrams would be: "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #". For sequences of characters, the 3-grams (sometimes referred to as "trigrams") that can be generated from "good morning" are "goo", "ood", "od ", "d m", " mo", "mor" and so forth. Some practitioners preprocess strings to remove spaces, most simply collapse whitespace to a single space while preserving paragraph marks. Punctuation is also commonly reduced or removed by preprocessing. n-grams can also be used for sequences of words or, in fact, for almost any type of data. They have been used for example for extracting features for clustering large sets of satellite earth images and for determining what part of the Earth a particular image came from. They have also been very successful as the first pass in genetic sequence search and in the identification of the species from which short sequences of DNA originated.

n-gram models are often criticized, especially by Chomskyans, because they lack any explicit representation of long range dependency. (In fact, it was Chomsky's blistering attack on Markov models, in the late 1950s, that caused their virtual disappearance from natural language processing, along with statistical methods in general, until well into the 1980s.) This is because the only explicit dependency range is (n-1) tokens for an n-gram model, and since natural languages incorporate many cases of unbounded dependencies (such as wh-movement), this means that an n-gram model cannot in principle distinguish unbounded dependencies from noise (since long range correlations drop exponentially with distance for any Markov model). For this reason, n-gram models have not made much impact on linguistic theory, where part of the explicit goal is to model such dependencies.

Another Chomskyan criticism that has been leveled is that Markov models of language, including n-gram models, do not explicitly capture the performance/competence distinction introduced by Noam Chomsky. This is because n-gram models are not designed to model linguistic knowledge as such, and make no claims to being (even potentially) complete models of linguistic knowledge; instead, they are used in practical applications.

In practice, however, n-gram models have been shown through abundant experiments to be extremely effective in creating language models, which are a core component in modern statistical language applications. In contrast, the typical applications based on Chomskyan syntactic models that were common until the 1990's were based largely or entirely on hand-crafted rules and were hopelessly unscalable. Such applications tended to do well on toy problems but failed entirely when tested on real-world data. (Despite this, for some 30 or 40 years most applications were created in such a fashion. The reason for the long persistence of this paradigm was partly the limitations of computers of the time and partly the preeminence of the Chomskyan paradigm in linguistics at the time. Chomskyans tend to disdain testing of their theories on typical real-world data, focusing instead on thought experiments that are designed to test the theoretical limits of their theories, similar to the way that mathematicians tend to focus on "pathological" rather than typical cases. The dominance of Chomskyan paradigms during this time meant that testing on real-world data was simply not considered important.)

Furthermore, most modern applications that rely on a language model, such as machine translation applications, do not rely exclusively on a language model. In keeping with the paradigm of Bayesian inference, modern statistical models are typically made up of two parts, a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data. When a language model is used, it is used as part of the prior distribution (e.g. to gauge the inherent "goodness" of a possible translation), and even then it is often not the only component in this distribution. Handcrafted features of various sorts are also used, for example variables that represent the position of a word in a sentence or the general topic of discourse. In addition, features based on the structure of the potential result, such as syntactic considerations, are often used. Such features are also used as part of the likelihood function, which makes use of the observed data. Conventional linguistic theory can be incorporated in these features (although in practice, it is rare that features specific to Chomskyan or other particular theories of grammar are incorporated, as computational linguists tend to be "agnostic" towards individual theories of grammar).

n-grams for approximate matching

n-grams can also be used for efficient approximate matching. By converting a sequence of items to a set of n-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner. For example, if we convert strings with only letters in the English alphabet into 3-grams, we get a $263$ -dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. For example, both the strings "abc" and "bca" give rise to exactly the same 2-gram "bc" (although {"ab", "bc"} is clearly not the same as {"bc", "ca"}). However, we know empirically that if two strings of real text have a similar vector representation (as measured by cosine distance) then they are likely to be similar. Other metrics have also been applied to vectors of n-grams with varying, sometimes better, results. For example z-scores have been used to compare documents by examining how many standard deviations each n-gram differs from its mean occurrence in a large collection, or text corpus, of documents (which form the "background" vector). In the event of small counts, the g-score may give better results for comparing alternative models.

It is also possible to take a more principled approach to the statistics of n-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in Bayesian inference.

Other applications

n-grams find use in several areas of computer science, computational linguistics, and applied mathematics.

They have been used to:

design kernels that allow machine learning algorithms such as support vector machines to learn from string data

find likely candidates for the correct spelling of a misspelled word

improve compression in compression algorithms where a small area of data requires n-grams of greater length

assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, speech recognition, OCR (optical character recognition), Intelligent Character Recognition (ICR), machine translation and similar applications

improve retrieval in information retrieval systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference documents

improve retrieval performance in genetic sequence analysis as in the BLAST family of programs

identify the language a text is in or the species a small sequence of DNA was taken from

predict letters or words at random in order to create text, as in the dissociated press algorithm.

Bias-versus-variance trade-off

What goes into picking the n for the n-gram?

Smoothing techniques

There are problems of balance weight between infrequent grams (for example, if a proper name appeared in the training data) and frequent grams. Also, items not seen in the training data will be given a probability of 0.0 without smoothing. For unseen but plausible data from a sample, one can introduce pseudocounts. Pseudocounts are generally motivated on Bayesian grounds.

In practice it is necessary to smooth the probability distributions by also assigning non-zero probabilities to unseen words or n-grams. The reason is that models derived directly from the n-gram frequency counts have severe problems when confronted with any n-grams that have not explicitly been seen before -- the zero-frequency problem. Various smoothing methods are used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams; see Rule of succession) to more sophisticated models, such as Good-Turing discounting or back-off models. Some of these methods are equivalent to assigning a prior distribution to the probabilities of the N-grams and using Bayesian inference to compute the resulting posterior N-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations.

Linear interpolation (e.g., taking the weighted mean of the unigram, bigram, and trigram)

Good-Turing discounting

Witten-Bell discounting

Lidstone's smoothing

Katz's back-off model (trigram)

See also

Hidden Markov model

Trigram, Bigram

n-tuple

k-mer

String Kernel

References

^ http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Bibliography

Christopher D. Manning, Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press: 1999. ISBN 0-262-13360-1.

Ted Dunning, Statistical Identification of Language. Computing Research Laboratory Memorandum (1994) MCCS-94-273.

Owen White, Ted Dunning, Granger Sutton, Mark Adams, J.Craig Venter, and Chris Fields. A quality control algorithm for dna sequencing projects. Nucleic Acids Research, 21(16):3829—3838, 1993.

Frederick J. Damerau, Markov Models and Linguistic Theory. Mouton. The Hague, 1971.

External links

Google's ngram viewer and web n-grams database (September 2006)

Microsoft's web n-grams service

Peachnote's music ngram viewer

Google n-gram Information Extracter

Two visualizations of Google's n-gram dataset: Word Association, Word Spectrum.

Dunning, T. (1994) "Statistical Identification of Language". Technical Report MCCS 94-273, New Mexico State University, 1994. N-gram language identification algorithm [1] (language identification)

Ngram Statistics Package, open source package to identify statistically significant Ngrams

Stochastic Language Models (N-Gram) Specification (W3C)

language_detector, open source N-Gram based language detector, written in ruby

The Gibberizer, open source software to generate familiar-sounding gibberish using N-Grams

Categories:
Natural language processing
Computational linguistics
Speech recognition
Corpus linguistics
Probabilistic models

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

gram — gram·a·rye; gram; gram·i·ci·din; gram·i·na·ce·ae; gram·i·na·ceous; gram·i·na·les; gram·ine; gram·i·nic·o·lous; gram·i·niv·o·rous; gram·i·nous; gram·ma·log; gram·mar·i·an; gram·mar·less; gram·mat·i·cal; gram·mat·i·cism; gram·mat·i·cize;… … English syllables
Gram staining — (or Gram s method) is an empirical method of differentiating bacterial species into two large groups (Gram positive and Gram negative) based on the chemical and physical properties of their cell walls.cite book | last = Bergey | first = David H.… … Wikipedia
Gram-positive bacteria — Gram positive bacteria, stained purple, of both the bacillus (“rod shaped”) and coccus (spherical) forms. A few Gram negative bacteria are also present, stained pink. Numbered ticks are eleven (11) microns apart. Gram positive… … Wikipedia
gram — [ gram ] n. m. inv. • 1897; de coloration de Gram, nom d un bactériologiste danois ♦ Biol. Méthode de coloration des microbes avec une solution à base d iode et d iodure de potassium. ⊗ HOM. Gramme. ● Gram nom masculin invariable Coloration de… … Encyclopédie Universelle
Gram+ — Gram positif Les bactéries à Gram positif sont mises en évidence par une technique de coloration appelée coloration de Gram. Les bactéries à Gram positif apparaissent alors mauves au microscope. La technique de coloration repose sur les… … Wikipédia en Français
Gram-positif — Les bactéries à Gram positif sont mises en évidence par une technique de coloration appelée coloration de Gram. Les bactéries à Gram positif apparaissent alors mauves au microscope. La technique de coloration repose sur les caractéristiques… … Wikipédia en Français
Gram-positive — Gram positif Les bactéries à Gram positif sont mises en évidence par une technique de coloration appelée coloration de Gram. Les bactéries à Gram positif apparaissent alors mauves au microscope. La technique de coloration repose sur les… … Wikipédia en Français
Gram + — Gram positif Les bactéries à Gram positif sont mises en évidence par une technique de coloration appelée coloration de Gram. Les bactéries à Gram positif apparaissent alors mauves au microscope. La technique de coloration repose sur les… … Wikipédia en Français
Gram Positif — Les bactéries à Gram positif sont mises en évidence par une technique de coloration appelée coloration de Gram. Les bactéries à Gram positif apparaissent alors mauves au microscope. La technique de coloration repose sur les caractéristiques… … Wikipédia en Français
Gram positive — Gram positif Les bactéries à Gram positif sont mises en évidence par une technique de coloration appelée coloration de Gram. Les bactéries à Gram positif apparaissent alors mauves au microscope. La technique de coloration repose sur les… … Wikipédia en Français
Gram (disambiguation) — Gram is a unit of measurement of mass. Otherwise, gram may refer to: gram, the Greek based suffix meaning drawing or representation. Contents 1 Places 2 People … Wikipedia

Academic Dictionaries and Encyclopedias

n-gram

Contents

Examples

n-gram models

Applications and considerations

n-grams for approximate matching

Other applications

Bias-versus-variance trade-off

Smoothing techniques

See also

References

Bibliography

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

n-gram

Contents

Examples

n-gram models

Applications and considerations

n-grams for approximate matching

Other applications

Bias-versus-variance trade-off

Smoothing techniques

See also

References

Bibliography

External links

Look at other dictionaries:

Share the article and excerpts

Direct link