Language model

A statistical language model assigns a probability to a sequence of "m" words $P(w_1,ldots,w_m)$ by means of a probability distribution.

Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.

In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.

When used in information retrieval, a language model is associated with a document in a collection. With query "Q" as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, "P(Q|M_d)".

Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). For that reason these models are often approximated using smoothed N-gram models.

N-gram models

In an n-gram model, the probability $P(w_1,ldots,w_m)$ of observing the sentence w₁,...,w_m is approximated as

$P(w_1,ldots,w_m) = prod^m_{i=1} P(w_i|w_1,ldots,w_{i-1}) approx prod^m_{i=1} P(w_i|w_{i-(n-1)},ldots,w_{i-1})$

Here, it is assumed that the probability of observing the "i^th" word "w_i" in the context history of the preceding "i-1" words can be approximated by the probability of observing it in the shortened context history of the preceding "n-1" words ("n^th order Markov property).

The conditional probability can be calculated from n-gram frequency counts: $P(w_i|w_{i-(n-1)},ldots,w_{i-1}) = frac{count(w_{i-(n-1)},w_{i-1},ldots,w_i)}{count(w_{i-(n-1)},ldots,w_{i-1})}$

The words bigram and trigram language model denote n-gram language models with "n=2" and "n=3", respectively.

Example

In a bigram (n=2) language model, the probability of the sentence "I saw the red house" is approximated as $P(I,saw,the,red,house) approx P(I) P(saw|I) P(the|saw) P(red|the) P(house|red)$

whereas in a trigram (n=3) language model, the approximation is $P(I,saw,the,red,house) approx P(I) P(saw|I) P(the|I,saw) P(red|saw,the) P(house|the,red)$

See also

* Factored language model

References

Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

Factored language model — The factored language model (FLM) is an extension of a conventional language model. In an FLM, each word is viewed as a vector of k factors: w i = {f i^1, ..., f i^k}. An FLM provides the probabilistic model P(f|f i, ..., f N) where the… … Wikipedia
Language identification — is the process of determining which natural language given content is in. Traditionally, identification of written language as practiced, for instance, in library science has relied on manually identifying frequent words and letters known to be… … Wikipedia
Model-driven security — (MDS) means applying model driven approaches (and especially the concepts behind model driven software development) [1] to security. Contents 1 Development of the concept 2 Opinions of industry analysts … Wikipedia
Model-based testing — is the application of Model based design for designing and optionally executing the necessary artifacts to perform software testing. Models can be used to represent the desired behavior of the System Under Test (SUT), or to represent the desired… … Wikipedia
Model Shop (film) — Model Shop Directed by Jacques Demy Produced by Jacques Demy … Wikipedia
Model Transformation Language — Presentation = Ubiquitous transformations The notion of Model transformation is of central importance to Information Technology. A software system may be seen as a set of information transformations. The Unix system itself may be viewed as a… … Wikipedia
Model-driven architecture — (MDA) is a software design approach for the development of software systems. It provides a set of guidelines for the structuring of specifications, which are expressed as models. Model driven architecture is a kind of domain engineering, and… … Wikipedia
Model-driven engineering — (MDE) is a software development methodology which focuses on creating and exploiting domain models (that is, abstract representations of the knowledge and activities that govern a particular application domain), rather than on the computing (or… … Wikipedia
Language identification in the limit — is a formal model for inductive inference. It was introduced by E. Mark Gold in his paper with the same title [http://www.isrl.uiuc.edu/ amag/langev/paper/gold67limit.html] . In this model, a learner is provided with presentation of some language … Wikipedia
Model-based design — (MBD) is a mathematical and visual method of addressing problems associated with designing complex control,[1][2] signal processing[3] and communication systems. It is used in many motion control, industrial equipment, aerospace, and automotive… … Wikipedia

Academic Dictionaries and Encyclopedias

Language model

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Language model

Look at other dictionaries:

Share the article and excerpts

Direct link