Language model

Language model

A statistical language model assigns a probability to a sequence of "m" words P(w_1,ldots,w_m) by means of a probability distribution.

Language modeling is used in many natural language processing applications such as speech recognition, machine translation, part-of-speech tagging, parsing and information retrieval.

In speech recognition and in data compression, such a model tries to capture the properties of a language, and to predict the next word in a speech sequence.

When used in information retrieval, a language model is associated with a document in a collection. With query "Q" as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query, "P(Q|Md)".

Estimating the probability of sequences can become difficult in corpora, in which phrases or sentences can be arbitrarily long and hence some sequences are not observed during training of the language model (data sparseness problem of overfitting). For that reason these models are often approximated using smoothed N-gram models.

N-gram models

In an n-gram model, the probability P(w_1,ldots,w_m) of observing the sentence w1,...,wm is approximated as

P(w_1,ldots,w_m) = prod^m_{i=1} P(w_i|w_1,ldots,w_{i-1}) approx prod^m_{i=1} P(w_i|w_{i-(n-1)},ldots,w_{i-1})

Here, it is assumed that the probability of observing the "ith" word "wi" in the context history of the preceding "i-1" words can be approximated by the probability of observing it in the shortened context history of the preceding "n-1" words ("nth order Markov property).

The conditional probability can be calculated from n-gram frequency counts:P(w_i|w_{i-(n-1)},ldots,w_{i-1}) = frac{count(w_{i-(n-1)},w_{i-1},ldots,w_i)}{count(w_{i-(n-1)},ldots,w_{i-1})}

The words bigram and trigram language model denote n-gram language models with "n=2" and "n=3", respectively.

Example

In a bigram (n=2) language model, the probability of the sentence "I saw the red house" is approximated as P(I,saw,the,red,house) approx P(I) P(saw|I) P(the|saw) P(red|the) P(house|red)

whereas in a trigram (n=3) language model, the approximation isP(I,saw,the,red,house) approx P(I) P(saw|I) P(the|I,saw) P(red|saw,the) P(house|the,red)

See also

* Factored language model

References

*cite conference | author=J M Ponte and W B Croft | url=http://citeseer.ist.psu.edu/ponte98language.html | title=A Language Modeling Approach to Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1998 | pages=275-281
*cite conference | author=F Song and W B Croft | url=http://citeseer.ist.psu.edu/song99general.html | title=A General Language Model for Information Retrieval | booktitle=Research and Development in Information Retrieval | year=1999 | pages=279-280


Wikimedia Foundation. 2010.

Игры ⚽ Поможем написать реферат

Look at other dictionaries:

  • Factored language model — The factored language model (FLM) is an extension of a conventional language model. In an FLM, each word is viewed as a vector of k factors: w i = {f i^1, ..., f i^k}. An FLM provides the probabilistic model P(f|f i, ..., f N) where the… …   Wikipedia

  • Language identification — is the process of determining which natural language given content is in. Traditionally, identification of written language as practiced, for instance, in library science has relied on manually identifying frequent words and letters known to be… …   Wikipedia

  • Model-driven security — (MDS) means applying model driven approaches (and especially the concepts behind model driven software development) [1] to security. Contents 1 Development of the concept 2 Opinions of industry analysts …   Wikipedia

  • Model-based testing — is the application of Model based design for designing and optionally executing the necessary artifacts to perform software testing. Models can be used to represent the desired behavior of the System Under Test (SUT), or to represent the desired… …   Wikipedia

  • Model Shop (film) — Model Shop Directed by Jacques Demy Produced by Jacques Demy …   Wikipedia

  • Model Transformation Language — Presentation = Ubiquitous transformations The notion of Model transformation is of central importance to Information Technology. A software system may be seen as a set of information transformations. The Unix system itself may be viewed as a… …   Wikipedia

  • Model-driven architecture — (MDA) is a software design approach for the development of software systems. It provides a set of guidelines for the structuring of specifications, which are expressed as models. Model driven architecture is a kind of domain engineering, and… …   Wikipedia

  • Model-driven engineering — (MDE) is a software development methodology which focuses on creating and exploiting domain models (that is, abstract representations of the knowledge and activities that govern a particular application domain), rather than on the computing (or… …   Wikipedia

  • Language identification in the limit — is a formal model for inductive inference. It was introduced by E. Mark Gold in his paper with the same title [http://www.isrl.uiuc.edu/ amag/langev/paper/gold67limit.html] . In this model, a learner is provided with presentation of some language …   Wikipedia

  • Model-based design — (MBD) is a mathematical and visual method of addressing problems associated with designing complex control,[1][2] signal processing[3] and communication systems. It is used in many motion control, industrial equipment, aerospace, and automotive… …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”