Maximum entropy probability distribution

In statistics and information theory, a maximum entropy probability distribution is a probability distribution whose entropy is at least as great as that of all other members of a specified class of distributions.

According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class, then the distribution with the largest entropy should be chosen as the default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

1 Definition of entropy
2 Examples of maximum entropy distributions
3 A theorem by Boltzmann
4 See also
5 Notes
6 References

Definition of entropy

Further information: Entropy (information theory)

If X is a discrete random variable with distribution given by

$\operatorname{Pr}(X=x_k) = p_k \quad\mbox{ for } k=1,2,\ldots$

then the entropy of X is defined as

$H(X) = - \sum_{k\ge 1}p_k\log p_k .$

If X is a continuous random variable with probability density p(x), then the entropy of X is sometimes defined as^[1]^[2]^[3]

$H(X) = - \int_{-\infty}^\infty p(x)\log p(x) dx$

where p(x) log p(x) is understood to be zero whenever p(x) = 0. In connection with maximum entropy distributions, this form of definition is often the only one given, or at least it is taken as the standard form. However, it is recognisable as the special case m=1 of the more general definition

$H^c(p(x)\|m(x)) = -\int p(x)\log\frac{p(x)}{m(x)}\,dx,$

which is discussed in the articles Entropy (information theory) and Principle of maximum entropy.

The base of the logarithm is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theoreticians may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nats or nepers for the entropy.

Examples of maximum entropy distributions

A table of examples of maximum entropy distributions is given in Park & Bera (2009)^[4]

Given mean and standard deviation: the normal distribution

The normal distribution N(μ,σ²) has maximum entropy among all real-valued distributions with specified mean μ and standard deviation σ. Therefore, the assumption of normality imposes the minimal prior structural constraint beyond these moments.(See the differential entropy article for a derivation.)

Uniform and piecewise uniform distributions

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval).

More generally, if we're given a subdivision a=a₀ < a₁ < ... < a_k = b of the interval [a,b] and probabilities p₁,...,p_k which add up to one, then we can consider the class of all continuous distributions such that

$\operatorname{Pr}(a_{j-1}\le X < a_j) = p_j \quad \mbox{ for } j=1,\ldots,k$

The density of the maximum entropy distribution for this class is constant on each of the intervals [a_j-1,a_j); it looks somewhat like a histogram.

The uniform distribution on the finite set {x₁,...,x_n} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and given mean: the exponential distribution

The exponential distribution with mean 1/λ is the maximum entropy distribution among all continuous distributions supported in [0,∞) that have a mean of 1/λ.

In physics, this occurs when gravity acts on a gas that is kept at constant pressure and temperature: if X describes the height of a molecule, then the variable X is exponentially distributed (which also means that the density of the gas depends on height proportional to the exponential distribution). The reason: X is clearly positive and its mean, which corresponds to the average potential energy, is fixed. Over time, the system will attain its maximum entropy configuration, according to the second law of thermodynamics.

Discrete distributions with given mean

Among all the discrete distributions supported on the set {x₁,...,x_n} with mean μ, the maximum entropy distribution has the following shape:

$\operatorname{Pr}(X=x_k) = Cr^{x_k} \quad\mbox{ for } k=1,\ldots, n$

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large number N of dice is thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x₁,...,x₆} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x₁,x₂,...} with mean μ, the maximum entropy distribution has the shape:

$\operatorname{Pr}(X=x_k) = Cr^{x_k} \quad\mbox{ for } k=1,2,\ldots ,$

where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case that x_k = k, this gives

$C = \frac{1}{\mu - 1} , \quad\quad r = \frac{\mu - 1}{\mu} .$

Circular random variables

For a continuous random variable $θ i$ distributed about the unit circle, the Von Mises distribution maximizes the entropy when given the real and imaginary parts of the first circular moment^[5] or, equivalently, the circular mean and circular variance.

When given the mean and variance of the angles $θ i$ modulo $2π$ , the wrapped normal distribution maximizes the entropy^[5].

A theorem by Boltzmann

All the above examples are consequences of the following theorem by Ludwig Boltzmann.

Continuous version

Suppose S is a closed subset of the real numbers R and we're given n measurable functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class C of all continuous random variables which are supported on S (i.e. whose density function is zero outside of S) and which satisfy the n expected value conditions

$\operatorname{E}(f_j(X)) = a_j\quad\mbox{ for } j=1,\ldots,n$

If there is a member in C whose density function is positive everywhere in S, and if there exists a maximal entropy distribution for C, then its probability density p(x) has the following shape:

$p(x)=c \exp\left(\sum_{j=1}^n \lambda_j f_j(x)\right)\quad \mbox{ for all } x\in S$

where the constants c and λ_j have to be determined so that the integral of p(x) over S is 1 and the above conditions for the expected values are satisfied.

Conversely, if constants c and λ_j like this can be found, then p(x) is indeed the density of the (unique) maximum entropy distribution for our class C.

This theorem is proved with the calculus of variations and Lagrange multipliers.

Discrete version

Suppose S = {x₁,x₂,...} is a (finite or infinite) discrete subset of the reals and we're given n functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class C of all discrete random variables X which are supported on S and which satisfy the n conditions

$\operatorname{E}(f_j(X)) = a_j\quad\mbox{ for } j=1,\ldots,n$

If there exists a member of C which assigns positive probability to all members of S and if there exists a maximum entropy distribution for C, then this distribution has the following shape:

$\operatorname{Pr}(X=x_k)=c \exp\left(\sum_{j=1}^n \lambda_j f_j(x_k)\right)\quad \mbox{ for } k=1,2,\ldots$

where the constants c and λ_j have to be determined so that the sum of the probabilities is 1 and the above conditions for the expected values are satisfied.

Conversely, if constants c and λ_j like this can be found, then the above distribution is indeed the maximum entropy distribution for our class C.

This version of the theorem can be proved with the tools of ordinary calculus and Lagrange multipliers.

Caveats

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy (e.g. the class of all continuous distributions X on R with E(X) = 0 and E(X²) = E(X³) = 1^{[citation needed]}).

It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.

Notes

^ Williams, D. (2001) Weighing the Odds Cambridge UP ISBN 0-521-00618-x (pages 197-199)
^ Bernardo, J.M., Smith, A.F.M. (2000) Bayesian Theory'.' Wiley. ISBN 0-471-49464-x (pages 209, 366)
^ O'Hagan, A. (1994) Kendall's Advanced Theory of statistics, Vol 2B, Bayesian Inference, Edward Arnold. ISBN 0-340-52922-9 (Section 5.40)
^ Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model". Journal of Econometrics (Elsevier): 219–230. http://www.wise.xmu.edu.cn/Master/Download/..%5C..%5CUploadFiles%5Cpaper-masterdownload%5C2009519932327055475115776.pdf. Retrieved 2011-06-02.
^ ^a ^b Jammalamadaka, S. Rao; SenGupta, A. (2001). Topics in circular statistics. New Jersey: World Scientific. ISBN 9810237782. http://books.google.com/books?id=sKqWMGqQXQkC&printsec=frontcover&dq=Jammalamadaka+Topics+in+circular&hl=en&ei=iJ3QTe77NKL00gGdyqHoDQ&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDcQ6AEwAA#v=onepage&q&f=false. Retrieved 2011-05-15.

References

T. M. Cover and J. A. Thomas, Elements of Information Theory, 1991. Chapter 11.
I. J. Taneja, Generalized Information Measures and Their Applications 2001. Chapter 1

Probability distributions

Discrete univariate with finite support

Benford · Bernoulli · Beta-binomial · binomial · categorical · hypergeometric · Poisson binomial · Rademacher · discrete uniform · Zipf · Zipf-Mandelbrot

Discrete univariate with infinite support

beta negative binomial · Boltzmann · Conway–Maxwell–Poisson · discrete phase-type · extended negative binomial · Gauss–Kuzmin · geometric · logarithmic · negative binomial · parabolic fractal · Poisson · Skellam · Yule–Simon · zeta

Continuous univariate supported on a bounded interval, e.g. [0,1]

Arcsine · ARGUS · Balding-Nichols · Bates · Beta · Noncentral beta · Irwin–Hall · Kumaraswamy · logit-normal · raised cosine · triangular · U-quadratic · uniform · Wigner semicircle

Continuous univariate supported on a semi-infinite interval, usually [0,∞)

Benini · Benktander 1st kind · Benktander 2nd kind · Beta prime · Bose–Einstein · Burr · chi-squared · chi · Coxian · Dagum · Davis · Erlang · exponential · F · Fermi–Dirac · folded normal · Fréchet · Gamma · generalized inverse Gaussian · half-logistic · half-normal · Hotelling's T-squared · hyper-exponential · hypoexponential · inverse chi-squared (scaled-inverse-chi-squared) · inverse Gaussian · inverse gamma · Kolmogorov · Lévy · log-Cauchy · log-Laplace · log-logistic · log-normal · Maxwell–Boltzmann · Maxwell speed · Mittag–Leffler · Nakagami · noncentral chi-squared · Pareto · phase-type · Rayleigh · relativistic Breit–Wigner · Rice · Rosin–Rammler · shifted Gompertz · truncated normal · type-2 Gumbel · Weibull · Wilks' lambda

Continuous univariate supported on the whole real line (−∞, ∞)

Cauchy · exponential power · Fisher's z · generalized normal · generalized hyperbolic · geometric stable · Gumbel · Holtsmark · hyperbolic secant · Landau · Laplace · Linnik · logistic · noncentral t · normal (Gaussian) · normal-inverse Gaussian · skew normal · slash · stable · Student's t · type-1 Gumbel · variance-gamma · Voigt

Continuous univariate with support whose type varies

generalized extreme value · generalized Pareto · Tukey lambda · q-Gaussian · q-exponential · shifted log-logistic

Mixed continuous-discrete univariate distributions

rectified Gaussian

Multivariate (joint)

Discrete: Ewens · multinomial · multivariate Pólya · negative multinomial Continuous: Dirichlet · Generalized Dirichlet · multivariate normal · Multivariate stable · multivariate Student · normal-scaled inverse gamma · normal-gamma Matrix-valued: inverse-Wishart · matrix normal · Wishart

Directional

Univariate (circular) directional: Circular uniform · univariate von Mises · wrapped normal · wrapped Cauchy · wrapped exponential · wrapped Lévy Bivariate (spherical): Kent Bivariate (toroidal): bivariate von Mises Multivariate: von Mises–Fisher · Bingham

Degenerate and singular

Degenerate: discrete degenerate · Dirac delta function Singular: Cantor

Families

Circular · compound Poisson · elliptical · exponential · natural exponential · location-scale · maximum entropy · mixture · Pearson · Tweedie · wrapped

Categories:

Entropy and information
Continuous distributions
Discrete distributions
Particle statistics
Types of probability distributions

Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

Maximum entropy — may refer to: The principle of maximum entropy The maximum entropy probability distribution Maximum entropy spectral estimation Maximum entropy spectral analysis Maximum entropy thermodynamics The law of maximum entropy production Entropy… … Wikipedia
Principle of maximum entropy — This article is about the probability theoretic principle. For the classifier in machine learning, see maximum entropy classifier. For other uses, see maximum entropy (disambiguation). Bayesian statistics Theory Bayesian probability Probability… … Wikipedia
Maximum entropy thermodynamics — In physics, maximum entropy thermodynamics (colloquially, MaxEnt thermodynamics) views equilibrium thermodynamics and statistical mechanics as inference processes. More specifically, MaxEnt applies inference techniques rooted in Shannon… … Wikipedia
Probability distribution — This article is about probability distribution. For generalized functions in mathematical analysis, see Distribution (mathematics). For other uses, see Distribution (disambiguation). In probability theory, a probability mass, probability density … Wikipedia
Maximum spacing estimation — The maximum spacing method tries to find a distribution function such that the spacings, D(i), are all approximately of the same length. This is done by maximizing their geometric mean. In statistics, maximum spacing estimation (MSE or MSP), or… … Wikipedia
Exponential distribution — Not to be confused with the exponential families of probability distributions. Exponential Probability density function Cumulative distribution function para … Wikipedia
Cauchy distribution — Not to be confused with Lorenz curve. Cauchy–Lorentz Probability density function The purple curve is the standard Cauchy distribution Cumulative distribution function … Wikipedia
Uniform distribution (continuous) — Uniform Probability density function Using maximum convention Cumulative distribution function … Wikipedia
Chi-squared distribution — This article is about the mathematics of the chi squared distribution. For its uses in statistics, see chi squared test. For the music group, see Chi2 (band). Probability density function Cumulative distribution function … Wikipedia
Entropy — This article is about entropy in thermodynamics. For entropy in information theory, see Entropy (information theory). For a comparison of entropy in information theory with entropy in thermodynamics, see Entropy in thermodynamics and information… … Wikipedia

Academic Dictionaries and Encyclopedias

Maximum entropy probability distribution

Contents

Definition of entropy

Examples of maximum entropy distributions

Given mean and standard deviation: the normal distribution

Uniform and piecewise uniform distributions