Distance correlation

In statistics and in probability theory, distance correlation is a measure of statistical dependence between two random variables or two random vectors of arbitrary, not necessarily equal dimension. Its important property is that this measure of dependence is zero if and only if the random variables are statistically independent. This measure is derived from a number of other quantities that are used in its specification, specifically: distance variance, distance standard deviation and distance covariance. These take the same roles as the ordinary moments with corresponding names in the specification of the Pearson product-moment correlation coefficient.

These distance-based measures can be put into an indirect relationship to the ordinary moments by an alternative formulation (described below) using ideas related to Brownian motion, and this has led to the use of names such as Brownian covariance and Brownian distance covariance.

1 Background
2 Definitions
3 Properties
4 Generalization
5 Alternative formulation: Brownian covariance
6 See also
7 Notes
8 References
9 External links

Background

The classical measure of dependence, the Pearson correlation coefficient,^[1] is mainly sensitive to a linear relationship between two variables. Distance correlation was introduced in 2005 by Gabor J Szekely in several lectures to address this deficiency of Pearson’s correlation, namely that it can easily be zero for dependent variables. Correlation = 0 (uncorrelatedness) does not imply independence while distance correlation = 0 does imply independence. The first results on distance correlation were published in 2007 and 2009.^[2]^[3] It was proved that distance covariance is the same as the Brownian covariance.^[3] These measures are examples of energy distances.

Definitions

Distance covariance

The population value of distance covariance ^[2]^:p.2783;^[4] is the square root of

dCov²(X,Y):= E|X – X'||Y – Y'| + E|X – X'| E|Y – Y'| – E|X – X'||Y – Y"| - E|X – X"||Y – Y'|

= E|X – X'||Y – Y'| + E|X – X'| E|Y – Y'| – 2E|X – X'||Y – Y"|,

where E denotes expected value, |.| denotes Euclidean norm, and (X,Y), (X',Y'), and (X",Y") are independent and identically distributed. Distance covariance can be expressed in terms of Pearson’s covariance, cov, as follows: dCov²(X,Y) = cov(|X–X'|, |Y–Y'|) – 2cov(|X–X'|, |Y–Y"|). This identity shows that the distance covariance is not the same as the covariance of distances, cov(|X–X'|, |Y–Y'|), which can be zero even if X and Y are not independent.

The sample distance covariance is defined as follows. Let (X_k Y_k), k=1,2,…, n be a statistical sample from a pair of real valued or vector valued random variables (X,Y). First, compute all pairwise distances

a_k,l = |X_k – X_l | and b_k,l = |Y_k – Y_l | for k,l=1,2,…,n.

That is, compute the n by n distance matrices (a_k,l) and (b_k,l). Then take all centered distances A_k,l:= a_k,l – $a$ _k.– $a$ _.l + $a$ _.. and B_k,l:= b_k,l – $b$ _k. - $b$ _.l + $b$ _.. where $a$ _k. is the k-th row mean, $a$ _.l is the l-th column mean, and $a$ _.. is the grand mean of the distance matrix of the X sample. The notation is similar for the b values. (In the matrices of centered distances (A_k,l) and (B_k,l) all row sums and all column sums equal zero.) The squared sample distance covariance is simply the arithmetic average of the products A_k,l B_k,l; that is

dcov²_n (X,Y):= (1/n²) Σ _k,l A_k,l B_k,l.

The statistic T_n = n[dcov²_n (X,Y)] determines a consistent multivariate test of independence of random vectors in arbitrary dimensions. For an implementation see dcov.test function in the energy package for R.^[5]

Distance variance

The distance variance is a special case of distance covariance when the two variables are identical. The population value of distance variance is the square root of

dVar²(X):= E|X – X'|² + E²|X – X'| – 2E|X – X'||X – X"|,

where E denotes the expected value, X’ is an independent and identically distributed copy of X and X" is independent of X and X' and has the same distribution as X and X'.

The sample distance variance is the square root of

dvar²_n (X):=dcov²_n (X,X) = (1/n²) Σ _k,l A _k,l²,

which is a relative of Corrado Gini’s mean difference introduced in 1912 (but Gini did not work with centered distances).

Distance standard deviation

The distance standard deviation is the square root of the distance variance.

Distance correlation

The distance correlation ^[2]^[3] of two random variables is obtained by dividing their distance covariance by the product of their distance standard deviations. The distance correlation is

$dCor(X, Y) = \frac{dCov(X, Y)}{\sqrt{dVar(X) \, dVar(Y)}},$

and the sample distance correlation is defined by substituting the sample distance covariance and distance variances for the population coefficients above.

For easy computation of sample distance correlation see the dcor function in the energy package for R.^[5]

Properties

Distance correlation

(i) 0 ≤ dcor_n(X,Y) ≤ 1 and 0 ≤ dCor(X,Y) ≤1.

(ii) dCor(X,Y) = 0 if and only if X and Y are independent.

(iii) dcor_n(X,Y) = 1 implies that dimensions of the linear spaces spanned by X and Y samples respectively are almost surely equal and if we assume that these subspaces are equal, then in this subspace Y = a + b CX for some vector a, scalar b, and orthonormal matrix C.

Distance covariance

(i) dCov (X,Y) ≥ 0 and dcov_n (X,Y) ≥ 0.

(ii) dCov²(a₁ +b₁ C₁ X, a₂ +b₂ C₂ Y) = |b₁ b₂| dCov²(X,Y) for all constant vectors a₁, a₂ , scalars b₁, b₂, and orthonormal matrices C₁, C₂.

(iii) If the random vectors (X₁, Y₁) and (X₂, Y₂) are independent then

dCov(X₁ +X₂, Y₁ +Y₂) ≤ dCov(X₁, Y₁) + dCov (X₂, Y₂).

Equality holds if and only if X₁ and Y₁ are both constants, or X₂ and Y₂ are both constants, or X₁, X₂, Y₁, Y₂ are mutually independent.

(iv) dCov (X,Y) = 0 if and only if X and Y are independent.

This last property is the most important effect of working with centered distances.

The statistic dcov²_n (X,Y) is a biased estimator of dCov²(X,Y) because E[dcov²_n (X,Y)] = [(n-1)/n²][(n-2)dCov²(X,Y)+E|X-X’|E|Y-Y’|]. The bias therefore can easily be corrected.^[6]

Distance variance

(i) dVar(X) = 0 if and only if X = E(X) almost surely.

(ii) dVar_n (X) = 0 if and only if every sample observation is identical.

(iii) dVar(a + bCX) = |b| dVar(X) for all constant vectors a, scalars b, and orthonormal matrices C.

(iv) If X and Y are independent then dVar(X + Y) ≤ dVar(X) + dVar(Y).

Equality holds if (iv) if and only if one of the random variables X or Y is a constant.

Generalization

Distance covariance can be generalized to include powers of Euclidean distance. Define

dCov²(X, Y; α):= E|X – X’|^α|Y – Y’|^α + E|X – X’|^α E|Y – Y’|^α – 2 E|X – X’|^α|Y – Y"|^α.

Then for every 0 < α < 2, X and Y are independent if and only if dCov²(X, Y; α) = 0. It is important to note that this characterization does not hold for exponent α = 2; in this case for bivariate (X, Y), dCor(X, Y; α=2) is a deterministic function of the Pearson correlation.^[2] If a_k,l and b_k,l are α powers of the corresponding distances, 0 < α ≤ 2, then α sample distance covariance can be defined as the nonnegative number for which

dcov²_n (X,Y ; α):= (1/n²) Σ _k,lA_k,l B_k,l.

One can extend dCov to metric space valued random variables X and Y: a_k,l = K(X_k, X_l) and b_k,l = L(Y_k, Y_l) where K, L are squares of metrics and (strictly) negative definite continuous functions.

Alternative formulation: Brownian covariance

Brownian covariance is motivated by generalization of the notion of covariance to stochastic processes. The square of the covariance of random variables X and Y can be written in the following form:

$\operatorname{cov}(X,Y)^2 = \operatorname{E}\left[ \left(X - \operatorname{E}(X)\right) \left(X^\mathrm{'} - \operatorname{E}(X^\mathrm{'})\right) \left(Y - \operatorname{E}(Y)\right) \left(Y^\mathrm{'} - \operatorname{E}(Y^\mathrm{'})\right) \right]$

where E denotes the expected value and the prime denotes independent and identically distributed copies. We need the following generalization of this formula. If U(s), V(t) are arbitrary random processes defined for all real s and t then define the U-centered version of X by

$X_U := U(X) - \operatorname{E}_X\left[ U(X) \mid \left \{ U(t) \right \} \right]$

whenever the subtracted conditional expected value exists and denote by Y_V the V-centered version of Y.^[3]^[7]^[8] The (U,V) covariance of (X,Y) is defined as the nonnegative number whose square is

$\operatorname{cov}_{U,V}^2(X,Y) := \operatorname{E}\left[X_U X_U^\mathrm{'} Y_V Y_V^\mathrm{'}\right]$

whenever the right-hand side is nonnegative and finite. The most important example is when U and V are two-sided independent Brownian motions /Wiener processes with expectation zero and covariance |s| + |t| - |s-t| = 2 min(s,t). (This is twice the covariance of the standard Wiener process; here the factor 2 simplifies the computations.) In this case the (U,V) covariance is called Brownian covariance and is denoted by

$\operatorname{cov}_W(X,Y).$

There is a surprising coincidence: The Brownian covariance is the same as the distance covariance:

$\operatorname{cov}_{\mathrm{W}}(X, Y) = \operatorname{dCov}(X, Y).$

On the other hand, if we replace the Brownian motion with the deterministic identity function id then Cov_id(X,Y) is simply the absolute value of the classical Pearson covariance,

$\operatorname{cov}_{\mathrm{id}}(X,Y) = \left\vert\operatorname{cov}(X,Y)\right\vert.$

Notes

^ Pearson (1895)
^ ^a ^b ^c ^d Székely, Rizzo and Bakirov (2007)
^ ^a ^b ^c ^d Székely & Rizzo (2009)
^ Székely & Rizzo (2009) Theorem 7, (3.7), p. 1249.
^ ^a ^b energy package for R
^ Székely and Rizzo (2009), Rejoinder
^ Bickel & Xu (2009)
^ Kosorok (2009)

References

Bickel, P.J. and Xu, Y. (2009) "Discussion of: Brownian distance covariance", Annals of Applied Statistics, 3 (4), 1266–1269. doi:10.1214/09-AOAS312A Free access to article
Gini, C. (1912). Variabilità e Mutabilità. Bologna: Tipografia di Paolo Cuppini.
Pearson, K. (1895). "Note on regression and inheritance in the case of two parents", Proceedings of the Royal Society, 58, 240–242
Pearson, K. (1920). "Notes on the history of correlation", Biometrika, 13, 25–45.
Székely, G. J. Rizzo, M. L. and Bakirov, N. K. (2007). "Measuring and testing independence by correlation of distances", Annals of Statistics, 35/6, 2769–2794. doi: 10.1214/009053607000000505 Reprint
Székely, G. J. and Rizzo, M. L. (2009). "Brownian distance covariance", Annals of Applied Statistics, 3/4, 1233–1303. doi: 10.1214/09-AOAS312 Reprint
Kosorok, M. R. (2009) "Discussion of: Brownian Distance Covariance", Annals of Applied Statistics, 3/4, 1270–1278. doi:10.1214/09-AOAS312B Free access to article

External links

E-statistics (energy statistics)

Categories:

Statistical dependence
Statistical distance measures
Theory of probability distributions
Multivariate statistics

Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

Correlation and dependence — This article is about correlation and dependence in statistical data. For other uses, see correlation (disambiguation). In statistics, dependence refers to any statistical relationship between two random variables or two sets of data. Correlation … Wikipedia
corrélation — [ kɔrelasjɔ̃ ] n. f. • 1718; correlacion v. 1420; bas lat. correlatio 1 ♦ Philos. Rapport entre deux phénomènes (⇒ corrélat) qui varient en fonction l un de l autre. ⇒ correspondance, interdépendance, réciprocité. ♢ Statist. Coefficient de… … Encyclopédie Universelle
Distance De Mahalanobis — En statistique, la distance de Mahalanobis est une mesure de distance introduite par P. C. Mahalanobis en 1936[1]. Elle est basée sur la corrélation entre des variables par lesquelles différents modèles peuvent être identifiés et analysés. C est… … Wikipédia en Français
Distance de mahalanobis — En statistique, la distance de Mahalanobis est une mesure de distance introduite par P. C. Mahalanobis en 1936[1]. Elle est basée sur la corrélation entre des variables par lesquelles différents modèles peuvent être identifiés et analysés. C est… … Wikipédia en Français
Correlation (mathematiques) — Corrélation (statistiques) Pour les articles homonymes, voir Corrélation. En probabilités et en statistique, étudier la corrélation entre deux ou plusieurs variables aléatoires ou statistiques, c’est étudier l’intensité de la liaison qui peut… … Wikipédia en Français
Corrélation (Mathématiques) — Corrélation (statistiques) Pour les articles homonymes, voir Corrélation. En probabilités et en statistique, étudier la corrélation entre deux ou plusieurs variables aléatoires ou statistiques, c’est étudier l’intensité de la liaison qui peut… … Wikipédia en Français
Corrélation (mathématiques) — Corrélation (statistiques) Pour les articles homonymes, voir Corrélation. En probabilités et en statistique, étudier la corrélation entre deux ou plusieurs variables aléatoires ou statistiques, c’est étudier l’intensité de la liaison qui peut… … Wikipédia en Français
Correlation function (astronomy) — For other uses, see Correlation function (disambiguation). Physical cosmology … Wikipedia
Correlation function — For other uses, see Correlation function (disambiguation). A correlation function is the correlation between random variables at two different points in space or time, usually as a function of the spatial or temporal distance between the points.… … Wikipedia
Correlation function (statistical mechanics) — For other uses, see Correlation function (disambiguation). In statistical mechanics, the correlation function is a measure of the order in a system, as characterized by a mathematical correlation function, and describes how microscopic variables… … Wikipedia

Academic Dictionaries and Encyclopedias

Distance correlation

Contents

Background

Definitions

Distance covariance

Distance variance

Distance standard deviation

Distance correlation

Properties

Distance correlation

Distance covariance

Distance variance

Generalization

Alternative formulation: Brownian covariance

See also

Notes

References

External links

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

Distance correlation

Contents

Background

Definitions

Distance covariance

Distance variance

Distance standard deviation

Distance correlation

Properties

Distance correlation

Distance covariance

Distance variance

Generalization

Alternative formulation: Brownian covariance

See also

Notes

References

External links

Look at other dictionaries:

Share the article and excerpts

Direct link