Neighbourhood components analysis

Neighbourhood components analysis

Neighbourhood components analysis is a supervised learning method for clustering multivariate data into distinct classes according to a given distance metric over the data. Functionally, it serves the same purposes as the K-nearest neighbour algorithm, and makes direct use of a related concept termed stochastic nearest neighbours.

Contents

Definition

Neighbourhood components analysis aims at "learning" a distance metric by finding a linear transformation of input data such that the average leave-one-out (LOO) classification performance is maximized in the transformed space. The key insight to the algorithm is that a matrix A corresponding to the transformation can be found by defining a differentiable objective function for A, followed by use of an iterative solver such as conjugate gradient descent. One of the benefits of this algorithm is that the number of classes k can be determined as a function of A, up to a scalar constant. This use of the algorithm therefore addresses the issue of model selection.

Explanation

In order to define A, we define an objective function describing classification accuracy in the transformed space and try to determine A * such that this objective function is maximized.

A * = argmaxAf(A)

Leave-one-out (LOO) classification

Consider predicting the class label of a single data point by consensus of its k-nearest neighbours with a given distance metric. This is known as leave-one-out classification. However, the set of nearest-neighbours Ci can be quite different after passing all the points through a linear transformation. Specifically, the set of neighbours for a point can undergo discrete changes in response to smooth changes in the elements of A, implying that any objective function f(\cdot) based on the neighbours of a point will be piecewise-constant, and hence not differentiable.

Solution

We can resolve this difficulty by using an approach inspired by stochastic gradient descent. Rather than considering the k-nearest neighbours at each transformed point in LOO-classification, we'll consider the entire transformed data set as stochastic nearest neighbours. We define these using a softmax function of the squared Euclidean distance between a given LOO-classification point and each other point in the transformed space:

p_{ij} =
\begin{cases}
 \frac{e^{-||Ax_i - Ax_j||^2}}{\sum_k e^{-||Ax_i - Ax_k||^2}}, & \mbox{if} j \ne i \\
 0, & \mbox{if} j = i
\end{cases}

The probability of correctly classifying data point i is the probability of classifying the points of each of its neighbours Ci:

p_i = \sum_j^n p_{ij} \quad where pij is the probability of classifying neighbour j of point i.

Define the objective function using LOO classification, this time using the entire data set as stochastic nearest neighbours:

f(A) = \sum_i \sum_{j \in C_i} p_{ij} = \sum_i p_i

Note that under stochastic nearest neighbours, the consensus class for a single point i is the expected value of a point's class in the limit of an infinite number of samples drawn from the distribution over its neighbours j \in C_i i.e.: P(Class(Xi) = Class(Xj)) = pij. Thus the predicted class is an affine combination of the classes of every other point, weighted by the softmax function for each j \in C_j where Cj is now the entire transformed data set.

This choice of objective function is preferable as it is differentiable with respect to A:


\frac{\partial f}{\partial A} = - 2A \sum_i \sum_{j \in C_i} p_{ij} \left ( x_{ij} x_{ij}^T - \sum_k p_{ik} x_{ik} x_{ik}^T \right )

 
 = 2A \sum_i \left ( p_i\sum_k p_{ik}x_{ik}x_{ik}^T - \sum_{j \in C_i} p_{ij}x_{ij}x_{ij}^T \right )

Obtaining a gradient for A means that it can be found with an iterative solver such as conjugate gradient descent. Note that in practice, most of the innermost terms of the gradient evaluate to insignificant contributions due to the rapidly diminishing contribution of distant points from the point of interest. This means that the inner sum of the gradient can be truncated, resulting in reasonable computation times even for large data sets.

Alternative formulation

"Maximizing f(\cdot) is equivalent to minimizing the L1-distance between the predicted class distribution and the true class distribution (ie: where the pi induced by A are all equal to 1). A natural alternative is the KL-divergence, which induces the following objective function and gradient:" (Goldberger 2005)


g(A) = \sum_i \log \left ( \sum_{j \in C_i} p_{ij} \right ) = \sum_i \log (p_i)


\frac{\partial g}{\partial A} = 2A \sum_i \left ( \sum_k p_{ik} x_{ik} x_{ik}^T - \frac{\sum_{j \in C_i} p_{ij} x_{ij} x_{ij}^T }{\sum_{j \in C_i} p_{ij}} \right )

In practice, optimization of A using this function tends to give similar performance results as with the original.

History and background

Neighbourhood components analysis was developed by Jacob Goldberger, Sam Roweis, Ruslan Salakhudinov, and Geoff Hinton at the University of Toronto's department of computer science in 2004.

See also

References

  • J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov. (2005) Neighbourhood Component Analysis. Advances in Neural Information Processing Systems. 17, 513-520.

External links

General

Software


Wikimedia Foundation. 2010.

Игры ⚽ Нужна курсовая?

Look at other dictionaries:

  • Cluster analysis — The result of a cluster analysis shown as the coloring of the squares into three clusters. Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more… …   Wikipedia

  • Component analysis — may refer to: Principal component analysis Kernel principal component analysis Independent component analysis Neighbourhood components analysis ANOVA simultaneous component analysis Connected Component Analysis This disambiguation pag …   Wikipedia

  • k-nearest neighbor algorithm — KNN redirects here. For other uses, see KNN (disambiguation). In pattern recognition, the k nearest neighbor algorithm (k NN) is a method for classifying objects based on closest training examples in the feature space. k NN is a type of instance… …   Wikipedia

  • NCA — is a three letter acronym, which may refer to the: Constitutional Assembly of Zimbabwe National Campaign for the Arts, a lobbying group for the arts in the United Kingdom National Capital Authority, a government authority involved in development… …   Wikipedia

  • Scale-invariant feature transform — Feature detection Output of a typical corner detection algorithm …   Wikipedia

  • Milky Way Galaxy — Large spiral galaxy (roughly 150,000 light years in diameter) that contains Earth s solar system. It includes the multitude of stars whose light is seen as the Milky Way, the irregular luminous band that encircles the sky defining the plane of… …   Universalium

  • architecture — /ahr ki tek cheuhr/, n. 1. the profession of designing buildings, open areas, communities, and other artificial constructions and environments, usually with some regard to aesthetic effect. Architecture often includes design or selection of… …   Universalium

  • education — /ej oo kay sheuhn/, n. 1. the act or process of imparting or acquiring general knowledge, developing the powers of reasoning and judgment, and generally of preparing oneself or others intellectually for mature life. 2. the act or process of… …   Universalium

  • physical science, principles of — Introduction       the procedures and concepts employed by those who study the inorganic world.        physical science, like all the natural sciences, is concerned with describing and relating to one another those experiences of the surrounding… …   Universalium

  • cosmos — /koz meuhs, mohs/, n., pl. cosmos, cosmoses for 2, 4. 1. the world or universe regarded as an orderly, harmonious system. 2. a complete, orderly, harmonious system. 3. order; harmony. 4. any composite plant of the genus Cosmos, of tropical… …   Universalium

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”