 knearest neighbor algorithm

"KNN" redirects here. For other uses, see KNN (disambiguation).
In pattern recognition, the knearest neighbor algorithm (kNN) is a method for classifying objects based on closest training examples in the feature space. kNN is a type of instancebased learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The knearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor.
The same method can be used for regression, by simply assigning the property value for the object to be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. (A common weighting scheme is to give each neighbor a weight of 1/d, where d is the distance to the neighbor. This scheme is a generalization of linear interpolation.)
The neighbors are taken from a set of objects for which the correct classification (or, in the case of regression, the value of the property) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required. The knearest neighbor algorithm is sensitive to the local structure of the data.^{[citation needed]}
Nearest neighbor rules in effect compute the decision boundary in an implicit manner. It is also possible to compute the decision boundary itself explicitly, and to do so in an efficient manner so that the computational complexity is a function of the boundary complexity.^{[1]}
Contents
Algorithm
The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.
In the classification phase, k is a userdefined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.
Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance) can be used. Often, the classification accuracy of "k"NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.
A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number.^{[2]} One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbors.
KNN is a special case of a variablebandwidth, kernel density "balloon" estimator with a uniform kernel.^{[3]} ^{[4]}
Parameter selection
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification^{[citation needed]}, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques, for example, crossvalidation. The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm.
The accuracy of the kNN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling.^{[5]} Another popular approach is to scale features by the mutual information of the training data with the training classes.^{[citation needed]}
In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method.^{[6]}
Properties
The naive version of the algorithm is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed. Using an appropriate nearest neighbor search algorithm makes kNN computationally tractable even for large data sets.
The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data).^{[7]} knearest neighbor is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points). Various improvements to knearest neighbor methods are possible by using proximity graphs.^{[8]}
For estimating continuous variables
The kNN algorithm can also be adapted for use in estimating continuous variables. One such implementation uses an inverse distance weighted average of the knearest multivariate neighbors. This algorithm functions as follows:
 Compute Euclidean or Mahalanobis distance from target plot to those that were sampled.
 Order samples taking for account calculated distances.
 Choose heuristically optimal k nearest neighbor based on RMSE done by cross validation technique.
 Calculate an inverse distance weighted average with the knearest multivariate neighbors.
Using a weighted kNN also significantly improves the results: the class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance between that point and the point for which the class is to be predicted.
See also
 Nearest neighbor search
 Cluster analysis
 Classification (machine learning)
 Data mining
 Machine learning
 Pattern recognition
 Predictive analytics
 Dimension reduction
References
 ^ Bremner D, Demaine E, Erickson J, Iacono J, Langerman S, Morin P, Toussaint G (2005). "Outputsensitive algorithms for computing nearestneighbor decision boundaries". Discrete and Computational Geometry 33 (4): 593–604. doi:10.1007/s0045400411520.
 ^ D. Coomans; D.L. Massart (1982). "Alternative knearest neighbour rules in supervised pattern recognition : Part 1. kNearest neighbour classification by using alternative voting rules". Analytica Chimica Acta 136: 15–27. doi:10.1016/S00032670(01)953590.
 ^ D. G. Terrell; D. W. Scott (1992). "Variable kernel density estimation". Annals of Statistics 20 (3): 1236–1265. doi:10.1214/aos/1176348768.
 ^ Mills, Peter. "Efficient statistical classification of satellite measurements". International Journal of Remote Sensing.
 ^ Nigsch F, Bender A, van Buuren B, Tissen J, Nigsch E, Mitchell JB (2006). "Melting point prediction employing knearest neighbor algorithms and genetic parameter optimization". Journal of Chemical Information and Modeling 46 (6): 2412–2422. doi:10.1021/ci060149f. PMID 17125183.
 ^ Hall P, Park BU, Samworth RJ (2008). "Choice of neighbor order in nearestneighbor classification". Annals of Statistics 36 (5): 2135–2152. doi:10.1214/07AOS537.
 ^ Cover TM, Hart PE (1967). "Nearest neighbor pattern classification". IEEE Transactions on Information Theory 13 (1): 21–27. doi:10.1109/TIT.1967.1053964.
 ^ Toussaint GT (April 2005). "Geometric proximity graphs for improving nearest neighbor methods in instancebased learning and data mining". International Journal of Computational Geometry and Applications 15 (2): 101–150. doi:10.1142/S0218195905001622.
Further reading
 When Is "Nearest Neighbor" Meaningful?
 Belur V. Dasarathy, ed (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. ISBN 0818689307.
 Shakhnarovish, Darrell, and Indyk, ed (2005). NearestNeighbor Methods in Learning and Vision. MIT Press. ISBN 026219547X.
 Mäkelä H Pekkarinen A (20040726). "Estimation of forest stand volumes by Landsat TM imagery and standlevel fieldinventory data". Forest Ecology and Management 196 (2–3): 245–255. doi:10.1016/j.foreco.2004.02.049.
 Fast k nearest neighbor search using GPU. In Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska, USA, June 2008. V. Garcia and E. Debreuve and M. Barlaud.
External links
 knearest neighbor algorithm in C++ and Boost by Antonio Gulli
 knearest neighbor algorithm in Java (Applet) (includes source code) by Leonardo Nascimento Ferreira
 knearest neighbor algorithm in Visual Basic and Java (includes executable and source code)
 kNN and Potential energy (Applet), University of Leicester
 knearest neighbor tutorial using MS Excel
 STANN: A simple threaded approximate nearest neighbor C++ library that can compute Euclidean knearest neighbor graphs in parallel
 TiMBL: a fast C++ implementation of kNN with feature and distance weighting, specifically suited to symbolic feature spaces
 libAGF: A library for multivariate, adaptive kernel estimation, including KNN and Gaussian kernels
 OBSearch: A library for similarity search in metric spaces created during Google Summer of Code 2007
 ANN: A Library for Approximate Nearest Neighbor Searching
 Into contains open source implementations of exact and approximate kNN algorithms in C++
Categories: Classification algorithms
 Search algorithms
 Machine learning algorithms
Wikimedia Foundation. 2010.
Look at other dictionaries:
Knearest neighbor algorithm — In pattern recognition, the k nearest neighbor algorithm ( k NN) is a method for classifying objects based on closest training examples in the feature space. k NN is a type of instance based learning, or lazy learning where the function is only… … Wikipedia
Nearest neighbor search — (NNS), also known as proximity search, similarity search or closest point search, is an optimization problem for finding closest points in metric spaces. The problem is: given a set S of points in a metric space M and a query point… … Wikipedia
Nearest neighbor — may refer to: Nearest neighbor search in pattern recognition and in computational geometry Nearest neighbor interpolation for interpolating data Nearest neighbor graph in geometry The k nearest neighbor algorithm in machine learning, an… … Wikipedia
Nearestneighbor interpolation — (blue lines) in one dimension on a (uniform) dataset (red points) … Wikipedia
Nearest neighbour algorithm — This article is about an approximation algorithm to solve the travelling salesman problem. For other uses, see Nearest neighbor. The nearest neighbour algorithm was one of the first algorithms used to determine a solution to the travelling… … Wikipedia
Nearestneighbor chain algorithm — In the theory of cluster analysis, the nearest neighbor chain algorithm is a method that can be used to perform several types of agglomerative hierarchical clustering, using an amount of memory that is linear in the number of points to be… … Wikipedia
Nearest neighbor graph — A nearest neighbor graph of 100 points in the Euclidean plane. The nearest neighbor graph (NNG) for a set of n objects P in a metric space (e.g., for a set of points in the plane with Euclidean distance) is a directed graph with P being its… … Wikipedia
Greedy algorithm — A greedy algorithm is any algorithm that follows the problem solving metaheuristic of making the locally optimum choice at each stage Paul E. Black, greedy algorithm in Dictionary of Algorithms and Data Structures [online] , U.S. National… … Wikipedia
Ibk algorithm — The ibk algorithm is an alternate version of the k nearest neighbor algorithm, used in k nearest neighbour classification … Wikipedia
FNN algorithm — The false nearest neighbor (FNN) algorithm is an algorithm for estimating the embedding dimension.ee also* Time series * Nearest neighbor External links * [http://balrog.wku.edu/ amaral/docs/chaospaper/node9.html False nearest neighbors] by… … Wikipedia