K-means algorithm

The k-means algorithm is an algorithm to cluster $n$ objects based on attributes into $k$ partitions, $k < n$ . It is similar to the expectation-maximization algorithm for mixtures of Gaussians in that they both attempt to find the centers of natural clusters in the data. It assumes that the object attributes form a vector space. The objective it tries to achieve is to minimize total intra-cluster variance, or, the squared error function

:: $V = sum_{i=1}^{k} sum_{x_j in S_i} (x_j - mu_i)^2$ where there are "k" clusters "S_i", i = 1, 2, ..., "k", and "µ_i" is the centroid or mean point of all the points "x_j" ∈ "S_i".

The k-means clustering was invented in 1956. [H. Steinhaus. Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801– 804, 1956.] The most common form of the algorithm uses an iterative refinement heuristic known as Lloyd's algorithm [S. Lloyd, Last square quantization in PCM’s. Bell Telephone Laboratories Paper (1957). Published in journal much later: S. P. Lloyd. Least squares quantization in PCM. Special issue on quantization, IEEE Trans. Inform.Theory, 28:129–137, 1982.] . Lloyd's algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed).

Lloyd's algorithm and k-means are often used synonymously, but in reality Lloyd's algorithm is a heuristic for solving the k-means problemD. Arthur, S. Vassilvitskii: [http://www.stanford.edu/~darthur/kMeansPlusPlus.pdf "k-means++ The Advantages of Careful Seeding"] 2007 Symposium onDiscrete Algorithms (SODA).] , as with certain combinations of starting points and centroids, Lloyd's algorithm can in fact converge to the wrong answer (ie a different and optimal answer to the minimization function above exists.)

Other variations exist [ An efficient k-means clustering algorithm: Analysis and implementation, T. Kanungo, D. M. Mount, N. Netanyahu, C. Piatko, R. Silverman, and A. Y. Wu, IEEE Trans. Pattern Analysis and Machine Intelligence, 24 (2002), 881-892.] , but Lloyd's algorithm has remained popular because it converges extremely quickly in practice. In fact, many have observed that the number of iterations is typically much less than the number of points. Recently, however, David Arthur and Sergei Vassilvitskii showed that there exist certain point sets on which k-means takes superpolynomial time: 2^Ω(√"n") to converge. [Cite conference
author = David Arthur & Sergei Vassilvitskii
year = 2006
title = How Slow is the k-means Method?
booktitle = Proceedings of the 2006 Symposium on Computational Geometry (SoCG)
url = http://www.cs.duke.edu/courses/spring07/cps296.2/papers/kMeans-socg.pdf]

Approximate k-means algorithms have been designed that make use of coresets: small subsets of the original data.

In terms of performance the algorithm is not guaranteed to return a global optimum. The quality of the final solution depends largely on the initial set of clusters, and may, in practice, be much poorer than the global optimum. Fact|date=June 2008 Since the algorithm is extremely fast, a common method is to run the algorithm several times and return the best clustering found.

A drawback of the k-means algorithm is that the number of clusters "k" is an input parameter. An inappropriate choice of "k" may yield poor results. The algorithm also assumes that the variance is an appropriate measure of cluster scatter.

Demonstration of the algorithm

The following images demonstrate the k-means clustering algorithm in action, for the two-dimensional case. The initial centres are generated randomly to demonstrate the stages in more detail. The background space partitions are only for illustration and are not generated by the k-means algorithm.

Applications of the algorithm

Image Segmentation

The k-means clustering algorithm is commonly used in computer vision as a form of image segmentation. The results of the segmentation are used to aid border detection and object recognition. In this context, the standard euclidean distance is usually insufficient in forming the clusters. Instead, a weighted distance measure utilizing pixel coordinates, RGB pixel color and/or intensity, and image texture is commonly used. [Shapiro, Linda G. & Stockman, George C. (2001). "Computer Vision." Upper Saddle River, NJ: Prentice Hall.]

Relation to PCA

It has been shown recently (2001) [H. Zha, C. Ding, M. Gu, X. He and H.D. Simon."Spectral Relaxation for K-means Clustering",Neural Information Processing Systems vol.14 (NIPS 2001). pp. 1057-1064, Vancouver, Canada. Dec. 2001.] [Chris Ding and Xiaofeng He. "K-means Clustering via Principal Component Analysis".Proc. of Int'l Conf. Machine Learning (ICML 2004), pp 225-232. July 2004.] that the relaxed solution of k-means clustering, specified by the cluster indicators, is given by the PCA (principal component analysis) principal components, and the PCA subspace spanned by the principal directions is identical to the cluster centroid subspace specified by the between-class scatter matrix.

Enhancements

In 2006 a new way of choosing the initial centers was proposed, dubbed "k-means++".The idea is to select centers in a way that they are already initially close to large quantities of points. The authors use $L^2$ norm in selecting the centers, but general $L^n$ may be used to tune the aggressiveness of the seeding.

This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes considerable time, the k-means itself converges very fast after this seeding and thus the seeding actually lowers the computation time too. The authors tested their method with real and synthetic datasets and obtained typically 2-fold to 10-fold improvements in speed, and for certain datasets close to 1000-fold improvements in error. Their tests almost always showed the new method to be at least as good as vanilla k-means in both speed and error.

Additionally, the authors calculate an approximation ratio for their algorithm. This is something that has not been done with vanilla k-means (although with several variations of it). The k-means++ guarantees to have approximation ratio $O(log(k))$ where $k$ is the number of clusters used.

Variations

The set of squared error minimizing cluster functions also includes the "K"-medoids algorithm, an approach which forces the center point of each cluster to be one of the actual points.

References

* J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297

* J. A. Hartigan (1975) "Clustering Algorithms". Wiley.

* Cite journal
author = J. A. Hartigan and M. A. Wong
year = 1979
title = A K-Means Clustering Algorithm
journal = Applied Statistics
volume = 28
issue = 1
pages = 100–108
doi = 10.2307/2346830

External links

* [http://people.revoledu.com/kardi/tutorial/kMean/NumericalExample.htm Numerical Example of K-means clustering]
* [http://www.leet.it/home/lale/clustering/ Application example which uses K-means clustering to reduce the number of colors in images]
* [http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html Interactive demo of the K-means-algorithm(Applet)]
* [http://www.javaworld.com/javaworld/jw-11-2006/jw-1121-thread.html An example of multithreaded application which uses K-means in Java]
* [http://www25.brinkster.com/denshade/kmeans.php.htm K-means application in php ]
* [http://informationandvisualization.de/blog/kmeans-and-voronoi-tesselation-built-processing Another animation of the K-means-algorithm]
* [http://mloss.org/software/view/48/ A fast implementation of the K-means algorithm which uses the triangle inequality to speed up computation ]
* [http://www.user.cifnet.com/~lwebzem/k_means/test3.cgi K-means clustering using Perl. Online clustering.]

ee also

* Data clustering
* Linde-Buzo-Gray algorithm

Wikimedia Foundation. 2010.

Игры ⚽ Нужен реферат?

Look at other dictionaries:

Algorithm — Flow chart of an algorithm (Euclid s algorithm) for calculating the greatest common divisor (g.c.d.) of two numbers a and b in locations named A and B. The algorithm proceeds by successive subtractions in two loops: IF the test B ≤ A yields yes… … Wikipedia
Algorithm characterizations — The word algorithm does not have a generally accepted definition. Researchers are actively working in formalizing this term. This article will present some of the characterizations of the notion of algorithm in more detail. This article is a… … Wikipedia
Algorithm examples — This article Algorithm examples supplements Algorithm and Algorithm characterizations. = An example: Algorithm specification of addition m+n =Choice of machine model:This problem has not been resolved by the mathematics community. There is no… … Wikipedia
algorithm — [13] Algorithm comes from the name of an Arab mathematician, in full Abu Ja far Mohammed ibn Musa al Khwarizmi (c. 780–c. 850), who lived and taught in Baghdad and whose works in translation introduced Arabic numerals to the West. The last part… … The Hutchinson dictionary of word origins
algorithm — [13] Algorithm comes from the name of an Arab mathematician, in full Abu Ja far Mohammed ibn Musa al Khwarizmi (c. 780–c. 850), who lived and taught in Baghdad and whose works in translation introduced Arabic numerals to the West. The last part… … Word origins
means — Synonyms and related words: MO, Swiss bank account, action, ad hoc measure, algorithm, answer, apparatus, approach, artifice, assessed valuation, assets, assets and liabilities, attack, available means, balance, bank account, bottom dollar,… … Moby Thesaurus
algorithm — Synonyms and related words: Arabic numerals, MO, Roman numerals, algorism, applied mathematics, approach, attack, binary system, course, decimal system, duodecimal system, fashion, figures, form, guise, hexadecimal system, higher mathematics,… … Moby Thesaurus
k-means clustering — In statistics and data mining, k means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. This results into a partitioning of… … Wikipedia
Expectation-maximization algorithm — An expectation maximization (EM) algorithm is used in statistics for finding maximum likelihood estimates of parameters in probabilistic models, where the model depends on unobserved latent variables. EM alternates between performing an… … Wikipedia
Linde-Buzo-Gray algorithm — The Linde Buzo Gray algorithm is a vector quantization algorithm to derive a good codebook.It is similar to the k means method in data clustering.The algorithm At each iteration, each vector is split into two new vectors.*A initial state:… … Wikipedia

Academic Dictionaries and Encyclopedias

K-means algorithm

Look at other dictionaries:

Share the article and excerpts

Academic Dictionaries and Encyclopedias

Wikipedia

K-means algorithm

Look at other dictionaries:

Share the article and excerpts

Direct link