# K-means++: Wikis

Note: Many of our articles have direct quotes from sources you can cite, within the Wikipedia article! This article doesn't yet, but we're working on it! See more info or our list of citable articles.

# Encyclopedia

k-means++[1][2] is an algorithm for choosing the initial values for k-means clustering in statistics and machine learning. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii as an approximation algorithm for the NP-hard k-means problem---a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm.

## Background

The k-means problem is to find cluster centers that minimize the sum of squared distances from each data point being clustered to its cluster center (the center that is closest to it). Although finding an exact solution to the k-means problem for arbitrary input is NP-hard[3], the standard approach to finding an approximate solution (often called Lloyd's algorithm or the k-means algorithm) is used widely and frequently finds reasonable solutions quickly.

However, the k-means algorithm has at least two major theoretic shortcomings:

• First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.[4]
• Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal clustering.

In a nutshell, k-means++ addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard k-means optimization iterations. With the k-means++ initialization, the algorithm is guaranteed to find a solution that is O(log k) competitive to the optimal k-means solution.

To illustrate the potential of the k-means algorithm to perform arbitrarily poorly with respect to the objective function of minimizing the sum squared distance of points to assigned clusters, consider the example of four points in $\mathbb{R}^2$ that form an axis aligned rectangle with the width of the rectangle they form being somewhat larger than its height.

If k = 2 and the two initial cluster centers lie at the mid-points of the top and bottom line segments of the rectangle formed by the four data points, the k-means algorithm will converge without moving these cluster centers. Consequently, the two bottom data points are clustered together and the two data points forming the top of the rectangle are clustered together---a suboptimal clustering because the width of the rectangle is greater than the height of the rectangle.

Now, consider stretching the rectangle horizontally to an arbitrary width. The standard k-means algorithm will still cluster the points suboptimally, and by increasing the horizontal distance between the two data points in each cluster, we can make the algorithm do arbitrarily poorly with respect to the k-means objective function.

## Initialization Algorithm

With the intuition of spreading the k initial cluster centers away from each other, the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its distance squared to the point's closest cluster center.

The exact algorithm is as follows:

1. Choose one center uniformly at random from among the data points.
2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
3. Add one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
4. Repeat Steps 2 and 3 until k centers have been chosen.
5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

This seeding method gives out considerable improvements in the final error of k-means. Although the initial selection in the algorithm takes extra time, the k-means part itself converges very fast after this seeding and thus the algorithm actually lowers the computation time too. The authors tested their method with real and synthetic datasets and obtained typically 2-fold improvements in speed, and for certain datasets close to 1000-fold improvements in error. Their tests almost always showed the new method to be at least as good as vanilla k-means in both speed and error.

Additionally, the authors calculate an approximation ratio for their algorithm. The k-means++ algorithm guarantees an approximation ratio O(log(k)) where k is the number of clusters used. This is in contrast to vanilla k-means, which can generate clusterings arbitrarily worse than the optimum [5].

## Applications

The k-means++ approach has been applied since its initial proposal. In a review[6], which includes many types of clustering algorithms, the method is said to successfully overcome some of the problems associated with other ways of defining initial cluster-centres for k-means clustering. Lee et al.[7] report an application of k-means++ to create geographical cluster of photographs based the latitude and longitude information attached to the photos. An application to financial diversification is reported by Howard and Johansen.[8] Jaiswal[9] describes k-means++ as "very useful". Other support for the method and ongoing discussion is also available online.[10]

## References

1. ^ Arthur, D. and Vassilvitskii, S. (2007). "k-means++: the advantages of careful seeding". Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. pp. 1027--1035.
2. ^ http://theory.stanford.edu/~sergei/slides/BATS-Means.pdf Slides for presentation of method by Arthur, D. and Vassilvitskii, S.
3. ^ Drineas, P. and Frieze, A. and Kannan, R. and Vempala, S. and Vinay, V. (2004). "Clustering Large Graphs via the Singular Value Decomposition". Machine Learning (Kluwer Academic Publishers Hingham, MA, USA) 56 (1-3): pp. 9--33.
4. ^ Arthur, D. and Vassilvitskii, S. (2006), How slow is the k-means method?, pp. 144--153
5. ^ T. Kanungo, D. Mount, N. Netanyahux, C. Piatko, R. Silverman, A. Wu "A Local Search Approximation Algorithm for k-Means Clustering" 2004 Computational Geometry: Theory and Applications.
6. ^ http://www.cs.ucla.edu/~shindler/shindler-kMedian-survey.pdf Approximation Algorithms for the Metric k-Median Problem
7. ^ http://sir-lab.usc.edu/publications/2008-ICWSM2LEES.pdf Discovering Relationships among Tags and Geotags, 2007
8. ^ http://www.cse.ohio-state.edu/~johansek/clustering.pdf Clustering Techniques for Financial Diversification, March 2009
9. ^ http://www1.cs.columbia.edu/~rjaiswal/jaiswal-research.pdf Research statement, April 2009