K-Means Clustering — Grouping Data Without Labels
A scroll-driven visual deep dive into K-Means clustering. Learn the iterative algorithm, choosing K with the elbow method, limitations, and when to use alternatives.
Unsupervised Learning
No labels? No problem.
Let the data group itself.
K-Means finds structure in unlabeled data by partitioning points into K clusters, each defined by its center. It’s the simplest, fastest, and most widely used clustering algorithm — and understanding its strengths and limits is essential.
Two Steps, Repeated
K-Means is guaranteed to converge, but is it guaranteed to find the GLOBAL optimum?
💡 Think about what happens if two initial centroids start in the same cluster...
K-Means decreases the objective (within-cluster sum of squares) at every step, so it ALWAYS converges. But it converges to a LOCAL minimum that depends entirely on where the initial centroids land. Bad initialization → bad clusters. The solution: K-Means++ initializes centroids far apart from each other, dramatically improving results. In practice, scikit-learn runs K-Means++ 10 times and keeps the best result.
What K-Means Optimizes
Within-Cluster Sum of Squares (WCSS / Inertia)
J = Σₖ Σ_{x∈Cₖ} ||x − μₖ||² Assignment step: cᵢ = argminₖ ||xᵢ − μₖ||² Update step: μₖ = (1/|Cₖ|) Σ_{x∈Cₖ} x Each step decreases J → guaranteed convergence K-Means uses Euclidean distance (L2). This means it finds clusters that are:
💡 Think about what region of space is 'closest' to a single point using Euclidean distance...
Because K-Means assigns points to the NEAREST centroid using Euclidean distance, cluster boundaries are Voronoi cells — convex polygons around each centroid. This means K-Means can only find roughly spherical clusters. It fails on crescent shapes, rings, or elongated clusters. For non-convex clusters, use DBSCAN or spectral clustering.
How Many Clusters?
The Silhouette Score for a point is defined as (b − a) / max(a, b), where a = avg distance to same-cluster points, b = avg distance to nearest other cluster. A silhouette score of 0 means:
💡 If a = b, what is (b − a) / max(a, b)?
When silhouette = 0, a = b: the point is equidistant from its own cluster and the nearest other cluster. It's on the decision boundary. Silhouette = +1 means perfectly clustered (a ≪ b), and silhouette = −1 means probably misassigned (a ≫ b — closer to another cluster). Averaging silhouette scores across all points gives an overall quality metric for the clustering.
When K-Means Fails
K-Means in the Real World
K-Means time complexity is O(n·K·d·i) where n=points, K=clusters, d=dimensions, i=iterations. For 1M points, 10 clusters, 100 dimensions, and 50 iterations, approximately how many distance computations?
💡 Multiply n × K × i...
n × K × i = 1,000,000 × 10 × 50 = 500,000,000 distance computations. Each distance computation involves d=100 dimensions. K-Means' linear scaling in n is what makes it practical for large datasets — unlike hierarchical clustering (O(n²) or O(n³)). Mini-batch K-Means speeds this up further by using random subsets of data per iteration.
🎓 What You Now Know
✓ K-Means alternates assign + update — Assign points to nearest centroid, move centroids to cluster mean.
✓ Minimizes WCSS (inertia) — Converges to a local optimum. Use K-Means++ for smart initialization.
✓ Choose K with elbow or silhouette — Or let domain knowledge decide.
✓ Assumes spherical clusters — Fails on non-convex shapes, varying sizes/densities.
✓ Fast and scalable — O(nKdi), linear in data size. Mini-batch variant for very large datasets.
K-Means is the “Hello World” of unsupervised learning — simple, fast, and surprisingly effective. But knowing its limitations is what separates a beginner from a practitioner. When K-Means fails, DBSCAN and PCA are your next moves. 🎯
↗ Keep Learning
DBSCAN — Finding Clusters of Any Shape
A scroll-driven visual deep dive into DBSCAN. Learn density-based clustering, core/border/noise points, choosing epsilon and minPts, and when DBSCAN beats K-Means.
PCA — Compressing Reality Without Losing the Plot
A scroll-driven visual deep dive into Principal Component Analysis. Learn eigenvectors, variance maximization, dimensionality reduction, and when PCA transforms your data — and when it doesn't.
Comments
No comments yet. Be the first!