K-Means Clustering

@kmeans_clustering

K-Means Clustering groups data points into k clusters by repeatedly assigning points to the nearest center and updating those centers.

Pseudo-code

function k_means(data, k):
    choose k random points as centroids

    for iter <- 1 to max_iterations:

        create empty clusters

        for each point p in data:
            best_centroid <- null
            min_dist <- INF

            for each centroid c:
                dist <- distance(p, c)
                if dist < min_dist then
                    min_dist <- dist
                    best_centroid <- c

            assign p to cluster of best_centroid

        new_centroids <- empty list

        for each cluster:
            compute mean of all points in cluster
            add mean to new_centroids

        if centroids == new_centroids then
            break

        centroids <- new_centroids

    return clusters, centroids

x x is one coordinate of a data point, like its position left or right.
y y is another coordinate, like its position up or down.
k k is how many groups you want to split the data into.
centroid centroid is the center of a cluster, like the average position of all points in that group.
distance distance measures how far a point is from a centroid, usually using straight-line distance.

2 2 represents two dimensions (x and y), but real data can have more.
max_iterations max_iterations limits how long the algorithm runs to avoid infinite loops.

Recipe Steps

Imagine you have many dots on a map and you want to group them into k groups.

Place k random centers on the map. These are like temporary group leaders.

For each point, find the closest center and join that group. Like choosing the nearest shop.

Now move each center to the average position of its group. Like moving the shop to be closer to its customers.

Repeat assigning and moving until nothing changes anymore.

At the end, points that are close together form clusters.

Questions & Answers

To divide data into k groups where points in the same group are close to each other.

It represents the center of a cluster and helps assign points.

You may get too many small clusters that do not make sense.

Because clusters improve as centroids move to better positions.

- Points are grouped based on distance to the nearest centroid.
- Centroids move after each step to better fit the data.
- The algorithm stops when centroids no longer change.
- Different starting points can lead to different results.
- Choosing the right number of clusters is important.
- Works best when clusters are roughly circular and similar in size.

Related Algorithms

Decision Tree (ID3 / CART)

A Decision Tree builds a flowchart-like structure that splits data step by step using features, until it can make a clear prediction.

View

Gradient Descent

Gradient Descent is a way for machines to learn from data by adjusting their predictions to get closer to the correct answer. It's like a never-ending staircase where the machine takes small steps down until it reaches the bottom, which is the optimal solution.

View

Longest Common Subsequence (LCS)

The Longest Common Subsequence (LCS) finds the longest sequence of characters that appear in the same order in both sequences (not necessarily next to each other).

View

K-Means Clustering

Pseudo-code

Recipe Steps

Questions & Answers

What is the goal of K-Means?

What does a centroid do?

What happens if k is too large?

Why do we repeat the process?

Related Algorithms

Decision Tree (ID3 / CART)

Gradient Descent

Longest Common Subsequence (LCS)