Total Pageviews

Monday, June 29, 2026

K-Means Clustering in Machine Learning using Python

 

🟢 K-Means Clustering in Machine Learning


Note: K-Means Clustering is an Unsupervised Machine Learning algorithm. It groups similar data points into clusters without using labeled data.


🟦 Program Aim

Aim:

To implement the K-Means Clustering Algorithm using Python and group customers based on their annual income.


🟩 Algorithm Used

K-Means Clustering


🟨 Problem Statement

A shopping mall wants to divide its customers into different groups based on their Annual Income.

The objective is to identify customers with similar income levels for better marketing and promotional strategies.


🟪 Step 1: Install Required Library

Install Scikit-Learn if it is not already installed.

pip install scikit-learn

Explanation

Scikit-Learn provides the KMeans algorithm.


🟦 Step 2: Import Required Libraries

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Explanation

LibraryPurpose
pandasStore and manage data
matplotlibPlot graphs
sklearn.clusterProvides the KMeans algorithm

🟩 Step 3: Create the Dataset

data = {
"Income":[20,22,25,28,60,62,65,68]
}

df = pd.DataFrame(data)

print(df)

Explanation

The dataset contains the annual income (in thousands) of eight customers.

CustomerIncome (₹ Thousands)
120
222
325
428
560
662
765
868

Notice that the data naturally forms two groups:

  • Low Income
  • High Income

🟨 Step 4: Create the K-Means Model

model = KMeans(n_clusters=2, random_state=42)

Explanation

  • KMeans() creates the clustering model.
  • n_clusters=2 means divide the data into 2 clusters.
  • random_state=42 ensures the same result every time the program runs.

🟦 Step 5: Train the Model

model.fit(df)

Explanation

The algorithm learns the patterns in the dataset.

During training, K-Means automatically:

  • Chooses cluster centers (centroids)
  • Assigns each data point to the nearest centroid
  • Recalculates the centroids
  • Repeats until the centroids no longer change

🟩 Step 6: Find Cluster Labels

df["Cluster"] = model.labels_

print(df)

Explanation

model.labels_ stores the cluster number assigned to each customer.

Example Output

IncomeCluster
200
220
250
280
601
621
651
681

Cluster 0 → Low Income

Cluster 1 → High Income


🟨 Step 7: Display Cluster Centers

print(model.cluster_centers_)

Explanation

Cluster centers (centroids) represent the average value of each cluster.

Example Output

[[23.75]
[63.75]]

Meaning

Cluster 1 Average Income = ₹23.75 Thousand

Cluster 2 Average Income = ₹63.75 Thousand


🟦 Step 8: Predict the Cluster of a New Customer

prediction = model.predict([[55]])

print("Cluster =", prediction[0])

Explanation

Suppose a new customer has an income of ₹55 Thousand.

The algorithm predicts the cluster to which the customer belongs.

Example Output

Cluster = 1

🟩 Step 9: Plot the Clusters

plt.scatter(df["Income"],
[1]*len(df),
c=df["Cluster"],
s=120)

plt.title("K-Means Clustering")

plt.xlabel("Income")

plt.yticks([])

plt.show()

Explanation

This graph displays:

  • Different colors represent different clusters.
  • Customers in the same cluster have similar income levels.

🟪 Complete Python Program

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Create Dataset

data = {
"Income":[20,22,25,28,60,62,65,68]
}

df = pd.DataFrame(data)

print("Original Dataset")

print(df)

# Create Model

model = KMeans(
n_clusters=2,
random_state=42
)

# Train Model

model.fit(df)

# Display Cluster Labels

df["Cluster"] = model.labels_

print("\nClustered Dataset")

print(df)

# Display Cluster Centers

print("\nCluster Centers")

print(model.cluster_centers_)

# Predict New Customer

prediction = model.predict([[55]])

print("\nNew Customer belongs to Cluster =", prediction[0])

# Plot Graph

plt.scatter(
df["Income"],
[1]*len(df),
c=df["Cluster"],
s=120
)

plt.title("K-Means Clustering")

plt.xlabel("Income")

plt.yticks([])

plt.show()

🟥 Sample Output

Original Dataset

Income

20

22

25

28

60

62

65

68

Clustered Dataset

Income Cluster

20 0

22 0

25 0

28 0

60 1

62 1

65 1

68 1

Cluster Centers

[[23.75]

[63.75]]

New Customer belongs to Cluster = 1

🟦 Step-by-Step Working of K-Means

Step 1

Choose the number of clusters (K).

Example:

K = 2

Step 2

Randomly select two centroids.

C1

C2

Step 3

Calculate the distance of every data point from each centroid.

Step 4

Assign each data point to its nearest centroid.

Step 5

Calculate new centroids.

Step 6

Repeat Steps 3–5 until the centroids no longer change.

Step 7

Final clusters are formed.


🟨 Workflow Diagram

Customer Dataset


Choose K


Select Initial Centroids


Calculate Distance


Assign Data Points


Update Centroids


Repeat Until Stable


Final Clusters

🟩 Advantages

✔ Simple and easy to implement

✔ Fast for large datasets

✔ Efficient clustering algorithm

✔ Easy to understand

✔ Works well with numerical data


🟥 Limitations

❌ Number of clusters (K) must be specified in advance.

❌ Sensitive to outliers.

❌ Works best for spherical clusters.

❌ Different initial centroids may produce different results.


🟦 Applications

  • 🛒 Customer Segmentation
  • 🏥 Disease Pattern Analysis
  • 📷 Image Compression
  • 🌐 Website User Grouping
  • 🎯 Recommendation Systems
  • 📊 Market Basket Analysis
  • 🛰 Satellite Image Segmentation

🟨 Viva Questions

  1. What is K-Means Clustering?
  2. Why is K-Means called an unsupervised algorithm?
  3. What is a centroid?
  4. What is the purpose of n_clusters?
  5. What is random_state?
  6. What happens during each iteration of K-Means?
  7. Name two applications of K-Means Clustering.
  8. What are the limitations of K-Means?

⭐ One-Line Revision

K-Means Clustering groups similar data points into K clusters by repeatedly assigning points to the nearest centroid and updating the centroids until stable clusters are formed.

No comments:

Post a Comment