Monday, June 29, 2026

K-Means Clustering in Machine Learning using Python

🟢 K-Means Clustering in Machine Learning

Note: K-Means Clustering is an Unsupervised Machine Learning algorithm. It groups similar data points into clusters without using labeled data.

🟦 Program Aim

Aim:

To implement the K-Means Clustering Algorithm using Python and group customers based on their annual income.

🟩 Algorithm Used

K-Means Clustering

🟨 Problem Statement

A shopping mall wants to divide its customers into different groups based on their Annual Income.

The objective is to identify customers with similar income levels for better marketing and promotional strategies.

🟪 Step 1: Install Required Library

Install Scikit-Learn if it is not already installed.


pip install scikit-learn

Explanation

Scikit-Learn provides the KMeans algorithm.

🟦 Step 2: Import Required Libraries


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Explanation

Library	Purpose
pandas	Store and manage data
matplotlib	Plot graphs
sklearn.cluster	Provides the KMeans algorithm

🟩 Step 3: Create the Dataset


data = {
    "Income":[20,22,25,28,60,62,65,68]
}

df = pd.DataFrame(data)

print(df)

Explanation

The dataset contains the annual income (in thousands) of eight customers.

Customer	Income (₹ Thousands)
1	20
2	22
3	25
4	28
5	60
6	62
7	65
8	68

Notice that the data naturally forms two groups:

Low Income
High Income

🟨 Step 4: Create the K-Means Model


model = KMeans(n_clusters=2, random_state=42)

Explanation

KMeans() creates the clustering model.
n_clusters=2 means divide the data into 2 clusters.
random_state=42 ensures the same result every time the program runs.

🟦 Step 5: Train the Model


model.fit(df)

Explanation

The algorithm learns the patterns in the dataset.

During training, K-Means automatically:

Chooses cluster centers (centroids)
Assigns each data point to the nearest centroid
Recalculates the centroids
Repeats until the centroids no longer change

🟩 Step 6: Find Cluster Labels


df["Cluster"] = model.labels_

print(df)

Explanation

model.labels_ stores the cluster number assigned to each customer.

Example Output

Income	Cluster
20	0
22	0
25	0
28	0
60	1
62	1
65	1
68	1

Cluster 0 → Low Income

Cluster 1 → High Income

🟨 Step 7: Display Cluster Centers


print(model.cluster_centers_)

Explanation

Cluster centers (centroids) represent the average value of each cluster.

Example Output


[[23.75]
 [63.75]]

Meaning

Cluster 1 Average Income = ₹23.75 Thousand

Cluster 2 Average Income = ₹63.75 Thousand

🟦 Step 8: Predict the Cluster of a New Customer


prediction = model.predict([[55]])

print("Cluster =", prediction[0])

Explanation

Suppose a new customer has an income of ₹55 Thousand.

The algorithm predicts the cluster to which the customer belongs.

Example Output


Cluster = 1

🟩 Step 9: Plot the Clusters


plt.scatter(df["Income"],
            [1]*len(df),
            c=df["Cluster"],
            s=120)

plt.title("K-Means Clustering")

plt.xlabel("Income")

plt.yticks([])

plt.show()

Explanation

This graph displays:

Different colors represent different clusters.
Customers in the same cluster have similar income levels.

🟪 Complete Python Program


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Create Dataset

data = {
    "Income":[20,22,25,28,60,62,65,68]
}

df = pd.DataFrame(data)

print("Original Dataset")

print(df)

# Create Model

model = KMeans(
    n_clusters=2,
    random_state=42
)

# Train Model

model.fit(df)

# Display Cluster Labels

df["Cluster"] = model.labels_

print("\nClustered Dataset")

print(df)

# Display Cluster Centers

print("\nCluster Centers")

print(model.cluster_centers_)

# Predict New Customer

prediction = model.predict([[55]])

print("\nNew Customer belongs to Cluster =", prediction[0])

# Plot Graph

plt.scatter(
    df["Income"],
    [1]*len(df),
    c=df["Cluster"],
    s=120
)

plt.title("K-Means Clustering")

plt.xlabel("Income")

plt.yticks([])

plt.show()

🟥 Sample Output


Original Dataset

Income

20

22

25

28

60

62

65

68

Clustered Dataset

Income  Cluster

20        0

22        0

25        0

28        0

60        1

62        1

65        1

68        1

Cluster Centers

[[23.75]

 [63.75]]

New Customer belongs to Cluster = 1

🟦 Step-by-Step Working of K-Means

Step 1

Choose the number of clusters (K).

Example:


K = 2

⬇

Step 2

Randomly select two centroids.


C1

C2

⬇

Step 3

Calculate the distance of every data point from each centroid.

⬇

Step 4

Assign each data point to its nearest centroid.

⬇

Step 5

Calculate new centroids.

⬇

Step 6

Repeat Steps 3–5 until the centroids no longer change.

⬇

Step 7

Final clusters are formed.

🟨 Workflow Diagram


Customer Dataset
        │
        ▼
Choose K
        │
        ▼
Select Initial Centroids
        │
        ▼
Calculate Distance
        │
        ▼
Assign Data Points
        │
        ▼
Update Centroids
        │
        ▼
Repeat Until Stable
        │
        ▼
Final Clusters

🟩 Advantages

✔ Simple and easy to implement

✔ Fast for large datasets

✔ Efficient clustering algorithm

✔ Easy to understand

✔ Works well with numerical data

🟥 Limitations

❌ Number of clusters (K) must be specified in advance.

❌ Sensitive to outliers.

❌ Works best for spherical clusters.

❌ Different initial centroids may produce different results.

🟦 Applications

🛒 Customer Segmentation
🏥 Disease Pattern Analysis
📷 Image Compression
🌐 Website User Grouping
🎯 Recommendation Systems
📊 Market Basket Analysis
🛰 Satellite Image Segmentation

🟨 Viva Questions

What is K-Means Clustering?
Why is K-Means called an unsupervised algorithm?
What is a centroid?
What is the purpose of n_clusters?
What is random_state?
What happens during each iteration of K-Means?
Name two applications of K-Means Clustering.
What are the limitations of K-Means?

⭐ One-Line Revision

K-Means Clustering groups similar data points into K clusters by repeatedly assigning points to the nearest centroid and updating the centroids until stable clusters are formed.

SEM 1	SEM 2	SEM 3
SEM 4	SEM 5	SEM 6

SEM 1	SEM 2	SEM 3
SEM 4	SEM 5	SEM 6

SEM 1	SEM 2	SEM 3
SEM 4	SEM 5	SEM 6

CLASS-4	CLASS-5	CLASS-6
CLASS-7	CLASS-8	CLASS-9
CLASS10	CLASS11 application	CLASS12 application
CLASS11 science	CLASS12 science

C	C++	CORE JAVA	SQL	PYTHON
MS OFFICE	HTML	VISUAL BASIC	advanced java	8085
PROLOG	ASSEMBLY LANGUAGE	JAVA SCRIPT	SHELL PROGRAMMING	R
DIGITAL ELECTRONICS	COMPUTER ARCHITECTURE	DATA STRUCTURE	OPERATING SYSTEM	GRAPH THEORY
DISCRETE MATHEMATICS	NUMERICAL ALGORITHM	AUTOMATA	MICROPROCESSOR	NETWORKING
GRAPHICS	SOFTWARE ENGINEERING	DATABSE	ANALYSIS OF ALGORITHM	IMAGE PROCESSING
ARTIFICIAL INTELLIGENCE	BIG DATA	CLOUD COMPUTING	DATA MINING	INTERNET TECHNOLOGY

CU BSC computer science old syllabus	WBSU BSC computer science old syllabus
CU cbcs BSC computer science HONOURS syllabus 2018	WBSU cbcs BSc computer science HONOURS syllabus 2018
CU cbcs BSC computer science GENERAL syllabus 2018	WBSU cbcs BSC computer science GENERAL syllabus 2018

Total Pageviews

Monday, June 29, 2026