🟢 K-Means Clustering in Machine Learning
Note: K-Means Clustering is an Unsupervised Machine Learning algorithm. It groups similar data points into clusters without using labeled data.
🟦 Program Aim
Aim:
To implement the K-Means Clustering Algorithm using Python and group customers based on their annual income.
🟩 Algorithm Used
K-Means Clustering
🟨 Problem Statement
A shopping mall wants to divide its customers into different groups based on their Annual Income.
The objective is to identify customers with similar income levels for better marketing and promotional strategies.
🟪 Step 1: Install Required Library
Install Scikit-Learn if it is not already installed.
pip install scikit-learn
Explanation
Scikit-Learn provides the KMeans algorithm.
🟦 Step 2: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Explanation
| Library | Purpose |
|---|---|
| pandas | Store and manage data |
| matplotlib | Plot graphs |
| sklearn.cluster | Provides the KMeans algorithm |
🟩 Step 3: Create the Dataset
data = {
"Income":[20,22,25,28,60,62,65,68]
}
df = pd.DataFrame(data)
print(df)
Explanation
The dataset contains the annual income (in thousands) of eight customers.
| Customer | Income (₹ Thousands) |
|---|---|
| 1 | 20 |
| 2 | 22 |
| 3 | 25 |
| 4 | 28 |
| 5 | 60 |
| 6 | 62 |
| 7 | 65 |
| 8 | 68 |
Notice that the data naturally forms two groups:
- Low Income
- High Income
🟨 Step 4: Create the K-Means Model
model = KMeans(n_clusters=2, random_state=42)
Explanation
-
KMeans()creates the clustering model. -
n_clusters=2means divide the data into 2 clusters. -
random_state=42ensures the same result every time the program runs.
🟦 Step 5: Train the Model
model.fit(df)
Explanation
The algorithm learns the patterns in the dataset.
During training, K-Means automatically:
- Chooses cluster centers (centroids)
- Assigns each data point to the nearest centroid
- Recalculates the centroids
- Repeats until the centroids no longer change
🟩 Step 6: Find Cluster Labels
df["Cluster"] = model.labels_
print(df)
Explanation
model.labels_ stores the cluster number assigned to each customer.
Example Output
| Income | Cluster |
|---|---|
| 20 | 0 |
| 22 | 0 |
| 25 | 0 |
| 28 | 0 |
| 60 | 1 |
| 62 | 1 |
| 65 | 1 |
| 68 | 1 |
Cluster 0 → Low Income
Cluster 1 → High Income
🟨 Step 7: Display Cluster Centers
print(model.cluster_centers_)
Explanation
Cluster centers (centroids) represent the average value of each cluster.
Example Output
[[23.75]
[63.75]]
Meaning
Cluster 1 Average Income = ₹23.75 Thousand
Cluster 2 Average Income = ₹63.75 Thousand
🟦 Step 8: Predict the Cluster of a New Customer
prediction = model.predict([[55]])
print("Cluster =", prediction[0])
Explanation
Suppose a new customer has an income of ₹55 Thousand.
The algorithm predicts the cluster to which the customer belongs.
Example Output
Cluster = 1
🟩 Step 9: Plot the Clusters
plt.scatter(df["Income"],
[1]*len(df),
c=df["Cluster"],
s=120)
plt.title("K-Means Clustering")
plt.xlabel("Income")
plt.yticks([])
plt.show()
Explanation
This graph displays:
- Different colors represent different clusters.
- Customers in the same cluster have similar income levels.
🟪 Complete Python Program
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Create Dataset
data = {
"Income":[20,22,25,28,60,62,65,68]
}
df = pd.DataFrame(data)
print("Original Dataset")
print(df)
# Create Model
model = KMeans(
n_clusters=2,
random_state=42
)
# Train Model
model.fit(df)
# Display Cluster Labels
df["Cluster"] = model.labels_
print("\nClustered Dataset")
print(df)
# Display Cluster Centers
print("\nCluster Centers")
print(model.cluster_centers_)
# Predict New Customer
prediction = model.predict([[55]])
print("\nNew Customer belongs to Cluster =", prediction[0])
# Plot Graph
plt.scatter(
df["Income"],
[1]*len(df),
c=df["Cluster"],
s=120
)
plt.title("K-Means Clustering")
plt.xlabel("Income")
plt.yticks([])
plt.show()
🟥 Sample Output
Original Dataset
Income
20
22
25
28
60
62
65
68
Clustered Dataset
Income Cluster
20 0
22 0
25 0
28 0
60 1
62 1
65 1
68 1
Cluster Centers
[[23.75]
[63.75]]
New Customer belongs to Cluster = 1
🟦 Step-by-Step Working of K-Means
Step 1
Choose the number of clusters (K).
Example:
K = 2
⬇
Step 2
Randomly select two centroids.
C1
C2
⬇
Step 3
Calculate the distance of every data point from each centroid.
⬇
Step 4
Assign each data point to its nearest centroid.
⬇
Step 5
Calculate new centroids.
⬇
Step 6
Repeat Steps 3–5 until the centroids no longer change.
⬇
Step 7
Final clusters are formed.
🟨 Workflow Diagram
Customer Dataset
│
▼
Choose K
│
▼
Select Initial Centroids
│
▼
Calculate Distance
│
▼
Assign Data Points
│
▼
Update Centroids
│
▼
Repeat Until Stable
│
▼
Final Clusters
🟩 Advantages
✔ Simple and easy to implement
✔ Fast for large datasets
✔ Efficient clustering algorithm
✔ Easy to understand
✔ Works well with numerical data
🟥 Limitations
❌ Number of clusters (K) must be specified in advance.
❌ Sensitive to outliers.
❌ Works best for spherical clusters.
❌ Different initial centroids may produce different results.
🟦 Applications
- 🛒 Customer Segmentation
- 🏥 Disease Pattern Analysis
- 📷 Image Compression
- 🌐 Website User Grouping
- 🎯 Recommendation Systems
- 📊 Market Basket Analysis
- 🛰 Satellite Image Segmentation
🟨 Viva Questions
- What is K-Means Clustering?
- Why is K-Means called an unsupervised algorithm?
- What is a centroid?
-
What is the purpose of
n_clusters? -
What is
random_state? - What happens during each iteration of K-Means?
- Name two applications of K-Means Clustering.
- What are the limitations of K-Means?
⭐ One-Line Revision
K-Means Clustering groups similar data points into K clusters by repeatedly assigning points to the nearest centroid and updating the centroids until stable clusters are formed.
No comments:
Post a Comment