Bijan Krishna Paul
VILL- DEBANDI, AMTA, HOWRAH PIN-711410 Phone-9836357266
Total Pageviews
Tuesday, June 30, 2026
Monday, June 29, 2026
K-Means Clustering in Machine Learning using Python
🟢 K-Means Clustering in Machine Learning
Note: K-Means Clustering is an Unsupervised Machine Learning algorithm. It groups similar data points into clusters without using labeled data.
🟦 Program Aim
Aim:
To implement the K-Means Clustering Algorithm using Python and group customers based on their annual income.
🟩 Algorithm Used
K-Means Clustering
🟨 Problem Statement
A shopping mall wants to divide its customers into different groups based on their Annual Income.
The objective is to identify customers with similar income levels for better marketing and promotional strategies.
🟪 Step 1: Install Required Library
Install Scikit-Learn if it is not already installed.
pip install scikit-learn
Explanation
Scikit-Learn provides the KMeans algorithm.
🟦 Step 2: Import Required Libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
Explanation
| Library | Purpose |
|---|---|
| pandas | Store and manage data |
| matplotlib | Plot graphs |
| sklearn.cluster | Provides the KMeans algorithm |
🟩 Step 3: Create the Dataset
data = {
"Income":[20,22,25,28,60,62,65,68]
}
df = pd.DataFrame(data)
print(df)
Explanation
The dataset contains the annual income (in thousands) of eight customers.
| Customer | Income (₹ Thousands) |
|---|---|
| 1 | 20 |
| 2 | 22 |
| 3 | 25 |
| 4 | 28 |
| 5 | 60 |
| 6 | 62 |
| 7 | 65 |
| 8 | 68 |
Notice that the data naturally forms two groups:
- Low Income
- High Income
🟨 Step 4: Create the K-Means Model
model = KMeans(n_clusters=2, random_state=42)
Explanation
-
KMeans()creates the clustering model. -
n_clusters=2means divide the data into 2 clusters. -
random_state=42ensures the same result every time the program runs.
🟦 Step 5: Train the Model
model.fit(df)
Explanation
The algorithm learns the patterns in the dataset.
During training, K-Means automatically:
- Chooses cluster centers (centroids)
- Assigns each data point to the nearest centroid
- Recalculates the centroids
- Repeats until the centroids no longer change
🟩 Step 6: Find Cluster Labels
df["Cluster"] = model.labels_
print(df)
Explanation
model.labels_ stores the cluster number assigned to each customer.
Example Output
| Income | Cluster |
|---|---|
| 20 | 0 |
| 22 | 0 |
| 25 | 0 |
| 28 | 0 |
| 60 | 1 |
| 62 | 1 |
| 65 | 1 |
| 68 | 1 |
Cluster 0 → Low Income
Cluster 1 → High Income
🟨 Step 7: Display Cluster Centers
print(model.cluster_centers_)
Explanation
Cluster centers (centroids) represent the average value of each cluster.
Example Output
[[23.75]
[63.75]]
Meaning
Cluster 1 Average Income = ₹23.75 Thousand
Cluster 2 Average Income = ₹63.75 Thousand
🟦 Step 8: Predict the Cluster of a New Customer
prediction = model.predict([[55]])
print("Cluster =", prediction[0])
Explanation
Suppose a new customer has an income of ₹55 Thousand.
The algorithm predicts the cluster to which the customer belongs.
Example Output
Cluster = 1
🟩 Step 9: Plot the Clusters
plt.scatter(df["Income"],
[1]*len(df),
c=df["Cluster"],
s=120)
plt.title("K-Means Clustering")
plt.xlabel("Income")
plt.yticks([])
plt.show()
Explanation
This graph displays:
- Different colors represent different clusters.
- Customers in the same cluster have similar income levels.
🟪 Complete Python Program
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Create Dataset
data = {
"Income":[20,22,25,28,60,62,65,68]
}
df = pd.DataFrame(data)
print("Original Dataset")
print(df)
# Create Model
model = KMeans(
n_clusters=2,
random_state=42
)
# Train Model
model.fit(df)
# Display Cluster Labels
df["Cluster"] = model.labels_
print("\nClustered Dataset")
print(df)
# Display Cluster Centers
print("\nCluster Centers")
print(model.cluster_centers_)
# Predict New Customer
prediction = model.predict([[55]])
print("\nNew Customer belongs to Cluster =", prediction[0])
# Plot Graph
plt.scatter(
df["Income"],
[1]*len(df),
c=df["Cluster"],
s=120
)
plt.title("K-Means Clustering")
plt.xlabel("Income")
plt.yticks([])
plt.show()
🟥 Sample Output
Original Dataset
Income
20
22
25
28
60
62
65
68
Clustered Dataset
Income Cluster
20 0
22 0
25 0
28 0
60 1
62 1
65 1
68 1
Cluster Centers
[[23.75]
[63.75]]
New Customer belongs to Cluster = 1
🟦 Step-by-Step Working of K-Means
Step 1
Choose the number of clusters (K).
Example:
K = 2
⬇
Step 2
Randomly select two centroids.
C1
C2
⬇
Step 3
Calculate the distance of every data point from each centroid.
⬇
Step 4
Assign each data point to its nearest centroid.
⬇
Step 5
Calculate new centroids.
⬇
Step 6
Repeat Steps 3–5 until the centroids no longer change.
⬇
Step 7
Final clusters are formed.
🟨 Workflow Diagram
Customer Dataset
│
▼
Choose K
│
▼
Select Initial Centroids
│
▼
Calculate Distance
│
▼
Assign Data Points
│
▼
Update Centroids
│
▼
Repeat Until Stable
│
▼
Final Clusters
🟩 Advantages
✔ Simple and easy to implement
✔ Fast for large datasets
✔ Efficient clustering algorithm
✔ Easy to understand
✔ Works well with numerical data
🟥 Limitations
❌ Number of clusters (K) must be specified in advance.
❌ Sensitive to outliers.
❌ Works best for spherical clusters.
❌ Different initial centroids may produce different results.
🟦 Applications
- 🛒 Customer Segmentation
- 🏥 Disease Pattern Analysis
- 📷 Image Compression
- 🌐 Website User Grouping
- 🎯 Recommendation Systems
- 📊 Market Basket Analysis
- 🛰 Satellite Image Segmentation
🟨 Viva Questions
- What is K-Means Clustering?
- Why is K-Means called an unsupervised algorithm?
- What is a centroid?
-
What is the purpose of
n_clusters? -
What is
random_state? - What happens during each iteration of K-Means?
- Name two applications of K-Means Clustering.
- What are the limitations of K-Means?
⭐ One-Line Revision
K-Means Clustering groups similar data points into K clusters by repeatedly assigning points to the nearest centroid and updating the centroids until stable clusters are formed.
🌳 Decision Tree in Machine Learning Using Python
🌳 Decision Tree in Machine Learning
🎯 Aim
Aim:
To implement the Decision Tree Classification Algorithm using Python and predict whether a customer is eligible for a Loan Approval based on their Age.
📖 Problem Statement
A bank has customer records containing Age and Loan Approval Status.
The bank wants to predict whether a new customer will get a loan based on the customer's age.
🟦 Step 1: Import Required Library
from sklearn.tree import DecisionTreeClassifier
🔍 Explanation
-
sklearn→ Scikit-Learn library used for Machine Learning. -
tree→ Module containing Decision Tree algorithms. -
DecisionTreeClassifier→ Used for solving classification problems.
🟩 Step 2: Create the Training Dataset
X = [
[22],
[25],
[35],
[40],
[28],
[50]
]
🔍 Explanation
X represents the Independent Variable (Input Feature).
Here, the feature is Age.
| Customer | Age |
|---|---|
| 1 | 22 |
| 2 | 25 |
| 3 | 35 |
| 4 | 40 |
| 5 | 28 |
| 6 | 50 |
The model learns patterns from these age values.
🟨 Step 3: Create the Target Variable
y = [
"Reject",
"Reject",
"Approve",
"Approve",
"Reject",
"Approve"
]
🔍 Explanation
y represents the Dependent Variable (Target Output).
| Age | Loan Status |
|---|---|
| 22 | Reject |
| 25 | Reject |
| 35 | Approve |
| 40 | Approve |
| 28 | Reject |
| 50 | Approve |
The model learns the relationship between Age and Loan Status.
🟪 Step 4: Create the Decision Tree Model
model = DecisionTreeClassifier()
🔍 Explanation
This line creates a Decision Tree Classifier object.
No training happens here.
It only creates an empty model.
🟦 Step 5: Train the Model
model.fit(X, y)
🔍 Explanation
The fit() method trains the model.
Syntax:
model.fit(input, output)
Here,
- Input → X (Age)
- Output → y (Loan Status)
During training, the Decision Tree:
- Reads all training data.
- Finds the best splitting condition.
- Creates decision rules.
- Builds the tree.
🟩 Step 6: Predict New Data
Suppose a new customer is 30 years old.
prediction = model.predict([[30]])
🔍 Explanation
The model compares the new customer's age with the learned decision rules and predicts the loan status.
🟨 Step 7: Display the Result
print("Loan Status =", prediction[0])
🔍 Explanation
prediction is returned as a list.
Example:
['Reject']
prediction[0] extracts the first element.
Output:
Loan Status = Reject
📌 Complete Python Program
# Decision Tree Classification Example
from sklearn.tree import DecisionTreeClassifier
# Training Data (Input Feature)
X = [
[22],
[25],
[35],
[40],
[28],
[50]
]
# Target Output
y = [
"Reject",
"Reject",
"Approve",
"Approve",
"Reject",
"Approve"
]
# Create Decision Tree Model
model = DecisionTreeClassifier()
# Train the Model
model.fit(X, y)
# Predict Loan Status for Age = 30
prediction = model.predict([[30]])
# Display Result
print("Loan Status =", prediction[0])
💻 Sample Output
Loan Status = Reject
🌳 How the Decision Tree Works
Suppose the trained model creates the following decision tree:
Age
│
Age ≤ 30 ?
/ \
Yes No
│ │
Reject Approve
Explanation
- If Age ≤ 30, predict Reject.
- If Age > 30, predict Approve.
For a customer aged 30:
30 ≤ 30
➡ Prediction = Reject
For a customer aged 40:
40 > 30
➡ Prediction = Approve
⚙️ Step-by-Step Working
Start
│
▼
Import DecisionTreeClassifier
│
▼
Create Training Dataset (X and y)
│
▼
Create Decision Tree Model
│
▼
Train the Model using fit()
│
▼
Provide New Customer Data
│
▼
Predict Loan Status
│
▼
Display Result
│
▼
End
📊 Explanation of Important Functions
| Function | Purpose |
|---|---|
DecisionTreeClassifier() | Creates the Decision Tree model |
fit(X, y) | Trains the model using the training dataset |
predict() | Predicts the class of new data |
print() | Displays the prediction |
🌍 Real-Life Applications
- 🏦 Loan Approval
- 🏥 Disease Diagnosis
- 📧 Spam Email Detection
- 🎓 Student Performance Prediction
- 🛒 Customer Purchase Prediction
- 🚗 Car Insurance Approval
- 🌾 Crop Recommendation
- 💳 Credit Risk Analysis
✅ Advantages
- Easy to understand and interpret.
- Requires little data preprocessing.
- Handles both numerical and categorical data.
- Works for classification and regression.
- Can visualize decision-making as a tree.
❌ Limitations
- Can overfit the training data.
- Sensitive to small changes in the dataset.
- Large trees become difficult to interpret.
- May not perform well with very complex datasets.
🎯 Viva Questions
- What is a Decision Tree?
- Why is it called a Decision Tree?
-
What is
DecisionTreeClassifier? -
What is the purpose of
fit()? -
What is the purpose of
predict()? - What are independent and dependent variables?
- What are the advantages of Decision Trees?
- What are the limitations of Decision Trees?
- Give two real-life applications of Decision Trees.
- Differentiate between Decision Tree Classification and Decision Tree Regression.
📝 University Exam Definition
Decision Tree is a supervised machine learning algorithm used for classification and regression. It predicts the output by splitting data into smaller subsets using decision rules based on input features, forming a tree-like structure.
⭐ One-Line Revision
Decision Tree builds a tree-like model by asking a series of questions about the input data and predicts the final output based on the learned decision rules.
Association Rule Mining (Apriori Algorithm) Using Python
Association Rule Mining (Apriori Algorithm)
Note: Association Rule Mining is an Unsupervised Machine Learning technique. It is mainly used for Market Basket Analysis to discover relationships between items frequently purchased together.
🟦 Program Aim
Aim:
To implement the Association Rule Mining (Apriori Algorithm) using Python and identify products that are frequently purchased together.
🟩 Algorithm Used
Apriori Algorithm
🟨 Problem Statement
A supermarket wants to analyze customer shopping patterns. By examining previous transactions, the store aims to identify products that are frequently purchased together. This information helps improve product placement, cross-selling, and promotional strategies.
🟪 Step 1: Install Required Library
Install the mlxtend package (only once).
pip install mlxtend
Explanation
-
mlxtendstands for Machine Learning Extensions. - It provides the Apriori algorithm and functions for generating association rules.
🟦 Step 2: Import Required Libraries
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
Explanation
-
pandas→ Used to create and manipulate data. -
TransactionEncoder→ Converts transaction data into a True/False matrix. -
apriori()→ Finds frequent itemsets. -
association_rules()→ Generates association rules from frequent itemsets.
🟩 Step 3: Create the Transaction Dataset
transactions = [
["Milk", "Bread", "Butter"],
["Milk", "Bread"],
["Milk", "Butter"],
["Bread", "Butter"],
["Milk", "Bread", "Butter", "Eggs"],
["Bread", "Eggs"],
["Milk", "Eggs"]
]
Explanation
Each inner list represents one customer's shopping basket.
| Customer | Purchased Items |
|---|---|
| 1 | Milk, Bread, Butter |
| 2 | Milk, Bread |
| 3 | Milk, Butter |
| 4 | Bread, Butter |
| 5 | Milk, Bread, Butter, Eggs |
| 6 | Bread, Eggs |
| 7 | Milk, Eggs |
🟨 Step 4: Convert Transactions into Binary Format
encoder = TransactionEncoder()
encoded_data = encoder.fit(transactions).transform(transactions)
df = pd.DataFrame(encoded_data, columns=encoder.columns_)
Explanation
The Apriori algorithm requires data in binary (True/False or 1/0) format.
The dataset becomes:
| Bread | Butter | Eggs | Milk |
|---|---|---|---|
| True | True | False | True |
| True | False | False | True |
| False | True | False | True |
| True | True | False | False |
| True | True | True | True |
| True | False | True | False |
| False | False | True | True |
🟦 Step 5: Display the Dataset
print(df)
Explanation
Displays the converted transaction matrix used for mining frequent itemsets.
🟩 Step 6: Find Frequent Itemsets
frequent_items = apriori(df, min_support=0.3, use_colnames=True)
print(frequent_items)
Explanation
-
min_support = 0.3means an itemset must appear in at least 30% of all transactions. -
use_colnames=Truedisplays product names instead of column numbers.
Example Output:
| Support | Itemsets |
|---|---|
| 0.71 | {Milk} |
| 0.71 | {Bread} |
| 0.57 | {Butter} |
| 0.43 | {Eggs} |
| 0.43 | {Milk, Bread} |
| 0.43 | {Milk, Butter} |
🟨 Step 7: Generate Association Rules
rules = association_rules(
frequent_items,
metric="confidence",
min_threshold=0.7
)
print(rules)
Explanation
This step generates association rules using:
- Metric = Confidence
- Minimum Confidence = 70%
Example Rule:
Milk → Bread
Meaning:
Customers buying Milk are likely to buy Bread as well.
🟥 Step 8: Display Selected Columns
print(rules[['antecedents',
'consequents',
'support',
'confidence',
'lift']])
Explanation
This displays the most important measures:
| Antecedent | Consequent | Support | Confidence | Lift |
|---|---|---|---|---|
| Milk | Bread | 0.43 | 0.75 | 1.05 |
| Bread | Butter | 0.43 | 0.60 | 1.04 |
🟪 Complete Python Program
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
transactions = [
["Milk", "Bread", "Butter"],
["Milk", "Bread"],
["Milk", "Butter"],
["Bread", "Butter"],
["Milk", "Bread", "Butter", "Eggs"],
["Bread", "Eggs"],
["Milk", "Eggs"]
]
encoder = TransactionEncoder()
encoded_data = encoder.fit(transactions).transform(transactions)
df = pd.DataFrame(encoded_data, columns=encoder.columns_)
print("Transaction Dataset")
print(df)
frequent_items = apriori(df,
min_support=0.3,
use_colnames=True)
print("\nFrequent Itemsets")
print(frequent_items)
rules = association_rules(frequent_items,
metric="confidence",
min_threshold=0.7)
print("\nAssociation Rules")
print(rules[['antecedents',
'consequents',
'support',
'confidence',
'lift']])
🟩 Sample Output
Transaction Dataset
Bread Butter Eggs Milk
0 True True False True
1 True False False True
2 False True False True
3 True True False False
4 True True True True
5 True False True False
6 False False True True
Frequent Itemsets
support itemsets
0.71 {Milk}
0.71 {Bread}
0.57 {Butter}
0.43 {Eggs}
0.43 {Milk, Bread}
...
Association Rules
Milk → Bread
Bread → Butter
🟦 Step-by-Step Working of the Algorithm
Transaction Data
│
▼
Convert into Binary Matrix
│
▼
Apply Apriori Algorithm
│
▼
Find Frequent Itemsets
│
▼
Generate Association Rules
│
▼
Display Support, Confidence & Lift
🟨 Important Terms
| Term | Description |
|---|---|
| Support | Frequency of an itemset appearing in all transactions. |
| Confidence | Probability that customers who buy item A also buy item B. |
| Lift | Measures the strength of the relationship between two items. A lift value greater than 1 indicates a positive association. |
| Frequent Itemset | A group of items that appears frequently in the dataset. |
| Association Rule | A rule showing the relationship between two or more items (e.g., Milk → Bread). |
🌍 Real-Life Applications
- 🛒 Market Basket Analysis
- 🛍 Product Recommendation Systems
- 🏪 Store Shelf Arrangement
- 💳 Banking Product Recommendations
- 🎬 Movie Recommendation Systems
- 🌐 E-commerce Websites (Amazon, Flipkart)
- 🍔 Restaurant Combo Offers
🎯 Viva Questions
- What is Association Rule Mining?
- What is the Apriori Algorithm?
- Define Support, Confidence, and Lift.
- What is a Frequent Itemset?
- Why is TransactionEncoder used?
-
What is the purpose of
min_support? -
What is the purpose of
min_thresholdin association rules? - Give two real-life applications of Association Rule Mining.
⭐ One-Line Revision
Association Rule Mining uses the Apriori algorithm to discover frequently occurring item combinations and generate rules such as "If a customer buys Milk, they are also likely to buy Bread."