📘 Module 1: Introduction to R Programming
(6 Classes)
🎯 Learning Outcomes
After completing this module, students will be able to:
✔ Install R and RStudio
✔ Understand the RStudio Interface
✔ Write basic R programs
✔ Perform arithmetic and logical operations
✔ Work with different data types
✔ Create and manipulate vectors, lists, matrices and data frames
✔ Understand factors and categorical variables
CLASS 1
Introduction to R
What is R?
R is an open-source programming language specially designed for
- Data Analysis
- Statistics
- Machine Learning
- Artificial Intelligence
- Data Visualization
- Research
It was developed by
- Ross Ihaka
- Robert Gentleman
at the University of Auckland.
Today R is maintained by the R Foundation.
Why Learn R?
Advantages
✔ Free
✔ Open Source
✔ Easy to Learn
✔ Powerful Graphics
✔ Huge Package Library
✔ Excellent Statistical Functions
✔ Cross Platform
Applications of R
R is widely used in
- Data Science
- Business Analytics
- Bioinformatics
- Finance
- Healthcare
- Marketing
- Machine Learning
- Research
Installing R
Step 1
Download R from
Install normally.
Step 2
Download RStudio
https://posit.co/download/rstudio-desktop/
Install after installing R.
CLASS 2
RStudio Interface
When RStudio opens, four main windows appear.
+---------------------+----------------------+
| Source Editor | Environment |
| | History |
+---------------------+----------------------+
| Console | Files |
| | Plots |
| | Packages |
| | Help |
+---------------------+----------------------+
1. Source
Used to
- Write scripts
- Save programs
- Edit code
Shortcut
Ctrl + Shift + N
2. Console
Used to execute commands immediately.
Example
5+10
Output
15
3. Environment
Shows
- Variables
- Data
- Functions
4. Files
Displays project files.
5. Plots
Displays graphs.
6. Packages
Shows installed packages.
7. Help
Displays documentation.
Example
help(mean)
Understanding the R Command Prompt
Console Prompt
>
means R is ready.
Example
> 5+2
[1] 7
CLASS 3
Basic Operations
Arithmetic Operators
| Operator | Meaning |
|---|---|
| + | Addition |
| - | Subtraction |
| * | Multiplication |
| / | Division |
| ^ | Power |
| %% | Modulus |
| %/% | Integer Division |
Example
a <- 20
b <- 6
a+b
a-b
a*b
a/b
a%%b
a%/%b
a^2
Output
26
14
120
3.333333
2
3
400
Comparison Operators
| Operator | Meaning |
|---|---|
| > | Greater |
| < | Less |
| >= | Greater Equal |
| <= | Less Equal |
| == | Equal |
| != | Not Equal |
Example
10>5
5==5
10!=2
Output
TRUE
TRUE
TRUE
Logical Operators
| Operator | Meaning |
|---|---|
| & | AND |
| | | OR |
| ! | NOT |
Example
TRUE & FALSE
TRUE | FALSE
!TRUE
Output
FALSE
TRUE
FALSE
CLASS 4
Data Types
R supports many data types.
Numeric
x <- 10.5
class(x)
typeof(x)
Output
"numeric"
"double"
Integer
x <- 10L
class(x)
Output
"integer"
Character
name <- "Rahul"
class(name)
Output
"character"
Logical
flag <- TRUE
class(flag)
Output
"logical"
Factor
gender <- factor(c("Male","Female","Male"))
gender
Output
Male Female Male
Levels:
Female Male
Variable Assignment
There are three assignment operators.
x <- 10
y = 20
30 -> z
Output
x=10
y=20
z=30
Variable Naming Rules
✔ Can contain letters
✔ Numbers
✔ Underscore
✔ Dot
Cannot start with numbers.
Correct
student_name
age
salary1
marks.math
Wrong
1age
my-name
CLASS 5
Data Structures in R
Vector
A vector stores similar data.
Create Vector
marks <- c(80,90,75,85,95)
marks
Output
80 90 75 85 95
Length
length(marks)
Output
5
Class
class(marks)
Output
"numeric"
Type
typeof(marks)
Output
"double"
Indexing
marks[2]
Output
90
Multiple Values
marks[c(2,4)]
Output
90
85
Functions
sum(marks)
mean(marks)
max(marks)
min(marks)
Output
425
85
95
75
List
Lists store different data types.
student <- list(
Name="Amit",
Age=20,
Marks=85,
Passed=TRUE
)
student
Output
$Name
"Amit"
$Age
20
$Marks
85
$Passed
TRUE
Access
student$Name
student[[2]]
Output
"Amit"
20
Matrix
Stores data in rows and columns.
mat <- matrix(1:9,nrow=3,ncol=3)
mat
Output
1 4 7
2 5 8
3 6 9
Indexing
mat[2,3]
Output
8
Matrix Addition
A<-matrix(1:4,2,2)
B<-matrix(5:8,2,2)
A+B
Output
6 10
8 12
CLASS 6
Data Frame and Factors
Data Frame
Most important data structure.
student <- data.frame(
Roll=c(1,2,3),
Name=c("A","B","C"),
Marks=c(90,85,95)
)
student
Output
Roll Name Marks
1 A 90
2 B 85
3 C 95
Structure
str(student)
Summary
summary(student)
Access Column
student$Marks
First Row
student[1,]
Import CSV
data <- read.csv("student.csv")
head(data)
Export CSV
write.csv(student,"student.csv")
Factors
Factors store categorical data.
Example
grade <- factor(c(
"A",
"B",
"A",
"C",
"B"
))
grade
Output
A
B
A
C
B
Levels
A B C
Levels
levels(grade)
Output
"A"
"B"
"C"
Frequency
table(grade)
Output
A 2
B 2
C 1
Summary of Data Structures
| Data Structure | Stores |
|---|---|
| Vector | Same Data Type |
| List | Different Data Types |
| Matrix | 2D Same Data Type |
| Data Frame | Tabular Data |
| Factor | Categorical Data |
Common Built-in Functions
| Function | Purpose |
|---|---|
| length() | Number of elements |
| class() | Data class |
| typeof() | Internal type |
| sum() | Addition |
| mean() | Average |
| max() | Maximum |
| min() | Minimum |
| str() | Structure |
| summary() | Summary |
| head() | First rows |
| tail() | Last rows |
| table() | Frequency |
Practical Exercises
- Create two variables and perform all arithmetic operations.
- Compare two numbers using comparison operators.
- Demonstrate logical operators using TRUE and FALSE.
- Create variables of numeric, integer, character, logical, and factor types.
- Create a vector of 10 numbers and calculate its sum, mean, maximum, and minimum.
- Create a list containing a student's name, age, course, and marks.
- Create a 3×3 matrix and print the second row.
- Create a data frame of five students with roll number, name, and marks.
- Import a CSV file and display the first five records.
- Create a factor for student grades and display the frequency of each grade.
Viva Questions
- What is R?
- What is RStudio?
- What is the difference between R and RStudio?
- What are the data types in R?
- Explain vectors with an example.
- What is a list?
- What is a matrix?
- What is a data frame?
- What are factors?
-
Explain the difference between
class()andtypeof(). -
What is the use of
summary()? - What is indexing in R?
- How do you import a CSV file?
- How do you export a CSV file?
- Why are factors important in statistical analysis?
📘 Module 2: Data Manipulation and Management (10 Classes)
📚 Syllabus
1. Data Import and Export
- Reading data from CSV files
- Reading data from Excel files
- Writing data to CSV files
- Writing data to Excel files
2. Data Cleaning and Preparation
-
Handling missing values (
NA) - Detecting and removing duplicates
- Data type conversion
- Renaming rows and columns
3. Data Transformation
-
Selecting columns (
select()) -
Filtering rows (
filter()) -
Arranging data (
arrange()) -
Creating new variables (
mutate()) -
Transforming variables (
transmute()) -
Summarizing data (
summarise()) -
Grouping data (
group_by())
📖 Class-wise Course Plan
| Class | Topics |
|---|---|
| Class 1 | Introduction to Data Manipulation, Reading CSV Files (read.csv()) |
| Class 2 | Reading Excel Files (readxl), Importing Different File Formats |
| Class 3 | Writing Data to CSV and Excel (write.csv(), writexl) |
| Class 4 | Data Cleaning: Missing Values (NA), is.na(), na.omit() |
| Class 5 | Handling Duplicate Records, Data Type Conversion |
| Class 6 | Renaming Rows and Columns, Working with Data Frames |
| Class 7 | Data Transformation: select(), filter(), arrange() |
| Class 8 | mutate(), transmute(), Creating New Variables |
| Class 9 | summarise(), group_by(), Statistical Summaries |
| Class 10 | Complete Data Cleaning & Transformation Case Study, Revision, Viva Questions, Lab Exercises |
Class 1: Data Import and Export – Reading Data from CSV Files
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
- Understand the concept of data import.
- Know different file formats supported by R.
- Read CSV files into R.
- Display and inspect imported data.
- Understand the structure of a data frame.
- Perform basic data exploration.
📖 2.1 Introduction to Data Import
Definition
Data Import is the process of loading data from external sources into R for analysis and visualization.
Most real-world datasets are stored in external files such as:
- CSV Files
- Excel Files
- Text Files
- JSON Files
- Database Tables
R provides powerful functions to import these datasets efficiently.
🌟 Why Data Import is Important?
Data import is the first step in any data analysis project because it allows users to work with real-world datasets.
Advantages
- Imports large datasets quickly.
- Supports multiple file formats.
- Easy to analyze imported data.
- Compatible with data visualization and machine learning.
📊 Common Data File Formats
| File Format | Extension | Description |
|---|---|---|
| CSV | .csv | Comma-Separated Values |
| Excel | .xlsx | Microsoft Excel Workbook |
| Text | .txt | Plain Text File |
| JSON | .json | JavaScript Object Notation |
| R Data | .RData | Native R Data File |
📖 2.2 What is a CSV File?
CSV stands for Comma-Separated Values.
Each row represents one record, and each column represents one variable.
CSV is the most widely used format for data exchange because it is simple and supported by almost every software application.
📊 Sample CSV Dataset (10 Records)
File Name: employee.csv
| Emp_ID | Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| 104 | Sneha | HR | 27 | 32000 |
| 105 | Karan | IT | 35 | 60000 |
| 106 | Neha | Finance | 31 | 55000 |
| 107 | Arjun | Sales | 29 | 40000 |
| 108 | Pooja | Finance | 33 | 58000 |
| 109 | Rohan | IT | 26 | 45000 |
| 110 | Anjali | HR | 32 | 52000 |
📖 2.3 Creating a CSV File
The dataset above can be saved in Notepad or Microsoft Excel as:
employee.csv
CSV Content
Emp_ID,Name,Department,Age,Salary
101,Amit,HR,25,30000
102,Priya,Sales,28,35000
103,Rahul,IT,30,50000
104,Sneha,HR,27,32000
105,Karan,IT,35,60000
106,Neha,Finance,31,55000
107,Arjun,Sales,29,40000
108,Pooja,Finance,33,58000
109,Rohan,IT,26,45000
110,Anjali,HR,32,52000
🔵 2.4 Reading a CSV File
Method 1: Using read.csv()
Syntax
read.csv(file, header = TRUE)
Parameters
| Parameter | Description |
|---|---|
file | CSV file path |
header | TRUE if the first row contains column names |
💻 Example 1: Read Employee Data
employee <- read.csv("employee.csv")
employee
Output
Emp_ID Name Department Age Salary
1 101 Amit HR 25 30000
2 102 Priya Sales 28 35000
3 103 Rahul IT 30 50000
4 104 Sneha HR 27 32000
5 105 Karan IT 35 60000
6 106 Neha Finance 31 55000
7 107 Arjun Sales 29 40000
8 108 Pooja Finance 33 58000
9 109 Rohan IT 26 45000
10 110 Anjali HR 32 52000
Explanation
-
read.csv()imports the CSV file. - The data is stored as a data frame.
- Each row represents one employee.
- Each column represents one variable.
💻 Example 2: View the First Six Records
head(employee)
Output
Emp_ID Name Department Age Salary
1 101 Amit HR 25 30000
2 102 Priya Sales 28 35000
3 103 Rahul IT 30 50000
4 104 Sneha HR 27 32000
5 105 Karan IT 35 60000
6 106 Neha Finance 31 55000
💻 Example 3: View the Last Six Records
tail(employee)
Output
Emp_ID Name Department Age Salary
5 105 Karan IT 35 60000
6 106 Neha Finance 31 55000
7 107 Arjun Sales 29 40000
8 108 Pooja Finance 33 58000
9 109 Rohan IT 26 45000
10 110 Anjali HR 32 52000
💻 Example 4: Display Structure of Dataset
str(employee)
Output
'data.frame': 10 obs. of 5 variables:
$ Emp_ID : int
$ Name : chr
$ Department : chr
$ Age : int
$ Salary : int
Explanation
str() displays:
- Number of rows
- Number of columns
- Data types of variables
💻 Example 5: Dataset Dimensions
dim(employee)
Output
[1] 10 5
Interpretation: The dataset contains 10 rows and 5 columns.
💻 Example 6: Column Names
colnames(employee)
Output
[1] "Emp_ID" "Name" "Department" "Age" "Salary"
💻 Example 7: Row Names
rownames(employee)
Output
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
💻 Example 8: Summary of Dataset
summary(employee)
Output (Example)
Emp_ID
Min. :101
1st Qu.:103.25
Median :105.5
Mean :105.5
3rd Qu.:107.75
Max. :110
Age
Min. :25
Mean :29.6
Max. :35
Salary
Min. :30000
Mean :45700
Max. :60000
💻 Example 9: Display Individual Column
employee$Salary
Output
[1] 30000 35000 50000 32000 60000
[6] 55000 40000 58000 45000 52000
💻 Example 10: Display Multiple Columns
employee[,c("Name","Salary")]
Output
Name Salary
1 Amit 30000
2 Priya 35000
3 Rahul 50000
4 Sneha 32000
5 Karan 60000
6 Neha 55000
7 Arjun 40000
8 Pooja 58000
9 Rohan 45000
10 Anjali 52000
📊 Common Functions for Exploring Data
| Function | Purpose |
|---|---|
head() | First 6 rows |
tail() | Last 6 rows |
str() | Structure |
summary() | Statistical summary |
dim() | Rows and columns |
nrow() | Number of rows |
ncol() | Number of columns |
colnames() | Column names |
rownames() | Row names |
🌍 Real-Life Applications
- Importing student records
- Employee databases
- Sales reports
- Banking transactions
- Hospital patient data
- Survey results
- Research datasets
- Machine learning datasets
✔ Advantages of CSV Files
- Easy to create and edit.
- Lightweight and portable.
- Supported by Excel, R, Python, and databases.
- Ideal for data exchange.
✖ Limitations
- Does not store formatting.
- Does not support formulas.
- No multiple worksheets (unlike Excel).
- Data types are not preserved automatically.
📝 Lab Exercises
-
Create an
employee.csvfile with 10 employee records. -
Import the file using
read.csv(). - Display the first and last six records.
- Find the number of rows and columns.
- Display the structure of the dataset.
-
Print only the
NameandSalarycolumns. -
Generate a statistical summary using
summary().
❓ Viva Questions
- What is a CSV file?
-
What is the purpose of
read.csv()? -
What does the
headerargument do? - Which function displays the first six rows?
- Which function shows the structure of a dataset?
- How do you display column names?
-
What is the difference between
head()andtail()? -
What information does
summary()provide? - Name two advantages of CSV files.
- Give two real-world applications of importing CSV data.
📚 Class Summary
In this class, you learned:
- The concept of data import.
- CSV file structure.
-
Reading CSV files using
read.csv(). -
Exploring datasets with
head(),tail(),str(),dim(), andsummary(). - Practical examples using a 10-record employee dataset.
- Real-world applications, advantages, limitations, exercises, and viva questions.
Class 2: Data Import and Export – Reading Data from Excel Files (.xlsx)
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
- Understand Excel file formats.
- Install and use the readxl package.
- Read Excel files into R.
- Import specific worksheets.
- Read multiple sheets from an Excel workbook.
- Explore imported data using R functions.
- Compare CSV and Excel file formats.
📖 2.5 Introduction to Excel Files
Definition
An Excel file is a spreadsheet created using Microsoft Excel. It stores data in rows and columns and may contain multiple worksheets, formulas, charts, and formatting.
Unlike CSV files, Excel files can store multiple sheets in a single workbook.
🌟 Advantages of Excel Files
- Multiple worksheets in one file
- Supports formulas and functions
- Can contain charts and graphs
- Easy to edit using Microsoft Excel
- Widely used in businesses and organizations
📊 Excel File Extensions
| Extension | Description |
|---|---|
.xls | Excel 97–2003 Workbook |
.xlsx | Excel 2007 and Later Workbook |
.xlsm | Macro-Enabled Workbook |
📖 2.6 The readxl Package
The readxl package is used to import Excel files into R.
If it is not installed, install it once using:
Install Package
install.packages("readxl")
Load Package
library(readxl)
📊 Sample Excel File
File Name: employee.xlsx
Worksheet: Employee
| Emp_ID | Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| 104 | Sneha | HR | 27 | 32000 |
| 105 | Karan | IT | 35 | 60000 |
| 106 | Neha | Finance | 31 | 55000 |
| 107 | Arjun | Sales | 29 | 40000 |
| 108 | Pooja | Finance | 33 | 58000 |
| 109 | Rohan | IT | 26 | 45000 |
| 110 | Anjali | HR | 32 | 52000 |
📖 2.7 Reading an Excel File
Syntax
read_excel(path)
💻 Example 1: Read an Excel File
library(readxl)
employee <- read_excel("employee.xlsx")
employee
Output
# A tibble: 10 × 5
Emp_ID Name Department Age Salary
1 101 Amit HR 25 30000
2 102 Priya Sales 28 35000
3 103 Rahul IT 30 50000
4 104 Sneha HR 27 32000
5 105 Karan IT 35 60000
6 106 Neha Finance 31 55000
7 107 Arjun Sales 29 40000
8 108 Pooja Finance 33 58000
9 109 Rohan IT 26 45000
10 110 Anjali HR 32 52000
💻 Example 2: Read a Specific Worksheet
Suppose the workbook contains two sheets:
- Employee
- Salary
library(readxl)
employee <- read_excel(
"employee.xlsx",
sheet="Employee"
)
employee
Output
Displays all records from the Employee worksheet.
💻 Example 3: Read Sheet by Number
library(readxl)
employee <- read_excel(
"employee.xlsx",
sheet=1
)
Output
Imports the first worksheet.
💻 Example 4: Display Available Sheet Names
library(readxl)
excel_sheets("employee.xlsx")
Output
[1] "Employee"
[2] "Salary"
💻 Example 5: Read Selected Columns
library(readxl)
employee <- read_excel(
"employee.xlsx",
range="A:C"
)
employee
Output
Emp_ID Name Department
101 Amit HR
102 Priya Sales
103 Rahul IT
...
110 Anjali HR
💻 Example 6: Read Specific Cell Range
library(readxl)
employee <- read_excel(
"employee.xlsx",
range="A1:E6"
)
employee
Output
Imports only the first six rows.
💻 Example 7: View Dataset Structure
str(employee)
Output
tibble [10 × 5]
Emp_ID : numeric
Name : character
Department : character
Age : numeric
Salary : numeric
💻 Example 8: Display Summary
summary(employee)
Output
Emp_ID
Min :101
Mean :105.5
Max :110
Age
Min :25
Mean :29.6
Max :35
Salary
Min :30000
Mean :45700
Max :60000
💻 Example 9: First Six Records
head(employee)
Output
First six employee records are displayed.
💻 Example 10: Last Six Records
tail(employee)
Output
Last six employee records are displayed.
📊 Comparison: CSV vs Excel
| Feature | CSV | Excel |
|---|---|---|
| File Extension | .csv | .xlsx |
| Multiple Sheets | ❌ No | ✅ Yes |
| Supports Formatting | ❌ No | ✅ Yes |
| Supports Charts | ❌ No | ✅ Yes |
| File Size | Small | Larger |
| Speed | Faster | Slightly Slower |
| Best For | Data Exchange | Business Reports |
🌍 Real-Life Applications
- Student attendance records
- Employee payroll
- Banking reports
- Hospital patient data
- Sales reports
- Inventory management
- Research datasets
- Financial statements
✔ Advantages of readxl
- Reads Excel files directly.
-
Supports
.xlsand.xlsx. - Imports selected sheets.
- Imports selected cell ranges.
- Fast and reliable.
✖ Limitations
- Cannot modify Excel files (reading only).
- Formatting is not imported.
- Macros are ignored.
- Charts and images are not imported.
📝 Lab Exercises
Exercise 1
Install the readxl package.
Exercise 2
Read an Excel file named employee.xlsx.
Exercise 3
Display available worksheet names.
Exercise 4
Read only the first worksheet.
Exercise 5
Import only columns A to C.
Exercise 6
Import rows 1–6 from the worksheet.
Exercise 7
Display the structure and summary of the imported dataset.
❓ Viva Questions
- What is an Excel workbook?
- Which package is used to read Excel files in R?
- Which function imports Excel data?
-
What is the purpose of
excel_sheets()? - How do you read a worksheet by name?
- How do you read a worksheet by number?
- What is the difference between CSV and Excel?
-
Can
readxlread.xlsfiles? -
Can
readxlimport charts? - Give two applications of Excel data import.
📚 Class Summary
In this class, you learned:
- Introduction to Excel files.
- Installing and loading the readxl package.
-
Reading Excel files with
read_excel(). - Importing specific worksheets and ranges.
-
Viewing sheet names with
excel_sheets(). - Comparing CSV and Excel formats.
- Practical R programs with outputs.
- Real-world applications, exercises, and viva questions.
Class 3: Data Export – Writing Data to CSV and Excel Files
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
- Understand data export in R.
- Write data frames to CSV files.
- Write data frames to Excel files.
- Export selected columns and filtered data.
- Save processed data for future use.
- Understand the differences between CSV and Excel exports.
📖 2.8 Introduction to Data Export
Definition
Data Export is the process of saving data from R into an external file so that it can be used in other software such as Microsoft Excel, LibreOffice Calc, databases, or shared with others.
Common export formats include:
- CSV (.csv)
- Excel (.xlsx)
- Text (.txt)
- RData (.RData)
🌟 Why Data Export is Important?
Data export allows users to:
- Save processed datasets.
- Share reports with others.
- Store analysis results.
- Create backup copies.
- Use data in other applications.
📊 Sample Dataset (10 Records)
employee <- data.frame(
Emp_ID=c(101,102,103,104,105,106,107,108,109,110),
Name=c("Amit","Priya","Rahul","Sneha","Karan",
"Neha","Arjun","Pooja","Rohan","Anjali"),
Department=c("HR","Sales","IT","HR","IT",
"Finance","Sales","Finance","IT","HR"),
Age=c(25,28,30,27,35,31,29,33,26,32),
Salary=c(30000,35000,50000,32000,60000,
55000,40000,58000,45000,52000)
)
employee
Output
Emp_ID Name Department Age Salary
1 101 Amit HR 25 30000
2 102 Priya Sales 28 35000
3 103 Rahul IT 30 50000
4 104 Sneha HR 27 32000
5 105 Karan IT 35 60000
6 106 Neha Finance 31 55000
7 107 Arjun Sales 29 40000
8 108 Pooja Finance 33 58000
9 109 Rohan IT 26 45000
10 110 Anjali HR 32 52000
🔵 2.9 Writing Data to a CSV File
Syntax
write.csv(data, file, row.names = FALSE)
Parameters
| Parameter | Description |
|---|---|
data | Data frame to export |
file | Output file name |
row.names=FALSE | Prevents row numbers from being written |
💻 Example 1: Export Entire Dataset
write.csv(employee,
"employee.csv",
row.names=FALSE)
Output
employee.csv created successfully.
💻 Example 2: Export Selected Columns
emp_salary <- employee[,c("Name","Salary")]
write.csv(emp_salary,
"salary.csv",
row.names=FALSE)
Output
salary.csv created successfully.
💻 Example 3: Export Employees from IT Department
IT_emp <- subset(employee,
Department=="IT")
write.csv(IT_emp,
"IT_Employees.csv",
row.names=FALSE)
Output
IT_Employees.csv created successfully.
💻 Example 4: Export Employees with Salary > 50,000
high_salary <- subset(employee,
Salary>50000)
write.csv(high_salary,
"HighSalary.csv",
row.names=FALSE)
Output
HighSalary.csv created successfully.
🟢 2.10 Writing Data to Excel Files
R uses the writexl package to export Excel files.
Install Package
install.packages("writexl")
Load Package
library(writexl)
Syntax
write_xlsx(data, path)
💻 Example 5: Export to Excel
library(writexl)
write_xlsx(employee,
"employee.xlsx")
Output
employee.xlsx created successfully.
💻 Example 6: Export Salary Data
salary_data <- employee[,c("Name","Salary")]
write_xlsx(salary_data,
"EmployeeSalary.xlsx")
Output
EmployeeSalary.xlsx created successfully.
💻 Example 7: Export HR Department
HR_emp <- subset(employee,
Department=="HR")
write_xlsx(HR_emp,
"HR_Department.xlsx")
Output
HR_Department.xlsx created successfully.
💻 Example 8: Export Finance Department
Finance_emp <- subset(employee,
Department=="Finance")
write_xlsx(Finance_emp,
"Finance.xlsx")
Output
Finance.xlsx created successfully.
💻 Example 9: Export Employees Older Than 30
older_emp <- subset(employee,
Age>30)
write.csv(older_emp,
"AgeAbove30.csv",
row.names=FALSE)
Output
AgeAbove30.csv created successfully.
💻 Example 10: Export Summary Statistics
summary_data <- summary(employee)
write.table(summary_data,
"Summary.txt")
Output
Summary.txt created successfully.
📊 Comparison: write.csv() vs write_xlsx()
| Feature | write.csv() | write_xlsx() |
|---|---|---|
| Output Format | CSV | Excel |
| Multiple Sheets | ❌ No | ❌ No (basic usage) |
| File Size | Smaller | Larger |
| Readable in Excel | ✅ Yes | ✅ Yes |
| Supports Formatting | ❌ No | Limited |
🌍 Real-Life Applications
- Exporting employee payroll reports.
- Saving student examination results.
- Generating monthly sales reports.
- Creating financial statements.
- Exporting survey responses.
- Sharing machine learning results.
- Backing up processed datasets.
- Sending reports to management.
✔ Advantages
- Saves processed data permanently.
- Easy to share with others.
- Compatible with Excel and other software.
- Useful for report generation.
- Supports automation.
✖ Limitations
- CSV files cannot store formatting.
- Excel export requires an additional package.
- Charts and formulas are not exported automatically.
📝 Lab Exercises
- Create a data frame containing 10 student records.
- Export the data frame to a CSV file.
-
Export only the
NameandMarkscolumns. - Export students scoring more than 80 marks.
- Export the dataset to an Excel file.
- Create separate Excel files for different departments.
- Generate a summary report and save it as a text file.
❓ Viva Questions
- What is data export?
- Which function exports data to CSV?
-
Why is
row.names = FALSEcommonly used? - Which package is used to export Excel files?
-
What is the purpose of
write_xlsx()? - Can CSV files store formatting?
- Name two advantages of exporting data.
- What is the difference between CSV and Excel export?
- How can you export only selected columns?
- Give two real-life applications of data export.
📚 Class Summary
In this class, you learned:
- The concept of data export.
-
Writing data frames to CSV files using
write.csv(). -
Writing Excel files using the
writexlpackage. - Exporting filtered and selected datasets.
- Comparison of CSV and Excel exports.
- Practical examples with outputs.
- Real-world applications, lab exercises, and viva questions.
Class 4: Data Cleaning and Preparation – Handling Missing Values (NA)
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
-
Understand missing values (
NA) in R. - Identify missing values in datasets.
- Count missing values.
- Remove missing values.
- Replace missing values.
- Perform statistical analysis after handling missing data.
📖 2.11 Introduction to Data Cleaning
Definition
Data Cleaning is the process of detecting, correcting, or removing inaccurate, incomplete, duplicate, or inconsistent data from a dataset.
Data cleaning is one of the most important steps in Data Science, Machine Learning, and Statistical Analysis because the quality of the analysis depends on the quality of the data.
🌟 Why Data Cleaning is Important?
Data cleaning helps to:
- Improve data quality.
- Increase the accuracy of analysis.
- Remove errors and inconsistencies.
- Handle missing values effectively.
- Improve machine learning model performance.
📖 2.12 What are Missing Values?
A missing value is a data value that is unavailable or unknown. In R, missing values are represented by NA (Not Available).
Common Causes of Missing Values
- Data entry errors
- Survey respondents skipping questions
- Equipment or sensor failures
- Data transmission errors
- Incomplete records
📊 Sample Dataset (10 Records)
student <- data.frame(
Roll_No=c(1,2,3,4,5,6,7,8,9,10),
Name=c("Amit","Priya","Rahul","Sneha","Karan",
"Neha","Arjun","Pooja","Rohan","Anjali"),
Marks=c(85,NA,78,92,NA,81,75,88,NA,95),
Age=c(20,21,20,22,21,NA,20,22,21,20)
)
student
Output
Roll_No Name Marks Age
1 1 Amit 85 20
2 2 Priya NA 21
3 3 Rahul 78 20
4 4 Sneha 92 22
5 5 Karan NA 21
6 6 Neha 81 NA
7 7 Arjun 75 20
8 8 Pooja 88 22
9 9 Rohan NA 21
10 10 Anjali 95 20
📖 2.13 Detecting Missing Values
Syntax
is.na(object)
is.na() checks each value and returns TRUE if it is missing, otherwise FALSE.
💻 Example 1: Detect Missing Values
is.na(student)
Output
Roll_No Name Marks Age
1 FALSE FALSE FALSE FALSE
2 FALSE FALSE TRUE FALSE
3 FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE
5 FALSE FALSE TRUE FALSE
6 FALSE FALSE FALSE TRUE
7 FALSE FALSE FALSE FALSE
8 FALSE FALSE FALSE FALSE
9 FALSE FALSE TRUE FALSE
10 FALSE FALSE FALSE FALSE
💻 Example 2: Count Missing Values
sum(is.na(student))
Output
[1] 4
Explanation: There are 4 missing values in the dataset.
💻 Example 3: Missing Values in Each Column
colSums(is.na(student))
Output
Roll_No 0
Name 0
Marks 3
Age 1
💻 Example 4: Missing Values in Each Row
rowSums(is.na(student))
Output
1 0
2 1
3 0
4 0
5 1
6 1
7 0
8 0
9 1
10 0
📖 2.14 Removing Missing Values
Syntax
na.omit(data)
💻 Example 5: Remove Missing Records
clean_student <- na.omit(student)
clean_student
Output
Roll_No Name Marks Age
1 Amit 85 20
3 Rahul 78 20
4 Sneha 92 22
7 Arjun 75 20
8 Pooja 88 22
10 Anjali 95 20
Explanation
Rows containing missing values are removed.
📖 2.15 Replacing Missing Values
Instead of deleting rows, missing values can be replaced.
💻 Example 6: Replace Missing Marks with Zero
student$Marks[is.na(student$Marks)] <- 0
student
Output
Marks
85
0
78
92
0
81
75
88
0
95
💻 Example 7: Replace Missing Age with Mean Age
student$Age[is.na(student$Age)] <-
mean(student$Age, na.rm=TRUE)
student
Output
Age
20
21
20
22
21
20.78
20
22
21
20
Explanation
na.rm=TRUE ignores missing values while calculating the mean.
💻 Example 8: Calculate Mean Without Missing Values
mean(student$Marks, na.rm=TRUE)
Output
[1] 84.86
💻 Example 9: Calculate Median
median(student$Marks, na.rm=TRUE)
Output
[1] 84.5
💻 Example 10: Standard Deviation
sd(student$Marks, na.rm=TRUE)
Output
[1] 7.38
(Approximate value.)
📖 2.16 Methods for Handling Missing Values
| Method | Description |
|---|---|
| Delete rows | Remove incomplete records |
| Replace with Mean | Numerical data |
| Replace with Median | Skewed numerical data |
| Replace with Mode | Categorical data |
| Predict Missing Values | Machine learning techniques |
📊 Useful Functions
| Function | Purpose |
|---|---|
is.na() | Detect missing values |
sum(is.na()) | Count missing values |
colSums(is.na()) | Missing values by column |
rowSums(is.na()) | Missing values by row |
na.omit() | Remove missing rows |
mean(..., na.rm=TRUE) | Ignore missing values |
median(..., na.rm=TRUE) | Ignore missing values |
sd(..., na.rm=TRUE) | Standard deviation without missing values |
🌍 Real-Life Applications
- Student attendance records
- Hospital patient databases
- Banking transactions
- Insurance claims
- Sales and inventory management
- Customer feedback analysis
- Survey data cleaning
- Machine learning preprocessing
✔ Advantages
- Improves data quality.
- Increases analysis accuracy.
- Prevents errors in statistical calculations.
- Enhances model performance.
- Produces reliable reports.
✖ Disadvantages
- Removing records may reduce dataset size.
- Replacing values may introduce bias if done incorrectly.
- Requires careful selection of imputation methods.
📝 Lab Exercises
- Create a dataset with 10 student records containing missing values.
-
Detect missing values using
is.na(). - Count total missing values.
- Find missing values in each column.
- Find missing values in each row.
-
Remove missing records using
na.omit(). - Replace missing marks with 0.
- Replace missing ages with the mean age.
- Calculate the mean and median while ignoring missing values.
- Find the standard deviation of marks after handling missing values.
❓ Viva Questions
- What is a missing value in R?
- How are missing values represented in R?
-
What is the purpose of
is.na()? -
What does
na.omit()do? -
Why is
na.rm=TRUEused? - How can missing values be counted?
- What are common causes of missing data?
- When should you replace missing values instead of deleting rows?
- What are the advantages of handling missing values?
- Give two real-life applications of data cleaning.
📚 Class Summary
In this class, you learned:
- The concept of data cleaning.
-
Missing values (
NA) and their causes. -
Detecting missing values using
is.na(). - Counting missing values.
-
Removing missing records with
na.omit(). - Replacing missing values with constants and statistical measures.
- Practical examples with outputs.
- Real-world applications, exercises, and viva questions.
Class 5: Handling Duplicate Records and Data Type Conversion
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
- Understand duplicate records in datasets.
- Detect duplicate rows and values.
- Remove duplicate records.
- Understand different data types in R.
- Convert data between numeric, character, factor, and logical types.
- Apply data type conversion in real-world datasets.
📖 2.17 Introduction to Duplicate Data
Definition
A duplicate record is a row or value that appears more than once in a dataset.
Duplicate data may occur because of:
- Repeated data entry
- System errors
- Database merging
- Data import from multiple sources
Duplicate records can lead to inaccurate statistical analysis and incorrect reports.
🌟 Why Remove Duplicate Records?
Removing duplicates helps to:
- Improve data quality.
- Reduce storage space.
- Increase analysis accuracy.
- Prevent incorrect statistical results.
- Improve machine learning performance.
📊 Sample Dataset (10 Records)
employee <- data.frame(
Emp_ID=c(101,102,103,104,105,103,107,108,109,110),
Name=c("Amit","Priya","Rahul","Sneha","Karan",
"Rahul","Arjun","Pooja","Rohan","Anjali"),
Department=c("HR","Sales","IT","HR","IT",
"IT","Sales","Finance","IT","HR"),
Salary=c(30000,35000,50000,32000,60000,
50000,40000,58000,45000,52000)
)
employee
Output
Emp_ID Name Department Salary
101 Amit HR 30000
102 Priya Sales 35000
103 Rahul IT 50000
104 Sneha HR 32000
105 Karan IT 60000
103 Rahul IT 50000
107 Arjun Sales 40000
108 Pooja Finance 58000
109 Rohan IT 45000
110 Anjali HR 52000
📖 2.18 Detecting Duplicate Records
Syntax
duplicated(data)
💻 Example 1: Detect Duplicate Rows
duplicated(employee)
Output
[1]
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
Explanation
The 6th row is a duplicate of the 3rd row.
💻 Example 2: Display Duplicate Records
employee[duplicated(employee),]
Output
Emp_ID Name Department Salary
103 Rahul IT 50000
💻 Example 3: Count Duplicate Records
sum(duplicated(employee))
Output
[1] 1
💻 Example 4: Remove Duplicate Records
employee_unique <- employee[!duplicated(employee),]
employee_unique
Output
Duplicate row removed successfully.
Total Records = 9
💻 Example 5: Detect Duplicate Employee IDs
duplicated(employee$Emp_ID)
Output
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
FALSE
FALSE
FALSE
📖 2.19 Data Types in R
R supports different types of data.
| Data Type | Description | Example |
|---|---|---|
| Numeric | Numbers | 100 |
| Character | Text | "Amit" |
| Logical | TRUE/FALSE | TRUE |
| Factor | Categories | HR, Sales |
📖 2.20 Data Type Conversion
Data type conversion changes one data type into another.
💻 Example 6: Numeric to Character
x <- 100
class(x)
x <- as.character(x)
class(x)
Output
[1] "numeric"
[1] "character"
💻 Example 7: Character to Numeric
x <- "250"
class(x)
x <- as.numeric(x)
class(x)
Output
[1] "character"
[1] "numeric"
💻 Example 8: Character to Factor
department <- c(
"HR",
"Sales",
"IT",
"HR",
"Finance"
)
factor_department <-
as.factor(department)
factor_department
Output
[1]
HR
Sales
IT
HR
Finance
Levels:
Finance
HR
IT
Sales
💻 Example 9: Numeric to Logical
x <- c(1,0,5)
as.logical(x)
Output
[1]
TRUE
FALSE
TRUE
Explanation
-
0becomes FALSE. - Any non-zero value becomes TRUE.
💻 Example 10: Check Data Type
class(employee)
str(employee)
Output
[1]
"data.frame"
'data.frame':
10 obs.
4 variables
📊 Common Conversion Functions
| Function | Purpose |
|---|---|
as.numeric() | Convert to numeric |
as.character() | Convert to character |
as.factor() | Convert to factor |
as.logical() | Convert to logical |
class() | Display data type |
str() | Display structure |
📊 Comparison of Data Types
| Type | Stores | Example |
|---|---|---|
| Numeric | Numbers | 100 |
| Character | Text | "Amit" |
| Logical | TRUE/FALSE | TRUE |
| Factor | Categories | HR |
🌍 Real-Life Applications
Duplicate Handling
- Banking transactions
- Employee databases
- Hospital patient records
- Student admission systems
- Customer databases
Data Type Conversion
- Machine learning preprocessing
- Survey analysis
- Statistical modeling
- Financial analysis
- Database management
✔ Advantages
- Removes redundant information.
- Improves dataset quality.
- Ensures correct data types for analysis.
- Enhances model accuracy.
- Simplifies data manipulation.
✖ Disadvantages
- Removing duplicates without verification may delete valid records.
- Incorrect data type conversion may cause data loss.
- Requires careful validation before conversion.
📝 Lab Exercises
- Create a dataset containing duplicate employee records.
-
Detect duplicate rows using
duplicated(). - Count duplicate records.
- Remove duplicate records.
- Detect duplicate employee IDs.
- Convert numeric data to character.
- Convert character data to numeric.
- Convert department names to factors.
- Convert numeric values to logical.
- Display the structure of the dataset.
❓ Viva Questions
- What is a duplicate record?
- Which function detects duplicate rows?
- How can duplicate rows be removed?
-
What is the purpose of
duplicated()? - What are the four basic data types in R?
- Which function converts data to numeric?
- Which function converts data to character?
- What is a factor in R?
-
How does
as.logical()work? - Why is data type conversion important?
📚 Class Summary
In this class, you learned:
- Duplicate records and their effects.
- Detecting and removing duplicate data.
- Basic data types in R.
-
Data type conversion using
as.numeric(),as.character(),as.factor(), andas.logical(). - Practical examples with outputs.
- Real-world applications, exercises, and viva questions.
Class 6: Renaming Columns and Rows in R
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
- Understand the importance of meaningful column and row names.
-
Rename columns using
colnames(),names(), andrename(). -
Rename rows using
rownames(). - Rename multiple columns simultaneously.
- Apply renaming techniques in real-world datasets.
📖 2.21 Introduction to Renaming
Definition
Renaming is the process of changing the names of columns or rows in a dataset to make them more meaningful, readable, and easier to understand.
For example:
| Old Name | New Name |
|---|---|
| M1 | Marks |
| Dept | Department |
| Sal | Salary |
| Age1 | Age |
Using meaningful names improves code readability and makes data analysis easier.
🌟 Why Rename Columns and Rows?
Renaming helps to:
- Improve readability.
- Use meaningful variable names.
- Avoid confusion during analysis.
- Make reports easier to understand.
- Prepare data for machine learning and visualization.
📊 Sample Dataset (10 Records)
employee <- data.frame(
ID=c(101,102,103,104,105,106,107,108,109,110),
EmpName=c("Amit","Priya","Rahul","Sneha","Karan",
"Neha","Arjun","Pooja","Rohan","Anjali"),
Dept=c("HR","Sales","IT","HR","IT",
"Finance","Sales","Finance","IT","HR"),
Age=c(25,28,30,27,35,31,29,33,26,32),
Sal=c(30000,35000,50000,32000,60000,
55000,40000,58000,45000,52000)
)
employee
Output
| ID | EmpName | Dept | Age | Sal |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| 104 | Sneha | HR | 27 | 32000 |
| 105 | Karan | IT | 35 | 60000 |
| 106 | Neha | Finance | 31 | 55000 |
| 107 | Arjun | Sales | 29 | 40000 |
| 108 | Pooja | Finance | 33 | 58000 |
| 109 | Rohan | IT | 26 | 45000 |
| 110 | Anjali | HR | 32 | 52000 |
📖 2.22 Renaming Columns Using colnames()
Syntax
colnames(dataframe) <- c("Column1","Column2",...)
💻 Example 1: Rename All Columns
colnames(employee) <- c("Emp_ID",
"Name",
"Department",
"Age",
"Salary")
employee
Output
| Emp_ID | Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| ... | ... | ... | ... | ... |
💻 Example 2: Display Column Names
colnames(employee)
Output
[1] "Emp_ID"
[2] "Name"
[3] "Department"
[4] "Age"
[5] "Salary"
📖 2.23 Renaming Columns Using names()
names() works similarly to colnames().
Syntax
names(dataframe)
💻 Example 3
names(employee)
Output
[1]
"Emp_ID"
"Name"
"Department"
"Age"
"Salary"
💻 Example 4: Rename One Column
names(employee)[5] <- "Monthly_Salary"
employee
Output
| Emp_ID | Name | Department | Age | Monthly_Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| ... | ... | ... | ... | ... |
📖 2.24 Renaming Rows
Rows can also have names.
Syntax
rownames(dataframe)
💻 Example 5: Display Row Names
rownames(employee)
Output
[1]
"1"
"2"
"3"
...
"10"
💻 Example 6: Rename Rows
rownames(employee) <-
paste("Employee",
1:10,
sep="_")
employee
Output
Employee_1
Employee_2
Employee_3
...
Employee_10
📖 2.25 Renaming Using rename() from dplyr
The dplyr package provides the rename() function.
Install Package
install.packages("dplyr")
Load Package
library(dplyr)
Syntax
rename(data,
NewName = OldName)
💻 Example 7
library(dplyr)
employee <-
rename(employee,
Salary=Monthly_Salary)
employee
Output
The column Monthly_Salary is renamed to Salary.
💻 Example 8: Rename Multiple Columns
library(dplyr)
employee <-
rename(
employee,
Employee_ID=Emp_ID,
Employee_Name=Name
)
Output
| Employee_ID | Employee_Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| ... | ... | ... | ... | ... |
💻 Example 9: Verify Structure
str(employee)
Output
'data.frame':
10 obs.
5 variables
💻 Example 10: Display Dataset
head(employee)
Output
Employee_ID Employee_Name Department Age Salary
101 Amit HR 25 30000
102 Priya Sales 28 35000
103 Rahul IT 30 50000
104 Sneha HR 27 32000
105 Karan IT 35 60000
106 Neha Finance 31 55000
📊 Comparison of Renaming Functions
| Function | Purpose |
|---|---|
colnames() | Rename all columns |
names() | Rename one or more columns |
rownames() | Rename rows |
rename() | Rename selected columns using dplyr |
📊 Advantages of Meaningful Column Names
| Poor Name | Better Name |
|---|---|
| M1 | Marks |
| Dept | Department |
| Sal | Salary |
| Emp | Employee_Name |
| ID | Employee_ID |
🌍 Real-Life Applications
- Employee management systems
- Student databases
- Banking records
- Hospital patient databases
- Inventory management
- Sales reporting
- Data visualization
- Machine learning preprocessing
✔ Advantages
- Improves readability.
- Makes code easier to understand.
- Helps create professional reports.
- Simplifies data manipulation.
- Enhances collaboration among team members.
✖ Limitations
- Renaming columns incorrectly may break existing code.
- Duplicate column names should be avoided.
- Frequent renaming may reduce code consistency.
📝 Lab Exercises
- Create a dataset containing 10 employee records.
-
Rename all column names using
colnames(). - Display column names.
-
Rename one column using
names(). - Display row names.
- Rename all row names.
- Install and load the dplyr package.
-
Rename one column using
rename(). - Rename two columns simultaneously.
- Display the structure of the renamed dataset.
❓ Viva Questions
- What is the purpose of renaming columns?
- Which function changes column names?
- Which function changes row names?
-
What is the difference between
colnames()andnames()? -
Which package contains
rename()? - How do you rename multiple columns?
- Why are meaningful column names important?
- Can row names be customized?
-
What is the syntax of
rename()? - Give two real-life applications of renaming data.
📚 Class Summary
In this class, you learned:
- The importance of meaningful column and row names.
-
Renaming columns using
colnames()andnames(). -
Renaming rows using
rownames(). -
Using
rename()from the dplyr package. - Practical examples with outputs.
- Comparison tables, real-world applications, lab exercises, and viva questions.
Class 7: Data Transformation – select(), filter(), and arrange()
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
Select specific columns from a dataset.
Filter rows based on conditions.
Sort data in ascending and descending order.
Combine multiple transformation operations.
Use dplyr functions for efficient data analysis.
📖 2.26 Introduction to Data Transformation
Data Transformation means modifying, selecting, filtering, or arranging data into a form suitable for analysis.
R provides powerful transformation functions through the dplyr package.
Install and Load dplyr
📊 Sample Dataset (10 Records)
🔵 2.27 Selecting Columns with select()
Definition
The select() function chooses specific columns from a dataset.
Syntax
💻 Example 1: Select Name and Salary
Output
Name | Salary |
|---|---|
Amit | 30000 |
Priya | 35000 |
Rahul | 50000 |
... | ... |
💻 Example 2: Select Multiple Columns
💻 Example 3: Exclude a Column
🟢 2.28 Filtering Rows with filter()
Definition
The filter() function selects rows that satisfy specified conditions.
Syntax
💻 Example 4: Employees from IT Department
Output
Name | Department |
|---|---|
Rahul | IT |
Karan | IT |
Rohan | IT |
💻 Example 5: Salary Greater Than 50,000
💻 Example 6: Multiple Conditions (AND)
💻 Example 7: Multiple Conditions (OR)
🟣 2.29 Arranging Data with arrange()
Definition
The arrange() function sorts rows based on one or more columns.
Syntax
💻 Example 8: Sort by Salary (Ascending)
Output
Name | Salary |
|---|---|
Amit | 30000 |
Sneha | 32000 |
Priya | 35000 |
... | ... |
💻 Example 9: Sort by Salary (Descending)
Output
Name | Salary |
|---|---|
Karan | 60000 |
Pooja | 58000 |
Neha | 55000 |
... | ... |
💻 Example 10: Sort by Department and Salary
📊 Combining Functions
Example: IT Employees Sorted by Salary
Output
Name | Salary |
|---|---|
Karan | 60000 |
Rahul | 50000 |
Rohan | 45000 |
📊 Comparison of Functions
Function | Purpose |
|---|---|
select() | Choose columns |
filter() | Choose rows |
arrange() | Sort rows |
🌍 Real-Life Applications
Selecting important columns from large databases.
Filtering customers with high purchases.
Sorting employees by salary.
Analyzing sales by region.
Preparing data for machine learning.
Generating management reports.
✔ Advantages
Simple and readable syntax.
Fast processing.
Works well with large datasets.
Easy to combine multiple operations.
Widely used in data science projects.
✖ Limitations
Requires the dplyr package.
Very large datasets may require additional optimization.
Incorrect conditions may produce unexpected results.
📝 Lab Exercises
Select only Name and Salary columns.
Exclude the Age column.
Filter employees from the Sales department.
Filter employees with salary greater than 40,000.
Filter employees from IT with salary greater than 45,000.
Sort employees by Age ascending.
Sort employees by Salary descending.
Sort employees by Department and Salary.
Display only IT employees sorted by salary.
Combine select(), filter(), and arrange() in one program.
❓ Viva Questions
What is data transformation?
What is the purpose of select()?
What is the purpose of filter()?
What is the purpose of arrange()?
How do you sort data in descending order?
How do you apply multiple conditions in filter()?
What does desc() do?
Can select() exclude columns?
What is the pipe operator %>%?
Give two real-life applications of data transformation.
📚 Class Summary
In this class, you learned:
select() for choosing columns.
filter() for selecting rows.
arrange() for sorting data.
Using multiple conditions.
Combining transformation functions with the pipe operator.
Practical examples with outputs.
Real-world applications, exercises, and viva questions.
Class 8: Data Transformation using mutate() and transmute()
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
-
Understand the purpose of
mutate()andtransmute(). - Create new variables in a dataset.
- Modify existing variables.
- Perform arithmetic operations on columns.
- Calculate bonus, tax, gross salary, and net salary.
-
Understand the difference between
mutate()andtransmute().
📖 2.30 Introduction to mutate()
Definition
The mutate() function from the dplyr package is used to create new columns or modify existing columns in a data frame.
It is one of the most frequently used functions in data analysis and machine learning.
Install and Load Package
install.packages("dplyr")
library(dplyr)
📊 Sample Dataset (10 Records)
employee <- data.frame(
Emp_ID=c(101,102,103,104,105,106,107,108,109,110),
Name=c("Amit","Priya","Rahul","Sneha","Karan",
"Neha","Arjun","Pooja","Rohan","Anjali"),
Department=c("HR","Sales","IT","HR","IT",
"Finance","Sales","Finance","IT","HR"),
Age=c(25,28,30,27,35,31,29,33,26,32),
Salary=c(30000,35000,50000,32000,60000,
55000,40000,58000,45000,52000)
)
employee
Output
| Emp_ID | Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| 104 | Sneha | HR | 27 | 32000 |
| 105 | Karan | IT | 35 | 60000 |
| 106 | Neha | Finance | 31 | 55000 |
| 107 | Arjun | Sales | 29 | 40000 |
| 108 | Pooja | Finance | 33 | 58000 |
| 109 | Rohan | IT | 26 | 45000 |
| 110 | Anjali | HR | 32 | 52000 |
📖 2.31 Creating New Columns with mutate()
Syntax
mutate(dataframe,
NewColumn = Expression)
💻 Example 1: Calculate 10% Bonus
library(dplyr)
employee_bonus <- employee %>%
mutate(Bonus = Salary * 0.10)
employee_bonus
Output
| Name | Salary | Bonus |
|---|---|---|
| Amit | 30000 | 3000 |
| Priya | 35000 | 3500 |
| Rahul | 50000 | 5000 |
| Sneha | 32000 | 3200 |
| Karan | 60000 | 6000 |
| Neha | 55000 | 5500 |
| Arjun | 40000 | 4000 |
| Pooja | 58000 | 5800 |
| Rohan | 45000 | 4500 |
| Anjali | 52000 | 5200 |
💻 Example 2: Calculate Gross Salary
employee_gross <- employee %>%
mutate(Gross_Salary = Salary + (Salary * 0.10))
employee_gross
Output
| Name | Salary | Gross_Salary |
|---|---|---|
| Amit | 30000 | 33000 |
| Priya | 35000 | 38500 |
| Rahul | 50000 | 55000 |
| Sneha | 32000 | 35200 |
| Karan | 60000 | 66000 |
| Neha | 55000 | 60500 |
| Arjun | 40000 | 44000 |
| Pooja | 58000 | 63800 |
| Rohan | 45000 | 49500 |
| Anjali | 52000 | 57200 |
💻 Example 3: Calculate 5% Income Tax
employee_tax <- employee %>%
mutate(Tax = Salary * 0.05)
employee_tax
Output
| Name | Salary | Tax |
|---|---|---|
| Amit | 30000 | 1500 |
| Priya | 35000 | 1750 |
| Rahul | 50000 | 2500 |
| Sneha | 32000 | 1600 |
| Karan | 60000 | 3000 |
| Neha | 55000 | 2750 |
| Arjun | 40000 | 2000 |
| Pooja | 58000 | 2900 |
| Rohan | 45000 | 2250 |
| Anjali | 52000 | 2600 |
💻 Example 4: Calculate Net Salary
employee_net <- employee %>%
mutate(
Bonus = Salary*0.10,
Tax = Salary*0.05,
Net_Salary = Salary + Bonus - Tax
)
employee_net
Output
| Name | Salary | Bonus | Tax | Net_Salary |
|---|---|---|---|---|
| Amit | 30000 | 3000 | 1500 | 31500 |
| Priya | 35000 | 3500 | 1750 | 36750 |
| Rahul | 50000 | 5000 | 2500 | 52500 |
| Sneha | 32000 | 3200 | 1600 | 33600 |
| Karan | 60000 | 6000 | 3000 | 63000 |
| Neha | 55000 | 5500 | 2750 | 57750 |
| Arjun | 40000 | 4000 | 2000 | 42000 |
| Pooja | 58000 | 5800 | 2900 | 60900 |
| Rohan | 45000 | 4500 | 2250 | 47250 |
| Anjali | 52000 | 5200 | 2600 | 54600 |
💻 Example 5: Increase Salary by ₹5,000
employee %>%
mutate(Salary = Salary + 5000)
Output
Each employee's salary increases by ₹5,000.
📖 2.32 The transmute() Function
Definition
The transmute() function creates new columns but returns only the newly created columns.
Unlike mutate(), the original columns are not included.
Syntax
transmute(dataframe,
NewColumn = Expression)
💻 Example 6: Display Bonus Only
employee %>%
transmute(Name,
Bonus = Salary*0.10)
Output
| Name | Bonus |
|---|---|
| Amit | 3000 |
| Priya | 3500 |
| Rahul | 5000 |
| Sneha | 3200 |
| Karan | 6000 |
| Neha | 5500 |
| Arjun | 4000 |
| Pooja | 5800 |
| Rohan | 4500 |
| Anjali | 5200 |
💻 Example 7: Gross Salary Only
employee %>%
transmute(Name,
Gross = Salary*1.10)
Output
Displays only Name and Gross Salary.
💻 Example 8: Age After Five Years
employee %>%
mutate(Age_After_5_Years = Age + 5)
Output
| Name | Age | Age_After_5_Years |
|---|---|---|
| Amit | 25 | 30 |
| Priya | 28 | 33 |
| Rahul | 30 | 35 |
| Sneha | 27 | 32 |
| Karan | 35 | 40 |
| Neha | 31 | 36 |
| Arjun | 29 | 34 |
| Pooja | 33 | 38 |
| Rohan | 26 | 31 |
| Anjali | 32 | 37 |
💻 Example 9: Annual Salary
employee %>%
mutate(Annual_Salary = Salary * 12)
Output
| Name | Monthly Salary | Annual Salary |
|---|---|---|
| Amit | 30000 | 360000 |
| Priya | 35000 | 420000 |
| Rahul | 50000 | 600000 |
| Sneha | 32000 | 384000 |
| Karan | 60000 | 720000 |
| Neha | 55000 | 660000 |
| Arjun | 40000 | 480000 |
| Pooja | 58000 | 696000 |
| Rohan | 45000 | 540000 |
| Anjali | 52000 | 624000 |
💻 Example 10: Employee Category
employee %>%
mutate(Category = ifelse(Salary >= 50000,
"High Salary",
"Normal Salary"))
Output
| Name | Salary | Category |
|---|---|---|
| Amit | 30000 | Normal Salary |
| Priya | 35000 | Normal Salary |
| Rahul | 50000 | High Salary |
| Sneha | 32000 | Normal Salary |
| Karan | 60000 | High Salary |
| Neha | 55000 | High Salary |
| Arjun | 40000 | Normal Salary |
| Pooja | 58000 | High Salary |
| Rohan | 45000 | Normal Salary |
| Anjali | 52000 | High Salary |
📊 Comparison of mutate() and transmute()
| Feature | mutate() | transmute() |
|---|---|---|
| Keeps Original Columns | ✅ Yes | ❌ No |
| Creates New Columns | ✅ Yes | ✅ Yes |
| Modifies Existing Columns | ✅ Yes | ✅ Yes |
| Returns Only New Columns | ❌ No | ✅ Yes |
🌍 Real-Life Applications
- Employee payroll systems
- Student result processing
- Banking interest calculation
- GST and tax calculation
- Insurance premium calculation
- Sales commission reports
- Financial reporting
- Business analytics
📝 Lab Exercises
- Calculate a 15% bonus for each employee.
- Create a Gross Salary column.
- Create a Net Salary column after deducting 8% tax.
- Calculate annual salary.
- Increase every salary by ₹2,000.
- Create a category column (High Salary, Medium Salary, Low Salary).
-
Display only Name and Bonus using
transmute(). - Calculate age after 10 years.
- Create a PF deduction column (12% of salary).
- Calculate Take Home Salary = Salary + Bonus − Tax − PF.
❓ Viva Questions
-
What is the purpose of
mutate()? -
What is the difference between
mutate()andtransmute()? -
Can
mutate()modify existing columns? -
Which package contains
mutate()? - Which function returns only new columns?
- How do you create a new column in R?
-
What is the use of
ifelse()insidemutate()? - How do you calculate annual salary?
-
What are the advantages of
mutate()? -
Give two real-life applications of
transmute().
📚 Class Summary
In this class, you learned:
-
Creating new variables with
mutate(). - Modifying existing variables.
-
Using
transmute()to return only selected transformed columns. - Calculating bonus, tax, gross salary, net salary, annual salary, and employee categories.
- Practical R programs with outputs.
- Real-world applications, lab exercises, and viva questions.
Class 9: Data Transformation using summarise() and group_by()
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
-
Understand the purpose of
summarise()andgroup_by(). - Calculate statistical summaries of datasets.
- Group data based on one or more columns.
- Generate department-wise reports.
- Perform grouped statistical analysis.
- Apply summary functions in real-world business scenarios.
📖 2.33 Introduction to summarise()
Definition
The summarise() (or summarize()) function from the dplyr package is used to calculate summary statistics for a dataset. It reduces multiple rows into a single summary.
Common statistics include:
- Mean
- Sum
- Minimum
- Maximum
- Count
- Standard Deviation
- Variance
Install and Load Package
install.packages("dplyr")
library(dplyr)
📊 Sample Dataset (10 Records)
employee <- data.frame(
Emp_ID=c(101,102,103,104,105,106,107,108,109,110),
Name=c("Amit","Priya","Rahul","Sneha","Karan",
"Neha","Arjun","Pooja","Rohan","Anjali"),
Department=c("HR","Sales","IT","HR","IT",
"Finance","Sales","Finance","IT","HR"),
Age=c(25,28,30,27,35,31,29,33,26,32),
Salary=c(30000,35000,50000,32000,60000,
55000,40000,58000,45000,52000)
)
employee
Output
| Emp_ID | Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| 104 | Sneha | HR | 27 | 32000 |
| 105 | Karan | IT | 35 | 60000 |
| 106 | Neha | Finance | 31 | 55000 |
| 107 | Arjun | Sales | 29 | 40000 |
| 108 | Pooja | Finance | 33 | 58000 |
| 109 | Rohan | IT | 26 | 45000 |
| 110 | Anjali | HR | 32 | 52000 |
📖 2.34 Using summarise()
Syntax
summarise(dataframe,
NewColumn = function(column))
💻 Example 1: Calculate Average Salary
library(dplyr)
employee %>%
summarise(
Average_Salary = mean(Salary)
)
Output
| Average_Salary |
|---|
| 45700 |
💻 Example 2: Total Salary
employee %>%
summarise(
Total_Salary = sum(Salary)
)
Output
| Total_Salary |
|---|
| 457000 |
💻 Example 3: Minimum and Maximum Salary
employee %>%
summarise(
Minimum = min(Salary),
Maximum = max(Salary)
)
Output
| Minimum | Maximum |
|---|---|
| 30000 | 60000 |
💻 Example 4: Count Employees
employee %>%
summarise(
Total_Employees = n()
)
Output
| Total_Employees |
|---|
| 10 |
💻 Example 5: Standard Deviation
employee %>%
summarise(
Standard_Deviation = sd(Salary)
)
Output
| Standard_Deviation |
|---|
| 10682.07 (approx.) |
📖 2.35 Using group_by()
Definition
The group_by() function divides a dataset into groups. When used with summarise(), it calculates statistics for each group separately.
Syntax
group_by(dataframe, Column_Name)
💻 Example 6: Average Salary by Department
employee %>%
group_by(Department) %>%
summarise(
Average_Salary = mean(Salary)
)
Output
| Department | Average Salary |
|---|---|
| Finance | 56500 |
| HR | 38000 |
| IT | 51667 |
| Sales | 37500 |
💻 Example 7: Total Salary by Department
employee %>%
group_by(Department) %>%
summarise(
Total_Salary = sum(Salary)
)
Output
| Department | Total Salary |
|---|---|
| Finance | 113000 |
| HR | 114000 |
| IT | 155000 |
| Sales | 75000 |
💻 Example 8: Employee Count by Department
employee %>%
group_by(Department) %>%
summarise(
Employees = n()
)
Output
| Department | Employees |
|---|---|
| Finance | 2 |
| HR | 3 |
| IT | 3 |
| Sales | 2 |
💻 Example 9: Department-wise Minimum and Maximum Salary
employee %>%
group_by(Department) %>%
summarise(
Minimum = min(Salary),
Maximum = max(Salary)
)
Output
| Department | Minimum | Maximum |
|---|---|---|
| Finance | 55000 | 58000 |
| HR | 30000 | 52000 |
| IT | 45000 | 60000 |
| Sales | 35000 | 40000 |
💻 Example 10: Multiple Summary Statistics
employee %>%
group_by(Department) %>%
summarise(
Average_Age = mean(Age),
Average_Salary = mean(Salary),
Highest_Salary = max(Salary),
Lowest_Salary = min(Salary),
Employees = n()
)
Output
| Department | Avg Age | Avg Salary | Highest | Lowest | Employees |
|---|---|---|---|---|---|
| Finance | 32.0 | 56500 | 58000 | 55000 | 2 |
| HR | 28.0 | 38000 | 52000 | 30000 | 3 |
| IT | 30.3 | 51667 | 60000 | 45000 | 3 |
| Sales | 28.5 | 37500 | 40000 | 35000 | 2 |
📊 Common Summary Functions
| Function | Purpose |
|---|---|
mean() | Average |
sum() | Total |
min() | Minimum |
max() | Maximum |
n() | Count |
sd() | Standard Deviation |
var() | Variance |
median() | Median |
📊 Comparison of Functions
| Function | Purpose |
|---|---|
summarise() | Creates summary statistics |
group_by() | Groups data into categories |
n() | Counts rows in each group |
mean() | Calculates average |
sum() | Calculates total |
🌍 Real-Life Applications
- Department-wise salary analysis.
- Student performance reports by class.
- Monthly sales summaries by region.
- Customer purchase analysis.
- Banking transaction summaries.
- Hospital patient statistics.
- Inventory reports.
- Business intelligence dashboards.
✔ Advantages
- Produces concise statistical summaries.
- Supports grouped analysis.
- Easy to combine with other dplyr functions.
- Ideal for dashboards and reports.
- Highly efficient for large datasets.
✖ Limitations
- Requires correctly grouped data.
- Missing values should be handled before summarizing.
- Complex summaries may require additional functions.
📝 Lab Exercises
- Calculate the average salary of all employees.
- Find the total salary paid.
- Count the total number of employees.
- Find the highest and lowest salary.
- Calculate the standard deviation of salaries.
- Find the average salary for each department.
- Count employees in each department.
- Calculate total salary by department.
- Find the minimum and maximum salary for each department.
- Create a department-wise summary showing average age, average salary, highest salary, lowest salary, and employee count.
❓ Viva Questions
-
What is the purpose of
summarise()? -
What is the purpose of
group_by()? - Which function counts the number of rows?
- How do you calculate the average salary?
-
What is the difference between
summarise()andgroup_by()? -
Can
summarise()be used withoutgroup_by()? - Which function calculates standard deviation?
-
What is the purpose of
n()? - Why is grouped analysis important?
-
Give two real-life applications of
group_by().
Class 10 (Final): Complete Data Cleaning and Data Transformation Case Study
Duration: 1 Class
🎯 Learning Objectives
After completing this lesson, students will be able to:
- Import data from a CSV file.
- Explore the dataset.
- Handle missing values.
- Remove duplicate records.
- Rename columns.
- Transform data using dplyr.
- Generate summary reports.
- Export the processed dataset.
- Apply the complete data analysis workflow in R.
📖 2.36 Complete Data Analysis Workflow
A typical data analysis project follows these steps:
Raw Data
│
▼
Import Data
│
▼
Explore Dataset
│
▼
Clean Data
│
▼
Transform Data
│
▼
Summarize Data
│
▼
Export Results
📊 Case Study: Employee Salary Analysis
Suppose a company provides the following employee dataset.
Sample Dataset (10 Records)
employee <- data.frame(
Emp_ID = c(101,102,103,104,105,106,107,108,109,109),
Name = c("Amit","Priya","Rahul","Sneha","Karan",
"Neha","Arjun","Pooja","Rohan","Rohan"),
Department = c("HR","Sales","IT","HR","IT",
"Finance","Sales","Finance","IT","IT"),
Age = c(25,28,30,27,35,NA,29,33,26,26),
Salary = c(30000,35000,50000,32000,60000,
55000,40000,58000,45000,45000)
)
employee
Output
| Emp_ID | Name | Department | Age | Salary |
|---|---|---|---|---|
| 101 | Amit | HR | 25 | 30000 |
| 102 | Priya | Sales | 28 | 35000 |
| 103 | Rahul | IT | 30 | 50000 |
| 104 | Sneha | HR | 27 | 32000 |
| 105 | Karan | IT | 35 | 60000 |
| 106 | Neha | Finance | NA | 55000 |
| 107 | Arjun | Sales | 29 | 40000 |
| 108 | Pooja | Finance | 33 | 58000 |
| 109 | Rohan | IT | 26 | 45000 |
| 109 | Rohan | IT | 26 | 45000 |
Notice that:
- One missing value exists in Age.
- One duplicate employee record exists.
Step 1: Explore the Dataset
Program 1
str(employee)
summary(employee)
Output
'data.frame': 10 observations of 5 variables
Summary:
Emp_ID
Name
Department
Age
Salary
Step 2: Detect Missing Values
Program 2
sum(is.na(employee))
Output
[1] 1
Step 3: Replace Missing Age with Mean
Program 3
employee$Age[is.na(employee$Age)] <-
mean(employee$Age,
na.rm=TRUE)
employee
Output
Missing value replaced successfully.
Step 4: Detect Duplicate Records
Program 4
duplicated(employee)
Output
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
TRUE
Step 5: Remove Duplicate Records
Program 5
employee <-
employee[!duplicated(employee),]
Output
Duplicate record removed.
Total Records = 9
Step 6: Rename Columns
Program 6
colnames(employee) <-
c("Employee_ID",
"Employee_Name",
"Department",
"Age",
"Salary")
employee
Output
Columns renamed successfully.
Step 7: Create Bonus Column
Program 7
library(dplyr)
employee <-
employee %>%
mutate(
Bonus = Salary*0.10
)
employee
Output
| Employee_Name | Salary | Bonus |
|---|---|---|
| Amit | 30000 | 3000 |
| Priya | 35000 | 3500 |
| Rahul | 50000 | 5000 |
| Sneha | 32000 | 3200 |
| Karan | 60000 | 6000 |
| Neha | 55000 | 5500 |
| Arjun | 40000 | 4000 |
| Pooja | 58000 | 5800 |
| Rohan | 45000 | 4500 |
Step 8: Create Gross Salary
Program 8
employee <-
employee %>%
mutate(
Gross_Salary=
Salary+Bonus
)
employee
Output
| Employee_Name | Gross Salary |
|---|---|
| Amit | 33000 |
| Priya | 38500 |
| Rahul | 55000 |
| Sneha | 35200 |
| Karan | 66000 |
| Neha | 60500 |
| Arjun | 44000 |
| Pooja | 63800 |
| Rohan | 49500 |
Step 9: Department-wise Summary
Program 9
employee %>%
group_by(Department)%>%
summarise(
Employees=n(),
Average_Salary=
mean(Salary),
Highest=max(Salary),
Lowest=min(Salary)
)
Output
| Department | Employees | Average Salary | Highest | Lowest |
|---|---|---|---|---|
| Finance | 2 | 56500 | 58000 | 55000 |
| HR | 2 | 31000 | 32000 | 30000 |
| IT | 3 | 51667 | 60000 | 45000 |
| Sales | 2 | 37500 | 40000 | 35000 |
Step 10: Export Processed Dataset
Program 10
write.csv(
employee,
"Employee_Report.csv",
row.names=FALSE
)
Output
Employee_Report.csv created successfully.
📊 Complete Workflow Summary
| Step | Function |
|---|---|
| Import Data | read.csv() |
| Check Structure | str() |
| Summary | summary() |
| Missing Values | is.na() |
| Remove Missing | na.omit() |
| Replace Missing | mean() |
| Duplicate Detection | duplicated() |
| Remove Duplicates | !duplicated() |
| Rename Columns | colnames() |
| Create New Columns | mutate() |
| Group Data | group_by() |
| Statistical Summary | summarise() |
| Export Data | write.csv() |
📊 Best Practices
✔ Keep a backup of the original dataset.
✔ Handle missing values before analysis.
✔ Remove duplicate records carefully.
✔ Use meaningful column names.
✔ Verify data types.
✔ Use group_by() for grouped analysis.
✔ Export the final cleaned dataset.
✔ Document every transformation step.
⚠ Common Errors and Solutions
| Error | Cause | Solution |
|---|---|---|
| Object not found | Incorrect variable name | Check spelling |
| Missing package | Package not installed | install.packages() |
| NA values in mean | Missing values present | Use na.rm = TRUE |
| Duplicate records | Repeated data | Use duplicated() |
| Wrong column name | Typing mistake | Use colnames() |
🌍 Real-Life Applications
- Employee payroll processing.
- Student examination systems.
- Banking customer databases.
- Hospital patient records.
- Insurance claim processing.
- Retail sales analysis.
- Inventory management.
- Government census data.
- Customer relationship management (CRM).
- Machine learning data preprocessing.
📝 Lab Programs
- Import a CSV file.
- Display the first 10 records.
- Check the structure of the dataset.
- Count missing values.
- Replace missing values with the mean.
- Detect duplicate records.
- Remove duplicate records.
- Rename all columns.
- Create a Bonus column.
- Calculate Gross Salary.
- Calculate Annual Salary.
- Group employees by department.
- Calculate average salary department-wise.
- Export the cleaned dataset.
- Create a complete employee report.
❓ Viva Questions
- What is data cleaning?
- What is data transformation?
- Which function imports a CSV file?
- How do you detect missing values?
- Which function removes duplicate records?
-
What is the purpose of
mutate()? -
What is
group_by()used for? - Which function exports data to CSV?
- Why is data cleaning important?
- What are the steps in a data analysis workflow?
-
What is the difference between
summarise()andmutate()? - Why are meaningful column names important?
-
What is the purpose of
na.rm = TRUE? - How do you calculate department-wise statistics?
- Give three real-life applications of data transformation.
-
What is the use of
duplicated()? - How do you create a new variable in R?
- What is the difference between CSV and Excel files?
- Why should raw data be backed up before cleaning?
- Explain the complete data analysis process in R.
📚 Module 2 Summary
In this module, you learned:
- Importing and exporting data using CSV and Excel files.
- Handling missing values and duplicate records.
- Converting data types.
- Renaming rows and columns.
- Selecting, filtering, and arranging data.
-
Creating and transforming variables with
mutate()andtransmute(). -
Summarizing data using
summarise()andgroup_by(). - Applying a complete data cleaning and transformation workflow using R.
- Solving real-world data analysis problems with practical R programs and outputs.
Module 3: Data Visualization in R Programming
📘 CHAPTER 1: Introduction to Data Visualization in R
🌟 1.1 What is Data Visualization?
Data Visualization is the graphical representation of data using charts, graphs, and plots.
It helps to convert raw data into meaningful visual information.
🎯 Purpose:
- To understand patterns in data
- To identify trends and relationships
- To detect outliers
- To support decision making
📊 1.2 Importance of Data Visualization
- Makes complex data easy to understand
- Improves analysis speed
- Helps in statistical interpretation
- Useful in business intelligence
- Enhances presentation quality
📈 1.3 Types of Data Visualizations in R
| Type | Purpose |
|---|---|
| Scatter Plot | Relationship between variables |
| Line Plot | Trend analysis |
| Bar Chart | Category comparison |
| Histogram | Data distribution |
| Pie Chart | Percentage representation |
| Box Plot | Outlier detection |
🟦 1.4 Base R Graphics
Base R provides built-in functions to create plots without installing additional packages.
🔧 Common Functions:
-
plot()→ General plotting -
barplot()→ Bar chart -
hist()→ Histogram -
pie()→ Pie chart -
boxplot()→ Box plot
📍 1.5 Scatter Plot in Base R
🎯 Objective:
To show relationship between two variables.
💻 R Script:
# Scatter Plot Example
x <- c(10, 20, 30, 40, 50)
y <- c(15, 25, 35, 45, 60)
plot(x, y,
main = "Scatter Plot Example",
xlab = "X Values",
ylab = "Y Values",
col = "blue",
pch = 19,
cex = 1.5)
🖥️ Output:
- A blue scatter plot
- Points increasing diagonally
- Title: Scatter Plot Example
📌 Interpretation:
There is a positive relationship between X and Y values.
📉 1.6 Line Plot in Base R
💻 R Script:
# Line Plot Example
sales <- c(100, 120, 150, 180, 200)
plot(sales,
type = "l",
col = "red",
lwd = 3,
main = "Sales Growth Over Time",
xlab = "Time",
ylab = "Sales")
🖥️ Output:
- Red line graph
- Shows increasing trend
📌 Interpretation:
Sales are increasing steadily over time.
📊 1.7 Bar Plot in Base R
💻 R Script:
# Bar Plot Example
students <- c(30, 25, 40, 35)
barplot(students,
names.arg = c("A", "B", "C", "D"),
col = "green",
main = "Class Strength")
🖥️ Output:
- Green vertical bars
- Categories A, B, C, D
📊 1.8 Histogram in Base R
💻 R Script:
# Histogram Example
marks <- c(45, 50, 55, 60, 65, 70, 75, 80, 85)
hist(marks,
col = "skyblue",
main = "Marks Distribution",
xlab = "Marks")
🖥️ Output:
- Blue histogram bars
- Frequency distribution of marks
🥧 1.9 Pie Chart in Base R
💻 R Script:
# Pie Chart Example
data <- c(20, 30, 25, 25)
pie(data,
labels = c("Food", "Rent", "Travel", "Savings"),
col = rainbow(4),
main = "Expense Distribution")
🖥️ Output:
- Multicolor pie chart
- Shows percentage distribution
📦 1.10 Box Plot in Base R
💻 R Script:
# Box Plot Example
marks <- c(40, 50, 55, 60, 65, 70, 75, 90)
boxplot(marks,
col = "orange",
main = "Marks Analysis")
🖥️ Output:
- Orange box plot
- Shows median and spread
⚡ 1.11 Key Advantages of Base R Graphics
- Easy to use
- No installation required
- Fast execution
- Good for basic analysis
📌 1.12 Summary
- Data visualization converts data into graphical form
- Base R provides simple plotting tools
- Common plots: scatter, line, bar, histogram, pie, box
- Helps in understanding patterns and trends
❓ 1.13 Viva Questions
- What is data visualization?
- What is the use of plot() in R?
- What is a scatter plot?
- Difference between bar plot and histogram?
- What is the purpose of a box plot?
- What does col parameter do?
- What is the use of pch in scatter plot?
📘 CHAPTER 2: Advanced Data Visualization Using Base R Graphics + Introduction to ggplot2
🌟 2.1 Limitations of Base R Graphics
Although Base R graphics are useful, they have some limitations:
- ❌ Limited customization
- ❌ Not visually attractive for reports
- ❌ Difficult to create complex plots
- ❌ No grammar-based structure
- ❌ Hard to build advanced dashboards
👉 To overcome these problems, we use ggplot2
🎨 2.2 Introduction to ggplot2
ggplot2 is a powerful visualization package in R based on the Grammar of Graphics.
📦 Install Package:
install.packages("ggplot2")
📥 Load Package:
library(ggplot2)
📚 2.3 Grammar of Graphics (Core Concept)
A plot in ggplot2 is built using layers:
🧩 Components:
| Component | Meaning |
|---|---|
| Data | Dataset |
| Aesthetics (aes) | Mapping variables |
| Geom | Type of plot |
| Stats | Statistical transformation |
| Coord | Coordinate system |
| Theme | Visual appearance |
📊 2.4 Basic ggplot Structure
ggplot(data, aes(x, y)) +
geom_function()
📌 2.5 Example Dataset
student <- data.frame(
Name = c("A", "B", "C", "D", "E"),
Marks = c(70, 85, 90, 60, 75),
Age = c(18, 19, 20, 18, 21)
)
📍 2.6 Scatter Plot (ggplot2)
library(ggplot2)
ggplot(student, aes(x = Age, y = Marks)) +
geom_point(color = "blue", size = 4) +
ggtitle("Age vs Marks Scatter Plot") +
xlab("Age") +
ylab("Marks")
🖥️ Output:
- Blue circular points
- Clear relationship between Age and Marks
📉 2.7 Line Plot (ggplot2)
ggplot(student, aes(x = Age, y = Marks)) +
geom_line(color = "red", size = 1.5) +
geom_point(color = "black", size = 3) +
ggtitle("Line Plot of Marks")
🖥️ Output:
- Red line connecting points
- Black dots on each value
📊 2.8 Bar Plot (ggplot2)
ggplot(student, aes(x = Name, y = Marks)) +
geom_bar(stat = "identity", fill = "green") +
ggtitle("Student Marks Bar Chart")
🖥️ Output:
- Green vertical bars
- Each student’s marks compared
📊 2.9 Histogram (ggplot2)
ggplot(student, aes(x = Marks)) +
geom_histogram(binwidth = 10,
fill = "skyblue",
color = "black") +
ggtitle("Marks Distribution")
🖥️ Output:
- Histogram showing frequency of marks
📦 2.10 Box Plot (ggplot2)
ggplot(student, aes(y = Marks)) +
geom_boxplot(fill = "orange") +
ggtitle("Box Plot of Marks")
🖥️ Output:
- Orange box showing median & outliers
🌈 2.11 Density Plot
ggplot(student, aes(x = Marks)) +
geom_density(fill = "pink", alpha = 0.5) +
ggtitle("Density Plot of Marks")
🖥️ Output:
- Smooth curve showing distribution
🎨 2.12 Customizing ggplot2
🔹 Titles & Labels
ggplot(student, aes(Age, Marks)) +
geom_point() +
labs(title = "Student Performance",
x = "Age",
y = "Marks")
🔹 Themes
ggplot(student, aes(Age, Marks)) +
geom_point() +
theme_minimal()
Other Themes:
- theme_bw()
- theme_classic()
- theme_dark()
🔹 Colors & Size
ggplot(student, aes(Age, Marks)) +
geom_point(color = "red", size = 4)
🔹 Scales
ggplot(student, aes(Age, Marks)) +
geom_point() +
scale_y_continuous(limits = c(50, 100))
🧩 2.13 Faceting (Multiple Plots)
student$Gender <- c("M", "F", "M", "F", "M")
ggplot(student, aes(Age, Marks)) +
geom_point() +
facet_wrap(~Gender)
🖥️ Output:
- Separate plots for Male and Female
📊 2.14 Multiple Plot Layout
library(gridExtra)
p1 <- ggplot(student, aes(Age, Marks)) + geom_point()
p2 <- ggplot(student, aes(Name, Marks)) + geom_bar(stat="identity")
grid.arrange(p1, p2, ncol = 2)
📌 2.15 Summary
- Base R is simple but limited
- ggplot2 is powerful and flexible
- Grammar of Graphics is core concept
- Customization is easy in ggplot2
- Faceting helps in multi-view analysis
❓ 2.16 Viva Questions
- What is ggplot2?
- What is Grammar of Graphics?
- Difference between base R and ggplot2?
- What is aes() in ggplot2?
- What is geom_point()?
- What is faceting?
- What is theme in ggplot2?
- What is density plot?
📘 CHAPTER 3: Interactive Data Visualization in R (Plotly & Shiny)
🌟 3.1 What is Interactive Visualization?
Interactive visualization allows users to:
- 🔍 Zoom in/out of graphs
- 🖱️ Hover to see values
- 🎯 Click and explore data
- 📊 Filter and analyze dynamically
👉 It makes data exploration more powerful than static graphs.
📦 3.2 Plotly in R
Plotly is used to create interactive charts in R.
📥 Install Plotly
install.packages("plotly")
📥 Load Library
library(plotly)
📊 3.3 Interactive Scatter Plot
library(plotly)
x <- c(1,2,3,4,5)
y <- c(10,20,15,25,30)
fig <- plot_ly(
x = x,
y = y,
type = "scatter",
mode = "markers",
marker = list(color = "blue", size = 10)
)
fig
🖥️ Output:
- Interactive blue points
- Hover shows values
- Zoom enabled
📈 3.4 Interactive Line Plot
plot_ly(
x = 1:10,
y = (1:10)^2,
type = "scatter",
mode = "lines+markers",
line = list(color = "red")
)
🖥️ Output:
- Red curve showing quadratic growth
- Click and zoom enabled
📊 3.5 Interactive Bar Chart
plot_ly(
x = c("A", "B", "C", "D"),
y = c(20, 35, 30, 40),
type = "bar",
marker = list(color = "green")
)
🖥️ Output:
- Green bars
- Hover shows values
📊 3.6 ggplot2 + Plotly Integration
library(ggplot2)
library(plotly)
student <- data.frame(
Name = c("A","B","C","D"),
Marks = c(70,80,90,85)
)
p <- ggplot(student, aes(Name, Marks)) +
geom_bar(stat="identity", fill="blue")
ggplotly(p)
🖥️ Output:
- Interactive bar chart
- Hover + zoom + click enabled
🌐 3.7 Introduction to Shiny
Shiny is used to create interactive web applications in R.
👉 Used for:
- Dashboards
- Data apps
- Live reports
📥 Install Shiny
install.packages("shiny")
📥 Load Library
library(shiny)
🧱 3.8 Structure of Shiny App
A Shiny app has 2 parts:
| Component | Purpose |
|---|---|
| UI | User Interface |
| Server | Logic/Backend |
📱 3.9 Simple Shiny App
library(shiny)
ui <- fluidPage(
titlePanel("Simple Shiny App"),
sidebarLayout(
sidebarPanel(
sliderInput("num",
"Select Number:",
min = 1,
max = 100,
value = 50)
),
mainPanel(
textOutput("result")
)
)
)
server <- function(input, output) {
output$result <- renderText({
paste("Selected Value:", input$num)
})
}
shinyApp(ui = ui, server = server)
🖥️ Output:
- Slider input (1–100)
- Dynamic text updates instantly
📊 3.10 Shiny Dashboard Example
library(shiny)
ui <- fluidPage(
titlePanel("Student Dashboard"),
sidebarLayout(
sidebarPanel(
selectInput("subject",
"Choose Subject:",
choices = c("Math", "Science", "English"))
),
mainPanel(
textOutput("outputText")
)
)
)
server <- function(input, output) {
output$outputText <- renderText({
paste("You selected:", input$subject)
})
}
shinyApp(ui = ui, server = server)
🖥️ Output:
- Dropdown menu
- Dynamic response display
📊 3.11 Advantages of Interactive Visualization
- 🎯 Real-time interaction
- 📊 Better data understanding
- 📈 Professional dashboards
- 🧠 Easy decision-making
- 🌐 Web-based applications
⚖️ 3.12 Comparison
| Tool | Type | Use |
|---|---|---|
| Base R | Static | Basic plots |
| ggplot2 | Static advanced | Publication graphs |
| Plotly | Interactive | Dynamic charts |
| Shiny | Web app | Dashboards |
📌 3.13 Summary
- Plotly adds interactivity to graphs
- ggplotly converts ggplot to interactive charts
- Shiny creates full web applications
- Interactive tools are used in real-world analytics
❓ 3.14 Viva Questions
- What is interactive visualization?
- What is Plotly used for?
- What is Shiny in R?
- Difference between ggplot2 and Plotly?
- What are UI and Server in Shiny?
- What is ggplotly()?
- What are dashboards?
🎓 FINAL SUMMARY (FULL MODULE)
✔ Base R Graphics → Simple plots
✔ ggplot2 → Advanced visualization
✔ Plotly → Interactive charts
✔ Shiny → Full web dashboards
📘 MODULE 4: STATISTICAL ANALYSIS AND MODELING
Class 1: Descriptive Statistics and Measures of Central Tendency
🌟 Learning Objectives
After completing this chapter, students will be able to:
- Understand the concept of descriptive statistics.
- Explain measures of central tendency.
- Calculate Mean, Median, and Mode using R.
- Interpret statistical results.
- Apply descriptive statistics to real-world data.
📚 4.1 Introduction to Statistics
Statistics is the science of collecting, organizing, analyzing, interpreting, and presenting data. It helps researchers, businesses, scientists, and governments make informed decisions based on numerical information.
For example:
- A school calculates the average marks of students.
- A company analyzes monthly sales.
- A hospital studies patient recovery rates.
- Weather departments analyze temperature records.
Statistics transforms raw data into useful information.
📖 Types of Statistics
Statistics is broadly classified into two categories:
1. Descriptive Statistics
Descriptive statistics summarizes and describes the main features of a dataset. It does not make predictions but presents the data in a meaningful way.
Examples:
- Mean
- Median
- Mode
- Range
- Variance
- Standard Deviation
Applications
- Student result analysis
- Employee salary reports
- Sales reports
- Population surveys
2. Inferential Statistics
Inferential statistics uses sample data to make predictions or conclusions about a larger population.
Examples:
- Hypothesis testing
- Regression analysis
- ANOVA
- Confidence intervals
⭐ Importance of Descriptive Statistics
Descriptive statistics helps to:
- Summarize large datasets.
- Identify patterns and trends.
- Compare different datasets.
- Support decision-making.
- Prepare data for advanced analysis.
📊 Measures of Central Tendency
Measures of Central Tendency describe the center or typical value of a dataset.
The three common measures are:
- Mean
- Median
- Mode
🔵 4.2 Mean (Arithmetic Mean)
Definition
The Mean is the arithmetic average of all observations.
It is the most commonly used measure of central tendency.
Formula
Where:
- ΣX = Sum of all observations
- N = Number of observations
Sample Data (10 Students' Marks)
| Student | Marks |
|---|---|
| 1 | 45 |
| 2 | 52 |
| 3 | 58 |
| 4 | 63 |
| 5 | 67 |
| 6 | 72 |
| 7 | 78 |
| 8 | 84 |
| 9 | 90 |
| 10 | 95 |
Manual Calculation
Step 1: Add all values
45 + 52 + 58 + 63 + 67 + 72 + 78 + 84 + 90 + 95
= 704
Step 2: Count observations
Number of observations = 10
Step 3: Apply Formula
Mean = 704 ÷ 10
= 70.4
💻 R Program
# Program to Calculate Mean
marks <- c(45,52,58,63,67,72,78,84,90,95)
print("Student Marks")
print(marks)
mean_value <- mean(marks)
print("Mean of Marks")
print(mean_value)
🖥 Output
[1] "Student Marks"
[1] 45 52 58 63 67 72 78 84 90 95
[1] "Mean of Marks"
[1] 70.4
📖 Explanation
The mean() function in R calculates the arithmetic average of all values in the vector.
mean(marks)
returns
70.4
because the total marks are 704, divided by 10 students.
✅ Interpretation
The average marks obtained by the students are 70.4.
This means that if the total marks were equally distributed among all students, each student would receive 70.4 marks.
🌍 Real-Life Applications of Mean
- Calculating students' average marks.
- Measuring average monthly income.
- Determining average rainfall.
- Calculating average temperature.
- Business profit analysis.
- Cricket batting average.
- Manufacturing quality control.
✔ Advantages of Mean
- Easy to calculate.
- Uses all observations.
- Suitable for mathematical analysis.
- Widely used in statistics.
✖ Disadvantages of Mean
- Affected by very high or very low values (outliers).
- Not suitable for highly skewed data.
- Cannot be used for categorical data.
💡 Important Note
The Mean is the most widely used measure of central tendency, but it can be misleading when a dataset contains extreme values.
📝 Practice Exercise
Use the following data to calculate the Mean manually and using R.
| Data |
|---|
| 25 |
| 30 |
| 35 |
| 40 |
| 45 |
| 50 |
| 55 |
| 60 |
| 65 |
| 70 |
Write an R Program
marks <- c(25,30,35,40,45,50,55,60,65,70)
mean(marks)
Expected Output
[1] 47.5
📌 Key Points
- Mean is the arithmetic average.
- It is calculated using all observations.
- R provides the
mean()function. - Mean is affected by extreme values.
- It is widely used in business, science, education, and research.
🎯 Learning Summary
After completing this lesson, you have learned:
- What is descriptive statistics?
- Types of statistics.
- Importance of descriptive statistics.
- Definition and formula of Mean.
- Manual calculation of Mean.
- R program to calculate Mean.
- Interpretation of output.
- Applications, advantages, and disadvantages of Mean.
🔴 4.3 Median
📖 Definition
The Median is the middle value of a dataset when the observations are arranged in ascending or descending order.
Unlike the Mean, the Median is not affected by extremely high or low values (outliers). Therefore, it is considered a better measure of central tendency for skewed data.
🎯 Formula
For Odd Number of Observations
For Even Number of Observations
Where:
- n = Total number of observations
📊 Example (10 Student Marks)
| Student | Marks |
|---|---|
| 1 | 45 |
| 2 | 52 |
| 3 | 58 |
| 4 | 63 |
| 5 | 67 |
| 6 | 72 |
| 7 | 78 |
| 8 | 84 |
| 9 | 90 |
| 10 | 95 |
The data is already arranged in ascending order.
🧮 Manual Calculation
Number of observations = 10 (Even)
Middle positions:
- 5th value = 67
- 6th value = 72
Median
= (67 + 72) ÷ 2
= 69.5
💻 R Program
# Program to Calculate Median
marks <- c(45,52,58,63,67,72,78,84,90,95)
print("Student Marks")
print(marks)
median_value <- median(marks)
print("Median of Marks")
print(median_value)
🖥 Output
[1] "Student Marks"
[1] 45 52 58 63 67 72 78 84 90 95
[1] "Median of Marks"
[1] 69.5
📖 Explanation
The median() function automatically sorts the values (if required) and finds the middle value.
For an even number of observations, it calculates the average of the two middle values.
✅ Interpretation
The median marks are 69.5.
This means:
- 50% of students scored below 69.5
- 50% of students scored above 69.5
🌍 Real-Life Applications
- Income analysis
- House price analysis
- Population studies
- Salary surveys
- Medical research
✔ Advantages
- Not affected by outliers.
- Easy to understand.
- Suitable for skewed data.
- Useful for ordinal data.
✖ Disadvantages
- Does not use every observation.
- Difficult to calculate for grouped data manually.
📝 Practice Exercise
Find the median of the following data using R.
Sample Data
28, 35, 40, 45, 50, 55, 60, 65, 70, 80
R Script
marks <- c(28,35,40,45,50,55,60,65,70,80)
median(marks)
Output
[1] 52.5
🟣 4.4 Mode
📖 Definition
The Mode is the value that appears most frequently in a dataset.
A dataset may have:
- One Mode (Unimodal)
- Two Modes (Bimodal)
- More than Two Modes (Multimodal)
- No Mode (all values occur once)
Since R does not provide a built-in function for statistical mode, we create a custom function.
📊 Example (10 Student Marks)
| Student | Marks |
|---|---|
| 1 | 45 |
| 2 | 52 |
| 3 | 63 |
| 4 | 63 |
| 5 | 63 |
| 6 | 72 |
| 7 | 78 |
| 8 | 84 |
| 9 | 90 |
| 10 | 95 |
📋 Frequency Table
| Marks | Frequency |
|---|---|
| 45 | 1 |
| 52 | 1 |
| 63 | 3 |
| 72 | 1 |
| 78 | 1 |
| 84 | 1 |
| 90 | 1 |
| 95 | 1 |
The highest frequency is 3.
Therefore,
Mode = 63
💻 R Program
# Program to Calculate Mode
marks <- c(45,52,63,63,63,72,78,84,90,95)
Mode <- function(x)
{
unique_values <- unique(x)
unique_values[which.max(tabulate(match(x, unique_values)))]
}
mode_value <- Mode(marks)
print("Student Marks")
print(marks)
print("Mode of Marks")
print(mode_value)
🖥 Output
[1] "Student Marks"
[1] 45 52 63 63 63 72 78 84 90 95
[1] "Mode of Marks"
[1] 63
📖 Explanation
The custom Mode() function:
- Finds the unique values.
- Counts how many times each value appears.
- Returns the value with the highest frequency.
✅ Interpretation
The most frequently occurring mark is 63.
This indicates that 63 is the most common score among the students.
🌍 Real-Life Applications
- Most sold product
- Most common blood group
- Most frequently purchased item
- Customer preference analysis
- Election survey analysis
✔ Advantages
- Easy to understand.
- Suitable for categorical data.
- Not affected by outliers.
- Represents the most common value.
✖ Disadvantages
- Some datasets have multiple modes.
- Some datasets have no mode.
- Less useful for mathematical calculations.
📊 Comparison of Mean, Median, and Mode
| Feature | Mean | Median | Mode |
|---|---|---|---|
| Definition | Average of all values | Middle value | Most frequent value |
| Uses All Data | ✔ Yes | ✖ No | ✖ No |
| Affected by Outliers | ✔ Yes | ✖ No | ✖ No |
| Suitable for Categorical Data | ✖ No | ✖ No | ✔ Yes |
| R Function | mean() | median() | Custom Function |