You are on page 1of 31

Directorate of Distance Education,

Guru Jambheshwar University


of Science & Technology,
Hisar

Data Analytics
Practical File
2022-23

Submitted to: Submitted by:


Mr. Neeraj Verma, Poonam
21151005010
Assistant Professor,
3rd Sem
Dept. of Computer Science
MCA (2 years)
INDEX

Sr. No. Title Page


No.
1. Write a script for statistics techniques 3-4
(mean, mode, median, variance,
standard deviation).
2. Write a script to work with data frame. 4-7

3. Script for cross validation example. 7-12

4. Draw all types of graph with example 13-24


in different scripts.

5. R script for creating Confusion Matrix 25-29


and measuring its performance.

6. Exercise with file handling package fs. 30-31

2
1. Write a script for statistics techniques
(mean, mode, median, variance, standard
deviation).
Mean: Calculate sum of all the values and divide it with the total number
of values in the data set.

> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set


> mean.result = mean(x) # calculate mean
> print (mean.result)
[1] 2.8

Median: The middle value of the data set.

> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set


> median.result = median(x) # calculate median
> print (median.result)
[1] 2.5

Mode: The most occurring number in the data set. For calculating mode,
there is no default function in R. So, we have to create our own custom
function.

> mode <- function(x) {


+ ux <- unique(x)
+ ux[which.max(tabulate(match(x, ux)))]
+}

> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set

> mode.result = mode(x) # calculate mode (with our custom function


named ‘mode’)
> print (mode.result)
[1] 1

Variance: How far a set of data values are spread out from their mean.

> variance.result = var(x) # calculate variance

3
> print (variance.result)
[1] 2.484211

Standard Deviation: A measure that is used to quantify the amount of


variation or dispersion of a set of data values.

> sd.result = sqrt(var(x)) # calculate standard deviation


> print (sd.result)
[1] 1.576138

2. Write a script to work with data frame.


A data frame is a table or a two-dimensional array-like structure in
which each column contains values of one variable and each row
contains one set of values from each column.

Following are the characteristics of a data frame.

The column names should be non-empty.


The row names should be unique.
The data stored in a data frame can be of numeric, factor or
character type.
Each column should contain same number of data items.

Create Data Frame


# Create the data frame.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-


11",
"2015-03-27")),
stringsAsFactors = FALSE
)

4
# Print the data frame.
print(emp.data)

When we execute the above code, it produces the following result



emp_id emp_name salary start_date
1 1 Rick 623.30 2012-01-01
2 2 Dan 515.20 2013-09-23
3 3 Michelle 611.00 2014-11-15
4 4 Ryan 729.00 2014-05-11
5 5 Gary 843.25 2015-03-27

Get the Structure of the Data Frame


The structure of the data frame can be seen by using str() function.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-


11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Get the structure of the data frame.
str(emp.data)

When we execute the above code, it produces the following result



'data.frame': 5 obs. of 4 variables:
$ emp_id : int 1 2 3 4 5
$ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
$ salary : num 623 515 611 729 843
$ start_date: Date, format: "2012-01-01" "2013-09-23" "2014-11-15" "2014-05-
11" …

Summary of Data in Data Frame

5
The statistical summary and nature of the data can be obtained by
applying summary() function.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-


11",
"2015-03-27")),
stringsAsFactors = FALSE
)
# Print the summary.
print(summary(emp.data))

When we execute the above code, it produces the following result



emp_id emp_name salary start_date
Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
1st Qu. :2 Class. :character 1st Qu. :611.0 1st Qu. :2013-09-23
Median :3 Mode :character. Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean. :2014-01-14
3rd Qu. :4 3rd Qu. :729.0 3rd Qu. :2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27

Extract Data from Data Frame


Extract specific column from a data frame using column name.

# Create the data frame.


emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),

start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE

6
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)

When we execute the above code, it produces the following result



emp.data.emp_name emp.data.salary
1 Rick 623.30
2 Dan 515.20
3 Michelle 611.00
4 Ryan 729.00
5 Gary 843.25

3. Script for cross validation example.


What is Cross-Validation?

In Machine Learning, Cross-validation is a resampling method used for


model evaluation to avoid testing a model on the same dataset on which
it was trained. This is a common mistake, especially that a separate
testing dataset is not always available. However, this usually leads to
inaccurate performance measures (as the model will have an almost
perfect score since it is being tested on the same data it was trained on).
To avoid this kind of mistakes, cross validation is usually preferred.

The concept of cross-validation is actually simple: Instead of using the


whole dataset to train and then test on same data, we could randomly
divide our data into training and testing datasets.

There are several types of cross-validation methods (LOOCV – Leave-


one-out cross validation, the holdout method, k-fold cross validation).

The Validation set Approach

7
The validation set approach consists of randomly splitting the data into
two sets: one set is used to train the model and the remaining other set
sis used to test the model.

The process works as follow:

Build (train) the model on the training data set


Apply the model to the test data set to predict the outcome of new
unseen observations
Quantify the prediction error as the mean squared difference between
the observed and the predicted outcome values.
The example below splits the swiss data set so that 80% is used for
training a linear regression model and 20% is used to evaluate the model
performance.

# Split the data into training and test set


set.seed(123)
training.samples <- swiss$Fertility %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- swiss[training.samples, ]
test.data <- swiss[-training.samples, ]
# Build the model
model <- lm(Fertility ~., data = train.data)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$Fertility),
RMSE = RMSE(predictions, test.data$Fertility),
MAE = MAE(predictions, test.data$Fertility))
## R2 RMSE MAE
## 1 0.39 9.11 7.48

When comparing two models, the one that produces the lowest test
sample RMSE is the preferred model.

the RMSE and the MAE are measured in the same scale as the outcome
variable. Dividing the RMSE by the average value of the outcome variable
will give you the prediction error rate, which should be as small as
possible:

8
RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)

## [1] 0.128

Note that, the validation set method is only useful when you have a large
data set that can be partitioned. A disadvantage is that we build a model
on a fraction of the data set only, possibly leaving out some interesting
information about data, leading to higher bias. Therefore, the test error
rate can be highly variable, depending on which observations are
included in the training set and which observations are included in the
validation set.

Leave one out cross validation - LOOCV

This method works as follow:

1. Leave out one data point and build the model on the rest of the
data set
2. Test the model against the data point that is left out at step 1 and
record the test error associated with the prediction
3. Repeat the process for all data points
4. Compute the overall prediction error by taking the average of all
these test error estimates recorded at step 2.

Practical example in R using the caret package:

# Define training control


train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)

## Linear Regression
##
## 47 samples

9
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.74 0.613 6.12
##
## Tuning parameter 'intercept' was held constant at a value of TRUE

The advantage of the LOOCV method is that we make use all data points
reducing potential bias.

However, the process is repeated as many times as there are data points,
resulting to a higher execution time when n is extremely large.

Additionally, we test the model performance against one data point at


each iteration. This might result to higher variation in the prediction
error, if some data points are outliers. So, we need a good ratio of testing
data points, a solution provided by the k-fold cross-validation method.

K-fold cross-validation

The k-fold cross-validation method evaluates the model performance


on different subset of the training data and then calculate the average
prediction error rate. The algorithm is as follow:

Randomly split the data set into k-subsets (or k-fold) (for example 5
subsets)
1. Reserve one subset and train the model on all other subsets
2. Test the model on the reserved subset and record the prediction
error
3. Repeat this process until each of the k subsets has served as the
test set.

10
4. Compute the average of the k recorded errors. This is called the
cross-validation error serving as the performance metric for the
model.

K-fold cross-validation (CV) is a robust method for estimating the


accuracy of a model.

The most obvious advantage of k-fold CV compared to LOOCV is


computational. A less obvious but potentially more important advantage
of k-fold CV is that it often gives more accurate estimates of the test
error rate than does LOOCV (James et al. 2014).

Typical question, is how to choose right value of k?

Lower value of K is more biased and hence undesirable. On the other


hand, higher value of K is less biased, but can suffer from large
variability. It is not hard to see that a smaller value of k (say k = 2) always
takes us towards validation set approach, whereas a higher value of k
(say k = number of data points) leads us to LOOCV approach.

In practice, one typically performs k-fold cross-validation using k = 5 or


k = 10, as these values have been shown empirically to yield test error
rate estimates that suffer neither from excessively high bias nor from
very high variance.

The following example uses 10-fold cross validation to estimate the


prediction error. Make sure to set seed for reproducibility.

# Define training control


set.seed(123)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)

## Linear Regression

11
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.38 0.751 6.03
##
## Tuning parameter 'intercept' was held constant at a value of TRUE

Repeated K-fold cross-validation

The process of splitting the data into k-folds can be repeated a number
of times, this is called repeated k-fold cross validation.

The final model error is taken as the mean error from the number of
repeats.

The following example uses 10-fold cross validation with 3 repeats:

# Define training control


set.seed(123)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(Fertility ~., data = swiss, method = "lm",
trControl = train.control)
# Summarize the results
print(model)

12
4. Draw all types of graph with example in
different scripts.
Pie - Charts
In R the pie chart is created using the pie() function which takes positive
numbers as a vector input. The additional parameters are used to control
labels, color, title etc.

Syntax
The basic syntax for creating a pie-chart using the R is −

pie(x, labels, radius, main, col, clockwise)

Following is the description of the parameters used −

x is a vector containing the numeric values used in the pie chart.


labels is used to give description to the slices.
radius indicates the radius of the circle of the pie
chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn
clockwise or anti clockwise.

Example
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")

# Give the chart file a name.


png(file = "city.png")

# Plot the chart.


pie(x,labels)

# Save the file.

13
dev.off()

When we execute the above code, it produces the following result


Bar Chart
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable. R uses the function barplot() to
create bar charts. R can draw both vertical and Horizontal bars in the bar
chart. In bar chart each of the bars can be given different colors.

Syntax
The basic syntax to create a bar-chart in R is −

barplot(H,xlab,ylab,main, names.arg,col)

Following is the description of the parameters used −

H is a vector or matrix containing numeric values used in bar


chart.

14
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each bar.
col is used to give colors to the bars in the graph.

Example
A simple bar chart is created using just the input vector and the name of
each bar.

The below script will create and save the bar chart in the current R
working directory.

# Create the data for the chart


H <- c(7,12,28,3,41)

# Give the chart file a name


png(file = "barchart.png")

# Plot the bar chart


barplot(H)

# Save the file


dev.off()

When we execute above code, it produces following result −

15
Boxplot
Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the
data set. It is also useful in comparing the distribution of data across data
sets by drawing boxplots for each of them.

Boxplots are created in R by using the boxplot() function.

Syntax
The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)

Following is the description of the parameters used −

x is a vector or a formula.
data is the data frame.

16
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box
proportionate to the sample size.
names are the group labels which will be printed under each
boxplot.
main is used to give a title to the graph.

Example
We use the data set "mtcars" available in the R environment to create a
basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.

input <- mtcars[,c('mpg','cyl')]


print(head(input))

When we execute above code, it produces following result −

mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6

Creating the Boxplot


The below script will create a boxplot graph for the relation between
mpg (miles per gallon) and cyl (number of cylinders).

# Give the chart file a name.


png(file = "boxplot.png")

# Plot the chart.


boxplot(mpg ~ cyl, data = mtcars, xlab = "Number of Cylinders",
ylab = "Miles Per Gallon", main = "Mileage Data")

# Save the file.


dev.off()

17
When we execute the above code, it produces the following result

Histogram
A histogram represents the frequencies of values of a variable bucketed
into ranges. Histogram is similar to bar chat but the difference is it
groups the values into continuous ranges. Each bar in histogram
represents the height of the number of values present in that range.

R creates histogram using hist() function. This function takes a vector as


an input and uses some more parameters to plot histograms.

Syntax
The basic syntax for creating a histogram using R is −

18
hist(v,main,xlab,xlim,ylim,breaks,col,border)

Following is the description of the parameters used −

v is a vector containing numeric values used in histogram.


main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
breaks is used to mention the width of each bar.

Example
A simple histogram is created using input vector, label, col and border
parameters.

The script given below will create and save the histogram in the current
R working directory.

# Create data for the graph.


v <- c(9,13,21,8,36,22,12,41,31,33,19)

# Give the chart file a name.


png(file = "histogram.png")

# Create the histogram.


hist(v,xlab = "Weight",col = "yellow",border = "blue")

# Save the file.


dev.off()

When we execute the above code, it produces the following result


19
Line Graph
A line chart is a graph that connects a series of points by drawing line
segments between them. These points are ordered in one of their
coordinate (usually the x-coordinate) value. Line charts are usually used
in identifying the trends in data.

The plot() function in R is used to create the line graph.

Syntax
The basic syntax to create a line chart in R is −

20
plot(v,type,col,xlab,ylab)

Following is the description of the parameters used −

v is a vector containing the numeric values.


type takes the value "p" to draw only the points, "l" to draw only
the lines and "o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.

Example
A simple line chart is created using the input vector and the type
parameter as "O". The below script will create and save a line chart in the
current R working directory.

# Create the data for the chart.


v <- c(7,12,28,3,41)

# Give the chart file a name.


png(file = "line_chart.jpg")

# Plot the bar chart.


plot(v,type = "o")

# Save the file.


dev.off()

When we execute the above code, it produces the following result


21
Scatterplot
Scatterplots show many points plotted in the Cartesian plane. Each point
represents the values of two variables. One variable is chosen in the
horizontal axis and another in the vertical axis.

The simple scatterplot is created using the plot() function.

Syntax
The basic syntax for creating scatterplot in R is −

plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −

x is the data set whose values are the horizontal coordinates.


y is the data set whose values are the vertical coordinates.

22
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.

Example
We use the data set "mtcars" available in the R environment to create a
basic scatterplot. Let's use the columns "wt" and "mpg" in mtcars.

input <- mtcars[,c('wt','mpg')]


print(head(input))

When we execute the above code, it produces the following result


wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1

Creating the Scatterplot


The below script will create a scatterplot graph for the relation between
wt(weight) and mpg(miles per gallon).

# Get the input values.


input <- mtcars[,c('wt','mpg')]

# Give the chart file a name.


png(file = "scatterplot.png")

# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15
and 30.
plot(x = input$wt,y = input$mpg,

23
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)

# Save the file.


dev.off()

When we execute the above code, it produces the following result


24
5. R script for creating Confusion Matrix and
measuring its performance.

A confusion matrix in R is a table that will categorize the predictions against


the actual values. It includes two dimensions, among them one will indicate
the predicted values and another one will represent the actual values.

Each row in the confusion matrix will represent the predicted values and
columns will be responsible for actual values. This can also be vice-versa.
Even though the matrixes are easy, the terminology behind them seems
complex. There is always a chance to get confused about the classes. Hence
the term - Confusion matrix

This is a two-class binary model shows the distribution of predicted and


actual values.

25
This is a three-class binary model that shows the distribution of predicted
and actual values of the data.

In the confusion matrix in R, the class of interest or our target class will be a
positive class and the rest will be negative.

You can express the relationship between the positive and negative classes
with the help of the 2x2 confusion matrix. It will include 4 categories -

True Positive (TN) - This is correctly classified as the class if interest /


target.
True Negative (TN) - This is correctly classified as not a class of
interest / target.
False Positive (FP) - This is wrongly classified as the class of interest /
target.
False Negative (FN) - This is wrongly classified as not a class of
interest / target.

Creating a Simple Confusion Matrix


#Insatll required packages
install.packages('caret')

#Import required library


library(caret)

26
#Creates vectors having data points
expected_value <- factor(c(1,0,1,0,1,1,1,0,0,1))
predicted_value <- factor(c(1,0,0,1,1,1,0,0,0,1))

#Creating confusion matrix


example <- confusionMatrix(data=predicted_value, reference = expected_value)

#Display results
example

Confusion Matrix and Statistics

Reference
Prediction 0 1
0 3 2
1 1 4

Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.3823

Kappa : 0.4

Mcnemar's Test P-Value : 1.0000

Sensitivity : 0.7500
Specificity : 0.6667
Pos Pred Value : 0.6000
Neg Pred Value : 0.8000
Prevalence : 0.4000
Detection Rate : 0.3000
Detection Prevalence : 0.5000
Balanced Accuracy : 0.7083

'Positive' Class : 0

Measuring the performance

27
The success rate or the accuracy of the model can be easily calculated
using the 2x2 confusion matrix. The formula for calculating accuracy is -

Here, the TP, TN, FP, AND FN will represent the particular value counts
that belong to them. The accuracy will be calculated by summing and
dividing the values as per the formulae.

After this, you are encouraged to find the error rate that our model has
predicted wrongly. The formula for error rate is:

The error rate calculation is simple and to the point. If a model will
perform at 90% accuracy then the error rate will be 10%. As simple as
that.

The simple way to get the confusion matrix in R is by using the table()
function. Let’s see how it works.

table(expected_value,predicted_value)

predicted_value
expected_value 0 1
0 3 1
1 2 4

28
Observation:

The model has predicted 0 as 0, 3 times and 0 as 1, 1 time.


The model has predicted 1 as 0, 2 times and 1 as 1, 4 times.
The accuracy of the model is 70%.

29
6. Exercise with file handling package fs.
# install.packages("fs", repos = "http://cran.us.r-project.org")
library(fs)
library(stringr)

Folder exists

dir_exists(dname)

txt_files
TRUE

Files exist
file_exists(fnames)

txt_files/a.txt txt_files/b.txt txt_files/c.txt TRUE FALSE


FALSE

txt_files/d.txt txt_files/e.txt
TRUE TRUE

txt_files/f.txt
TRUE

List files in the folder and set a variable

Then make a new variable replacing .txt by .crt for new file names

fnames <- dir_ls("txt_files")


fnames_new <- str_replace(fnames, ".txt", ".crt")
fnames_new

[1] "txt_files/a.crt" "txt_files/a.crt" "txt_files/d.crt" "txt_files/d.crt"


[5] "txt_files/e.crt" "txt_files/e.crt" "txt_files/f.crt" "txt_files/f.crt"

Rename files with move function

30
file_move(fnames, fnames_new)

List files

dir_ls(dname)

txt_files/a.crt txt_files/d.crt txt_files/e.crt txt_files/f.crt

All files are crt now

Package fs has all the functions, for example file can be copied or deleted

file_copy("txt_files/a.crt", "txt_files/aa.crt")
print("files before deleting: ")

[1] "files before deleting: "

dir_ls(dname)

txt_files/a.crt txt_files/aa.crt txt_files/d.crt txt_files/e.crt


txt_files/f.crt

file_delete("txt_files/aa.crt")
print("files after deleting: ")

[1] "files after deleting: "

dir_ls(dname)

txt_files/a.crt txt_files/d.crt txt_files/e.crt txt_files/f.crt

31

You might also like