Professional Documents
Culture Documents
Data Analytics
Practical File
2022-23
2
1. Write a script for statistics techniques
(mean, mode, median, variance, standard
deviation).
Mean: Calculate sum of all the values and divide it with the total number
of values in the data set.
Mode: The most occurring number in the data set. For calculating mode,
there is no default function in R. So, we have to create our own custom
function.
Variance: How far a set of data values are spread out from their mean.
3
> print (variance.result)
[1] 2.484211
4
# Print the data frame.
print(emp.data)
5
The statistical summary and nature of the data can be obtained by
applying summary() function.
start_date = as.Date(c("2012-01-01","2013-09-23","2014-11-15","2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
6
)
# Extract Specific columns.
result <- data.frame(emp.data$emp_name,emp.data$salary)
print(result)
7
The validation set approach consists of randomly splitting the data into
two sets: one set is used to train the model and the remaining other set
sis used to test the model.
When comparing two models, the one that produces the lowest test
sample RMSE is the preferred model.
the RMSE and the MAE are measured in the same scale as the outcome
variable. Dividing the RMSE by the average value of the outcome variable
will give you the prediction error rate, which should be as small as
possible:
8
RMSE(predictions, test.data$Fertility)/mean(test.data$Fertility)
## [1] 0.128
Note that, the validation set method is only useful when you have a large
data set that can be partitioned. A disadvantage is that we build a model
on a fraction of the data set only, possibly leaving out some interesting
information about data, leading to higher bias. Therefore, the test error
rate can be highly variable, depending on which observations are
included in the training set and which observations are included in the
validation set.
1. Leave out one data point and build the model on the rest of the
data set
2. Test the model against the data point that is left out at step 1 and
record the test error associated with the prediction
3. Repeat the process for all data points
4. Compute the overall prediction error by taking the average of all
these test error estimates recorded at step 2.
## Linear Regression
##
## 47 samples
9
## 5 predictor
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 46, 46, 46, 46, 46, 46, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.74 0.613 6.12
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The advantage of the LOOCV method is that we make use all data points
reducing potential bias.
However, the process is repeated as many times as there are data points,
resulting to a higher execution time when n is extremely large.
K-fold cross-validation
Randomly split the data set into k-subsets (or k-fold) (for example 5
subsets)
1. Reserve one subset and train the model on all other subsets
2. Test the model on the reserved subset and record the prediction
error
3. Repeat this process until each of the k subsets has served as the
test set.
10
4. Compute the average of the k recorded errors. This is called the
cross-validation error serving as the performance metric for the
model.
## Linear Regression
11
##
## 47 samples
## 5 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 43, 42, 42, 41, 43, 41, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 7.38 0.751 6.03
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The process of splitting the data into k-folds can be repeated a number
of times, this is called repeated k-fold cross validation.
The final model error is taken as the mean error from the number of
repeats.
12
4. Draw all types of graph with example in
different scripts.
Pie - Charts
In R the pie chart is created using the pie() function which takes positive
numbers as a vector input. The additional parameters are used to control
labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
Example
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
13
dev.off()
Bar Chart
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable. R uses the function barplot() to
create bar charts. R can draw both vertical and Horizontal bars in the bar
chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
14
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each bar.
col is used to give colors to the bars in the graph.
Example
A simple bar chart is created using just the input vector and the name of
each bar.
The below script will create and save the bar chart in the current R
working directory.
15
Boxplot
Boxplots are a measure of how well distributed is the data in a data set. It
divides the data set into three quartiles. This graph represents the
minimum, maximum, median, first quartile and third quartile in the
data set. It is also useful in comparing the distribution of data across data
sets by drawing boxplots for each of them.
Syntax
The basic syntax to create a boxplot in R is −
x is a vector or a formula.
data is the data frame.
16
notch is a logical value. Set as TRUE to draw a notch.
varwidth is a logical value. Set as true to draw width of the box
proportionate to the sample size.
names are the group labels which will be printed under each
boxplot.
main is used to give a title to the graph.
Example
We use the data set "mtcars" available in the R environment to create a
basic boxplot. Let's look at the columns "mpg" and "cyl" in mtcars.
mpg cyl
Mazda RX4 21.0 6
Mazda RX4 Wag 21.0 6
Datsun 710 22.8 4
Hornet 4 Drive 21.4 6
Hornet Sportabout 18.7 8
Valiant 18.1 6
17
When we execute the above code, it produces the following result
−
Histogram
A histogram represents the frequencies of values of a variable bucketed
into ranges. Histogram is similar to bar chat but the difference is it
groups the values into continuous ranges. Each bar in histogram
represents the height of the number of values present in that range.
Syntax
The basic syntax for creating a histogram using R is −
18
hist(v,main,xlab,xlim,ylim,breaks,col,border)
Example
A simple histogram is created using input vector, label, col and border
parameters.
The script given below will create and save the histogram in the current
R working directory.
19
Line Graph
A line chart is a graph that connects a series of points by drawing line
segments between them. These points are ordered in one of their
coordinate (usually the x-coordinate) value. Line charts are usually used
in identifying the trends in data.
Syntax
The basic syntax to create a line chart in R is −
20
plot(v,type,col,xlab,ylab)
Example
A simple line chart is created using the input vector and the type
parameter as "O". The below script will create and save a line chart in the
current R working directory.
21
Scatterplot
Scatterplots show many points plotted in the Cartesian plane. Each point
represents the values of two variables. One variable is chosen in the
horizontal axis and another in the vertical axis.
Syntax
The basic syntax for creating scatterplot in R is −
22
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.
Example
We use the data set "mtcars" available in the R environment to create a
basic scatterplot. Let's use the columns "wt" and "mpg" in mtcars.
wt mpg
Mazda RX4 2.620 21.0
Mazda RX4 Wag 2.875 21.0
Datsun 710 2.320 22.8
Hornet 4 Drive 3.215 21.4
Hornet Sportabout 3.440 18.7
Valiant 3.460 18.1
# Plot the chart for cars with weight between 2.5 to 5 and mileage between 15
and 30.
plot(x = input$wt,y = input$mpg,
23
xlab = "Weight",
ylab = "Milage",
xlim = c(2.5,5),
ylim = c(15,30),
main = "Weight vs Milage"
)
24
5. R script for creating Confusion Matrix and
measuring its performance.
Each row in the confusion matrix will represent the predicted values and
columns will be responsible for actual values. This can also be vice-versa.
Even though the matrixes are easy, the terminology behind them seems
complex. There is always a chance to get confused about the classes. Hence
the term - Confusion matrix
25
This is a three-class binary model that shows the distribution of predicted
and actual values of the data.
In the confusion matrix in R, the class of interest or our target class will be a
positive class and the rest will be negative.
You can express the relationship between the positive and negative classes
with the help of the 2x2 confusion matrix. It will include 4 categories -
26
#Creates vectors having data points
expected_value <- factor(c(1,0,1,0,1,1,1,0,0,1))
predicted_value <- factor(c(1,0,0,1,1,1,0,0,0,1))
#Display results
example
Reference
Prediction 0 1
0 3 2
1 1 4
Accuracy : 0.7
95% CI : (0.3475, 0.9333)
No Information Rate : 0.6
P-Value [Acc > NIR] : 0.3823
Kappa : 0.4
Sensitivity : 0.7500
Specificity : 0.6667
Pos Pred Value : 0.6000
Neg Pred Value : 0.8000
Prevalence : 0.4000
Detection Rate : 0.3000
Detection Prevalence : 0.5000
Balanced Accuracy : 0.7083
'Positive' Class : 0
27
The success rate or the accuracy of the model can be easily calculated
using the 2x2 confusion matrix. The formula for calculating accuracy is -
Here, the TP, TN, FP, AND FN will represent the particular value counts
that belong to them. The accuracy will be calculated by summing and
dividing the values as per the formulae.
After this, you are encouraged to find the error rate that our model has
predicted wrongly. The formula for error rate is:
The error rate calculation is simple and to the point. If a model will
perform at 90% accuracy then the error rate will be 10%. As simple as
that.
The simple way to get the confusion matrix in R is by using the table()
function. Let’s see how it works.
table(expected_value,predicted_value)
predicted_value
expected_value 0 1
0 3 1
1 2 4
28
Observation:
29
6. Exercise with file handling package fs.
# install.packages("fs", repos = "http://cran.us.r-project.org")
library(fs)
library(stringr)
Folder exists
dir_exists(dname)
txt_files
TRUE
Files exist
file_exists(fnames)
txt_files/d.txt txt_files/e.txt
TRUE TRUE
txt_files/f.txt
TRUE
Then make a new variable replacing .txt by .crt for new file names
30
file_move(fnames, fnames_new)
List files
dir_ls(dname)
Package fs has all the functions, for example file can be copied or deleted
file_copy("txt_files/a.crt", "txt_files/aa.crt")
print("files before deleting: ")
dir_ls(dname)
file_delete("txt_files/aa.crt")
print("files after deleting: ")
dir_ls(dname)
31