Professional Documents
Culture Documents
Amanwit Kumar
Roll No. 004
Research and Business Analytics (2021-23)
WeSchool Bengaluru
1|P a g e
Table of Contents
2 . Assumptions ................................................................................................................... 4
i. Histogram ………………………………………………………………………………….………..…… 9
2|P a g e
3.4 Bi-Variate Analysis ..................................................................................................... 20
4 . Conclusion ..................................................................................................................... 25
3|P a g e
1 Project Objective
In this project, the ‘cars’ dataset has been given and we have to do exploratory data
analysis. We have to explore the data and derive insights from them. Analysis can be
done in various ways like descriptive analytics, including measures of central tendency
and measures of dispersion. Data analysis can also be done using visualization methods
which include tables, charts etc.
2 Assumptions
2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
8. Feature Exploration
Although Steps 5 and 6 are not in scope for this project, a brief about these steps
(and other steps as well) is given, as these are important steps for Data Exploration
journey.
4|P a g e
3.1 Environment Set up and Data Import
The base installation of R comes with many useful packages as standard. These
packages will contain many of the functions one can use on a daily basis. However,
as one starts using R for more diverse projects one will find that there comes a
time when he / she will need to extend R’s capabilities and for that there are many
installable packages.
This section is used to install necessary packages and invoke associated libraries.
Having all the packages at the same places increases code readability.
The working directory is the default location where R will look for files to load and
where it will put any files that is saved. Setting a working directory on starting of the
R session makes importing and exporting data files and code files easier. Basically,
working directory is the location/ folder on the PC where you have the data, codes
etc. related to the project.
In this project, the dataset used is an in-built dataset in R. The in-built dataset ‘cars’ is
imported using data() function and used for exploratory data analysis.
If the dataset is in .csv format, the command ‘read.csv’ is used for importing the file.
Variable identification is used to define the variable and to know the data type of each
variable. This can be done using the str() function, which gives the data type of each
variable.
5|P a g e
3.2.1 Variable Identification – Inferences
Using the function str() on the given dataset, the basic structure and data types can
be observed.
There are 2 variables with 50 number of observations each.
Variable Definition Data Type
speed Speed of car Numeric
dist Distance travelled Numeric
To calculate various summary statistics for our data variable, we can use various
syntax.
mean(speed)
[1] 15.4
mean(dist)
[1] 42.98
median(speed)
[1] 15
median(dist)
[1] 36
6|P a g e
Range of the variable
range(speed)
[1] 4 25
range(dist)
[1] 2 120
for speed
min(speed)
[1] 4
max(speed)
[1] 25
for distance
min(dist)
[1] 2
max(dist)
[1] 120
quantile(speed)
0% 25% 50% 75% 100%
4 12 15 19 25
quantile(dist)
0% 25% 50% 75% 100%
2 26 36 56 120
sd(speed)
[1] 5.287644
sd(dist)
[1] 25.76938
var(speed)
[1] 27.95918
var(dist)
[1] 664.0608
7|P a g e
Using the structure function, we can get the basic structure of the data. Sometimes it
is also considered to be an alternative for summary.
str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Summary gives the basic outline or the summary of all the above functions in a concise
manner.
summary(cars)
speed dist
Min. : 4.0 Min. : 2.00
1st Qu. : 12.0 1st Qu. : 26.00
Median : 15.0 Median : 36.00
Mean : 15.4 Mean : 42.98
3rd Qu. : 19.0 3rd Qu. : 56.00
Max. : 25.0 Max. : 120.00
The term “frequency” refers to how frequently something occurs. The number of
times an event occurs is indicated by the observation frequency.
The frequency distribution table may include numeric or quantitative data that are
category or qualitative. The distribution provides a glimpse of the data and allows you
to identify trends.
To create a frequency table for our variable, we can use the following syntax:
table(speed)
speed
4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
2 2 1 1 3 2 4 4 4 3 2 3 4 3 5 1 1 4 1
8|P a g e
Similarly, we can say the same for the frequency table of distance.
table(dist)
dist
2 4 10 14 16 17 18 20 22 24 26 28 32 34
1 1 2 1 1 1 1 2 1 1 4 2 3 3
36 40 42 46 48 50 52 54 56 60 64 66 68 70
2 2 1 2 1 1 1 2 2 1 1 1 1 1
76 80 84 85 92 93 120
1 1 1 1 1 1 1
3.3.3 Charts
R language is mostly used for statistics and data analytics purposes to represent the
data graphically in the software. To represent those data graphically, charts and
graphs are used in R.
There are hundreds of charts and graphs present in R. For univariate analysis we can
use a number of plots like box plot, dot chart, histogram etc.
1. Histogram
9|P a g e
Fig 2. Histogram of Distance
But the histogram created using base package in R is un-attractive. In R there are
various different packages which can give the same result with a better presentation.
The histogram shown below is created using ‘lattice’ package in R. Here, the histogram
is colourful and is shown as percent of total rather than the count. Here both the
histogram for speed and distance is given.
10 | P a g e
Fig 4. Histogram of Distance (using lattice)
The histogram can also be plotted using ‘ggplot2’ package. In the below given graph,
ggplot2 package has been used to plot the histogram for speed and distance.
11 | P a g e
Fig 6. Histogram of Distance (using ggplot2)
But the plotted histogram doesn’t seem good as it is plotted for each individual value
of speed and distance respectively. So, in the second case, histogram is plotted using
‘bins’. The ‘bins’ function categorises the values as per the given choice an gives the
required histogram. Here the bin value is taken to be 10.
12 | P a g e
Fig 8. Histogram of Distance (using ggplot2)
From the above histogram we can clearly find the outlier present in the data distance.
2. Boxplot
Box plot shows how the data is distributed in the data vector. It displays a dataset’s
five-number summary. The five values in the graph are, minimum value, first quartile,
second quartile(median), third quartile, the maximum value of the data vector.
13 | P a g e
Fig 10. Boxplot of Distance
We can get the boxplot of speed and distance using the base package only, but it
gives the boxplot in a vertical alignment. For data analysis, it is easier to explore the
data if the graphs are in horizontal alignment. Therefore, ‘lattice’ package is used to
get the boxplot of the given data in horizontal format. Also, in ‘base’ package the
median is not mentioned, whereas if boxplot is created using ‘lattice’ package, the
median is also mentioned.
14 | P a g e
Fig 12. Boxplot of Distance (using ggplot2)
‘ggplot2’ package can also be used to plot the boxplot. Here the median is clearly
mentioned, and gridlines are given for better analysis of data.
15 | P a g e
Fig 14. Boxplot of Distance (using ggplot2)
From the above boxplot we can infer that, the median speed is 15 kmph, where the
range of the data is between 4 kmph to 25 kmph and there are no outliers. In the
boxplot for distance, the median is 36 km, and the range of distance is between 2 km
to 120 km. Here 120 km is the outlier, as it is too far away from the interquartile range.
Kernel density plot is not present in base package. So ‘lattice’ package or ‘ggplot2’
package has to be used to get the density plot. In the below given graph, ‘lattice’
package is used to plot the kernel density plot.
16 | P a g e
Fig 15. Kernel Density Plot of Speed (using lattice)
In the above density plot, it is observed that, both the density as well as the
corresponding values are given. Kernel density plot can also be plotted using ‘ggplot2’
package. In the graph shown below ‘ggplot2’ package has been used to plot the
density plot for both speed and distance.
17 | P a g e
Fig 17. Kernel Density Plot of Speed (using ggplot2)
There is a significant difference in density plot created using ‘lattice’ package and
‘ggplot2’ package. In plot created using ‘ggplot2’ package, there is no values given for
the data and the line doesn’t start and end at ‘zero (0)’ rather from a base value.
18 | P a g e
From the kernel destiny plot we can infer that; the majority of the people travel at a
speed in between 10 to 20 kmph and the distance covered by most of the people is in
between 25 to 35 km.
4. Dotplot
A dot plot or dot chart is similar to a scatter plot. The main difference is that the dot
plot in R displays the index (each category) in the vertical axis and the corresponding
value in the horizontal axis, so it forms a horizontal line.
19 | P a g e
Each of these graphs provides a different perspective on the distribution of values for
the said variable.
In statistics, univariate analysis is the most basic type of data analysis. The important
thing to understand about univariate analysis is that there is only one data set
involved. While the univariate analysis is simple to do and understand, it can
sometimes provide incorrect results, especially when there are multiple factors to
consider. In this situation, bivariate and multivariate analysis should be used to better
analyse the data.
2. Correlation coefficient
The results from bivariate analysis can be stored in a data table. For that, different
functions can be used like table().
In this case, the ‘rpivotTable’ package has been used. And the output is in a table
format.
20 | P a g e
3.4.2 Correlation Coefficient
cor(speed,dist)
[1] 0.8068949
The correlation coefficient turns out to be 0.806. This value is close to 1, which
indicates a strong positive correlation between speed and distance.
3.4.3 Charts
In bi-variate analysis also, there are many charts and graphs which can be used.
1. Scatterplot
2. Scatter matrix
1. Scatter plot
A scatter plot uses dots to represent values for two different numeric variables. The
position of each dot on the horizontal and vertical axis indicates values for an
individual data point. These give a visual idea of the pattern that the variables follow.
Identification of correlation is common with scatter plots. Relationships between
variables can be described in many ways: positive or negative, strong or weak, linear
or nonlinear.
Scatterplots are available in base package. So, the first scatterplot between speed and
distance is done using base package.
21 | P a g e
Fig 22. Scatterplot
In the second case scatterplot is done using ‘lattice’ package. The only difference is
that in the second case ‘x’ and ‘y’ labels are given and as ‘lattice’ package is used, the
values have a different colour.
22 | P a g e
2. Scatter matrix
Here the scatter matrix is plotted for speed and distance. As there are only 2 variables
it creates a matrix of 2 by 2. The plot in the 1st quadrant and the plot in the 4th quadrant
are transpose of each other and the 2nd quadrant and 3rd quadrant gives the values of
different variables.
Boxplot hides the distribution behind each group. This post show how to tackle this
issue in base R, adding individual observation using dots with jittering. If the amount
of observation is not too high, individual observations can be added on top of boxes,
using jittering to avoid dot overlap.
23 | P a g e
Fig 24. Box – Scatter Plot (using ggplot2)
Here, the box and scatter plot graph plots the individual values of scatter plot on top
of box plot. From this graph the trend of the data can be observed.
In the given ‘cars’ dataset, there is no missing value. So, there is no need to proceed
with identifying missing value procedure.
Outlier identification can be done using boxplot. For that, any package can be used
whether it be the ‘base’ package, ‘lattice’ package or ‘ggplot2’ package.
From the uni-variate data analysis done previously, it is observed that using ‘ggplot2’
package, the boxplot analysis is easier. Boxplot is plotted for both speed and distance
to identify the outlier. From the boxplot it is observed that, distance has an outlier.
24 | P a g e
Fig 25. Boxplot of Distance to identify the outlier
4 Conclusion
In the given project the exploratory data analysis (EDA) of ‘cars’ dataset was done. It was
found that most of the people drive their cars at a speed of 10 kmph to 20 kmph and
most of them travel a distance of 25 km to 35 km. The slowest car speed was measured
to be 4 kmph and the highest car speed was measured to be 25 kmph. The least distance
travelled using car is 2 km and the maximum distance travelled by a car is 120 km. The
average speed at which the people drive their cars is 15.4 kmph and the average distance
travelled using a car is 42.98 km.
25 | P a g e
5 Appendix A
attach(cars)
# Importing libraries
library(e1071)
library(lattice)
library(rpivotTable)
library(ggplot2)
# descriptive statistics
dim(cars)
head(cars)
tail(cars)
str(cars)
summary(cars)
mean(speed)
median(speed)
mean(dist)
median(dist)
mode(speed)
mode(dist)
# mode is undefined, it gives the type of variable
26 | P a g e
# measures of dispersion
# for speed
range(speed)
sd(speed)
var(speed)
quantile(speed)
min(speed)
max(speed)
# for distance
range(dist)
sd(dist)
var(dist)
quantile(dist)
min(dist)
max(dist)
# e1071 package -
library(e1071)
cor(speed,dist)
# skewness
skewness(speed)
skewness(dist)
# kurtosis
kurtosis(speed)
kurtosis(dist)
# uni-variate
table(speed)
table(dist)
# bi-variate
table(dist,speed)
27 | P a g e
# Visualization
# rpivotTable package
library(rpivotTable)
rpivotTable(cars)
# histogram
hist(speed)
hist(dist)
# boxplot
boxplot(speed)
boxplot(dist)
# scatterplot
plot(speed,dist)
library(lattice)
# histogram
histogram(~speed)
histogram(~dist)
# boxplot
bwplot(~Speed)
bwplot(~dist)
# different from base package as it gives median value also
densityplot(~speed)
densityplot(~speed,
main="Density plot of Speed",
ylab="Density",xlab="Speed in kmph")
28 | P a g e
# density plot for distance
densityplot(~dist)
densityplot(~dist,
main="Density plot of Distance",
ylab="Density",xlab="Distance in km")
# dot plot
dotplot(~speed)
dotplot(dist)
# scatter plot
xyplot(speed~dist)
xyplot(dist~speed,
main="Scatterplot",
ylab="Distance (km)",xlab="Speed (kmph)")
# scatterplot matrix
splom(cars[c(1,2)],main="scatter matrix")
# ggplot2 package
library(ggplot2)
29 | P a g e
# boxplot
qplot(speed, data = cars, geom = "boxplot")
qplot(dist, data = cars, geom = "boxplot")
# histogram
30 | P a g e