You are on page 1of 30

Analytics Project Report

Amanwit Kumar
Roll No. 004
Research and Business Analytics (2021-23)
WeSchool Bengaluru

1|P a g e
Table of Contents

1 . Project Objective ............................................................................................................ 4

2 . Assumptions ................................................................................................................... 4

3 . Exploratory Data Analysis – Step by step approach ....................................................... 4

3.1 Environment Set up and Data Import ......................................................................... 5

3.1.1 Install necessary Packages and Invoke Libraries ...................................................... 5

3.1.2 Set up working Directory ......................................................................................... 5

3.1.3 Import and Read the Dataset .................................................................................. 5

3.2 Variable Identification ............................................................................................... 5

3.2.1 Variable Identification – Inferences ................................................................... 6

3.3 Univariate Analysis ..................................................................................................... 6

3.3.1 Summary Statistics …………………………………………………………………………,………….… 6

3.3.2 Frequency Table …………………………………………………………………………….…………..… 8

3.3.3 Charts …………………………………………………………………………………………….…………….. 9

i. Histogram ………………………………………………………………………………….………..…… 9

ii. Boxplot …………………………………………………………………………………………….….….. 13

iii. Kernel Density Plot ……………………………………………………………………………..…… 16

iv. Dotplot …………………………………………………………………………………….……………... 19

2|P a g e
3.4 Bi-Variate Analysis ..................................................................................................... 20

3.4.1 Pivot Table ………………………………………………………………………………………………...… 20

3.4.2 Correlation Coefficient …………………………………………………………………………..….… 21

3.4.3 Charts ………………………………………………………………………………………………..…….….. 21

i. Scatter plot ……………………………………………………………………………………..……... 21

ii. Scatter matrix …………………………………………………………………………………....…… 23

iii. Box – Scatter plot …………………………………………………………………………..….…... 23

3.5 Missing Value Identification ...................................................................................... 24

3.6 Outlier Identification ................................................................................................. 24

4 . Conclusion ..................................................................................................................... 25

5 . Appendix A – Source Code ............................................................................................ 26

3|P a g e
1 Project Objective

In this project, the ‘cars’ dataset has been given and we have to do exploratory data
analysis. We have to explore the data and derive insights from them. Analysis can be
done in various ways like descriptive analytics, including measures of central tendency
and measures of dispersion. Data analysis can also be done using visualization methods
which include tables, charts etc.

2 Assumptions

In the project undertaken, no assumptions has been made.

3 Exploratory Data Analysis – Step by step approach


Exploratory Data Analysis (EDA) can be simply defined as the critical process of
performing initial investigations on data so as to discover patterns, to spot anomalies, to
test hypothesis and to check assumptions with the help of summary statistics and
graphical representations. Exploratory data analysis is an approach of analysing data sets
to summarize their main characteristics, often using statistical graphics and other data
visualization methods.

A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import

2. Variable Identification

3. Univariate Analysis

4. Bi-Variate Analysis

5. Missing Value Treatment (Not in scope for our project)

6. Outlier Treatment (Not in scope for our project)

7. Variable Transformation / Feature Creation

8. Feature Exploration

We shall follow these steps in exploring the provided dataset.

Although Steps 5 and 6 are not in scope for this project, a brief about these steps
(and other steps as well) is given, as these are important steps for Data Exploration
journey.

4|P a g e
3.1 Environment Set up and Data Import

3.1.1 Install necessary Packages and Invoke Libraries

The base installation of R comes with many useful packages as standard. These
packages will contain many of the functions one can use on a daily basis. However,
as one starts using R for more diverse projects one will find that there comes a
time when he / she will need to extend R’s capabilities and for that there are many
installable packages.

This section is used to install necessary packages and invoke associated libraries.
Having all the packages at the same places increases code readability.

The packages used in this project include:


1. e1071
2. rpivotTable
3. lattice
4. ggplot2

3.1.2 Set up working Directory

The working directory is the default location where R will look for files to load and
where it will put any files that is saved. Setting a working directory on starting of the
R session makes importing and exporting data files and code files easier. Basically,
working directory is the location/ folder on the PC where you have the data, codes
etc. related to the project.

Please refer Appendix A for Source Code.

3.1.3 Import and Read the Dataset

In this project, the dataset used is an in-built dataset in R. The in-built dataset ‘cars’ is
imported using data() function and used for exploratory data analysis.
If the dataset is in .csv format, the command ‘read.csv’ is used for importing the file.

Please refer Appendix A for Source Code.

3.2 Variable Identification

Variable identification is used to define the variable and to know the data type of each
variable. This can be done using the str() function, which gives the data type of each
variable.

5|P a g e
3.2.1 Variable Identification – Inferences

Using the function str() on the given dataset, the basic structure and data types can
be observed.
There are 2 variables with 50 number of observations each.
Variable Definition Data Type
speed Speed of car Numeric
dist Distance travelled Numeric

3.3 Univariate Analysis

The term “univariate analysis” refers to a single-variable analysis. Univariate analysis is


a fundamental statistical data analysis technique. The data comprises only one variable
and does not have to deal with a cause-and-effect relationship.

Univariate analysis on a single variable can be done in three ways:

1. Summary statistics - Determines the value’s centre and spread.

2. Frequency table - This shows how frequently various values occur.

3. Charts - A visual representation of the distribution of values.

3.3.1 Summary Statistics

To calculate various summary statistics for our data variable, we can use various
syntax.

Let’s start with the mean of the variable,

mean(speed)
[1] 15.4
mean(dist)
[1] 42.98

Now we can find out the median of the data

median(speed)
[1] 15
median(dist)
[1] 36

6|P a g e
Range of the variable

range(speed)
[1] 4 25
range(dist)
[1] 2 120

Also, we can verify it by using min and max functions

for speed
min(speed)
[1] 4
max(speed)
[1] 25

for distance
min(dist)
[1] 2
max(dist)
[1] 120

The quantile distributions can also be known

quantile(speed)
0% 25% 50% 75% 100%
4 12 15 19 25
quantile(dist)
0% 25% 50% 75% 100%
2 26 36 56 120

Standard deviation is important for the continuous data variables.

sd(speed)
[1] 5.287644
sd(dist)
[1] 25.76938

We can also find the variance

var(speed)
[1] 27.95918
var(dist)
[1] 664.0608

7|P a g e
Using the structure function, we can get the basic structure of the data. Sometimes it
is also considered to be an alternative for summary.

str(cars)
'data.frame': 50 obs. of 2 variables:
$ speed: num 4 4 7 7 8 9 10 10 10 11 ...
$ dist : num 2 10 4 22 16 10 18 26 34 17 ...

Summary gives the basic outline or the summary of all the above functions in a concise
manner.

summary(cars)

speed dist
Min. : 4.0 Min. : 2.00
1st Qu. : 12.0 1st Qu. : 26.00
Median : 15.0 Median : 36.00
Mean : 15.4 Mean : 42.98
3rd Qu. : 19.0 3rd Qu. : 56.00
Max. : 25.0 Max. : 120.00

3.3.2 Frequency table

The term “frequency” refers to how frequently something occurs. The number of
times an event occurs is indicated by the observation frequency.

The frequency distribution table may include numeric or quantitative data that are
category or qualitative. The distribution provides a glimpse of the data and allows you
to identify trends.

To create a frequency table for our variable, we can use the following syntax:

table(speed)
speed
4 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25
2 2 1 1 3 2 4 4 4 3 2 3 4 3 5 1 1 4 1

We can infer the output like,


The value 4 occurs 2 times
The value 8 occurs 1 time
The value 12 occurs 4 time
And so on.

8|P a g e
Similarly, we can say the same for the frequency table of distance.

table(dist)
dist
2 4 10 14 16 17 18 20 22 24 26 28 32 34
1 1 2 1 1 1 1 2 1 1 4 2 3 3

36 40 42 46 48 50 52 54 56 60 64 66 68 70
2 2 1 2 1 1 1 2 2 1 1 1 1 1

76 80 84 85 92 93 120
1 1 1 1 1 1 1

3.3.3 Charts

R language is mostly used for statistics and data analytics purposes to represent the
data graphically in the software. To represent those data graphically, charts and
graphs are used in R.
There are hundreds of charts and graphs present in R. For univariate analysis we can
use a number of plots like box plot, dot chart, histogram etc.

1. Histogram

Histogram is a graphical representation used to create a graph with bars representing


the frequency of grouped data in vector. Histogram is same as bar chart but only
difference between them is histogram represents frequency of grouped data rather
than data itself. The histogram presented below is done using the base package.

Fig 1. Histogram of Speed

9|P a g e
Fig 2. Histogram of Distance

But the histogram created using base package in R is un-attractive. In R there are
various different packages which can give the same result with a better presentation.
The histogram shown below is created using ‘lattice’ package in R. Here, the histogram
is colourful and is shown as percent of total rather than the count. Here both the
histogram for speed and distance is given.

Fig 3. Histogram of Speed (using lattice)

10 | P a g e
Fig 4. Histogram of Distance (using lattice)

The histogram can also be plotted using ‘ggplot2’ package. In the below given graph,
ggplot2 package has been used to plot the histogram for speed and distance.

Fig 5. Histogram of Speed (using ggplot2)

11 | P a g e
Fig 6. Histogram of Distance (using ggplot2)

But the plotted histogram doesn’t seem good as it is plotted for each individual value
of speed and distance respectively. So, in the second case, histogram is plotted using
‘bins’. The ‘bins’ function categorises the values as per the given choice an gives the
required histogram. Here the bin value is taken to be 10.

Fig 7. Histogram of Speed (using ggplot2)

12 | P a g e
Fig 8. Histogram of Distance (using ggplot2)

From the above histogram we can clearly find the outlier present in the data distance.

2. Boxplot

Box plot shows how the data is distributed in the data vector. It displays a dataset’s
five-number summary. The five values in the graph are, minimum value, first quartile,
second quartile(median), third quartile, the maximum value of the data vector.

Fig 9. Boxplot of Speed

13 | P a g e
Fig 10. Boxplot of Distance

We can get the boxplot of speed and distance using the base package only, but it
gives the boxplot in a vertical alignment. For data analysis, it is easier to explore the
data if the graphs are in horizontal alignment. Therefore, ‘lattice’ package is used to
get the boxplot of the given data in horizontal format. Also, in ‘base’ package the
median is not mentioned, whereas if boxplot is created using ‘lattice’ package, the
median is also mentioned.

Fig 11. Boxplot of Speed (using lattice)

14 | P a g e
Fig 12. Boxplot of Distance (using ggplot2)

‘ggplot2’ package can also be used to plot the boxplot. Here the median is clearly
mentioned, and gridlines are given for better analysis of data.

Fig 13. Boxplot of Speed (using ggplot2)

15 | P a g e
Fig 14. Boxplot of Distance (using ggplot2)

From the above boxplot we can infer that, the median speed is 15 kmph, where the
range of the data is between 4 kmph to 25 kmph and there are no outliers. In the
boxplot for distance, the median is 36 km, and the range of distance is between 2 km
to 120 km. Here 120 km is the outlier, as it is too far away from the interquartile range.

3. Kernel Density Plot

The distribution of values in a dataset is represented by a density curve, which is a


curve on a graph. It’s especially useful for viewing a distribution’s “shape,” such as
whether the distribution contains one or more “peaks” of often occurring values and
if the distribution is skewed to the left or right.

Kernel density plot is not present in base package. So ‘lattice’ package or ‘ggplot2’
package has to be used to get the density plot. In the below given graph, ‘lattice’
package is used to plot the kernel density plot.

16 | P a g e
Fig 15. Kernel Density Plot of Speed (using lattice)

Fig 16. Kernel Density Plot of Distance (using lattice)

In the above density plot, it is observed that, both the density as well as the
corresponding values are given. Kernel density plot can also be plotted using ‘ggplot2’
package. In the graph shown below ‘ggplot2’ package has been used to plot the
density plot for both speed and distance.

17 | P a g e
Fig 17. Kernel Density Plot of Speed (using ggplot2)

Fig 18. Kernel Density Plot of Distance (using ggplot2)

There is a significant difference in density plot created using ‘lattice’ package and
‘ggplot2’ package. In plot created using ‘ggplot2’ package, there is no values given for
the data and the line doesn’t start and end at ‘zero (0)’ rather from a base value.

18 | P a g e
From the kernel destiny plot we can infer that; the majority of the people travel at a
speed in between 10 to 20 kmph and the distance covered by most of the people is in
between 25 to 35 km.

4. Dotplot

A dot plot or dot chart is similar to a scatter plot. The main difference is that the dot
plot in R displays the index (each category) in the vertical axis and the corresponding
value in the horizontal axis, so it forms a horizontal line.

Fig 19. Dotplot of Speed

Fig 20. Dotplot of Distance

19 | P a g e
Each of these graphs provides a different perspective on the distribution of values for
the said variable.

In statistics, univariate analysis is the most basic type of data analysis. The important
thing to understand about univariate analysis is that there is only one data set
involved. While the univariate analysis is simple to do and understand, it can
sometimes provide incorrect results, especially when there are multiple factors to
consider. In this situation, bivariate and multivariate analysis should be used to better
analyse the data.

3.4 Bi-Variate Analysis


The term bivariate analysis refers to the analysis of two variables. Basically, it is used to
understand the relationship between two variables.

Bi-variate analysis can be done using various methods like

1. Bivariate / Pivot table

2. Correlation coefficient

3. Charts and graphs

3.4.1 Bivariate / Pivot table

The results from bivariate analysis can be stored in a data table. For that, different
functions can be used like table().
In this case, the ‘rpivotTable’ package has been used. And the output is in a table
format.

Fig 21. Pivot Table

20 | P a g e
3.4.2 Correlation Coefficient

A Correlation Coefficient is a way to quantify the linear relationship between two


variables. That means it can evaluate the association between two variables.
We can use the cor() function in R to calculate the Correlation Coefficient between
two variables.

cor(speed,dist)
[1] 0.8068949

The correlation coefficient turns out to be 0.806. This value is close to 1, which
indicates a strong positive correlation between speed and distance.

3.4.3 Charts

In bi-variate analysis also, there are many charts and graphs which can be used.

The charts used in this project include:

1. Scatterplot

2. Scatter matrix

3. Box - Scatter plot

1. Scatter plot

A scatter plot uses dots to represent values for two different numeric variables. The
position of each dot on the horizontal and vertical axis indicates values for an
individual data point. These give a visual idea of the pattern that the variables follow.
Identification of correlation is common with scatter plots. Relationships between
variables can be described in many ways: positive or negative, strong or weak, linear
or nonlinear.
Scatterplots are available in base package. So, the first scatterplot between speed and
distance is done using base package.

21 | P a g e
Fig 22. Scatterplot

In the second case scatterplot is done using ‘lattice’ package. The only difference is
that in the second case ‘x’ and ‘y’ labels are given and as ‘lattice’ package is used, the
values have a different colour.

Fig 22. Scatterplot (using lattice)

22 | P a g e
2. Scatter matrix

Scatterplot matrix is a collection of scatterplots being organized into a matrix, and


each scatterplot shows the relationship between a pair of variables. This is very useful
for having a vague idea about linear correlation between variables. A scatter plot
matrix is an excellent way of visualizing the pairwise relationships among several
variables.

Fig 23. Scatter Matrix (using lattice)

Here the scatter matrix is plotted for speed and distance. As there are only 2 variables
it creates a matrix of 2 by 2. The plot in the 1st quadrant and the plot in the 4th quadrant
are transpose of each other and the 2nd quadrant and 3rd quadrant gives the values of
different variables.

3. Box – Scatter plot

Boxplot hides the distribution behind each group. This post show how to tackle this
issue in base R, adding individual observation using dots with jittering. If the amount
of observation is not too high, individual observations can be added on top of boxes,
using jittering to avoid dot overlap.

23 | P a g e
Fig 24. Box – Scatter Plot (using ggplot2)

Here, the box and scatter plot graph plots the individual values of scatter plot on top
of box plot. From this graph the trend of the data can be observed.

3.5 Missing Value Identification

In the given ‘cars’ dataset, there is no missing value. So, there is no need to proceed
with identifying missing value procedure.

3.6 Outlier Identification

Outlier identification can be done using boxplot. For that, any package can be used
whether it be the ‘base’ package, ‘lattice’ package or ‘ggplot2’ package.

From the uni-variate data analysis done previously, it is observed that using ‘ggplot2’
package, the boxplot analysis is easier. Boxplot is plotted for both speed and distance
to identify the outlier. From the boxplot it is observed that, distance has an outlier.

24 | P a g e
Fig 25. Boxplot of Distance to identify the outlier

In the boxplot, the outlier is found to be 120 km.

4 Conclusion
In the given project the exploratory data analysis (EDA) of ‘cars’ dataset was done. It was
found that most of the people drive their cars at a speed of 10 kmph to 20 kmph and
most of them travel a distance of 25 km to 35 km. The slowest car speed was measured
to be 4 kmph and the highest car speed was measured to be 25 kmph. The least distance
travelled using car is 2 km and the maximum distance travelled by a car is 120 km. The
average speed at which the people drive their cars is 15.4 kmph and the average distance
travelled using a car is 42.98 km.

25 | P a g e
5 Appendix A

This section contains the script for coding in R.

# *****Exploratory Data Analysis (EDA)*****

# 1. set the working directory


# 2. import the data
# 3. run the attach() function

# code to set up working directory


setwd("D:/WESCHOOL/Tri II/R & Python/R Studio")

# Importing in-bult dataset from R


data()
data(cars)
View(cars)

attach(cars)

# Importing libraries
library(e1071)
library(lattice)
library(rpivotTable)
library(ggplot2)

# descriptive statistics
dim(cars)
head(cars)
tail(cars)
str(cars)
summary(cars)

# measures of central tendency

mean(speed)
median(speed)

mean(dist)
median(dist)

mode(speed)
mode(dist)
# mode is undefined, it gives the type of variable

26 | P a g e
# measures of dispersion

# for speed

range(speed)
sd(speed)
var(speed)
quantile(speed)
min(speed)
max(speed)

# for distance

range(dist)
sd(dist)
var(dist)
quantile(dist)
min(dist)
max(dist)

# e1071 package -

library(e1071)

cor(speed,dist)

# skewness
skewness(speed)
skewness(dist)

# kurtosis
kurtosis(speed)
kurtosis(dist)

# uni-variate
table(speed)
table(dist)

# bi-variate
table(dist,speed)

27 | P a g e
# Visualization

# rpivotTable package

library(rpivotTable)
rpivotTable(cars)

# histogram
hist(speed)
hist(dist)

# boxplot
boxplot(speed)
boxplot(dist)

# scatterplot
plot(speed,dist)

# lattice package - for visualization purpose

library(lattice)

# histogram
histogram(~speed)
histogram(~dist)

# boxplot
bwplot(~Speed)
bwplot(~dist)
# different from base package as it gives median value also

# kernel density plot

# density plot for speed

densityplot(~speed)

densityplot(~speed,
main="Density plot of Speed",
ylab="Density",xlab="Speed in kmph")

28 | P a g e
# density plot for distance

densityplot(~dist)

densityplot(~dist,
main="Density plot of Distance",
ylab="Density",xlab="Distance in km")

# dot plot
dotplot(~speed)
dotplot(dist)

# scatter plot

xyplot(speed~dist)

xyplot(dist~speed,
main="Scatterplot",
ylab="Distance (km)",xlab="Speed (kmph)")

# scatterplot matrix
splom(cars[c(1,2)],main="scatter matrix")

# ggplot2 package

library(ggplot2)

# kernel density plot

qplot(speed, data = cars, geom = "density")


qplot(dist, data = cars, geom = "density")

qplot(speed, data = cars, geom = "density",


main = "Kernel Density Plot",
ylab = "Density", xlab = "Speed (kmph)")

qplot(dist, data = cars, geom = "density",


main = "Kernel Density Plot",
ylab = "Density", xlab = "Distance (km)")

29 | P a g e
# boxplot
qplot(speed, data = cars, geom = "boxplot")
qplot(dist, data = cars, geom = "boxplot")

# displays the combination of scatter plot and box plot


qplot(speed, dist, data = cars, geom=c("boxplot","jitter"))

# histogram

qplot(speed, data = cars, geom = "histogram")


qplot(dist, data = cars, geom = "histogram")

# bins=10 divides the x axis into bins of width 10


# and then counts the number of observations in each bin

qplot(speed, data = cars, geom = "histogram", bins=10)


qplot(dist, data = cars, geom = "histogram", bins=10)

30 | P a g e

You might also like