You are on page 1of 14

Indira Institute of Management, Pune

MBA Semester 2 (2021-23)


Advance Statistical Methods CCA-1

Assignment Report Topic

Exploratory Data Analysis (EDA)

Submitted By

MK Grp 1

Ajay Aynile MK B 1
Akshay Rathod MK B 4
Ankita Wadhe MK B 6
Arpan Shah MK B 7
Arpita Pojge MK B 8

Under the Guidance of

Prof. Dr. Punam Bhoyar


Exploratory Data Analysis (EDA):
Exploratory Data Analysis or EDA is a statistical approach or technique for analyzing data
sets in order to summarize their important and main characteristics generally by using some
visual aids. The EDA approach can be used to gather knowledge about the following aspects
of data:

Main characteristics or features of the data.

 The variables and their relationships.


 Finding out the important variables that can be used in our problem.
 EDA is an iterative approach that includes:

Generating questions about our data

 Searching for the answers by using visualization, transformation, and modeling of our
data.
 Using the lessons that we learn in order to refine our set of questions or to generate a
new set of questions.

About the dataset:

Page |
1
The dataset used contains 5 variables i.e., TV, Radio, Social Media, Influencer and Sales. It
represents the spending of the company on marketing campaign through TV, Social Media,
Radio and Influencers and amount of Sales generated through these campaigns.

Objective:
To perform Exploratory Data Analysis and do data cleaning on the used dataset.

Analysis tool:
The analysis is done using Rstudio

Exploratory Data Analysis and Data Cleaning:

STEP 1:

Setting the working directory to do the analysis on the selected dataset. The functions used for
this are getwd() and setwd(). The code for the same is as follows:

Code:

STEP 2:

The dimensions of the dataset were checked using the function dim(), to know the number of
rows (observations) and columns (variables) in the data set. The dataset was displayed using
View() function. str() function was used to get the structure of the dataset. Then the dataset
was checked for missing values and finding out the number of missing values using is.na()
and sum(is.na()) functions respectively.

Page |
2
Code:

Result:

The above results show that there are a total of 4572 observations and 5 variables. The
variables Radio, Social Media, and Sales have numeric values, and, variables TV and
Influencer have integer and characteristic values respectively. The dataset also has 26 missing
values as per the results obtained.

STEP 3:

Plotting the scatter plot for each variable to check the directionality of the scatter plot
whether it is positive or negative. This was done by excluding the variable Influencer as it has
characteristic values. The scatter plot was done using function plot().

Code:

Result:

Page |
3
The above scatter plot represents that all the variables have positive directionality in the plot
which shows they have good relationship between them. Also, from the we can say that the
relationship between TV and Sales is strong.

STEP 4:

The boxplot of all the variables was plot to check whether there are outliers in their respective
observations or not. The boxplot was plot using the function boxplot().

Code:

Page |
4
Result:

Page |
5
From the above results it can be observed that the observations of the variables Social Media
and Radio contain outliers. While the observations of the variables Sales and TV have no
outliers.

Page |
6
STEP 5:

Cleaning the data by replacing the missing values by respective mean or median values of the
observations of the respective variables. As from the boxplot we observed that variables
social media and radio contain outliers in their observations hence the missing values in these
observations are replaced by median values of their respective observations group. The
function used for mean and median are mean() and median() respectively.

Code:

The syntax used for replacing the values are as follows:

For mean –

dataset_name$variable1_name[is.na()]=mean(dataset_name$variable1_name, na.rm=T)

For median –

dataset_name$variable1_name[is.na()]=median(dataset_name$variable1_name, na.rm=T)

STEP 6:

Plotting the histogram to check the frequency distribution of each variable. This was done
using the function hist().

Code:

Page |
7
Result:

Page |
8
Page |
9
From the above charts it can be observed that the variables sales and TV have Normal
Distribution, or the distribution is nearly flat for both the variables. Whereas for variables
social media and radio the distribution is longer on the right side which shows that both the
variables have Right Skewed or Positively Skewed Distribution.

STEP 7:

Finding the correlation between the variables and the level of correlation between them i.e.,
strong, moderate or weak. The correlation was found out using the function cor().

Code:

Result:

The correlation value varies between 0 and 1, for values in the range 0 - 0.5 have weak
correlation, 0.5 - 0.85 have moderate correlation and 0.85 - 1 have strong correlation. Based
on this and the above results it can be interpreted that variable TV has strong correlation with
radio and sales while moderate correlation with social media. Also, radio has strong
correlation with sales and moderate relation with social media. Whereas sales have strong
correlation with TV and Radio, and moderate correlation with social media. This strength of
correlation between them is used for estimating the level of significance during the statistical
analysis.

Page |
STEP 8:

Installing the packages to access the library required for performing descriptive statistics on
the variables. The functions used are install.packages() and library().

Code:

STEP 9:

Performing Explorative Data Analysis on the dataset using Descriptive Statistical Method.

Code:

Results:

Conclusion:
From the above results it can be interpreted that distribution of TV and Sales are flat or have
zero skewness as their values are approximately zero which can be confirmed from their
respective histogram plots. Similarly, it can be interpreted that radio and social media have
positively skewed distribution which is high for social media compared to radio based on
their values obtained in the results and it can also be confirmed from their respective
histogram charts. These can also be confirmed based their respective values of kurtosis i.e.,
the distribution is flat for TV and Sales as they have kurtosis value less than -1, the
distribution for

Page |
radio is slightly peaked whereas distribution for the social media has a normal peak as the
kurtosis values are between -1 and 1.

Based on the above results the standard error for the variable sales is much higher compared
to other variables and standard error is lowest for the variable social media. This shows that
the estimated mean of the sample after data cleaning for sales is inaccurate or highly different
than the true population as it is much greater than zero, whereas, in case of social media
standard error is very much close to zero which shows that the estimated mean for social
media is approximately equals to its true population mean.

Page |
#Exploratory Data Analysis

#Setting the Working Directory


getwd()
setwd("D:\\ASM CCA_1")

sales=read.csv("Sales_Dataset.csv")

#Performing Exploratory Data Analysis


dim(sales) #to Know the Dimensions of the Dataset
View(sales) #to view dataset
str(sales) #shows that TV has integer values; radio, social media &
sales have numeric values; and influencer has characteristic values
is.na(sales) #to check presence of the missing values in the
dataset sum(is.na(sales)) #to find total no. of missing values

sales=sales[-c(4)]
plot(sales) #to check the direction of the plot whether positive
or negative

#To check for outliers


boxplot(sales$TV) #NO outliers
boxplot(sales$Radio) #outliers are present
boxplot(sales$Social.Media) #outliers are present
boxplot(sales$Sales) #NO outliers

#Replacing the missing values with mean or median


sales$TV[is.na(sales$TV)]=mean(sales$TV, na.rm = T)
#replacing missing values with mean as there are no outliers
sales$Radio[is.na(sales$Radio)]=median(sales$Radio, na.rm = T)
#replacing missing values with median as there are outliers
sales$Social.Media[is.na(sales$Social.Media)]=median(sales$Social.Media, na.rm =
T) #replacing missing values with median as there are outliers
sales$Sales[is.na(sales$Sales)]=mean(sales$Sales, na.rm = T)
#replacing missing values with mean as there are no outliers

#Plotting histogram to check the frequency distribution for each variable


hist(sales$TV)
hist(sales$Radio)
hist(sales$Social.Media)
hist(sales$Sales)

#To find correlation between variables in dataset


cor(sales)

#Installing the packages required for Descriptive Statistics


install.packages("psych")
library(psych)

#Performing Descriptive Statistical Method to find Skewness and Kurtosis


describe(sales)

You might also like