You are on page 1of 10

Exploratory Data Analysis

Instructions:

Please share your answers filled inline in the word document. Submit Python code and R code
files wherever applicable.

Please ensure you update all the details:

Name: _________________R.Divya________

Batch Id: __________________


Topic: Exploratory Data Analysis

Problem Statements:
Q1) Calculate Skewness, Kurtosis using R/Python code & draw inferences on the following data.
Hint: [Insights drawn from the data such as data is normally distributed/not, outliers, measures
like mean, median, mode, variance, std. deviation]
a. Cars speed and distance

© 2013 - 2022 360DigiTMG. All Rights Reserved.


Ans: By looking at the above data and calculating the skewness and kurtosis value:
Skewness (speed) = -0.11751 Skewness (dist) = 0.806895
We can say that speed is slightly negatively skewed and dist is slightly positively skewed
Kurtosis (speed) = -0.50899 Kurtosis (dist) = 0.405053
We can say that the speeds and dist is platykurtic, means data has thinner tales then standard
normal distribution.
Python Code:
import pandas as pd
Cars = pd.read_csv(path)

© 2013 - 2022 360DigiTMG. All Rights Reserved.


Cars.speed.mean() # '.' is used to refer to the variables within object
Cars.speed.median()
Cars.speed.mode()

# Measures of Dispersion / Second moment business decision


Cars.speed.var() # variance
Cars.speed.std() # standard deviation
range = max(Cars.speed) - min(Cars.speed) # range
range

# Third moment business decision


Cars.speed.skew()
Cars.dist.skew()

# Fourth moment business decision


Cars.speed.kurt()

import numpy as np
import seaborn as sns
sns.boxplot(Cars.speed)
sns.boxplot(Cars.dist) – There are outliers in distance

b. Top Speed (SP) and Weight (WT)

© 2013 - 2022 360DigiTMG. All Rights Reserved.


Ans: By looking at the above data and calculating the skewness and kurtosis value:
Skewness (speed) = 1.61145 Skewness (dist) = -0.61475

© 2013 - 2022 360DigiTMG. All Rights Reserved.


We can say that speed is positively skewed and dist is slightly negatively skewed
Kurtosis (speed) = 2.977329 Kurtosis (dist) = 0.950291
We can say that the speeds and dist is platykurtic, means data has thinner tales then standard
normal distribution.
Python Code:
import pandas as pd
Cars = pd.read_csv(path)
Cars.skew(axis = 0, skipna = True)

Cars.kurt(axis = 0, skipna = True)


Q2) Draw inferences about the following boxplot & histogram.
Hint: [Insights drawn from the plots about the data such as whether data is normally
distributed/not, outliers, measures like mean, median, mode, variance, std. deviation]

© 2013 - 2022 360DigiTMG. All Rights Reserved.


Ans: By looking at the above histogram and boxplot we can say that:
1. Data is not normally distributed
2. Data has a long right tale means it is positively skewed
3. At extreme right we have about 8 outliers
4. It has long right whisker

Histogram :
1.The Data is Right skewed
2. Max. Value – (0-100)
3. Min. Value – (300-400)
4. Mode will be between 50-100
Box Plot :
1. Box Plot is Right Skewed/Positive Skewed
2. Outliers are at upper side

Q3) Below are the scores obtained by a student in tests


34,36,36,38,38,39,39,40,40,41,41,41,41,42,42,45,49,56

© 2013 - 2022 360DigiTMG. All Rights Reserved.


1) Find mean, median, variance, standard deviation.
2) What can we say about the student marks? [Hint: Looking at the various measures
calculated above whether the data is normal/skewed or if outliers are present].
Ans: Mean= 41, Median= 40, variance= 24.111, Standard deviation= 4.910

Q5) What is the nature of skewness when mean, median of data is equal?
Ans: Symmetrical
Q6) What is the nature of skewness when mean > median?
Ans: Right skewed
Q7) What is the nature of skewness when median > mean?
Ans: Left skewed
Q8) What does positive kurtosis value indicates for a data?
Ans: The data is normally distributed and kurtosis value is 0
Q9) What does negative kurtosis value indicates for a data?
Ans: The distribution of the data has lighter tails and a flatter peaks than the normal
distribution
Q10) Answer the below questions using the below boxplot visualization.

What can we say about the distribution of the data?


Ans: Let’s assume above box plot is about age’s of the students in a school.
50% of the people are above 10 yrs old and remaining are less.
And students who’s age is above 15 are approx 40%.

What is nature of skewness of the data?

© 2013 - 2022 360DigiTMG. All Rights Reserved.


Ans: Left skewed, median is greater than mean
What will be the IQR of the data (approximately)?
Ans: Approximately= -8

Q11) Comment on the below Boxplot visualizations?

Draw an Inference from the distribution of data for Boxplot 1 with respect Boxplot 2.
Hint: [On comparing both the plots, and check if the data is normally distributed/not, outliers
present, skewness etc.]
Ans: By observing both the plots whisker’s level is high in boxplot 2, mean and median are
equal hence distribution is symmetrical.

Q12)

© 2013 - 2022 360DigiTMG. All Rights Reserved.


Answer the following three questions based on the boxplot above.
(i) What is inter-quartile range of this dataset? [Hint: IQR = Q3 – Q1] 12-5 = 7
In one line, explain what this value implies. (Hint: Based on IQR
definition)Middle 50% lies between 5 and 12.
(ii) What can we say about the skewness of this dataset? Lightly Positive skew
(iii) If it were found that the data point with the value 25 is 2.5, how would the new
boxplot be affected? No outliers
(Hint: On changing the data point from 25 to 2.5 in the data, how is it different from the
current one.)

Q13)

Answer the following three questions based on the histogram above.


(i) Where would the mode of this dataset lie? Hint: [In terms of values On Y-
axis]9,20
(ii) Comment on the skewness of the dataset. Ans: Positive skew,
Mode>Median>Mean

© 2013 - 2022 360DigiTMG. All Rights Reserved.


(iii) Suppose that the above histogram and the boxplot in question 2 are plotted for
the same dataset. Explain how these graphs complement each other in providing
information about any dataset. Hint: [Visualizing both the plots, draw the
insights] Ans: 50% data between 5 and 12, outlier in 25, , Positive skew,
Asymetrical, Mode>Median>Mean, Positive kurtosis

Hints:
For each assignment, the solution should be submitted in the below format
1. Research and Perform all possible steps for obtaining solution
2.
3. For Statistics calculations, explanation of the solutions should be documented in black and
white along with the codes.
Must follow these guidelines:
3.1. Be thorough with the concepts of Probability, Central Limit Theorem and Perform the
calculation stepwise
3.2. For True/False Questions, or short answer type questions explanation is must

3.3. R & Python code for Univariate Analysis (histogram, box plot, bar plots etc.) the data
distribution to be attached
4. All the codes (executable programs) should execute without errors
5. Code modularization should be followed
6. Each line of code should have comments explaining the logic and why you are using that

© 2013 - 2022 360DigiTMG. All Rights Reserved.

You might also like