Professional Documents
Culture Documents
BY:
PRANAV VISWANATHAN
1
Table of Contents
CONTENT PAGE
1 Project Objective 3
2 Assumptions 3
2.1 Assumptions from the problem’s point of view 4
3 Exploratory Data Analysis – Step by step approach 4
3.1 Environment Set up and Data Import 5
3.1.1 Install necessary Packages and Invoke Libraries 6
3.1.2 Set up working Directory 6
3.1.3 Import and Read the Dataset 7
3.2 Variable Identification 8
3.2.1 Variable Identification – Inferences 9
3.3 Univariate Analysis 10
3.4 Bi-Variate Analysis 13
3.5 Missing Value Identification 14
3.6 Outlier Identification 15
4 Problem and solutions 15
4.1 Problem 1 15
4.2 Problem 2 20
5 Conclusion 28
6 Appendix A – Source Code 29
2
1.PROJECT OBJECTIVE:
Getting the source i.e. the dataset needed in the desired file format.
2.ASSUMPTIONS:
The assumptions taken into account on the given dataset’s are that they are free
from missing values, errors.
Let us assume that the datasets provided for problems is free from errors.
3
2.1 Assumptions from problem point of view :
1. The dataset is correctly imported and checked for errors and missing values.
2. The dataset consists of date, month, season and temperature , so we check
for possible errors in the data type of different parameter’s.
3. Let us assume the temperature of cold storage is maintained properly
between 3-4 deg c.
4. Let us assume that the dataset is normally distributed with mean and
standard deviation.
5. The temperature is read properly at correct intervals.
6. Let the maximum accepted temperature be 3.9 deg c.
7. It is assumed that the in the first year of business they outsourced the plant
maintenance work to a professional company with stiff penalty clauses.
8. If it is proven that the probability of the temperature falls out of 2-5 deg c is
above 2.5 % or less than 5 % penality is 10 % of AMC else if it exceeds
above 5%, penality is 25% of AMC fee.
4
EDA is different from initial data analysis (IDA), which focuses more narrowly on
checking assumptions required for model fitting and hypothesis testing, and
handling missing values and making transformations of variables as needed.
5
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Here the necessary packages for using various functions are installed and the
respective libraries are invoked for the purpose of analyzing.
install.packages()- Function used for installing packages.
library()-is used to call the libraries from installed packages
R STUDIO O/P:
6
Alternatively,we can use the session->set work directory->select the directory
R STUDIO O/P:
7
Fig 5: output of dataset
8
3.2.1 Variable Identification – Inferences
After uploading the file into the R studio,
First step is to see the structure of the dataset.we use str() function to see the same.
From the above figure we see that the dataset is of data frame type and Data type
of each field is shown. This helps to identify whether the variables are categorical
or numerical.
We see that there are 35 observations divided among 4 variables.
2. Then the various insights of the dataset is seen that is the statistical parameters
are analyzed.
Here we use summary() to get the statistical parameters.
From the above figure we get the mean,median ,quartiles of various variables to
analyse the dataset for further maipulation.
9
3.suppose the dataset consists of large no of rows and columns we make use
head() and tail() to get the top and bottom n rows specified o check the consistency
of the dataset.
Numerical variables:
par(mfrow=c(2,2))
hist(Temperature)
boxplot(Temperature,horizontal =TRUE ,main='Boxplot of temperature ')
hist(Date)
boxplot(Date,horizontal =TRUE,main='Boxplot of Date' )
R studio o/p:
10
Fig 11: output
Analysis:
From the above charts of numerical variables of the given dataset we see that the
histogram of temperature is normally distributed as it inceases ,reaches maximum
and then decreases redembling the bell curve.
From the boxplot of temperature we see that it has an extreme value represented by
the outlier.
11
Fig 13: output of outliers present
Categorical variable:
plot(Season)
plot(Month)
R STUDIO O/P:
Bar chart of season:
12
Bar chart of month:
The above figure represents the change in temperature for various seasons.we can
see that temperature decreases during winter and rainy season.
13
Next we use ggplot() to understand properly and for easy analysis.
Code:
library(ggplot2)
ggplot(data,aes(x=Temperature,fill=Season))+geom_histogram(col='Black',bins=1
5)
+facet_wrap(~Season)
Analysis:
Fom the above two plots we see that the temperature of cold storage unit reaches
maximum in summer and decreases considerably in winter and rainy season.
14
R STUDIO O/P:
15
change of texture, body appearance, separation of fats the optimal temperature to be maintained
is between 2 deg - 4 deg C.
In the first year of business they outsourced the plant maintenance work to a professional
company with stiff penalty clauses. It was agreed that if it was statistically proven that
probability of temperature going outside the 2 degrees - 4 degrees C during the one-year contract
was above 2.5% and less than 5% then the penalty would be 10% of AMC (annual maintenance
case). In case it exceeded 5% then the penalty would be 25% of the AMC fee. The average
temperature data at date level is given in the file “Cold_Storage_Temp_Data.csv”
Q1) Find mean cold storage temperature for Summer, Winter and Rainy
Season :
#mean of temperatur in summer: (library:tidyverse)
d2
summer_mean=mean(d2$Temperature)
summer_mean
> summer_mean=mean(d2$Temperature)
> summer_mean
[1] 3.153333
R STUDIO O/P:
16
Fig 23:output to find mean of a particular season (summer)
d3
winter_mean=mean(d3$Temperature)
winter_mean
> winter_mean=mean(d3$Temperature)
> winter_mean
[1] 2.703226
R STUDIO O/P:
17
Mean of temperature in rainy season:
d4=data %>% filter(Season=="Rainy")
d4
rainy_mean=mean(d4$Temperature)
rainy_mean
R STUDIO O/P:
overall_mean
18
Q3)Find Standard Deviation for the full year
Overall standard deviation:
overall_sd=sd(data$Temperature)
overall_sd
sd=overall_sd
pnorm(q=2,mean,sd,lower.tail = T)
> pnorm(q=2,mean,sd,lower.tail = T)
[1] 0.02918146
R STUDIO O/P:
19
Fig 32 : output
R STUDIO O/P:
Fig 34 : output
PROBLEM 2:
In Mar 2018, Cold Storage started getting complaints from their Clients that they have been getting
complaints from end consumers of the dairy products going sour and often smelling. On getting
these complaints, the supervisor pulls out data of last 35 days temperatures. As a safety measure,
the Supervisor has been vigilant to maintain the temperature below 3.9 deg C.
20
Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1 do you feel that there
is need for some corrective action in the Cold Storage Plant or is it that the problem is from
procurement side from where Cold Storage is getting the Dairy Products. The data of the last 35
days is in “Cold_Storage_Mar2018.csv”
ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable temperature
range )
CODE:
setwd('C:\\Users\\user\\Desktop\\pgp-babi')
getwd()
##population data
data=read.csv('cold_storage.csv')
data
##sampled data
21
data2=read.csv('Cold_Storage_Mar2018.csv')
data2
standard_deviation=sd(data$Temperature)
standard_deviation
p=sqrt(35)
sd_n=standard_deviation/p
sd_n
##mean of sample
mean_sample=mean(data2$Temperature)
mean_sample
####PROBLEM:2 Q.1
####ASSUMPTIONS
####1)As a safety measure, the Supervisor has been vigilant to maintain the temperature below 3.9 deg
C.
####2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
## HO:MU<=3.9 deg c
## HA:MU>3.9 DEG C
alpha=0.1
##CALCULATED tstat:
Z=(X_bar-MU)/SD
22
Z_c=qnorm(q)
Z_c
### from the values of Z and Z_c we see that Z<Z_c so we fail to reject HO(null hypothesis)
###P-value method:
alpha=0.1
Z= 0.8641166
p_value=1-pnorm(-abs(Z))
p_value
### from the values of Z and p_value we see that p_value is greater than alpha .
R STUDIO O/P:
Fig 35 a.
23
Fig 35 b.
Conclsion of Z-test:
Firstly,on comparing the values of z ,z_c and z , p-value we see that z does
not fall in the critical region and on comparing with p-value we see that p-
value is greater than alpha .so we can conclude that we accept the null
hypothesis, which states that the cold storage temperature is maintained .
24
Secondly,we can conclude that the problem is from the procurement side.
ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable temperature
range )
Alpha=0.1
Based on above hypothesis ,
tstat=(xbar-mu)/(sigma/sqrt(n))
tsat is found out to be tstat= 2.752359
CODE:
setwd('C:\\Users\\user\\Desktop\\pgp-babi')
getwd()
data=read.csv('Cold_Storage_Mar2018.csv')
25
data
####ASSUMPTIONS
####1)As a safety measure, the Supervisor has been vigilant to maintain the
temperature below 3.9 deg C.
####2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
R STUDIO O/P:
> xbar=mean(data$Temperature)
> s=sd(data$Temperature)
> tstat=(xbar-3.9)/(s/sqrt(35))
> pvalue=pt(tstat,35)
> pvalue=pt(tstat,34)
> pvalue
[1] 0.9952888
> p=1-pt(tstat,34)
26
p
[1] 0.004711198
P=0.004711198 is less than alpha = 0.1 so we reject the null hypothesis and
accept the alternate hypothesis.
Conclsion of T-test:
Firstly,on comparing the values of tstat , p we see that p is less than alpha.so
we accept the alternate hypothesis which is the temperature is greater than
3.9 deg c.
Secondly,we can see that the sample mean is greater than the population
mean.
Thirdly, we can conclude that the problem is with the cold storage unit.
27
If we compare the parameters of t test to that of the population parameters ,it
is seen that the average temperature of sample is greater than that of the
population.
Moreover, T-test is preferred over Z-test as the sample is N=35 is
considerably smaller .
5.CONCLUSION:
The dataset is analyzed to find out the exact problem endured of the cold
storage unit faced by the customers over years .It is found out through
hypothesis testing that the problem lies with the cold storage unit, in spite of the
supervisor telling that a optimum temperature is maintained. The data analysis
helps to bring out the insights of the dataset making it possible to conclude the
hypothesis. Moreover it is finally concluded that T-test is more significant in
this case based on comparing the statistical parameter’s of sample and
population dataset.
28
APPENDIX
####Setting up the environment
setwd("C:\\Users\\user\\Desktop\\pgp-babi")
getwd()
####Dimensions od dataset
dim(data)
29
####Getting Summary of dataset
summary(data)
####Bi-variate analysis
library(ggplot2)
30
ggplot(data,aes(x=Temperature,fill=Season))+geom_histogram(col='Black',bins=1
5)
+facet_wrap(~Season)
####PROBLEM 1
###Q1)Find mean cold storage temperature for Summer, Winter and Rainy Season
:
#mean of temperature in summer:
library(tidyverse)
d2=data %>% filter(Season=="Summer")
d2
summer_mean=mean(d2$Temperature)
summer_mean
#mean of temperature in winter:
d3=data[1:31,c('Season','Month','Date','Temperature')]
d3
winter_mean=mean(d3$Temperature)
winter_mean
#mean of temperature in rainy:
d4=data %>% filter(Season=="Rainy")
d4
rainy_mean=mean(d4$Temperature)
rainy_mean
31
overall_mean
####PROBLEM 2
###Q1)State the Hypothesis, do the calculation using z test
32
#Assumptions:
#1)As a safety measure, the Supervisor has been vigilant to maintain the
temperature below 3.9 deg C.
#2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
#According to above assumptions the hypothesis is:
#NULL HO: mu<=3.9 deg c(As a safety measure, the Supervisor has been vigilant
to maintain the temperature below 3.9 deg C. )
#ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable
temperature range )
##population data
data=read.csv('cold_storage.csv')
data
##sampled data
data2=read.csv('Cold_Storage_Mar2018.csv')
data2
##standard deviation of population
standard_deviation=sd(data$Temperature)
standard_deviation
p=sqrt(35)
p
sd_n=standard_deviation/p
sd_n
##mean of sample
mean_sample=mean(data2$Temperature)
mean_sample
alpha=0.1
33
X_bar=mean_sample #mean of sample
N=35 #sample size
MU=3.9 #as per assumption
SD=sd_n #standard deviation of the population divided by sample
##CALCULATED tstat:
Z=(X_bar-MU)/SD
Z
q=1-alpha ## unrejected region
Z_c=qnorm(q)
Z_c
### from the values of Z and Z_c we see that Z<Z_c so we fail to reject HO(null
hypothesis)
### Moreover the problem is from the procurement side
### The assumption here is true that is the temperatur is maitained
###P-value method:
alpha=0.1
Z= 0.8641166
p_value=1-pnorm(-abs(Z))
p_value
### from the values of Z and p_value we see that p_value is greater than alpha .
### we dont reject HO
### we accept HO
34
#1)As a safety measure, the Supervisor has been vigilant to maintain the
temperature below 3.9 deg C.
#2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
#According to above assumptions the hypothesis is:
#NULL HO: mu<=3.9 deg c(As a safety measure, the Supervisor has been vigilant
to maintain the temperature below 3.9 deg C. )
#ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable
temperature range )
Alpha=0.1
mu=3.9
n=35
xbar=mean(data$Temperature)
s=sd(data$Temperature)
tstat=(xbar-3.9)/(s/sqrt(35))
tstat
pvalue=pt(tstat,34) ##for cumulative
pvalue
p=1-pt(tstat,34)## for single tail
p
35
OUTPUT:
Q1)
Q2)
Q3)
Q4)
36
Q4)
PROBLEM 2
Q1)
37
Q2)
> xbar=mean(data$Temperature)
> s=sd(data$Temperature)
> tstat=(xbar-3.9)/(s/sqrt(35))
> pvalue=pt(tstat,35)
> pvalue=pt(tstat,34)
> pvalue
[1] 0.9952888
> p=1-pt(tstat,34)
P
[1] 0.004711198
38