Final Report

Project 1 – COLD STORAGE CASE STUDY
REPORT ON THE ANALYSIS OF THE DATAST
BY:
PRANAV VISWANATHAN
1
Table of Contents
CONTENT PAGE
1 Project Objective 3
2 Assumptions 3
2.1 Assumptions from the problem’s point of view 4
3 Exploratory Data Analysis – Step by step approach 4
3.1 Environment Set up and Data Import 5
3.1.1 Install necessary Packages and Invoke Libraries 6
3.1.2 Set up working Directory 6
3.1.3 Import and Read the Dataset 7
3.2 Variable Identification 8
3.2.1 Variable Identification – Inferences 9
3.3 Univariate Analysis 10
3.4 Bi-Variate Analysis 13
3.5 Missing Value Identification 14
3.6 Outlier Identification 15
4 Problem and solutions 15
4.1 Problem 1 15
4.2 Problem 2 20
5 Conclusion 28
6 Appendix A – Source Code 29
2
1.PROJECT OBJECTIVE:
The objective of this project is to explore the COLD STORAGE DATASET

namely (Cold_Storage_Temp_Data.csv, Cold_Storage_MAR2018.csv) in R to
generate solutions and get the insights of the dataset provided .The exploration of
dataset is done in steps to get the desired output.
The following are the steps to be followed:
Getting the source i.e. the dataset needed in the desired file format.
(e.g. :- .csv, .excel)
1. Importing the dataset into the R studio

2. Exploring the structure and nuances of the dataset.
3. Graphical exploration to see the comparative analysis of different variables
present in the dataset.
4. Descriptive statistics to get the brief summary of the dataset and its insights
like type, class and to break down the dataset into measures of central
tendency and derive its outcome.
5. Draw insights and get solutions from the analysis done on the dataset.
2.ASSUMPTIONS:
The assumptions taken into account on the given dataset’s are that they are free
from missing values, errors.
Let us assume that the datasets provided for problems is free from errors.
3
2.1 Assumptions from problem point of view :
1. The dataset is correctly imported and checked for errors and missing values.
2. The dataset consists of date, month, season and temperature , so we check
for possible errors in the data type of different parameter’s.
3. Let us assume the temperature of cold storage is maintained properly
between 3-4 deg c.
4. Let us assume that the dataset is normally distributed with mean and
standard deviation.
5. The temperature is read properly at correct intervals.
6. Let the maximum accepted temperature be 3.9 deg c.
7. It is assumed that the in the first year of business they outsourced the plant
maintenance work to a professional company with stiff penalty clauses.
8. If it is proven that the probability of the temperature falls out of 2-5 deg c is
above 2.5 % or less than 5 % penality is 10 % of AMC else if it exceeds
above 5%, penality is 25% of AMC fee.
3. EDA-Expolatory Data Analysis:
In statistics, exploratory data analysis (EDA) is an approach to analyzing data

sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the
data can tell us beyond the formal modeling or hypothesis testing task. Exploratory
data analysis was promoted by John Tukey to encourage statisticians to explore the
data, and possibly formulate hypotheses that could lead to new data collection and
experiments.
4
EDA is different from initial data analysis (IDA), which focuses more narrowly on
checking assumptions required for model fitting and hypothesis testing, and
handling missing values and making transformations of variables as needed.
The objectives of EDA are to:
 Suggest hypotheses about the causes of observed phenomena

 Assess assumptions on which statistical inference will be based
 Support the selection of appropriate statistical tools and techniques
 Provide a basis for further data collection through surveys or experiments
Exploratory Data Analysis – Step by step approach :

A Typical Data exploration activity consists of the following steps:
1. Environment Set up and Data Import
2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
5. Missing Value Treatment
6. Outlier Treatment
7. Variable Transformation / Feature Creation
8. Feature Exploration
Steps 5 and 6 are not in the scope of this project.
5
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Here the necessary packages for using various functions are installed and the
respective libraries are invoked for the purpose of analyzing.
install.packages()- Function used for installing packages.
library()-is used to call the libraries from installed packages
3.1.2 Set up working Directory

Before the exploration of the given dataset ,we first set up an environment more
precisely where we want to save and take in the data set from .
This is done with the help of setwd() which is used to set up the working
environment.
getwd() - is a function which helps to get the location which is set.
CODE:
setwd('C:\\Users\\user\\Desktop\\pgp-babi')
getwd()
R STUDIO O/P:
Fig 1: setting directory
Fig 2: output of directory in R console
6
Alternatively,we can use the session->set work directory->select the directory
Fig 3: Alternate method to set directory
3.1.3 Import and Read the Dataset

The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for
importing the file.
Data=read.csv(‘file_name’)
Data
The above function returns the dataset .
R STUDIO O/P:
Fig 4: Reading data into console
7
Fig 5: output of dataset
3.2 Variable Identification

Variables are the factors in an experiment that change or potentially change.
read.csv()- This function is used to read all the data from a .csv file
str()-str() is a compact way to display the structure of an R object. This allows you
to usestr as a diagnostic function and an alternative to summary. str() will output
the information on one line for each basic structure. Str() is best for displaying
contents of lists.
summary()-summary() function is a generic function used to produce
result summaries of the results of various model fitting functions. The function
invokes particular methods which depend on the class of the first argument
mean()-Gives the averages of the field selected.
sd()-Gives the standard deviation of the selected field.
pnorm()-pnorm() calculates cumulative distribution function of normal
distribution, i.e. where μ is mean and σ is standard deviation
filter()-It selects or filters the rows of the data table that meet certain criteria
creating a new data subset.
head(data,n=value)-Gives the first n rows of the data set
tail(data,n=value)-Gives bottom n rows of data set
8
3.2.1 Variable Identification – Inferences
After uploading the file into the R studio,
First step is to see the structure of the dataset.we use str() function to see the same.
Fig 6: str function –it gives the structure of data
Fig 7: output of the structure of data
From the above figure we see that the dataset is of data frame type and Data type
of each field is shown. This helps to identify whether the variables are categorical
or numerical.
We see that there are 35 observations divided among 4 variables.
2. Then the various insights of the dataset is seen that is the statistical parameters
are analyzed.
Here we use summary() to get the statistical parameters.
Fig 8:summary function
Fig 9: output of summary function
From the above figure we get the mean,median ,quartiles of various variables to
analyse the dataset for further maipulation.
9
3.suppose the dataset consists of large no of rows and columns we make use
head() and tail() to get the top and bottom n rows specified o check the consistency
of the dataset.
3.3 Univariate Analysis

Univariate analysis is the simplest form of analyzing data. “Uni” means “one”, so
in other words your data has only one variable. It doesn’t deal with causes or
relationships and it’s major purpose is to describe; it takes data, summarizes that
data and finds patterns in the data.
Some ways you can describe patterns found in univariate data include central
tendency (mean, mode and median) and dispersion: range, variance, maximum,
minimum, quartiles (including the interquartile range), and standard deviation.
Numerical variables:
par(mfrow=c(2,2))
hist(Temperature)
boxplot(Temperature,horizontal =TRUE ,main='Boxplot of temperature ')
hist(Date)
boxplot(Date,horizontal =TRUE,main='Boxplot of Date' )
R studio o/p:
Fig 10: Function and code for generating charts
10
Fig 11: output
Analysis:
From the above charts of numerical variables of the given dataset we see that the
histogram of temperature is normally distributed as it inceases ,reaches maximum
and then decreases redembling the bell curve.
From the boxplot of temperature we see that it has an extreme value represented by
the outlier.
Fig 12: command to see the outliers
11
Fig 13: output of outliers present
Categorical variable:
plot(Season)
plot(Month)
R STUDIO O/P:
Bar chart of season:
Fig 14: Bar plot of categorical variable
12
Bar chart of month:
Fig 15: Bar plot of categorical variable
3.4 Bi-Variate Analysis:

Here we find the relationship between two or more variables.
Here I will be using rpivotTable() for easy analysis.
A) Count vs temperature by season
Fig 16: output of count vs temperature by season
The above figure represents the change in temperature for various seasons.we can
see that temperature decreases during winter and rainy season.
13
Next we use ggplot() to understand properly and for easy analysis.
Code:
library(ggplot2)
ggplot(data,aes(x=Temperature,fill=Season))+geom_histogram(col='Black',bins=1
5)
+facet_wrap(~Season)
Fig 17: ggplot output of count vs temperature by season
Analysis:
Fom the above two plots we see that the temperature of cold storage unit reaches
maximum in summer and decreases considerably in winter and rainy season.
3.5 Missing Value Identification

In R the missing values are coded by the symbol NA. To identify missings in your
dataset the function is is.na().
14
R STUDIO O/P:
Fig 18: command to find missing values
Fig 19: output showing there is no missing value.
3.6 Outlier Identification:

when information is not available we call it missing values. In R the missing values
are coded by the symbol NA. To identify missings in your dataset the function
is is.na().
Fig 20: command to see outlier
Fig 21:output of outliers
4.Problem and solution:

4.1)Problem 1:
Cold Storage started its operations in Jan 2016. They are in the business of storing Pasteurized
Fresh Whole or Skimmed Milk, Sweet Cream, Flavored Milk Drinks. To ensure that there is no
15
change of texture, body appearance, separation of fats the optimal temperature to be maintained
is between 2 deg - 4 deg C.
In the first year of business they outsourced the plant maintenance work to a professional
company with stiff penalty clauses. It was agreed that if it was statistically proven that
probability of temperature going outside the 2 degrees - 4 degrees C during the one-year contract
was above 2.5% and less than 5% then the penalty would be 10% of AMC (annual maintenance
case). In case it exceeded 5% then the penalty would be 25% of the AMC fee. The average
temperature data at date level is given in the file “Cold_Storage_Temp_Data.csv”
Q1) Find mean cold storage temperature for Summer, Winter and Rainy
Season :
#mean of temperatur in summer: (library:tidyverse)
d2=data %>% filter(Season=="Summer")
d2
summer_mean=mean(d2$Temperature)
summer_mean
> summer_mean=mean(d2$Temperature)
> summer_mean
[1] 3.153333
R STUDIO O/P:
Fig 22:command to find mean of a particular season (summer)
16
Fig 23:output to find mean of a particular season (summer)
Mean of temperature in winter:

d3=data[1:31,c('Season','Month','Date','Temperature')]
d3
winter_mean=mean(d3$Temperature)
winter_mean
> winter_mean=mean(d3$Temperature)
> winter_mean
[1] 2.703226
R STUDIO O/P:
Fig 24 : mean of winter season
Fig 25: output
17
Mean of temperature in rainy season:
d4=data %>% filter(Season=="Rainy")
d4
rainy_mean=mean(d4$Temperature)
rainy_mean
R STUDIO O/P:
Fig 26: mean in rainy season
Fig 27: output
Q2)Find overall mean for the full year

Overall mean:
overall_mean=mean(data$Temperature)
overall_mean
Fig 28: overall mean
Fig 28: output overall mean
18
Q3)Find Standard Deviation for the full year
Overall standard deviation:
overall_sd=sd(data$Temperature)
overall_sd
Fig 29: standard deviation of dataset
Fig 30: output of standard deviation
Q4) A Assume Normal distribution, what is the probability of temperature

having fallen below 2 deg C assume Normal distribution, what is the
probability of temperature having fallen below 2 deg C
mean=overall_mean
sd=overall_sd
pnorm(q=2,mean,sd,lower.tail = T)
> pnorm(q=2,mean,sd,lower.tail = T)
[1] 0.02918146
R STUDIO O/P:
Fig 31: probability of temperature below 2 deg
19
Fig 32 : output
Q5) Assume Normal distribution, what is the probability of temperature

having gone above 4 deg C
pnorm(q=4,mean,sd,lower.tail=F)
R STUDIO O/P:
Fig 33:probability of temperature above 4 deg c
Fig 34 : output
Q6)What will be the penalty for the AMC Company

the penalty for the AMC Company:
1)for less then 2 deg c =10 % of amc (as probability greater than 2.5%)
2)for greater than 4 deg c=0% of amc(as probability less than 2.5%)
Total penality=10 % of amc
PROBLEM 2:
In Mar 2018, Cold Storage started getting complaints from their Clients that they have been getting
complaints from end consumers of the dairy products going sour and often smelling. On getting
these complaints, the supervisor pulls out data of last 35 days temperatures. As a safety measure,
the Supervisor has been vigilant to maintain the temperature below 3.9 deg C.
20
Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1 do you feel that there
is need for some corrective action in the Cold Storage Plant or is it that the problem is from
procurement side from where Cold Storage is getting the Dairy Products. The data of the last 35
days is in “Cold_Storage_Mar2018.csv”
Q1) State the Hypothesis, do the calculation using z test

Assumptions:
1)As a safety measure, the Supervisor has been vigilant to maintain the
temperature below 3.9 deg C.
2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
According to above assumptions the hypothesis is:
NULL HO: mu<=3.9 deg c(As a safety measure, the Supervisor has been vigilant to maintain
the temperature below 3.9 deg C. )
ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable temperature
range )
Based on above hypothesis ,

Z=(xbar-mu)/sigma
Z is found out to be z=0.8641166
Z crictical is found to be z_c=1.281552
Comparing Z and Z_C we see Z_C>Z ,therefore we don’t reject NULL hypothesis.
CODE:
CODE:
getwd()
##population data
data=read.csv('cold_storage.csv')
data
##sampled data
21
data2=read.csv('Cold_Storage_Mar2018.csv')
data2
##standard deviation of population
standard_deviation=sd(data$Temperature)
standard_deviation
p=sqrt(35)
sd_n=standard_deviation/p
sd_n
##mean of sample
mean_sample=mean(data2$Temperature)
mean_sample
####PROBLEM:2 Q.1
####ASSUMPTIONS
####1)As a safety measure, the Supervisor has been vigilant to maintain the temperature below 3.9 deg
C.
####2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
##### ACCORDING TO THE ASSUMPTIONS,NULL AND ALTERNATE HYPOTHESIS ARE:
## HO:MU<=3.9 deg c
## HA:MU>3.9 DEG C
alpha=0.1
X_bar=mean_sample #mean of sample
N=35 #sample size
MU=3.9 #as per assumption
SD=sd_n #standard deviation of the population divided by sample
##CALCULATED tstat:
Z=(X_bar-MU)/SD
q=1-alpha ## unrejected region
22
Z_c=qnorm(q)
Z_c
### from the values of Z and Z_c we see that Z<Z_c so we fail to reject HO(null hypothesis)
### Moreover the problem is from the procurement side
### The assumption here is true that is the temperatur is maitained
###P-value method:
alpha=0.1
Z= 0.8641166
p_value=1-pnorm(-abs(Z))
p_value
### from the values of Z and p_value we see that p_value is greater than alpha .
### we dont reject HO
R STUDIO O/P:
Fig 35 a.
23
Fig 35 b.
Fig 35 a ,b :Z- test and its outcome
Fig 36: output of Z-test
Conclsion of Z-test:
 Firstly,on comparing the values of z ,z_c and z , p-value we see that z does
not fall in the critical region and on comparing with p-value we see that p-
value is greater than alpha .so we can conclude that we accept the null
hypothesis, which states that the cold storage temperature is maintained .
24
 Secondly,we can conclude that the problem is from the procurement side.
Q2) State the Hypothesis, do the calculation using t-test

Assumptions:
1)As a safety measure, the Supervisor has been vigilant to maintain the
2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
According to above assumptions the hypothesis is:
NULL HO: mu<=3.9 deg c(As a safety measure, the Supervisor has been vigilant to maintain
the temperature below 3.9 deg C. )
ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable temperature
range )
Alpha=0.1
Based on above hypothesis ,
tstat=(xbar-mu)/(sigma/sqrt(n))
tsat is found out to be tstat= 2.752359
pvalue=pt(tstat,34) ##for cumulative

pvalue= 0.9952888
pvalue(single tail)= 1-pt(tstat,34)
pvalue= 0.004711198
Here Pvalue is less than alpha, hence the null hypothes is rejected.
CODE:
getwd()
data=read.csv('Cold_Storage_Mar2018.csv')
25
data
####ASSUMPTIONS
####1)As a safety measure, the Supervisor has been vigilant to maintain the
####2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
##### ACCORDING TO THE ASSUMPTIONS,NULL AND ALTERNATE

HYPOTHESIS ARE:
## HO:MU<=3.9 deg c
## HA:MU>3.9 DEG C
alpha=0.1
mu=3.9
n=35
xbar=mean(data$Temperature)
s=sd(data$Temperature)
tstat=(xbar-3.9)/(s/sqrt(35))
tstat
pvalue
p=1-pt(tstat,34)## for single tail
p
R STUDIO O/P:
> xbar=mean(data$Temperature)
> s=sd(data$Temperature)
> tstat=(xbar-3.9)/(s/sqrt(35))
> pvalue=pt(tstat,35)
> pvalue
[1] 0.9952888
> p=1-pt(tstat,34)
26
p
[1] 0.004711198
P=0.004711198 is less than alpha = 0.1 so we reject the null hypothesis and
accept the alternate hypothesis.
Conclsion of T-test:
 Firstly,on comparing the values of tstat , p we see that p is less than alpha.so
we accept the alternate hypothesis which is the temperature is greater than
3.9 deg c.
 Secondly,we can see that the sample mean is greater than the population
mean.
 Thirdly, we can conclude that the problem is with the cold storage unit.
Q3) Give your inference after doing both the tests

 Supervisor of the cold storage company insists that the cold storage unit
maintains a temperature below 3.9 deg c.
 To substantiate this to the clients he shows that the average temperature
during a period of 35 days is well below the average temperature from
2016-17.
 However, after statin the hypothesis and doing Z and T test ,T test is more
appropriate for this situation ,since we are comparing the short term duration
with that of a period of 2-3 year’s.
 According to T-test,
Fig 37: output of T test
27
 If we compare the parameters of t test to that of the population parameters ,it
is seen that the average temperature of sample is greater than that of the
population.
 Moreover, T-test is preferred over Z-test as the sample is N=35 is
considerably smaller .
5.CONCLUSION:
The dataset is analyzed to find out the exact problem endured of the cold
storage unit faced by the customers over years .It is found out through
hypothesis testing that the problem lies with the cold storage unit, in spite of the
supervisor telling that a optimum temperature is maintained. The data analysis
helps to bring out the insights of the dataset making it possible to conclude the
hypothesis. Moreover it is finally concluded that T-test is more significant in
this case based on comparing the statistical parameter’s of sample and
population dataset.
28
APPENDIX
####Setting up the environment
setwd("C:\\Users\\user\\Desktop\\pgp-babi")
getwd()
####Getting the dataset

data
####Attaching dataset to R path

attach(data)
data
####Dimensions od dataset
dim(data)
####Getting top 5 rows

head(data,5)
####Getting Bottom % rows

tail(data,5)
####Getting the structure of the Dataset

str(data)
29
####Getting Summary of dataset
summary(data)
####Checking for mising values and total no. of missing values

is.na(data)
sum(is.na(data))
####Univariate data analysis

###Analysis of numeical variables
par(mfrow=c(2,2))
hist(Temperature)
boxplot(Temperature,horizontal =TRUE ,main='Boxplot of temperature ')
hist(Date)
boxplot(Date,horizontal =TRUE,main='Boxplot of Date' )
####Checking for outlier and printing them

OutVals=boxplot(Temperature,horizontal =TRUE ,main='Boxplot of temperature
')
OutVals
###Analysis of categorical values

plot(Season)
plot(Month)
####Bi-variate analysis
library(ggplot2)
30
ggplot(data,aes(x=Temperature,fill=Season))+geom_histogram(col='Black',bins=1
5)
+facet_wrap(~Season)
####PROBLEM 1
###Q1)Find mean cold storage temperature for Summer, Winter and Rainy Season
:
#mean of temperature in summer:
library(tidyverse)
d2=data %>% filter(Season=="Summer")
d2
summer_mean=mean(d2$Temperature)
summer_mean
#mean of temperature in winter:
d3=data[1:31,c('Season','Month','Date','Temperature')]
d3
winter_mean=mean(d3$Temperature)
winter_mean
#mean of temperature in rainy:
d4=data %>% filter(Season=="Rainy")
d4
rainy_mean=mean(d4$Temperature)
rainy_mean
###Q2)Find overall mean for the full year

overall_mean=mean(data$Temperature)
31
overall_mean
###Q3)Find Standard Deviation for the full year

overall_sd=sd(data$Temperature)
overall_sd
###Q4) A Assume Normal distribution, what is the probability of temperature

having fallen below 2 deg C assume Normal distribution, what is the probability of
temperature having fallen below 2 deg C
mean=overall_mean
sd=overall_sd
pnorm(q=2,mean,sd,lower.tail = T)
###Q5) Assume Normal distribution, what is the probability of temperature having

gone above 4 deg C
pnorm(q=4,mean,sd,lower.tail=F)
##Q6)What will be the penalty for the AMC Company

##the penalty for the AMC Company:
##1)for less then 2 deg c =10 % of amc (as probability greater than 2.5%)
##2)for greater than 4 deg c=0% of amc(as probability less than 2.5%)
##Total penality=10 % of amc
####PROBLEM 2
###Q1)State the Hypothesis, do the calculation using z test
32
#Assumptions:
#1)As a safety measure, the Supervisor has been vigilant to maintain the
#2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
#According to above assumptions the hypothesis is:
#NULL HO: mu<=3.9 deg c(As a safety measure, the Supervisor has been vigilant
to maintain the temperature below 3.9 deg C. )
#ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable
temperature range )
##population data
data
##sampled data
data2=read.csv('Cold_Storage_Mar2018.csv')
data2
##standard deviation of population
standard_deviation=sd(data$Temperature)
standard_deviation
p=sqrt(35)
p
sd_n=standard_deviation/p
sd_n
##mean of sample
mean_sample=mean(data2$Temperature)
mean_sample
alpha=0.1
33
X_bar=mean_sample #mean of sample
N=35 #sample size
MU=3.9 #as per assumption
SD=sd_n #standard deviation of the population divided by sample
##CALCULATED tstat:
Z=(X_bar-MU)/SD
Z
q=1-alpha ## unrejected region
Z_c=qnorm(q)
Z_c
### from the values of Z and Z_c we see that Z<Z_c so we fail to reject HO(null
hypothesis)
### Moreover the problem is from the procurement side
### The assumption here is true that is the temperatur is maitained
###P-value method:
alpha=0.1
Z= 0.8641166
p_value=1-pnorm(-abs(Z))
p_value
### from the values of Z and p_value we see that p_value is greater than alpha .
### we dont reject HO
### we accept HO
###Q2)State the Hypothesis, do the calculation using t-test

##Assumptions:
34
#1)As a safety measure, the Supervisor has been vigilant to maintain the
#2)Assume 3.9 deg C as upper acceptable temperature range and at alpha = 0.1
#According to above assumptions the hypothesis is:
#NULL HO: mu<=3.9 deg c(As a safety measure, the Supervisor has been vigilant
to maintain the temperature below 3.9 deg C. )
#ALTERNATE HA: mu> 3.9 deg c(Assume 3.9 deg C as upper acceptable
temperature range )
Alpha=0.1
mu=3.9
n=35
xbar=mean(data$Temperature)
s=sd(data$Temperature)
tstat=(xbar-3.9)/(s/sqrt(35))
tstat
pvalue
p=1-pt(tstat,34)## for single tail
p
35
OUTPUT:
Q1)
Q2)
Q3)
Q4)
36
Q4)
PROBLEM 2
Q1)
37
Q2)
> xbar=mean(data$Temperature)
> s=sd(data$Temperature)
> tstat=(xbar-3.9)/(s/sqrt(35))
> pvalue
[1] 0.9952888
> p=1-pt(tstat,34)
P
[1] 0.004711198
38

Final Report

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Report

Uploaded by

Copyright:

Available Formats

Project 1 – COLD STORAGE CASE STUDY

REPORT ON THE ANALYSIS OF THE DATAST

The objective of this project is to explore the COLD STORAGE DATASET

The following are the steps to be followed:

(e.g. :- .csv, .excel)

1. Importing the dataset into the R studio

3. EDA-Expolatory Data Analysis:

In statistics, exploratory data analysis (EDA) is an approach to analyzing data

The objectives of EDA are to:

 Suggest hypotheses about the causes of observed phenomena

Exploratory Data Analysis – Step by step approach :

3.1.2 Set up working Directory

Fig 1: setting directory

Fig 2: output of directory in R console

Fig 3: Alternate method to set directory

3.1.3 Import and Read the Dataset

Fig 4: Reading data into console

3.2 Variable Identification

Fig 6: str function –it gives the structure of data

Fig 7: output of the structure of data

Fig 8:summary function

Fig 9: output of summary function

3.3 Univariate Analysis

Fig 10: Function and code for generating charts

Fig 12: command to see the outliers

Fig 14: Bar plot of categorical variable

Fig 15: Bar plot of categorical variable

3.4 Bi-Variate Analysis:

A) Count vs temperature by season

Fig 16: output of count vs temperature by season

Fig 17: ggplot output of count vs temperature by season

3.5 Missing Value Identification

Fig 18: command to find missing values

Fig 19: output showing there is no missing value.

3.6 Outlier Identification:

Fig 20: command to see outlier

Fig 21:output of outliers

4.Problem and solution:

d2=data %>% filter(Season=="Summer")

Fig 22:command to find mean of a particular season (summer)

Mean of temperature in winter:

Fig 24 : mean of winter season

Fig 25: output

Fig 26: mean in rainy season

Fig 27: output

Q2)Find overall mean for the full year

Fig 28: overall mean

Fig 28: output overall mean

Fig 29: standard deviation of dataset

Fig 30: output of standard deviation

Q4) A Assume Normal distribution, what is the probability of temperature

Fig 31: probability of temperature below 2 deg

Q5) Assume Normal distribution, what is the probability of temperature

Fig 33:probability of temperature above 4 deg c

Q6)What will be the penalty for the AMC Company

Q1) State the Hypothesis, do the calculation using z test

Based on above hypothesis ,

##standard deviation of population

##### ACCORDING TO THE ASSUMPTIONS,NULL AND ALTERNATE HYPOTHESIS ARE:

X_bar=mean_sample #mean of sample

N=35 #sample size

MU=3.9 #as per assumption