You are on page 1of 12

Mini Project – Golf

Sravanthi.M

1
Table of Contents
1. Project Objective...............................................................................................................................3
2. Assumptions......................................................................................................................................3
3. Exploratory Data Analysis – Step by step approach...........................................................................3
3.1. Environment Set up and Data Import........................................................................................3
3.1.1.Install necessary Packages and Invoke Libraries.................................................................3
3.1.2.Set up working Directory....................................................................................................3
3.1.3.Import and Read the Dataset.............................................................................................4
3.2. Variable Identification................................................................................................................4
4. Conclusion.........................................................................................................................................4
5. Detailed Explanation of Findings…………………………………………………………………………………………………….5
5.1Q Formulate and present the rationale for a hypothesis test that Par could use to compare
the driving distances of the current and new golf balls.
5.2Q Analyze the data to provide the hypothesis testing conclusion. What is the p-value for your
test? What is your recommendation for Par Inc.?
5.3Q Provide descriptive statistical summaries of the data for each model.
5.4Q What is the 95% confidence interval for the population mean of each model, and what is
the 95% confidence interval for the difference between the means of the two population?

5.5Q Do you see a need for larger sample sizes and more testing with the golf balls? Discuss.

5.6 Source Code


1 Project Objective
The objective of the report is to explore the Golf data set (“Golf”) in R and generate insights about
the data set. This exploration report will consist of the following:

 Importing the dataset in R


 Understanding the structure of dataset
 Graphical exploration
 Descriptive statistics
 Insights from the dataset

2 Assumptions
 The Independent Samples t-Test compares the means of two independent groups in
order to determine whether there is statistical evidence that the associated population
means are significantly different.
 The Independent Samples t -Test is a parametric test

3 Exploratory Data Analysis – Step by step approach


A Typical Data exploration activity consists of the following steps:

1. Environment Set up and Data Import


2. Variable Identification
3. Variable Transformation / Feature Creation
4. Feature Exploration

We shall follow these steps in exploring the provided dataset.

3.1 Environment Set up and Data Import


3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability. For installation we will use
install. packages (“Package name”)

3.1.2 Set up working Directory


Setting a working directory on starting of the R session makes importing and exporting data
files and code files easier. Basically, working directory is the location/ folder on the PC where
you have the data, codes etc. related to the project. For setting up and importing we use
below syntax’s
Syntax → setwd() & getwd()

Please refer 5.6 for Source Code.

3|Page
3.1.3 Import and Read the Dataset
The given dataset is in .xsl format. Hence, the command ‘read_excel’ is used for importing the
file.

Please refer 5.6 for Source Code.

3.2 Variable Identification


We are using
 setwd() :For setting working directory

 getwd() : returns an absolute filepath representing the current working directory

 dim: returns the dimension (e.g. the number of columns and rows)

 Str: To look specific data row by row we use str()

 head: It is used to display first 10 rows and columns

 tail: It is used to display last 10 rows and columns

 colSums: It is used to sum the columns

 colMean: It is used to display mean of the columns

 summary: is a generic function used to produce result summaries of the results of

various model fitting functions. The function invokes particular methods which

depend on the class of the first argument.

 hist(): To plot histogram

 boxplot(): To plot boxplot

 mean: To find out mean

 sd: To find out standard deviation

4 Conclusion
With given data set you can go ahead and launch the new ball (cut resistant). To draw a final
conclusion, we need more larger sample sizes and we need to test in different whether
conditions and different places. So, that we can draw clear conclusion with more sample sizes.
Below is the expanded brief for given observations

4|Page
5 Detailed Explanation of Findings

5.1Q Formulate and present the rationale for a hypothesis test that Par could use to compare the driving
distances of the current and new golf balls.

Ans: Formulating and presenting the rationale for a hypothesis test that Par, Inc. to compare the driving
distance of the current and new golf balls. The result of test on the durability of the improved product
another issue has been raised and this is the effect of the new coating on driving distances. 40 balls of
both the new and current models were subjected to distance test. They are independent sample and test
follows a large sample case. The Null hypothesis and alternative hypothesis are formulated as follow:

 Mean distance of current-model balls: µ1.


 Mean distance of new cut-resistant balls: µ2.
 H0: µ1 = µ2 (Mean distance of current balls equals mean distance of new balls)
 H1: µ1 ≠ µ2 (Mean distance of current balls does not equal mean distance of new balls does not
equal to 0)
 Specify the level of significance. α = 0.05 so z = 1.96

Now, we need to perform two tailed t-test by using below syntax and code has been attached in 5.6

t.test (Golf$Current, Golf$New, paired = FALSE, conf. level = 0.95, alternative = "t")

Hence forth after performing t-test we will observe below values

Observations Values
Degrees of Freedom (df) 76.852
t- value 1.32

p-value 0.188
95% Lower Confidence Interval -1.384
95% Upper Confidence Interval 6.934

Results: Hence Null Hypothesis is rejected because Means of Current balls and New balls are not equal.
We accept Alternative Hypothesis.

5.2Q Analyze the data to provide the hypothesis testing conclusion. What is the p-value for your test?
What is your recommendation for Par Inc.?
Ans: From the above hypothesis testing conclusion is as follows:
We will be do Two tailed “t test” and the p-value = 0.188
Recommendation: With given sample data of Par, Inc. We have observed no much difference in
Mean so, we can go ahead and launch New Golf ball (Cut Resistant).

5.3Q Provide descriptive statistical summaries of the data for each model.

5|Page
Ans: For descriptive statistical summaries we will use “Histogram” & “Boxplots”. For Golf data set
Histogram & Boxplot will be plotted with below syntax
Histogram for Current Ball:
Syntax: hist (Golf$Current, main = "Current Balls", xlab = "Driving distance", border
="pink", col = "Blue")

Boxplot for Current balls:


Syntax: boxplot (Golf$Current, main = "Current Balls", xlab = "Driving distance", border
="Red", col = "Blue", horizontal = TRUE)

Histogram for New Balls:


Syntax: hist (Golf$New, main = "New Balls", xlab = "Driving distance", border ="Green",
6|Page
col = "Blue")

Boxplot for New Balls:

7|Page
5.4Q What is the 95% confidence interval for the population mean of each model, and what is the 95%
confidence interval for the difference between the means of the two population?
Ans: If we clearly analyze the problem, we can find 2 parts of the problem. 1 st part is we need to find
95% confidence interval for the population mean of each Current and New models. We can
calculate with the given below formula
α σ
x́ ± Z
2 √n

Confidence Interval for Current Ball:


x́=mean ( Golf $ Current ) ← Syntax of Mean of the current Ball
Z=1.960
Note:
According ¿ stands if we have 95 % confidence interval we will be considering Z value as stated
α =0.05( According Stands)
σ =sd ( Golf $ Current ) ← Syntax
n=Sample ¿¿
95% Confidence Interval for Current
Ball
Confidence Interval 95%
Mean 270.275
Standard Deviation 8.75
n 40
Z 1.960
Upper Confidence 272.98
Limit
Lower Confidence 267.56
Limit

Note: The values stated in the table are derived from above mentioned formula by using
above mentioned syntax’s and code has been attached below 5.6

Confidence Interval for New Ball:


x́=mean ( Golf $ New ) ← Syntax of Mean of thecurrent Ball
Z=1.960
Note:
According ¿ stands if we have 95 % confidence interval we will be considering Z value as stated
α =0.05( According Stands)
σ =sd ( Golf $ New ) ← Syntax

8|Page
n=Sample ¿¿
95% Confidence Interval for Current
Ball
Confidence Interval 95%
Mean 267.5
Standard Deviation 9.89
n 40
Z 1.960
Upper Confidence 270.56
Limit
Lower Confidence 264.43
Limit

Note: The values stated in the table are derived from above mentioned formula by using
above mentioned syntax’s and code has been attached below 5.6

2nd part of the problem is we need to find 95% confidence interval for difference between the
means of two populations. We can calculate with the given below formula or we can find by
hypothesis t-test

s 12 s 22
x 1−x 2 ±t 1−¿α2 , v
√ +
n 1 n2
¿

Note: The values stated in the table are derived using syntax’s and code which has been
attached below 5.6

95% Confidence Interval by Hypothesis T-test for given Independent Group


Observations Current Ball New Ball
Mean 270.275 267.5
Standard Deviation 8.75 9.89
Sample Size 40 40
Difference of Means 2.775
(Current – New)
Degrees of Freedom (df) 76.852
t- value 1.32

p-value 0.188
95% Lower Confidence Interval -1.384
95% Upper Confidence Interval 6.934

9|Page
Result: The 95% Confidence Interval for difference between the means of the two Populations are
(-1.384, 6.934)
5.5Q Do you see a need for larger sample sizes and more testing with the golf balls? Discuss.
Ans: Yes, by seeing statistical summaries of the given data we require large sample sizes. So, that we
can test in different circumstances and different weather conditions for better Assumptions. And
when compared to large sample size there may vary in cost and driving distance of the ball. Hence,
we recommend to test with larger sample size.
5.6 Source Code

## Setting up working directory and getting working directory

setwd("D:/College Data/Statistical Methods for Decision Making/Project")

getwd()

## installing readr package

install.packages("readr")

library(readr)

##installing readxl package

install.packages("readxl")

library(readxl)

##import data

Golf <- read_excel("Golf.xls")

dim(Golf) ##dimensions of dataset

str(Golf)

head(Golf,10)

tail(Golf,10)

colSums(Golf,na.rm = FALSE,dims = 1L)

colMeans(Golf,na.rm = FALSE,dims = 1L)

summary(Golf)

## Apply t test

t.test(Golf$Current, Golf$New, paired = FALSE, conf.level = 0.95,


alternative = "t")

##install ggplot2 package

10 | P a g
e
install.packages("ggplot2")

library(ggplot2)

## statistical summaries of the data for each model.

hist(Golf$Current,main = "Current Balls", xlab = "Driving distance", border


="pink", col = "Blue")

boxplot(Golf$Current,main = "Current Balls", xlab = "Driving distance",


border ="Red", col = "Blue", horizontal = TRUE)

hist(Golf$New,main = "New Balls", xlab = "Driving distance", border


="Green", col = "Blue")

boxplot(Golf$New,main = "New Balls", xlab = "Driving distance", border


="Orange", col = "Blue", horizontal = TRUE)

## 95% confidence interval for the population mean of each model Current and
New

#Current

x1bar = mean(Golf$Current)

s1 = sd(Golf$Current)
n = 40
z = 1.960

ULC = xbar+z*s/sqrt(n)

LLC = xbar-z*s/sqrt(n)

#New

x2bar = mean(Golf$New)

S2 = sd(Golf$New)

ULN = Nxbar+z*NS/sqrt(n)

LLN = Nxbar-z*NS/sqrt(n)

11 | P a g
e
12 | P a g
e

You might also like