Probability & Statistics in Golf Ball Distances

2DI90
Probability &
Statistics
2DI90 – Chapter 6 of MR
What is Statistics?
According to the Encyclopedia Britannica:
“Statistics is the art and science of gathering, analyzing, and
making inferences from data”
In his book “Statistical Models”, A. C. Davison answers

the question more thorough way:
“Statistics concerns what can be learned from data. Applied
statistics comprises a body of methods for data collection
and analysis across the whole range of science, and in areas
such as engineering, medicine, business, and law - wherever
variable data must be summarized, or used to test or confirm
theories, or to inform decisions. Theoretical statistics
underpins this by providing a framework for understanding
the properties and scope of methods used in applications.”
2
A Typical Example
The presidential elections in the United States work in a funny
way, and in each state there is essentially a separate election.
Very important are the so-called ``swing-states'', for which it
is difficult to predict the outcome of electoral process.
In the 2012 election Wisconsin appeared to be such a swing-

state. A phone survey (July 25, 2012) with 480 likely voters
yielded the following data: 248 of individuals indicated they will
vote for Barack Obama; 232 individuals indicate they will vote
for Mitt Romney.
What predictions can be made about the outcome of the

Wisconsin election (if it was to take place on that same day)?
The data in this example is loosely based on a recent poll, as described in http://www.rasmussenreports.com/
public_content/politics/elections/election_2012/election_2012_presidential_election/wisconsin/ 3
election_2012_wisconsin_president.
Probability AND Statistics
Probability and Statistics are NOT the same thing!!!
•  Probability provides the foundation of statistics

•  Statistics is concerned with making inferences from data,
assumed to be collected according to a probabilistic model
Probabilistic Model
(explains how data was created) Sample
Population (World)
Statistics – Inference about the 4

population/world from the sample
What is Data?
Definition: Data and Dataset
This seems a bit vague… For our purposes:
Data is a collection of numerical of categorical observations of

a certain process (either physical, biological, social, etc…).
Sometimes the order of the data is very important (e.g. AEX

over time), other times it is irrelevant in terms of the
questions asked (exam grades of 2DI90).
5
Descriptive Statistics
Typically we can only say something sensible about the world
if we assume a statistical model for it. Nevertheless, a good
start is to summarize the contents of a dataset, or
represent them in a palatable way.
This is the goal of Descriptive Statistics, which are either

numerical or graphical summaries and representations of data.
In what follows we will concentrate mostly on scenarios where

the ordering of the elements in the dataset is not considered
important. E.g.:
•  Exam grades of 2DI90

•  Customer satisfaction ratings of a store
•  Number of rotten apples in crates of apples from a certain
producer (order of the crates doesn’t matter) 6
A Typical Dataset (Exercise 6.33 of M)
Official golf balls must conform to a number of rules. In
particular they are expected to all have similar aerodynamical
properties.
To test these properties 100 balls from the brand “HIT-ME”

are “hit” by a machine, and the distance reached by each ball
is measured (in meters)
261.3 ; 258.1 ; 254.2 ; 257.7 ; 237.9 ; 255.8 ; 241.4 ; 256.8 ; 255.3 ; 255 ; 259.4 ; 270.5 ; 270.7 ; 272.6 ;
272.2 ; 251.1 ; 232.1 ; 273 ; 266.6 ; 273.2 ; 265.7 ; 264.5 ; 245.5 ; 280.3 ; 248.3 ; 267 ; 271.5 ; 240.8 ;
268.9 ; 263.5 ; 262.2 ; 244.8 ; 279.6 ; 272.7 ; 278.7 ; 255.8 ; 276.1 ; 274.2 ; 267.4 ; 244.5 ; 252 ; 264 ;
247.7 ; 273.6 ; 264.5 ; 285.3 ; 277.8 ; 261.4 ; 253.6 ; 278.5 ; 260 ; 271.2 ; 254.8 ; 256.1 ; 264.5 ; 255.4 ;
259.5 ; 274.9 ; 272.1 ; 273.3 ; 279.3 ; 279.8 ; 272.8 ; 268.5 ; 283.7 ; 263.2 ; 257.5 ; 233.7 ; 260.2 ; 263.7 ;
244.3 ; 241.2 ; 254.4 ; 274 ; 260.7 ; 260.6 ; 255.1 ; 233.7 ; 253.7 ; 250.2 ; 251.4 ; 270.6 ; 273.4 ; 242.9 ;
276.6 ; 237.8 ; 261 ; 236 ; 251.8 ; 280.3 ; 268.3 ; 266.8 ; 254.5 ; 234.3 ; 251.6 ; 226.8 ; 240.5 ; 252.1 ;
245.6 ; 270.5
Our goal is to make “meaningful” statements about all the

“HIT-ME” balls, but using only this sample…
7
A Typical Dataset (Exercise 6.33 of M)
Sample
(a small number of golf balls

randomly taken from the
Population population)
(all “HIT-ME” golf

balls) Our hope is that the sample is
somewhat representative of the
entire population of golf balls…
Before trying to do this, let’s see if we can “understand” the

data a bit better, and summarize it in nice ways… 8
Numerical Summaries – Sample Mean
Definition: Sample Mean/Sample Average
For the golf ball dataset we can easily compute this (with a
computer) and see that
Clearly this is good information to have, but it would be good to

know also if the distances are similar for all balls, or vary
wildly… 9
Sample Variance/Standard Deviation
Definition: Sample Variance/Standard Deviation
Notice the
In our example
units are
squared !!!
The sample standard deviation is given by

A intuitive interpretation of what the sample standard
deviation represents is not so easy, but we can still understand
why it does measure variability: 10
Sample Variance/Standard Deviation
always non-negative
Properties: Sample Variance/Standard Deviation
The last expression makes computations on paper easier (but 11

it
is often numerically not good)
Toyish Example
Four measurements were made of the inside diameter of
forged pistons rings used in a scooter engine. The data, in
millimeters is
12
The Sample Range
Another way to assess variability:
Definition: Sample Range
In our example
This is easy to calculate, but ignores a lot of information in the

sample !!!
It is also extremely sensitive to outliers (abnormal values)…
13
The Sample Range
Suppose there was an element in the dataset that
corresponded to a defective ball, and the corresponding
distance was only 150 meters. The sample range would change
to 135.3m (before was 58.3m), and the sample standard
deviation would change to 17.36m (before was 13.41m).
261.3 ; 258.1 ; 254.2 ; 257.7 ; 237.9 ; 255.8 ; 241.4 ; 256.8 ; 255.3 ; 255 ; 259.4 ; 270.5 ; 270.7 ; 272.6 ;
272.2 ; 251.1 ; 232.1 ; 273 ; 266.6 ; 273.2 ; 265.7 ; 264.5 ; 245.5 ; 280.3 ; 248.3 ; 267 ; 271.5 ; 240.8 ;
268.9 ; 263.5 ; 262.2 ; 244.8 ; 279.6 ; 272.7 ; 278.7 ; 255.8 ; 276.1 ; 274.2 ; 267.4 ; 244.5 ; 252 ; 264 ;
247.7 ; 273.6 ; 264.5 ; 285.3 ; 277.8 ; 261.4 ; 253.6 ; 278.5 ; 260 ; 271.2 ; 254.8 ; 256.1 ; 264.5 ; 255.4 ;
259.5 ; 274.9 ; 272.1 ; 273.3 ; 279.3 ; 279.8 ; 272.8 ; 268.5 ; 283.7 ; 263.2 ; 257.5 ; 233.7 ; 260.2 ; 263.7 ;
244.3 ; 241.2 ; 254.4 ; 274 ; 260.7 ; 260.6 ; 255.1 ; 233.7 ; 253.7 ; 250.2 ; 251.4 ; 270.6 ; 273.4 ; 242.9 ;
276.6 ; 237.8 ; 261 ; 236 ; 251.8 ; 280.3 ; 268.3 ; 266.8 ; 254.5 ; 234.3 ; 251.6 ; 226.8 ; 240.5 ; 252.1 ;
245.6 ; 270.5
14
Other Numerical Summaries
There are many other numerical summaries that are important
(we’ll encounter them later, in the context of graphical
representations of data)
Definition: Order Statistics
If one is in a setting where the order of the data is not

important then all the information one might need is contained
in the order statistics ! 15
Sample Median and Percentiles
Definition: Sample Median
This is essentially the value the “splits” the dataset in two:

approximately half of the data is below the median and half is
above the median. We can more generally define
Definition: Sample Percentiles
Calculation of sample percentiles is not done the same way

everywhere, and most statistical packages use a definition that
16
involves interpolation (like the median above).
Graphical Representations
Especially for large datasets, graphical representations are
often much more (qualitatively) informative than numerical
summaries. Perhaps we simplest graphical representation is the
scatter-plot:
230 240 250 260 270 280
It is sometimes convenient to jitter the abysses of the points,

so it is easier to see what’s going on…
17
230 240 250 260 270 280
Histograms
Scatterplots are still a bit difficult to read – a way we can get
a better view is by aggregating data into bins
230 240 250 260 270 280
Histogram of x
15
Frequency
5
0
230 240 250 260 270 280 290

18
x
Histograms – Choice of Binning
The choice of the number of bins
Histogram of x is a tricky business…
Too few !!!

40
Frequency
20
0
Histogram of x
220 240 260 280 300
x
15
Frequency
Just right!!!
5
0
Histogram of x
230 240 250 260 270 280 290
4
x
Frequency
3
2
Too many !!!

1
0
230 240 250 260 270 280
There are rules-of-thumb for

x
the number of bins that most 19
software will use… You don’t need to worry too much...
Histograms
Actually, if the data can be viewed as independent samples
from some continuous distribution, the histogram can be
interpreted as an estimate of the true underlying density
function !!!
Histogram of x
15
Frequency
5
0
230 240 250 260 270 280 290
Golf ball example: This clearly doesn’t look like a bell-shape, so

it is unlikely we can model the distances using a normal
distribution… 20
Frequency Distributions
You can also describe theHistogram
15 histogramof x numerically
Frequency
5
0
230 240 250 260 270 280 290
Important: you always lose information when binning, but it will

often make it much easier to qualitatively understand the
data… Keep this in mind… 21
Box-Plots
These are funny looking plots that give a nice graphical
representation of the data…
230 240 250 260 270 280
First Quartile: Sample Median: Third Quartile:

25% of the data is half the data is 75% of the data is
below this below this below this
Smallest Value Largest Value
230 240 250 260 270 280

22
Actually, the box-plots can be a bit different than this…
Box and Whisker Plots
Suppose that in the golf ball dataset there were two elements
that were quite extreme, three extra balls with distance
220m, 215m, 310m.
First Quartile Sample Median Third Quartile

Outliers Outliers
220 240 260 280 300
Whisker extends to the Whisker extends to the

smallest point within 1.5 InterQuartile Range largest point within 1.5
IQR (IQR) IQR
These plots are easy to understand, and are therefore quite

useful, we can even compare different datasets easily… 23
Multiple Box Plots
Example inspired by the Figure 6.15 of Montgomery:
Data of the manufacturing quality index of three companies:
Comparative boxplots of Quality Index
C3
C2
C1
70 80 90 100 110
Which company should you prefer?

24
Time-Sequence Plots
Sometimes the order of the data matters !!! Example:
PSI20 financial data over (01/07/2011 - 29/06/2012)
Histogram of x
7500
20 40 60 80
Frequency
6000
x
4500
0
0 50 100 150 200 250 4000 4500 5000 5500 6000 6500 7000 7500
Index x
There is clearly a temporal

trend that is completely ignored
in the histogram or boxplot
representations!!! The order of
the data really matters…
4500 5000 5500 6000 6500 7000 7500 25
PSI20 Example
We can, however, look at the daily returns instead…
PSI20 returns Histogram of r
80
60
Frequency
0.00
40
r
20
-0.04
0
0 50 100 150 200 250 -0.06 -0.04 -0.02 0.00 0.02 0.04
Index r
There seems to be much less of
a temporal trend on the returns,
so histograms and box-plots are
potentially valid representations
of the data…
The choice of Statistical Model

-0.04 -0.02 0.00 0.02
is already important for 26
description of the data!!!
Quantile-Quantile Plots
These are part of a general class of qualitative plots that are
meant to help you assess some properties of the data. Namely,
if the data can be reasonably modeled by independent samples
from some distribution…
Let’s recall some concepts from a few slides back:
Definition: Order Statistics
27
We can compare the order statistics, to the values we would
expect for some distributions (e.g. a normal distribution).
So this gives an easy visual way to check if assuming such a

model is somewhat reasonable…
28
* - The reasoning underlying the fact stated in the example is beyond what we will discuss in the course
Normal Quantile-Quantile Plots
Example: PSI20 Daily returns
Normal Q-Q Plot Histogram of r

Sample Quantiles
30
-0.04 0.00
Density
20
10
0
-3 -2 -1 0 1 2 3 -0.04 0.00
Theoretical Quantiles r
Daily Returns seem to be reasonably modeled by a normal

Distribution !!!
29
Example: Synthetic data – normal distribution (sanity check)

2
0.4
Sample Quantiles
Density
0
0.2
-1
-2
0.0
-3
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
30
Example: Synthetic data – exponential distribution
Too many points

away from the
4
Sample Quantiles
line
0.8
3
Density
2
0.4
1
0.0
0
-3 -2 -1 0 1 2 3 0 1 2 3 4
Everything seems to make sense here…

31
Summary: normal QQ plots give us a qualitative way to check
if data can be reasonably modeled by a normal distribution.
If most points lie approximately on a straight line then the
normal modeling assumption might be reasonable – otherwise it
is doubtful.
Normal Q-Q Plot

Golf ball distances
Normal Q-Q Plot
Histogram of r
PSI20 Daily returns
280
Sample Quantiles
30
270
Sample Quantiles
-0.04 0.00
Density
20
260
10
250
Too many points

away from the
240
0 line
230
-3 -2 -1 0 1 2 3 -0.04 0.00
-2 -1 0 1 2 32
Theoretical Quantiles
What’s Next
Now that we can summarize and represent data in nice ways we
would like to make meaningful statements about the processes
that generated the data.
For this we need to make some assumptions, leading into the

notion of a Statistical Model.
In this course will focus mainly on two types of statistical

models:
•  Multiple Independent Observations

•  Regression Models
Important!!! All models are wrong…

…but some are useful. 33

Probability & Statistics in Golf Ball Distances

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability & Statistics in Golf Ball Distances

Uploaded by

Copyright:

Available Formats

2DI90

In his book “Statistical Models”, A. C. Davison answers

In the 2012 election Wisconsin appeared to be such a swing-

What predictions can be made about the outcome of the

• Probability provides the foundation of statistics

Statistics – Inference about the 4

This seems a bit vague… For our purposes:

Data is a collection of numerical of categorical observations of

Sometimes the order of the data is very important (e.g. AEX

This is the goal of Descriptive Statistics, which are either

In what follows we will concentrate mostly on scenarios where

• Exam grades of 2DI90

To test these properties 100 balls from the brand “HIT-ME”

Our goal is to make “meaningful” statements about all the

(a small number of golf balls

(all “HIT-ME” golf

Before trying to do this, let’s see if we can “understand” the

Definition: Sample Mean/Sample Average

Clearly this is good information to have, but it would be good to

The sample standard deviation is given by

Properties: Sample Variance/Standard Deviation

The last expression makes computations on paper easier (but 11

Definition: Sample Range

This is easy to calculate, but ignores a lot of information in the

It is also extremely sensitive to outliers (abnormal values)…

Definition: Order Statistics

If one is in a setting where the order of the data is not

This is essentially the value the “splits” the dataset in two:

Definition: Sample Percentiles

Calculation of sample percentiles is not done the same way

230 240 250 260 270 280

It is sometimes convenient to jitter the abysses of the points,

230 240 250 260 270 280

230 240 250 260 270 280 290

Too few !!!

Too many !!!

230 240 250 260 270 280

There are rules-of-thumb for

230 240 250 260 270 280 290

Golf ball example: This clearly doesn’t look like a bell-shape, so

230 240 250 260 270 280 290

Important: you always lose information when binning, but it will

230 240 250 260 270 280

First Quartile: Sample Median: Third Quartile:

Smallest Value Largest Value

230 240 250 260 270 280

First Quartile Sample Median Third Quartile

220 240 260 280 300

Whisker extends to the Whisker extends to the

These plots are easy to understand, and are therefore quite

Which company should you prefer?

There is clearly a temporal

The choice of Statistical Model

Let’s recall some concepts from a few slides back:

Definition: Order Statistics

So this gives an easy visual way to check if assuming such a

Normal Q-Q Plot Histogram of r

Daily Returns seem to be reasonably modeled by a normal

Normal Q-Q Plot Histogram of r

Too many points

Everything seems to make sense here…

Normal Q-Q Plot

Too many points

•  Probability provides the foundation of statistics

•  Exam grades of 2DI90

•  Multiple Independent Observations