You are on page 1of 24

Statistical Lingo

You may have come to this


presentation because you
really like statistics, but theres
also the possibility that youd
rather be somewhere else
like maybe playing golf at a
fancy resort or something?
The irony is that sports probably
refers to statistics more than any
other segment of our society.

Statistical Lingo
Virtually everyone has a pretty good
understanding of what is meant by
the word average.
A golfer who shot rounds of 78, 84,
and 87 could compute her average,
or what statisticians would call her
mean ( or x).
NOTE: In general, a Greek letter is
used if an entire populations data is
being checked
but in the case of a sample, the
regular letter is used.
78
75

84
80

85

87
90

Statistical Lingo
Virtually everyone has a pretty good
understanding of what is meant by
the word average.
A golfer who shot games of 78, 84,
and 87 could compute her average,
or what statisticians would call her
mean ( or x), like this:
78
84
+87
249 : 3 = 83

75

80

85

90

Statistical Lingo
Each of these scores deviates from
the average (83) by some amount.
These deviations can be combined
to calculate what is called a
standard deviation.
78
84
+87
249

75

-5
+1
+4

80

85

90

Statistical Lingo
But if we want to calculate the
standard deviation we cant simply
add them up theyll cancel each
other out and well get zero.
On the other hand, squaring the
deviations will prevent that problem.
78
84
+87
249

75

-5 25
+1 1
+4 16

80

85

90

Statistical Lingo
Then we can add the squares up - this
helps to get an estimate of how much
variation is present. (The concept of
adding up squares of differences like
this is called the sum of the squares.)

78
84
+87
249

75

-5 25
+1 1
+4 + 16
42

80

85

90

Statistical Lingo
Then we divide this sum by the number
of scores in the list (N) minus 1. (This is
because we only have a sample of all
this persons golf scores if we had all
of their golf scores we would simply
divide by N.)
78
84
+87
249

75

-5 25
+1 1
+4 + 16
42 : 2 = 21

80

85

90

Statistical Lingo
If we just leave it like this, its called the
variance ( 2 or s2). If we take the square
root (which cancels out the fact that we
squared the deviations earlier) well get
the standard deviation ( or s). (Also,
we divide by 2 because its the number of
data points in the sample minus 1.)
78
84
+87
249

-5 25
+1 1
+4 + 16
42 : 2 = 21
21 = 4.6

75

80

85

90

Statistical Lingo
Another common term is the median.
Its the middle value of the data and is
insensitive to actual values in the set.
Real estate folks might refer to a median
income level for an area its virtually
unaffected by Bill Gates moving into
(or out of) the neighborhood.
78
84
87

75

80

85

90

Statistical Lingo
In a few short slides, weve covered a
number of the most frequently used
statistical terms.

deviation

78
84
+87
249

variance
(s2)

-5 25
+1 1
+4 + 16
42 : 2 = 21

249 / 3 = 83
mean (x)
75

standard
deviation
(s)

21 = 4.6
median

80

85

90

Statistical Lingo
Of course, if you had to manually
compute:
an average
a deviation for each data point
a square of all the deviations
a sum of the squares
a variance
a standard deviation
a median
every time you got some data, things
could get crazy; especially if theres a
lot of data. Thankfully, we have Minitab.

75

80

85

90

Getting Basic Stats From Minitab


1. Enter whatever data you want to analyze into a column in Minitab

2. Click on Stat, then on


Basic Statistics, then on
Display Descriptive
Statistics.

Getting Basic Stats From Minitab


3. In the box labeled Variable, indicate
the column containing the data.
4. Click on the box labeled Graphs.

3.

5. Check Graphical summary.


6. Click OK.
7. Click OK.
4.

5.

7.

6.

Getting Basic Stats From Minitab

Minitab will provide a summary of the data that looks something like this.
Well break this down in pieces to explain all the information displayed.

Getting Basic Stats From Minitab

Does the data fit a normal


distribution well enough to
assume normality?
(p < 0.05, no; p > 0.05 yes)
If a set of data is normally distributed it means that when it is plotted
as a histogram it has a symmetric bell shaped distribution.

Normal

Not Normal

Not Normal

If data is normally distributed, it allows for a number of predictions and


analytical methods that would otherwise not be valid. For example, the
mean and standard deviation can be used to predict the odds of having
values fall within certain ranges (like within specified tolerances).

Getting Basic Stats From Minitab

x (sample) or
(population)
s (sample) or
(population)
Mean: The average value of all the data points. (If calculated using a sample of
data from a population it may be written x, if calculated using all the data in the
population it may be written .)
StDev: The standard deviation of all the data points. It can be thought of as the
average distance that data points are from the mean the larger the standard
deviation, the greater the variation. (If calculated using a sample of data from a
population its usually written s, if calculated using all the data in
the population Its usually written .)

Getting Basic Stats From Minitab

s2 (sample) or
2
(population)

N (sample size)
Variance: Equal to the standard deviation squared.
Skewness: A measure of asymmetry the further from zero, the more skewed the
data. For example, if a distribution has a large tail at the upper end of its distribution,
skewness will likely be positive. Typically, the skewness value will range from negative
3 to positive 3.
Kurtosis: A number reflecting how much the sample data resembles a normal
distribution in shape. A very negative kurtosis indicates a distribution that is flatter than
usual, a very positive kurtosis indicates a distribution that is more peaked than usual.
The kurtosis value is approximately zero for a normal distribution.
N: The number of data points used in the creation of this summary.

Getting Basic Stats From Minitab

Minimum: The lowest value data point in the sample.


1st Quartile: The value which 25% of the data points fall below.
Median: The value which 50% of the data points fall below.
3rd Quartile: The value which 75% of the data points fall below.
Maximum: The highest value data point in the sample.

Getting Basic Stats From Minitab

Confidence Intervals: Because we only gave Minitab a


sample of data from a presumably larger population, it
can only estimate what the entire population is like.
Minitab can help us to understand how good our
estimates of things like the mean (Mu), the standard
deviation (Sigma), and median are.
Minitab does this by calculating an interval within which
it is 95% certain that these parameters actually reside if
the whole population were to be included.

Getting Basic Stats From Minitab

The vertical line part way through each of the red boxes is the calculated
mean (top) and median (bottom) for the sample of data entered.
Around these points, Minitab calculates an interval within which it is 95%
certain that the population mean and median actually reside.
For example, in the case of the top red bar, the vertical line in the middle of
the red bar shows a mean of about 50.6. While this is probably not the EXACT
mean for the population, using the number of data points and the amount of
variation they exhibited it can be estimated with good confidence (95%) that
the mean for the population falls somewhere between 48.9 and 52.3.

Getting Basic Stats From Minitab

Histogram of the data (with


Minitabs best estimate of what
normal curve fits the data best)

NOTE: Data points with values


lower than Q1-1.5(Q3-Q1) or
greater than Q3+1.5(Q3-Q1) are
considered outliers and
appear as individual dots

1st quartile

Median 3rd quartile

The Box and Whisker plot


divides data into quarters

Once you have the basic stats, whats next?


Given a process with a
mean = 83 & std dev = 4.6
69.5

74

68% of the population will be


captured within one standard
deviation of the mean.
95% of the population will be
captured within two standard
deviations of the mean.

83

87.5

92

96.5

68%
34% 34%
78.5

13.5%

13.5%
2.36%
69.5

87.5

68%
74

99.73% of the population will be


captured within three standard
deviations of the mean.

78.5

34% 34%
95%

68%
34% 34%
99.73%

13.5%
92

13.5%
2.36%
96.5

Once you have the basic stats, whats next?


Given a process with a
mean = 83 & std dev = 4.6
69.5

74

Note that the three items mentioned


(shape, mean, and standard deviation)
help to characterize the process
[or the performance of a process].
Its somewhat like when you ship a box
for overnight delivery: the courier wants
to know the length, width, height, and
weight of the box. That information
characterizes the box for them. In other
words, they know what to expect when
they come to get it.

78.5

83

87.5

92

96.5

68%
34% 34%
78.5

13.5%

68%
74

13.5%
2.36%
69.5

87.5

34% 34%
95%

68%
34% 34%
99.73%

13.5%
92

13.5%
2.36%
96.5

Once you have the basic stats, whats next?


Given a process with a
mean = 83 & std dev = 4.6
69.5

Understanding a process this well has


some rather powerful implications.
For example, once you know the mean
and standard deviation of a process
thats normally distributed, predicting
the percentage of times something will
fall above or below any given value
(like a tolerance limit, for instance) is
relatively easy.
In other words, we can tell how often
the process will perform properly.
Thats the topic of another tool time:
Process Capability.

74

78.5

83

87.5

92

96.5

68%
34% 34%
78.5

13.5%

68%
74

13.5%
2.36%
69.5

87.5

34% 34%
95%

68%
34% 34%
99.73%

13.5%
92

13.5%
2.36%
96.5