LEC 03 - Descriptive Statistics

Descriptive Statistics
If you want to inspire confidence, give plenty of statistics. It does

not matter that they should be accurate, or even intelligible, as
long as there is enough of them. -Lewis Carroll
1
Methods for Describing Data
• Motivating Example
• Data in SPSS
• Summarizing Data
• Graphical displays of data
• Descriptive Summaries of data
• Center
• Spread
• Shape
2
Think back to when you were 4 years old…
Actually, suppose you were placed in a room by yourself with

this marshmallow, and a grown-up told you that you could eat
the marshmallow now or, if you waited, you could have this one
plus one more when she returned…
…Would you have eaten the marshmallow or would you have

waited? How long would you have been able to wait?
3
Mischel’s Marshmallows
One now…
…Or two when
I get back.
Walter Mischel
“To function effectively, individuals must voluntarily postpone immediate

gratification and persist in goal-directed behavior for the sake of later outcomes.
This research analyzed the nature of future-oriented self control and the
psychological processes that underlie it. Individual differences in self control
were found as early as the preschool years. Those 4-year-old children who
delayed gratification longer in certain laboratory situations developed into more
cognitively and socially competent adolescents, achieving higher scholastic
performance and coping better with frustration and stress.”
- Mischel, Shoda & Rodriguez, 1989 4
Study Sample
Site: Bing Nursery School at Stanford University

Sample: n = 550 preschool children
Mostly “middle-class” children of faculty and students at Stanford
What type of study? Any issues you can see?
Marshmallow Dataset in SPSS
Variable names
Each row represents a

different individual
child (i = 1, 2, …, n = 550)
SPSS’s Data View: to

browse the actual data
Marshmallow Dataset in SPSS
Variable type: String Values: codes for the
(words) or numeric values of the variable
Labels: more info Measure: the scale that the variable is

about the variable measured (nominal, ordinal or scale)
Variable View: to browse

variable characteristics
Methods for Describing Data
• Motivating Example
• Data in SPSS
• Graphical displays of data
• Descriptive Summaries of data
• Center
• Spread
• Shape
8
Frequency Distribution
A frequency distribution lists all possible values that a variable
can take on along with the number of observations for each
value. May also show relative frequency (the proportions or
percentages of each value).
Boys
Girls
Frequency of Relative
each value frequency
In SPSS: Analyze → Descriptive Statistics → Frequencies

Bar Graphs
A bar graph shows the frequency or relative frequency of the
values of a categorical variable based on the height of the bars
Frequency Relative frequency

(count) (percent)
In SPSS: Graphs → Chart Builder → Bar

Histograms
A histogram shows the [relative] frequency of the values of a
quantitative variable. The height of each bar represents the [relative]
frequency of observations falling in that interval (often called a ‘bin’).
Frequency
(count) Variable
of interest
In SPSS: Graphs → Chart Builder → Histogram
Histograms: choice of bins
• The appearance of a histogram may be altered based on the choice of the

number of bins and the bin widths (they all must be the same width…why?).
We will use SPSS’s default settings (unless otherwise specified).
• What is the major difference between these two graphs? Why is the 2nd
slightly preferable (it’s subtle).
Measures of Center
• Mean and median are two most common
measures of center of a distribution
• Mean, denoted x , is the simple arithmetic

average (formula coming up)
• Mean of the set of numbers {1, 1, 5, -1} is

• x = (1 + 1 + 5 - 1) / 4 = 6 / 4 = 1.5
15
Algebraic formula for the mean
x1  x2  ...  xn 1 n
x   xi
n n i 1
Some notes about this formula…
• Assuming a sample of n individuals indexed by
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
the sample
• “∑” denotes the summation operator
• “Bar” notation denotes average (we say “x-bar”)
16
Marshmallow waiting time (n = 550)
Mean =
7.57 min ≈
_________
7 min, 42 sec
**Mean is the ‘balance point’ of the distribution. The place to put

a fulcrum to balance the histogram.
17
Median: another measure of center
• Mean is sensitive to presence of large observations
• Think of
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
• Sort the observations from smallest to largest
• If there is an odd number of observations, median is the middle
number
• If an even number of observations, median is the average of the
two values `straddling’ the middle
• Ex.1: {1, 2, 3, 6}: median = 2.5, mean = 3
• Ex.2: {1, 2, 3, 6, 500}: median = 3, mean = 102.4
18
Mean = 7 min, 42 sec
Median =
6.125 min ≈
__________
6 min, 7.5 sec
19
Histograms show shape
Skewed to the left Skewed to the right
(ex: score on an exam) (ex: financial data)
Symmetric & bell-shaped Bimodal

(ex: IQ, height) (usually shows two groups)
20
Idealized right-skewed distribution
Mean larger than Median
21
Idealized Symmetric Distribution
Mean and median are the same
22
Effect of Shape on Mean and
Median
• In a right skewed distribution, the mean is

greater than the median
• In a left skewed distribution, the mean is less

than the median
• In a symmetric distribution the mean is

approximately (sometimes exactly) equal to the
median
23

Median = 6 min, 7.5 sec
Shape:
Bimodal &
____________
Skewed-Right??
_________
24
Measuring Spread (Variability) in Data
Two common methods

1. Variance and standard deviation
• Measure spread about the mean
• Most often used, but also sensitive to large values
in skewed distributions
2. Quantiles and percentiles
• Median
• Quartiles and more general percentiles
25
The variance of a set of data
• The “center” of a group of observations can be
measured by the mean
• The variability of a single observation xi can be
measured by its distance from the center (e.g. mean)
( xi  x )
• Since we want this to always be a positive number, this
distance is converted to
( xi  x ) 2
• The “average” of these “squared deviations from the

mean” are used as a measure of variability
26
Variance
• The variance is the “average” of squared deviations from the
mean
• If there are n observations x1, x2,…, xn, then the variance is
n
1
s s 
2

2
n  1 i 1
x ( xi  x ) 2
27
Standard Deviation
• The standard deviation (SD) is the square root of the
variance
n
1
s  sx  s  2
x 
n  1 i 1
( xi  x ) 2
• Note:
The SD is in the original units of measurement
The variance is in the (original units)2
28
Example: variance & standard deviation
• N.E. Patriots points scored in each game of their preseason (4 games) were:
• Calculate the mean ( x ):

1 n 31  25  9  28 93
x   xi    23.25
n i 1 4 4
• Calculate the variance (s2):
1 n (31  23.25) 2  (25  23.25) 2  (9  23.25) 2  (28  23.25) 2
s 
2

n  1 i 1
( xi  x ) 
2
4 1
60.06  3.06  203.06  22.56
s2   96.25 points 2
3
• Calculate the standard deviation (s):
s  s 2  96.25 points 2  9.81 points
29
Interpreting Standard Deviation:
the Empirical Rule
• If the histogram of the x’s is approximately bell-shaped, then
• ~68% of observations fall within one sd of mean: x  s
• ~95% of observations fall within two sd’s: x  2 s
• Essentially all observations fall within three sd’s: x  3s
• Quick rule of thumb to estimate standard deviation:

• Take the whole range, and divide by 5 or 6
• This does not work for variables that are skewed,

multimodal, etc…
31
A more important detail: sensitivity to
extreme values
• Standard deviation and variance (like the mean) can be
sensitive to large observations
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
• Actually, even more sensitive than the mean…why?
• This issue will arise several times in the course…
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers
33

Median = 6 min, 7.5 sec
Shape: Bimodal
Std. Deviation:
6 min, 28 sec (6.47min)
_________________
But hard to interpret!
34
Measuring Location:
Percentiles and Quartiles of Data
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it.
• The 25th percentile is the value with 25% of the
observations at or below it, 75% above
• It is called the first quartile Q1,
• the 50th percentile is the median M, and
• the 75th percentile is the third quartile Q3
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile)
• In a small set of numbers, it may not be possible to find exact
values for the percentiles
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
35
Calculating Quartiles and IQR
• The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
location of the overall median.
• The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
location of the overall median.
• e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• Interquartile range, IQR = Q3 – Q1,is another measure of
spread in data.
• IQR is measuring how spread out the middle 50% of data is
36
Five number summary of a distribution:
1. Min = 0
2. Q1 = 0.815
3. Median = 6.125
4. Q3 = 15
5. Max = 15 37
Shape - Detecting Outliers
• For this class: an observation is an outlier if it falls more
than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e.,
• aka, outside the interval
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Marshmallow Data:
• Q1 = 0.815, Q3 = 15.00
• 1.5 x IQR = 1.5 x (15 – 0.815) = 21.28
• So the criteria is: an observation
• below 0.815 – 21.28 = – 20.46 or
• above 15.00 + 21.28 = 36.28
• There are no low or high outliers. But it’s silly to think
there would be in this bimodal distribution
38
Another Plot Type – Box plots
• Box plots are design to show clearly the center, spread
(especially IQR), and outliers
• They are based on the five-number summary

• Minimum, Q1, Median, Q3, Maximum
• Easiest to explain with an example, using the tuition data.
39
Box plots
From SPSS Documentation
outlier
• The dark line in the middle of the boxes is the

outlier median.
• The bottom of the box indicates the 25th
largest
non-outlier percentile.
• The top of the box represents the 75th
percentile.
Q3
• The T-bars that extend from the boxes are
median called inner fences or whiskers. These extend to
Q1 [up to] 1.5 times the height of the box [the
IQR]: the closest observation within those
smallest bounds.
non-outlier • The points are outliers. These are defined as
values that do not fall in the inner fences.
40
Box plots vs. Histograms
Histogram shows
relative frequency
of observations
and general shape
Boxplot shows
center (median),
spread (IQR and
range), and outliers
41
Box plots vs.
Histograms:
Marshmallow data
42
Outliers are sometimes data errors
One value of height on a

Stat 104 poll entered as 5.2
inches
43
Unit Recap
• What is statistics?
• Types of data
• Graphically
• Bar plots (Categorical)
• Histograms (Quantitative)
• Box plots (Quantitative)
• Details of a graph may be important
• Vertical axis scale
• Number of bins in histogram
44
Unit Recap
• Summarizing Data (cont.)

• Numerically
• Frequencies or Proportions/percentages (categorical)
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• SPSS is your friend!
45

LEC 03 - Descriptive Statistics

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

LEC 03 - Descriptive Statistics

Uploaded by

Copyright:

Available Formats

Descriptive Statistics

If you want to inspire confidence, give plenty of statistics. It does

Actually, suppose you were placed in a room by yourself with

…Would you have eaten the marshmallow or would you have

“To function effectively, individuals must voluntarily postpone immediate

Site: Bing Nursery School at Stanford University

Each row represents a

SPSS’s Data View: to

Labels: more info Measure: the scale that the variable is

Variable View: to browse

In SPSS: Analyze → Descriptive Statistics → Frequencies

Frequency Relative frequency

In SPSS: Graphs → Chart Builder → Bar

• The appearance of a histogram may be altered based on the choice of the

• Mean, denoted x , is the simple arithmetic

• Mean of the set of numbers {1, 1, 5, -1} is

**Mean is the ‘balance point’ of the distribution. The place to put

Mean = 7 min, 42 sec

Symmetric & bell-shaped Bimodal

• In a right skewed distribution, the mean is

• In a left skewed distribution, the mean is less

• In a symmetric distribution the mean is

Mean = 7 min, 42 sec

Two common methods

• The “average” of these “squared deviations from the

• If there are n observations x1, x2,…, xn, then the variance is

• Calculate the mean ( x ):

• Quick rule of thumb to estimate standard deviation:

• This does not work for variables that are skewed,

Mean = 7 min, 42 sec

• They are based on the five-number summary

• Easiest to explain with an example, using the tuition data.

• The dark line in the middle of the boxes is the

One value of height on a

• Summarizing Data (cont.)

You might also like