You are on page 1of 42

Descriptive Statistics

If you want to inspire confidence, give plenty of statistics. It does


not matter that they should be accurate, or even intelligible, as
long as there is enough of them. -Lewis Carroll

1
Methods for Describing Data
• Motivating Example
• Data in SPSS
• Summarizing Data
• Graphical displays of data
• Descriptive Summaries of data
• Center
• Spread
• Shape

2
Think back to when you were 4 years old…

Actually, suppose you were placed in a room by yourself with


this marshmallow, and a grown-up told you that you could eat
the marshmallow now or, if you waited, you could have this one
plus one more when she returned…

…Would you have eaten the marshmallow or would you have


waited? How long would you have been able to wait?
3
Mischel’s Marshmallows

One now…
…Or two when
I get back.
Walter Mischel

“To function effectively, individuals must voluntarily postpone immediate


gratification and persist in goal-directed behavior for the sake of later outcomes.
This research analyzed the nature of future-oriented self control and the
psychological processes that underlie it. Individual differences in self control
were found as early as the preschool years. Those 4-year-old children who
delayed gratification longer in certain laboratory situations developed into more
cognitively and socially competent adolescents, achieving higher scholastic
performance and coping better with frustration and stress.”
- Mischel, Shoda & Rodriguez, 1989 4
Study Sample

Site: Bing Nursery School at Stanford University


Sample: n = 550 preschool children
Mostly “middle-class” children of faculty and students at Stanford
What type of study? Any issues you can see?
Marshmallow Dataset in SPSS

Variable names

Each row represents a


different individual
child (i = 1, 2, …, n = 550)

SPSS’s Data View: to


browse the actual data
Marshmallow Dataset in SPSS
Variable type: String Values: codes for the
(words) or numeric values of the variable

Labels: more info Measure: the scale that the variable is


about the variable measured (nominal, ordinal or scale)

Variable View: to browse


variable characteristics
Methods for Describing Data
• Motivating Example
• Data in SPSS
• Summarizing Data
• Graphical displays of data
• Descriptive Summaries of data
• Center
• Spread
• Shape

8
Frequency Distribution
A frequency distribution lists all possible values that a variable
can take on along with the number of observations for each
value. May also show relative frequency (the proportions or
percentages of each value).

Boys

Girls

Frequency of Relative
each value frequency

In SPSS: Analyze → Descriptive Statistics → Frequencies


Bar Graphs
A bar graph shows the frequency or relative frequency of the
values of a categorical variable based on the height of the bars

Frequency Relative frequency


(count) (percent)

In SPSS: Graphs → Chart Builder → Bar


Histograms
A histogram shows the [relative] frequency of the values of a
quantitative variable. The height of each bar represents the [relative]
frequency of observations falling in that interval (often called a ‘bin’).

Frequency
(count) Variable
of interest
In SPSS: Graphs → Chart Builder → Histogram
Histograms: choice of bins

• The appearance of a histogram may be altered based on the choice of the


number of bins and the bin widths (they all must be the same width…why?).
We will use SPSS’s default settings (unless otherwise specified).
• What is the major difference between these two graphs? Why is the 2nd
slightly preferable (it’s subtle).
Measures of Center
• Mean and median are two most common
measures of center of a distribution

• Mean, denoted x , is the simple arithmetic


average (formula coming up)

• Mean of the set of numbers {1, 1, 5, -1} is


• x = (1 + 1 + 5 - 1) / 4 = 6 / 4 = 1.5

15
Algebraic formula for the mean
x1  x2  ...  xn 1 n
x   xi
n n i 1
Some notes about this formula…
• Assuming a sample of n individuals indexed by
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
the sample
• “∑” denotes the summation operator
• “Bar” notation denotes average (we say “x-bar”)

16
Marshmallow waiting time (n = 550)

Mean =
7.57 min ≈
_________
7 min, 42 sec

**Mean is the ‘balance point’ of the distribution. The place to put


a fulcrum to balance the histogram.
17
Median: another measure of center
• Mean is sensitive to presence of large observations
• Think of
• mean of {1, 3, 5} = 3
• mean of {1, 3, 20} = 8
• Median is the middle number in the set of observations and is not
sensitive to ‘extreme’ observations
• Sort the observations from smallest to largest
• If there is an odd number of observations, median is the middle
number
• If an even number of observations, median is the average of the
two values `straddling’ the middle
• Ex.1: {1, 2, 3, 6}: median = 2.5, mean = 3
• Ex.2: {1, 2, 3, 6, 500}: median = 3, mean = 102.4

18
Marshmallow waiting time (n = 550)

Mean = 7 min, 42 sec

Median =
6.125 min ≈
__________
6 min, 7.5 sec

19
Histograms show shape
Skewed to the left Skewed to the right
(ex: score on an exam) (ex: financial data)

Symmetric & bell-shaped Bimodal


(ex: IQ, height) (usually shows two groups)

20
Idealized right-skewed distribution
Mean larger than Median

21
Idealized Symmetric Distribution
Mean and median are the same

22
Effect of Shape on Mean and
Median

• In a right skewed distribution, the mean is


greater than the median

• In a left skewed distribution, the mean is less


than the median

• In a symmetric distribution the mean is


approximately (sometimes exactly) equal to the
median
23
Marshmallow waiting time (n = 550)

Mean = 7 min, 42 sec


Median = 6 min, 7.5 sec

Shape:
Bimodal &
____________
Skewed-Right??
_________

24
Measuring Spread (Variability) in Data

Two common methods


1. Variance and standard deviation
• Measure spread about the mean
• Most often used, but also sensitive to large values
in skewed distributions
2. Quantiles and percentiles
• Median
• Quartiles and more general percentiles

25
The variance of a set of data
• The “center” of a group of observations can be
measured by the mean
• The variability of a single observation xi can be
measured by its distance from the center (e.g. mean)
( xi  x )
• Since we want this to always be a positive number, this
distance is converted to
( xi  x ) 2

• The “average” of these “squared deviations from the


mean” are used as a measure of variability

26
Variance
• The variance is the “average” of squared deviations from the
mean

• If there are n observations x1, x2,…, xn, then the variance is

n
1
s s 
2

2

n  1 i 1
x ( xi  x ) 2

27
Standard Deviation
• The standard deviation (SD) is the square root of the
variance

n
1
s  sx  s  2
x 
n  1 i 1
( xi  x ) 2

• Note:
The SD is in the original units of measurement
The variance is in the (original units)2

28
Example: variance & standard deviation
• N.E. Patriots points scored in each game of their preseason (4 games) were:

• Calculate the mean ( x ):


1 n 31  25  9  28 93
x   xi    23.25
n i 1 4 4
• Calculate the variance (s2):
1 n (31  23.25) 2  (25  23.25) 2  (9  23.25) 2  (28  23.25) 2
s 
2

n  1 i 1
( xi  x ) 
2

4 1
60.06  3.06  203.06  22.56
s2   96.25 points 2
3
• Calculate the standard deviation (s):
s  s 2  96.25 points 2  9.81 points

29
Interpreting Standard Deviation:
the Empirical Rule
• If the histogram of the x’s is approximately bell-shaped, then
• ~68% of observations fall within one sd of mean: x  s
• ~95% of observations fall within two sd’s: x  2 s
• Essentially all observations fall within three sd’s: x  3s

• Quick rule of thumb to estimate standard deviation:


• Take the whole range, and divide by 5 or 6

• This does not work for variables that are skewed,


multimodal, etc…
31
A more important detail: sensitivity to
extreme values
• Standard deviation and variance (like the mean) can be
sensitive to large observations
• SD of {1, 3, 5} = 2
• SD of {1, 3, 20} = 10.4
• Actually, even more sensitive than the mean…why?
• This issue will arise several times in the course…
• Standard deviation and mean lose natural interpretation
in skewed data or data with outliers

33
Marshmallow waiting time (n = 550)

Mean = 7 min, 42 sec


Median = 6 min, 7.5 sec
Shape: Bimodal

Std. Deviation:
6 min, 28 sec (6.47min)
_________________
But hard to interpret!

34
Measuring Location:
Percentiles and Quartiles of Data
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it.
• The 25th percentile is the value with 25% of the
observations at or below it, 75% above
• It is called the first quartile Q1,
• the 50th percentile is the median M, and
• the 75th percentile is the third quartile Q3
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile)
• In a small set of numbers, it may not be possible to find exact
values for the percentiles
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
35
Calculating Quartiles and IQR
• The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
location of the overall median.
• The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
location of the overall median.
• e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• Interquartile range, IQR = Q3 – Q1,is another measure of
spread in data.
• IQR is measuring how spread out the middle 50% of data is

36
Five number summary of a distribution:
1. Min = 0
2. Q1 = 0.815
3. Median = 6.125
4. Q3 = 15
5. Max = 15 37
Shape - Detecting Outliers
• For this class: an observation is an outlier if it falls more
than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e.,
• aka, outside the interval
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Marshmallow Data:
• Q1 = 0.815, Q3 = 15.00
• 1.5 x IQR = 1.5 x (15 – 0.815) = 21.28
• So the criteria is: an observation
• below 0.815 – 21.28 = – 20.46 or
• above 15.00 + 21.28 = 36.28
• There are no low or high outliers. But it’s silly to think
there would be in this bimodal distribution
38
Another Plot Type – Box plots
• Box plots are design to show clearly the center, spread
(especially IQR), and outliers

• They are based on the five-number summary


• Minimum, Q1, Median, Q3, Maximum

• Easiest to explain with an example, using the tuition data.

39
Box plots
From SPSS Documentation
outlier

• The dark line in the middle of the boxes is the


outlier median.
• The bottom of the box indicates the 25th
largest
non-outlier percentile.
• The top of the box represents the 75th
percentile.
Q3
• The T-bars that extend from the boxes are
median called inner fences or whiskers. These extend to
Q1 [up to] 1.5 times the height of the box [the
IQR]: the closest observation within those
smallest bounds.
non-outlier • The points are outliers. These are defined as
values that do not fall in the inner fences.

40
Box plots vs. Histograms

Histogram shows
relative frequency
of observations
and general shape

Boxplot shows
center (median),
spread (IQR and
range), and outliers

41
Box plots vs.
Histograms:
Marshmallow data

42
Outliers are sometimes data errors

One value of height on a


Stat 104 poll entered as 5.2
inches

43
Unit Recap
• What is statistics?
• Types of data
• Summarizing Data
• Graphically
• Bar plots (Categorical)
• Histograms (Quantitative)
• Box plots (Quantitative)
• Details of a graph may be important
• Vertical axis scale
• Number of bins in histogram

44
Unit Recap

• Summarizing Data (cont.)


• Numerically
• Frequencies or Proportions/percentages (categorical)
• Center (mean, median)
• Spread (std.dev./variance, IQR)
• Shape (skewness, outliers)
• SPSS is your friend!

45

You might also like