Professional Documents
Culture Documents
1
Methods for Describing Data
• Motivating Example
• Data in SPSS
• Summarizing Data
• Graphical displays of data
• Descriptive Summaries of data
• Center
• Spread
• Shape
2
Think back to when you were 4 years old…
One now…
…Or two when
I get back.
Walter Mischel
Variable names
8
Frequency Distribution
A frequency distribution lists all possible values that a variable
can take on along with the number of observations for each
value. May also show relative frequency (the proportions or
percentages of each value).
Boys
Girls
Frequency of Relative
each value frequency
Frequency
(count) Variable
of interest
In SPSS: Graphs → Chart Builder → Histogram
Histograms: choice of bins
15
Algebraic formula for the mean
x1 x2 ... xn 1 n
x xi
n n i 1
Some notes about this formula…
• Assuming a sample of n individuals indexed by
i = 1, 2, 3….n
• “xi” denotes variable measurements on each person in
the sample
• “∑” denotes the summation operator
• “Bar” notation denotes average (we say “x-bar”)
16
Marshmallow waiting time (n = 550)
Mean =
7.57 min ≈
_________
7 min, 42 sec
18
Marshmallow waiting time (n = 550)
Median =
6.125 min ≈
__________
6 min, 7.5 sec
19
Histograms show shape
Skewed to the left Skewed to the right
(ex: score on an exam) (ex: financial data)
20
Idealized right-skewed distribution
Mean larger than Median
21
Idealized Symmetric Distribution
Mean and median are the same
22
Effect of Shape on Mean and
Median
Shape:
Bimodal &
____________
Skewed-Right??
_________
24
Measuring Spread (Variability) in Data
25
The variance of a set of data
• The “center” of a group of observations can be
measured by the mean
• The variability of a single observation xi can be
measured by its distance from the center (e.g. mean)
( xi x )
• Since we want this to always be a positive number, this
distance is converted to
( xi x ) 2
26
Variance
• The variance is the “average” of squared deviations from the
mean
n
1
s s
2
2
n 1 i 1
x ( xi x ) 2
27
Standard Deviation
• The standard deviation (SD) is the square root of the
variance
n
1
s sx s 2
x
n 1 i 1
( xi x ) 2
• Note:
The SD is in the original units of measurement
The variance is in the (original units)2
28
Example: variance & standard deviation
• N.E. Patriots points scored in each game of their preseason (4 games) were:
4 1
60.06 3.06 203.06 22.56
s2 96.25 points 2
3
• Calculate the standard deviation (s):
s s 2 96.25 points 2 9.81 points
29
Interpreting Standard Deviation:
the Empirical Rule
• If the histogram of the x’s is approximately bell-shaped, then
• ~68% of observations fall within one sd of mean: x s
• ~95% of observations fall within two sd’s: x 2 s
• Essentially all observations fall within three sd’s: x 3s
33
Marshmallow waiting time (n = 550)
Std. Deviation:
6 min, 28 sec (6.47min)
_________________
But hard to interpret!
34
Measuring Location:
Percentiles and Quartiles of Data
• The pth percentile of a distribution is that value such that p%
of the observations fall at or below it.
• The 25th percentile is the value with 25% of the
observations at or below it, 75% above
• It is called the first quartile Q1,
• the 50th percentile is the median M, and
• the 75th percentile is the third quartile Q3
• Called a quantile when expressed as a proportion instead of
percentage (25th percentile = .25 quantile)
• In a small set of numbers, it may not be possible to find exact
values for the percentiles
• The five-number summary of a distribution consists of
• Min, Q1, M, Q3, Max
35
Calculating Quartiles and IQR
• The first quartile Q1 is the median of the observations
whose position in the ordered list is to the left of the
location of the overall median.
• The third quartile Q3 is the median of the observations
whose position in the ordered list is to the right of the
location of the overall median.
• e.g., 1, 2, 3, 4, 5: Q1= 1.5, M = 3, Q3 = 4.5
• Interquartile range, IQR = Q3 – Q1,is another measure of
spread in data.
• IQR is measuring how spread out the middle 50% of data is
36
Five number summary of a distribution:
1. Min = 0
2. Q1 = 0.815
3. Median = 6.125
4. Q3 = 15
5. Max = 15 37
Shape - Detecting Outliers
• For this class: an observation is an outlier if it falls more
than
• 1.5 x IQR below Q1 or
• 1.5 x IQR above Q3 i.e.,
• aka, outside the interval
• (Q1 - 1.5 x IQR, Q3 + 1.5 x IQR)
• Marshmallow Data:
• Q1 = 0.815, Q3 = 15.00
• 1.5 x IQR = 1.5 x (15 – 0.815) = 21.28
• So the criteria is: an observation
• below 0.815 – 21.28 = – 20.46 or
• above 15.00 + 21.28 = 36.28
• There are no low or high outliers. But it’s silly to think
there would be in this bimodal distribution
38
Another Plot Type – Box plots
• Box plots are design to show clearly the center, spread
(especially IQR), and outliers
39
Box plots
From SPSS Documentation
outlier
40
Box plots vs. Histograms
Histogram shows
relative frequency
of observations
and general shape
Boxplot shows
center (median),
spread (IQR and
range), and outliers
41
Box plots vs.
Histograms:
Marshmallow data
42
Outliers are sometimes data errors
43
Unit Recap
• What is statistics?
• Types of data
• Summarizing Data
• Graphically
• Bar plots (Categorical)
• Histograms (Quantitative)
• Box plots (Quantitative)
• Details of a graph may be important
• Vertical axis scale
• Number of bins in histogram
44
Unit Recap
45