You are on page 1of 109

Statistics for Business and

Economics
Session 3
DESCRIPTIVE STATISTICS (CONT’)

Data display, exploration &


Variable relationship description

Andrianantenaina Hajanirina, B.A., B.Sc., M.M.


Probability & Statistics

Methods for Describing


Sets of Data
Learning Objectives

1. 3.1. Displaying and exploring data


2. 3.2. Describing the relationship between
two variables
Thinking Challenge
36%
Our market share far
exceeds all
competitors! - VP
34%

32%

30%
X Y Us
Data Presentation
Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Presenting
Qualitative Data
Data Presentation

Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Summary Table
1. Lists categories & number of elements in category
2. Obtained by tallying responses in category
3. May show frequencies (counts), % or both

Row Is
Major Count Tally:
Category |||| ||||
Accounting 130
|||| ||||
Economics 20
Management 50
Total 200
Data Presentation
Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Bar Graph
150
Equal Bar
Widths Bar Height
Shows
Frequency

100
Percent Frequency or %
Used
Also
50

0
Acct. Econ. Mgmt.
Major Vertical Bars
Zero Point for Qualitative
Variables
Data Presentation

Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Pie Chart

1. Shows breakdown of Majors


total quantity into
Mgmt.
categories Econ. 25%
2. Useful for showing 10% 36°
relative differences
3. Angle size Acct.
• (360°)(percent) 65%
(360°) (10%) = 36°
Data Presentation

Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Pareto Diagram
Like a bar graph, but with the categories arranged by
height in descending order from left to right.
150
Equal Bar
Widths Bar Height
Shows
Frequency

100
Percent Frequency or %
Used
Also
50

0
Acct. Mgmt. Econ.
Major Vertical Bars
Zero Point for Qualitative
Variables
Thinking Challenge
You’re an analyst for IRI. You want to show the
market shares held by Web browsers in 2006.
Construct a bar graph, pie chart, & Pareto diagram
to describe the data.
Browser Mkt. Share (%)
Firefox 14
Internet Explorer 81
Safari 4
Others 1
Bar Graph Solution*

100%
Market Share (%)

80%

60%

40%

20%

0%
Firefox Internet Safari Others
Explorer

Browser
Pie Chart Solution*
Market Share
Firefox,
14%

Safari, 4%
Others,
1%

Internet
Explorer,
81%
Pareto Diagram Solution*

100%
Market Share (%)

80%

60%

40%

20%

0%
Internet Firefox Safari Others
Explorer

Browser
Presenting
Quantitative Data
Data Presentation

Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Stem-and-Leaf Display
1. Divide each observation
into stem value and leaf 2 144677
value
• Stem value defines 26
3 028
class
• Leaf value defines
4 1
frequency (count)

2. Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Data Presentation

Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Frequency Distribution
Table Steps
1. Determine range
2. Select number of classes
• Usually between 5 & 15 inclusive
3. Compute class intervals (width)
4. Determine class boundaries (limits)
5. Compute class midpoints
6. Count observations & assign to classes
• Determine the range
Range (R) = highest value – lowest value
• Number of classes
C=1 + 10/3 x log N ( N = number of observation)
• Class Interval
CI = R/C (rounded)
• Class Boundaries
Lowest Boundaries value <= lowest value
Highest Boundaries value >= Highest Value
• Class Mid Point
CM = (Lower + Upper Boundaries) / 2
Histogram

Class Freq.
Count 15.5 – 25.5 3
5 25.5 – 35.5 5
35.5 – 45.5 2
Frequency 4
3
Relative
Frequency 2 Bars
Touch
Percent 1
0
0 15.5 25.5 35.5 45.5 55.5
Lower Boundary
Frequency Distribution Table
Example
Raw Data: 24, 26, 24, 21, 27 27 30, 41, 32, 38

Class Midpoint Frequency

15.5 – 25.5 20.5 3


Width
25.5 – 35.5 30.5 5

35.5 – 45.5 40.5 2

(Lower + Upper Boundaries) / 2


Boundaries
Relative Frequency &
% Distribution Tables
Relative Frequency Percentage
Distribution Distribution

Class Prop. Class %


15.5 – 25.5 .3 15.5 – 25.5 30.0
25.5 – 35.5 .5 25.5 – 35.5 50.0
35.5 – 45.5 .2 35.5 – 45.5 20.0
Data Presentation

Data
Presentation

Qualitative Quantitative
Data Data

Summary Stem-&-Leaf Frequency


Table Display Distribution

Bar Pie Pareto


Histogram
Graph Chart Diagram
Numerical Data Properties
Thinking Challenge

$400,000

$70,000

$50,000 ... employees cite low pay --


most workers earn only
$30,000 $20,000.
... President claims average
$20,000 pay is $70,000!
Standard Notation
Measure Sample Population
Mean X 
Standard
Deviation S 
2

2
Variance S
Size n N
Numerical Data Properties

Central Tendency
(Location)

Variation
(Dispersion)

Shape
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Central Tendency
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Mean
1. Measure of central tendency
2. Most common measure
3. Acts as ‘balance point’
4. Affected by extreme values (‘outliers’)
5. Formula (sample mean)
n
 Xi X1  X 2  …  X n
i 1
X  
n n
Mean Example
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
n

 Xi X1  X 2  X 3  X 4  X 5  X 6
i 1
X  
n 6
10 .3  4.9  8.9  11.7  6.3  7.7

6
 8.30
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Median
1. Measure of central tendency
2. Middle value in ordered sequence
• If n is odd, middle value of sequence
• If n is even, average of 2 middle values
3. Position of median in sequence
n 1
Positioning Point 
2
4. Not affected by extreme values
Median Example
Odd-Sized Sample
• Raw Data: 24.1 22.6 21.5 23.7 22.6
• Ordered: 21.5 22.6 22.6 23.7 24.1
• Position: 1 2 3 4 5

n 1 5 1
Positioning Point    3.0
2 2
Median  22 .6
Median Example
Even-Sized Sample
• Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
• Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
• Position: 1 2 3 4 5 6

n 1 6 1
Positioning Point    3.5
2 2
7.7  8.9
Median   8.30
2
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Mode
1. Measure of central tendency
2. Value that occurs most often
3. Not affected by extreme values
4. May be no mode or several modes
5. May be used for quantitative or qualitative
data
Mode Example
• No Mode
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
• One Mode
Raw Data: 6.3 4.9 8.9 6.3 4.9 4.9
• More Than 1 Mode
Raw Data: 21 28 28 41 43 43
Thinking Challenge
You’re a financial analyst
for Prudential-Bache
Securities. You have
collected the following
closing stock prices of new
stock issues: 17, 16, 21, 18,
13, 16, 12, 11.
Describe the stock prices
in terms of central
tendency.
Central Tendency Solution*
Mean
n

 Xi X1  X 2  …  X 8
i 1
X  
n 8
17  16  21  18  13  16  12  11

8
 15 .5
Central Tendency Solution*

Median
• Raw Data: 17 16 21 18 13 16 12 11
• Ordered: 11 12 13 16 16 17 18 21
• Position: 1 2 3 4 5 6 7 8
n 1 8 1
Positioning Point    4.5
2 2
16  16
Median   16
2
Central Tendency Solution*

Mode
Raw Data: 17 16 21 18 13 16 12 11

Mode = 16
Summary of
Central Tendency Measures
Measure Formula Description
Mean  Xi / n Balance Point
Median (n +1) Middle Value
Position
2 When Ordered
Mode none Most Frequent
Shape
Shape
1. Describes how data are distributed
2. Measures of Shape
• Skew = Symmetry

Left-Skewed Symmetric Right-Skewed


Mean Median Mean = Median Median Mean
Variation
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Range
1. Measure of dispersion
2. Difference between largest & smallest
observations
Range = Xlargest – Xsmallest
3. Ignores how data are distributed

7 8 9 10 7 8 9 10
Range = 10 – 7 = 3 Range = 10 – 7 = 3
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Variance &
Standard Deviation
1. Measures of dispersion
2. Most common measures
3. Consider how data are distributed
4. Show variation about mean (X or μ)

X = 8.3

4 6 8 10 12
Sample Variance Formula
n 2
 (X i  X )
i 1
S2 
n 1
2 2 2


=
(X 1 X )  (X 2 X )  …  (X n X )
n 1

n - 1 in denominator!
(Use N if Population
Variance)
Sample Standard Deviation
Formula
S  S 2

n 2
 (X i  X )
i 1

n 1


(X 1 X )  (X
2
2 X )  …  (X
2
n X )
2

n 1
Variance Example
Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7

n n

 (X i  X )  Xi
2

i 1 i 1
S  2
where X   8.3
n 1 n
2 2 2

S 2

(
10 .3  8.3 )  (4.9  8.3 )  …  (7.7  8.3 )
6 1
 6.368
Thinking Challenge
• You’re a financial analyst
for Prudential-Bache
Securities. You have
collected the following
closing stock prices of
new stock issues: 17, 16,
21, 18, 13, 16, 12, 11.
• What are the variance
and standard deviation
of the stock prices?
Variation Solution*
Sample Variance
Raw Data: 17 16 21 18 13 16 12 11
n 2 n

 (X i  X )  Xi
i 1 i 1
S2  where X   15 .5
n 1 n
2 2 2

S 2

(
17  15 .5 )  (16  15 .5 )  …  (11  15 .5 )
8 1
 11.14
Variation Solution*

Sample Standard Deviation


n 2
 (X i  X )
i 1
S  S2   11.14  3.34
n 1
Summary of
Variation Measures
Measure Formula Description
Range X largest – X smallest Total Spread

 X i
Dispersion about
 X
Standard Deviation 2
(Sample) Sample Mean
n 1

 X i  X 
Dispersion about
Standard Deviation 2
(Population) Population Mean
N
Variance (X i X )2 Squared Dispersion
(Sample) n–1 about Sample Mean
Interpreting Standard
Deviation
Interpreting Standard Deviation:
Chebyshev’s Theorem
• Applies to any shape data set
• No useful information about the fraction of data in the
interval x – s to x + s
• At least 3/4 of the data lies in the interval
x – 2s to x + 2s
• At least 8/9 of the data lies in the interval
x – 3s to x + 3s
• In general, for k > 1, at least 1 – 1/k2 of the data lies
in the interval x – ks to x + ks
Interpreting Standard Deviation:
Chebyshev’s Theorem

x  3s x  2s xs x xs x  2s x  3s

No useful information

At least 3/4 of the data

At least 8/9 of the data


Chebyshev’s Theorem Example
• Previously we found the mean
closing stock price of new stock
issues is 15.5 and the standard
deviation is 3.34.
• Use this information to form an
interval that will contain at least
75% of the closing stock prices of
new stock issues.
Chebyshev’s Theorem Example
At least 75% of the closing stock prices of new stock
issues will lie within 2 standard deviations of the mean.

x = 15.5 s = 3.34

(x – 2s, x + 2s) = (15.5 – 2∙3.34, 15.5 + 2∙3.34)


= (8.82, 22.18)
Interpreting Standard Deviation:
Empirical Rule
• Applies to data sets that are mound shaped and
symmetric
• Approximately 68% of the measurements lie in the
interval μ – σ to μ + σ
• Approximately 95% of the measurements lie in the
interval μ – 2σ to μ + 2σ
• Approximately 99.7% of the measurements lie in the
interval μ – 3σ to μ + 3σ
Interpreting Standard Deviation:
Empirical Rule

μ – 3σ μ – 2σ μ–σ μ μ+σ μ +2σ μ + 3σ

Approximately 68% of the measurements

Approximately 95% of the measurements

Approximately 99.7% of the measurements


Empirical Rule Example
Previously we found the mean
closing stock price of new
stock issues is 15.5 and the
standard deviation is 3.34. If
we can assume the data is
symmetric and mound shaped,
calculate the percentage of the
data that lie within the intervals
x + s, x + 2s, x + 3s.
Empirical Rule Example
• According to the Empirical Rule, approximately 68%
of the data will lie in the interval (x – s, x + s),
(15.5 – 3.34, 15.5 + 3.34) = (12.16, 18.84)
• Approximately 95% of the data will lie in the interval
(x – 2s, x + 2s),
(15.5 – 2∙3.34, 15.5 + 2∙3.34) = (8.82, 22.18)
• Approximately 99.7% of the data will lie in the interval
(x – 3s, x + 3s),
(15.5 – 3∙3.34, 15.5 + 3∙3.34) = (5.48, 25.52)
Numerical Measures of
Relative Standing
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range Z–scores
Mode Variance
Standard Deviation
Numerical Measures of
Relative Standing: Percentiles
• Describes the relative location of a
measurement compared to the rest of the data
• The pth percentile is a number such that p% of
the data falls below it and (100 – p)% falls
above it
• Median = 50th percentile
Percentile Example
• You scored 560 on the GMAT exam. This
score puts you in the 58th percentile.
• What percentage of test takers scored lower
than you did?
• What percentage of test takers scored higher
than you did?
Percentile Example
• What percentage of test takers scored lower
than you did?
58% of test takers scored lower than 560.
• What percentage of test takers scored higher
than you did?
(100 – 58)% = 42% of test takers scored
higher than 560.
Numerical Data
Properties & Measures
Numerical Data
Properties

Central Relative
Variation
Tendency Standing
Mean Range Percentiles
Median Interquartile Range
Z–scores
Mode Variance
Standard Deviation
Numerical Measures of
Relative Standing: Z–Scores
• Describes the relative location of a
measurement compared to the rest of the data
• Sample z–score Population z–score
x–x x–μ
z= s z= σ

• Measures the number of standard deviations


away from the mean a data value is located
Z–Score Example
• The mean time to assemble a
product is 22.5 minutes with a
standard deviation of 2.5 minutes.
• Find the z–score for an item that
took 20 minutes to assemble.
• Find the z–score for an item that
took 27.5 minutes to assemble.
Z–Score Example
x = 20, μ = 22.5 σ = 2.5
z = x σ– μ = 20 – 22.5 = –1.0
2.5

x = 27.5, μ = 22.5 σ = 2.5


z = x σ– μ = 27.5 – 22.5 = 2.0
2.5
Quartiles & Box Plots
Quartiles
1. Measure of noncentral tendency
2. Split ordered data into 4 quarters

25% 25% 25% 25%


Q1 Q2 Q3
3. Position of i-th quartile

Positioning Point of Qi 
i n 1 ( )
4
Quartile (Q1) Example
• Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
• Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
• Position: 1 2 3 4 5 6

Q 1 Position 
( ) ( )
1 n  1 1 6  1
  1.75  2
4 4
Q 1  6 .3
Quartile (Q2) Example
• Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
• Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
• Position: 1 2 3 4 5 6

Q 2 Position 
( ) ( )
2 n 1 2 6 1
  3.5
4 4
7.7  8.9
Q2   8.3
2
Quartile (Q3) Example
• Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7
• Ordered: 4.9 6.3 7.7 8.9 10.3 11.7
• Position: 1 2 3 4 5 6

Q 3 Position 
( ) ( )
3  n 1 3  6 1
  5.25  5
4 4
Q 3  10 .3
Numerical Data
Properties & Measures
Numerical Data
Properties

Central
Variation Shape
Tendency
Mean Range Skew
Median Interquartile Range
Mode Variance
Standard Deviation
Interquartile Range
1. Measure of dispersion
2. Also called midspread
3. Difference between third & first quartiles
• Interquartile Range = Q3 – Q1
4. Spread in middle 50%
5. Not affected by extreme values
Thinking Challenge
• You’re a financial analyst for
Prudential-Bache Securities.
You have collected the
following closing stock prices
of new stock issues: 17, 16,
21, 18, 13, 16, 12, 11.
• What are the quartiles, Q1
and Q3, and the interquartile
range?
Quartile Solution*
Q1
Raw Data: 17 16 21 18 13 16 12 11
Ordered: 11 12 13 16 16 17 18 21
Position: 1 2 3 4 5 6 7 8

( ) 1  (8  1)
1 n  1
Q 1 Position    2.5
4 4
Q 1  12 .5
Quartile Solution*
Q3
Raw Data: 17 16 21 18 13 16 12 11
Ordered: 11 12 13 16 16 17 18 21
Position: 1 2 3 4 5 6 7 8

Q 3 Position 
( ) 3  (8  1) 6.75  7
3  n 1
4 4
Q 3  18
Interquartile Range Solution*

Interquartile Range
Raw Data: 17 16 21 18 13 16 12 11
Ordered: 11 12 13 16 16 17 18 21
Position: 1 2 3 4 5 6 7 8

Interquartile Range  Q3  Q1  18 .0  12 .5  5.5


Box Plot
1. Graphical display of data using 5-number
summary

Xsmallest Q 1 Median Q 3 Xlargest

4 6 8 10 12
Shape & Box Plot

Left-Skewed Symmetric Right-Skewed


Q 1 Median Q3 Q1 Median Q 3 Q 1 Median Q 3
Graphing Bivariate
Relationships
Graphing Bivariate
Relationships
• Describes a relationship between two
quantitative variables
• Plot the data in a Scattergram
y y y

x x x
Positive Negative No
relationship relationship relationship
Scattergram Example
• You’re a marketing analyst for Hasbro Toys.
You gather the following data:
Ad $ (x) Sales (Units) (y)
1 1
2 1
3 2
4 2
5 4
• Draw a scattergram of the data
Scattergram Example

Sales
4
3
2
1
0
0 1 2 3 4 5
Advertising
Time Series Plot
Time Series Plot
• Used to graphically display data produced over
time
• Shows trends and changes in the data over
time
• Time recorded on the horizontal axis
• Measurements recorded on the vertical axis
• Points connected by straight lines
Time Series Plot Example
• The following data shows Average
the average retail price of Date Price
regular gasoline in New Oct 16, 2006 $2.219
York City for 8 weeks in Oct 23, 2006 $2.173
2006. Oct 30, 2006 $2.177
• Draw a time series plot Nov 6, 2006 $2.158
for this data. Nov 13, 2006 $2.185
Nov 20, 2006 $2.208
Nov 27, 2006 $2.236
Dec 4, 2006 $2.298
Time Series Plot Example
Price
2.35

2.3

2.25

2.2

2.15

2.1

2.05
10/16 10/23 10/30 11/6 11/13 11/20 11/27 12/4

Date
Distorting the Truth
with Descriptive Techniques
Errors in Presenting Data
1. Using ‘chart junk’
2. No relative basis in
comparing data
batches
3. Compressing the
vertical axis
4. No zero point on the
vertical axis
‘Chart Junk’

Bad Presentation Good Presentation


Minimum Wage Minimum Wage
1960: $1.00 $
4
1970: $1.60
2
1980: $3.10
0
1990: $3.80 1960 1970 1980 1990
No Relative Basis

Bad Presentation Good Presentation


A’s by Class A’s by Class
Freq. %
300 30%
200 20%
100 10%
0 0%
FR SO JR SR FR SO JR SR
Compressing
Vertical Axis

Bad Presentation Good Presentation


Quarterly Sales Quarterly Sales
$ $
200 50

100 25

0 0
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
No Zero Point
on Vertical Axis
Bad Presentation Good Presentation
Monthly Sales Monthly Sales
$ $
45 60
42 40
39 20
36 0
J M M J S N J M M J S N
Conclusion
1. Described Qualitative Data Graphically
2. Described Numerical Data Graphically
3. Explained Numerical Data Properties
4. Described Summary Measures
5. Analyzed Numerical Data Using Summary
Measures

You might also like