1-Exploratory Data Analysis: Graphs, Tables and Summary Measures

1-Exploratory Data Analysis:
Graphs, tables and Summary

Measures
First :
What is Statistics?
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.1

What is Statistics?
“Statistics is a way to get information from data.”

-Gerald Keller
(textbook author)

What is Statistics?
“Statistics is a way to get information from data”

Statistics
Data Information
Data: Facts, especially Information: Knowledge

numerical facts, collected communicated concerning
together for reference or some particular fact.
information.
Statistics is a tool for creating new understanding from a set of numbers.
Definitions: Oxford English Dictionary

Key Statistical Concepts…
Population
— a population is the group of all items of interest to
a statistics practitioner.
— frequently very large; sometimes infinite.
E.g. All 30 million (?) voters in Egypt
Sample
— A sample is a set of data drawn from the
population.
— Potentially very large, but less than the population.
E.g. a sample of 1,000 voters who took part in a poll.
Parameter
— A descriptive measure of a population.
Statistic
— A descriptive measure of a sample.

Population Sample
Subset
Statistic
Parameter
Populations have Parameters,
Samples have Statistics.

Descriptive Statistics…
…are methods of organizing, summarizing, and presenting
data in a convenient and informative way. These methods
include:
Graphical Techniques, and Numerical Techniques .
The actual method used depends on what information we
would like to extract. Are we interested in…
• measure(s) of central location? and/or
• measure(s) of variability (dispersion)?
Descriptive Statistics helps to answer these questions…

Inferential Statistics…
Descriptive Statistics describe the data set that’s being
analyzed, but doesn’t allow us to draw any conclusions or
make any interferences about the data. Hence we need
another branch of statistics: inferential statistics.
Inferential statistics is also a set of methods, but it is used to

draw conclusions or inferences about characteristics of
populations based on data from a sample.

Statistical Inference…
Statistical inference is the process of making an estimate,
prediction, or decision about a population based on a sample.
Population
Sample
Inference
Statistic
Parameter
What can we infer about a Population’s Parameters

based on a Sample’s Statistics?
Statistical Inference…
We use statistics to make inferences about parameters.
Therefore, we can make an estimate, prediction, or decision about a

population based on sample data.
• Large populations make investigating each member impractical and

expensive.
• Easier and cheaper to take a sample and make estimates about the
population from the sample.
However:
Such conclusions and estimates are not always going to be correct.
We need to know when a sample is less likely to be a good sample

Statistical Inference
The process of making guesses
about the truth from a sample.
Sample statistics n
x
̂  X n  i 1
n
n
Truth (not  (x  X i n)
2
ˆ 2  s 2  i 1
n 1
observable)
Sample *hat notation ^ is often used to indicate
“estitmate”
Population (observation)
parameters
N N
x
i 1
(x   )
i
2
 2  i 1
N N
Make guesses about
the whole
population
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Statistic Parameter
Mean: X estimates ____

Standard
deviation: s estimates ____
Proportion: p estimates ____

from entire
from sample
population
Population Point estimate Interval estimate
I am 95%
Mean confident that 
Mean, , is is between 40 &
X = 50
unknown 60
Sample

Graphical Techniques for
Nominal data
The only allowable calculation on nominal data is to

count the frequency of each value of a variable.
When the raw data can be naturally categorized in a
meaningful manner, we can display frequencies by
Bar charts – emphasize frequency of occurrences of the
different categories.
Pie chart – emphasize the proportion of occurrences of
each category.

Graphical & Tabular Techniques for
Nominal Data
First we need to summarize the data in a table that presents

the categories and their counts called a frequency
distribution.
A relative frequency distribution lists the categories and the

proportion with which each occurs.

Nominal Data (Tabular Summary)

The Pie Chart
The pie chart is a circle, subdivided into a number of

slices that represent the various categories.
The size of each slice is proportional to the
percentage corresponding to the category it
represents.

The Pie Chart
Other (28.9 /100)(3600) = 1040

11.1% Accounting
28.9%
General
management
14.2%
Finance Marketing
20.6% 25.3%

The Bar Chart
Rectangles represent each category.

The height of the rectangle represents the frequency.
The width of the rectangle is arbitrary, but must be the same
for all bars.
Bar Chart
80 73
70 64
60 52
Frequency
50
36
40
28
30
20
10
0
1 2 3 4 5 More
Area

It’s all the same information,
(based on the same data).
Just different presentation.

2 Graphical Techniques for
Interval Data
Example 2.1 Display & describe information concerning
the monthly bills of new telephone subscribers

Cumulative Relative Frequencies:
first class…
next class: .355+.185=.540
:
:
last class: .930+.070=1.00

Drawing the histogram
Draw a Histogram Bin Frequency

15 71
80
30 37
Frequency
60
45 13
40 60 9
20 75 10
0 90 18
15 30 45 60 75 90 105 120 105 28
Bills 120 14

Interpreting the Histogram
What information is visible from this histogram?

About half of all A few bills are in Relatively,
the bills are small the middle range large number
of large bills
80 71+37=108 13+9+10=32
18+28+14=60
60
Frequency
40
20
0
60
15
30
45
75
90
105
120
Bills
Shapes of histograms

Shapes of histograms
Negatively (left) skewed
Positively (right) skewed

Modal classes
A modal class is the one with the largest number

of observations.
A unimodal histogram
The modal class

Modal classes
A bimodal histogram : two peaks (don’t have to be same height)
A modal class A modal class

Bell shaped histograms
• Many statistical techniques require that the

population be bell shaped.
• Drawing the histogram helps verify the shape of
the population in question

A Contingency Table for two
Nominal variables
A sample of newspaper readers was asked to report which newspaper they read:
Globe and Mail (1), Post (2), Star (3), or Sun (4), and to indicate whether they
were blue-collar worker (1), white-collar worker (2), or professional (3).
Note how this reader is cross-

classified according to both
variables

Contingency Table
Interpretation: The relative frequencies in the columns 2 & 3 are similar, but
there are large differences between columns 1 and 2 and between columns 1 and 3.
This tells us that blue collar workers tend to read different newspapers from both
white collar workers and professionals and that white collar and professionals are
quite similar in their newspaper choice.
similar
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. dissimilar

Graphing a contingency table
Professionals tend
to read the Post
more than twice as
often as the Star or
Sun…

The Relationship Between Two Interval
Variables
A scatter diagram plots one variable against the

another.
The independent variable is labeled X while the
other, dependent variable, is labeled Y.
For example: A real estate agent wants to

study the relationship between house Size Price
price and house size 23 315
18 229
X variable: Size 26 335
20 261
Y variable: Price ……………..
……………..
Scatter Diagram
It appears that in fact there is a relationship: the greater the

house size the greater the selling price:

Typical Patterns of Scatter Diagrams
Positive linear relationship No relationship Negative linear relationship
Negative nonlinear relationship Nonlinear (concave) relationship

This is a weak linear relationship.
A non linear relationship seems to
fit the data better.

Summary
Interval Nominal
Data Data
Histogram, Ogive, Frequency and
Single Set of or Stem-and-Leaf Relative Frequency
Data Display Tables, Bar and
Pie Charts
Relationship Scatter Diagram Contingency Table,
Between Bar Charts
Two Variables
Numerical Description
1 Measures of Central Location
Mean, Median, Mode
The measure of central location reflects an “average” location for the data
points.
2 Measures of Variability
Range, Standard Deviation, Variance, Coefficient of Variation
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 37

1. Measures of location
This is the most popular measure of central location
Sum of the observations

Mean =
Number of observations

The Mean
Sample mean Population mean

n
ii11xxi i
n N
 i1 x i
x 
nn N
Sample size Population size

The Mean
is appropriate for describing approximately symmetric
measurement data, e.g. heights of people, student grades, etc.
is seriously affected by extreme values called “outliers”. E.g.

as soon as a billionaire moves into a neighborhood, the
average household income increases beyond what it was
previously!

The Mean
• Example 4.1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
 i 1 xi
10
0x1  7x2  ...  22
x10
x   11.0
10 10

The Mean of a Probability Distribution
MEAN
•The mean is a typical value used to represent the
central location of a probability distribution.
•The mean of a probability distribution is also
referred to as its expected value.

The Median
The Median of a set of observations is the value that

falls in the middle when the observations are
arranged in order of magnitude.
Example 4.3 Comment

Find the median of the time on the internet Suppose only 9 adults were sampled
for the 10 adults of example 4.1 (exclude, say, the longest time (33))
Even number of observations Odd number of observations
0, 0, 5, 7, 8,8.5, 9, 12, 14, 22, 33 0, 0, 5, 7, 8 9, 12, 14, 22

The Mode
The Mode of a set of observations is the value that
occurs most frequently.
Set of data may have one mode (or modal class), or two
or more modes.
The modal class

The Mode
Example 4.5
Find the mode for the data in Example 4.1. Here are the data
again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution
All observation except “0” occur once. There are two “0”. Thus, the mode
is zero.
Is this a good measure of central location?
The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).

Relationship among Mean and
Median
If a distribution is symmetrical, the mean and
median coincide
• If the distribution is skewed (right or left),

the mean follows the tail (right or left).
A positively skewed distribution A negatively skewed distribution

(skewed to the right) (skewed to the left)
Mean Mean
Median Median
Validity of mean, median and
mode
Mean: valid only for interval data.
Median: valid for ordinal and interval data.
Mode: Valid for ordinal and nominal data.

2 Measures of variability
Measures of central location fail to tell the whole story

about the distribution.
A question of interest still remains unanswered:
How much are the observations spread out

around the central (mean) value?
Note: All measures of variability are applicable for

interval data only

Variability of Data
Two sets of class grades are

shown. The mean (=50) is the
same in each case…
But, the red class has greater

variability than the blue class.

The range
The range of a set of observations is the largest

observation – smallest observation.
It is easy to compute.
However, it cannot provide any information on the
dispersion of the data between these two extremes.
Range
Smallest Largest
observation observation
The Variance
This measure reflects the dispersion of all the

observations
The variance of a sample of n observations

x1, x2, …,xn with mean x is defined as
ni 1( x i  x )2
s2 
n 1

The Variance
Example 4.7
The following sample consists of the number of jobs six
students applied for: 17, 15, 23, 7, 9, 13. Finds its mean
and variance
Solution:
i61 x i 17  15  23  7  9  13 84
x    14 jobs
6 6 6
 
n 2
2 ( x i  x) 1
s  i1
 (17  14) 2  (15  14) 2  ...(13  14) 2
n 1 6 1
 33.2 jobs 2
The Variance – Shortcut
method
2 1  n
2 (  n
x
i1 i ) 2

s   x i  
n  1  i1 n 

1  2 2
 2 17  15  ...  13
2

 17  15  ...  13  
6  1  6 
 33.2 jobs 2

Standard Deviation
The standard deviation of a set of observations is the

square root of the variance .
2
Sample standard deviation : s  s
2
Population standard deviation :   

Standard Deviation
Example 4.8
To examine the consistency of shots for a new innovative golf club,
a golfer was asked to hit 150 shots, 75 with a currently used (7-
iron) club, and 75 with the new club.
The distances were recorded.
Which 7-iron is more consistent?

Standard Deviation
Example 4.8 – solution
Excel printout, from the Current Innovation
“Descriptive Statistics” sub- Mean 150.5467 Mean 150.1467

menu. Standard Error
Median
0.668815
151
Standard Error
Median
0.357011
150
Mode 150 Mode 149
Standard Deviation 5.792104 Standard Deviation 3.091808
Sample Variance 33.54847 Sample Variance 9.559279
Kurtosis 0.12674 Kurtosis -0.88542
The innovation club is Skewness
Range
-0.42989
28
Skewness
Range
0.177338
12
more consistent, so is Minimum 134 Minimum 144
Maximum 162 Maximum 156
more predictable. Sum 11291 Sum 11261
Count 75 Count 75

The Variance and Standard
Deviation of a Probability Distribution
VARIANCE AND STANDARD DEVIATION

• Measures the amount of spread in a distribution
• The computational steps are:
1. Subtract the mean from each value, and square this
difference.
2. Multiply each squared difference by its probability.
3. Sum the resulting products to arrive at the variance.
The standard deviation is found by taking the positive

square root of the variance.

Mean, Variance, and Standard
Deviation of a Probability Distribution - Example
John Ragsdale sells new cars for Pelican Ford.

John usually sells the largest number of cars on
Saturday. He has developed the following
probability distribution for the number of cars he
expects to sell on a particular Saturday.

Mean of a Probability Distribution - Example

Variance and Standard
Deviation of a Probability Distribution - Example
   2  1.290  1.136
The Coefficient of
Variation
The coefficient of variation of a set of measurements is
the standard deviation divided by the mean value.
s
Sample coefficient of variation : cv 
x

Population coefficient of variation : CV 

This coefficient provides a proportionate measure of
variation.
A standard deviation of 10 may be perceived
large when the mean value is 100, but only
moderately large when the mean value is 500
Box Plot
This is a graph that shows five descriptive measures of the

data:
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation
1.5(Q3 – Q1) 1.5(Q3 – Q1)

Whisker Whisker
S Q1 Q2 Q3 L

Box Plot: Telephone Bill Amounts
Example 4.14
Bills
42.19
38.45 Left hand boundary = 9.275–1.5(IQR)= -104.226
29.23
89.35
Right hand boundary=84.9425+ 1.5(IQR)=198.4438
118.04
110.46
.
Smallest =. 0 -104.226 0 9.275 84.9425 119.63 198.4438
.
Q1 = 9.275 26.905
Median = 26.905
Q3 = 84.9425 No outliers are found
Largest = 119.63
IQR = 75.6675
Outliers = ()
Box Plot: GMAT Scores
Additional Example - GMAT scores

Create a box plot for the data regarding the GMAT
scores of 200 applicants.
GMAT Smallest = 449
512 Q1 = 512
531 Median = 537
461 Q3 = 575
515 Largest = 788
. IQR = 63
. Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
.
417.5 449 512 537 575 669.5 788

512-1.5(IQR) 575+1.5(IQR)

1-Exploratory Data Analysis: Graphs, Tables and Summary Measures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-Exploratory Data Analysis: Graphs, Tables and Summary Measures

Uploaded by

Copyright:

Available Formats

1-Exploratory Data Analysis:

Graphs, tables and Summary

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.1

“Statistics is a way to get information from data.”

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.2

“Statistics is a way to get information from data”

Data: Facts, especially Information: Knowledge

Statistics is a tool for creating new understanding from a set of numbers.

Definitions: Oxford English Dictionary

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.5

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.6

Descriptive Statistics helps to answer these questions…

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.7

Inferential statistics is also a set of methods, but it is used to

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.8

What can we infer about a Population’s Parameters

Therefore, we can make an estimate, prediction, or decision about a

• Large populations make investigating each member impractical and

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.10

Mean: X estimates ____

Proportion: p estimates ____

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The only allowable calculation on nominal data is to

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

First we need to summarize the data in a table that presents

A relative frequency distribution lists the categories and the

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

The pie chart is a circle, subdivided into a number of

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Other (28.9 /100)(3600) = 1040

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Rectangles represent each category.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

last class: .930+.070=1.00

Draw a Histogram Bin Frequency

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

What information is visible from this histogram?

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Negatively (left) skewed

Positively (right) skewed

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A modal class is the one with the largest number

The modal class

A bimodal histogram : two peaks (don’t have to be same height)

A modal class A modal class

• Many statistical techniques require that the

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Note how this reader is cross-

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. dissimilar

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

A scatter diagram plots one variable against the

For example: A real estate agent wants to

It appears that in fact there is a relationship: the greater the

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Positive linear relationship No relationship Negative linear relationship

Negative nonlinear relationship Nonlinear (concave) relationship

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 37

This is the most popular measure of central location