You are on page 1of 64

1-Exploratory Data Analysis:

Graphs, tables and Summary


Measures

First :
What is Statistics?

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.1


What is Statistics?

“Statistics is a way to get information from data.”


-Gerald Keller
(textbook author)

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.2


What is Statistics?

“Statistics is a way to get information from data”


Statistics

Data Information

Data: Facts, especially Information: Knowledge


numerical facts, collected communicated concerning
together for reference or some particular fact.
information.

Statistics is a tool for creating new understanding from a set of numbers.

Definitions: Oxford English Dictionary


Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.3
Key Statistical Concepts…
Population
— a population is the group of all items of interest to
a statistics practitioner.
— frequently very large; sometimes infinite.
E.g. All 30 million (?) voters in Egypt

Sample
— A sample is a set of data drawn from the
population.
— Potentially very large, but less than the population.
E.g. a sample of 1,000 voters who took part in a poll.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.4
Key Statistical Concepts…
Parameter
— A descriptive measure of a population.

Statistic
— A descriptive measure of a sample.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.5


Key Statistical Concepts…
Population Sample

Subset

Statistic
Parameter
Populations have Parameters,
Samples have Statistics.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.6


Descriptive Statistics…
…are methods of organizing, summarizing, and presenting
data in a convenient and informative way. These methods
include:
Graphical Techniques, and Numerical Techniques .
The actual method used depends on what information we
would like to extract. Are we interested in…
• measure(s) of central location? and/or
• measure(s) of variability (dispersion)?

Descriptive Statistics helps to answer these questions…

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.7


Inferential Statistics…
Descriptive Statistics describe the data set that’s being
analyzed, but doesn’t allow us to draw any conclusions or
make any interferences about the data. Hence we need
another branch of statistics: inferential statistics.

Inferential statistics is also a set of methods, but it is used to


draw conclusions or inferences about characteristics of
populations based on data from a sample.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.8


Statistical Inference…
Statistical inference is the process of making an estimate,
prediction, or decision about a population based on a sample.
Population

Sample

Inference

Statistic
Parameter

What can we infer about a Population’s Parameters


based on a Sample’s Statistics?
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.9
Statistical Inference…
We use statistics to make inferences about parameters.

Therefore, we can make an estimate, prediction, or decision about a


population based on sample data.

• Large populations make investigating each member impractical and


expensive.
• Easier and cheaper to take a sample and make estimates about the
population from the sample.

However:
Such conclusions and estimates are not always going to be correct.
We need to know when a sample is less likely to be a good sample

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 1.10


Statistical Inference
The process of making guesses
about the truth from a sample.

Sample statistics n

x
̂  X n  i 1
n
n

Truth (not  (x  X i n)
2

ˆ 2  s 2  i 1
n 1
observable)
Sample *hat notation ^ is often used to indicate
“estitmate”

Population (observation)
parameters
N N

x
i 1
(x   )
i
2

 2  i 1
N N
Make guesses about
the whole
population
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Statistic Parameter

Mean: X estimates ____


Standard
deviation: s estimates ____

Proportion: p estimates ____


from entire
from sample
population
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Population Point estimate Interval estimate
I am 95%
Mean confident that 
Mean, , is is between 40 &
X = 50
unknown 60

Sample

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Graphical Techniques for
Nominal data

The only allowable calculation on nominal data is to


count the frequency of each value of a variable.
When the raw data can be naturally categorized in a
meaningful manner, we can display frequencies by
Bar charts – emphasize frequency of occurrences of the
different categories.
Pie chart – emphasize the proportion of occurrences of
each category.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Graphical & Tabular Techniques for
Nominal Data

First we need to summarize the data in a table that presents


the categories and their counts called a frequency
distribution.

A relative frequency distribution lists the categories and the


proportion with which each occurs.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Nominal Data (Tabular Summary)

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


The Pie Chart

The pie chart is a circle, subdivided into a number of


slices that represent the various categories.
The size of each slice is proportional to the
percentage corresponding to the category it
represents.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


The Pie Chart

Other (28.9 /100)(3600) = 1040


11.1% Accounting
28.9%
General
management
14.2%

Finance Marketing
20.6% 25.3%

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


The Bar Chart

Rectangles represent each category.


The height of the rectangle represents the frequency.
The width of the rectangle is arbitrary, but must be the same
for all bars.
Bar Chart

80 73
70 64
60 52
Frequency

50
36
40
28
30
20
10
0
1 2 3 4 5 More
Area

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


It’s all the same information,
(based on the same data).
Just different presentation.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


2 Graphical Techniques for
Interval Data
Example 2.1 Display & describe information concerning
the monthly bills of new telephone subscribers

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Cumulative Relative Frequencies:

first class…
next class: .355+.185=.540

:
:

last class: .930+.070=1.00


Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Drawing the histogram

Draw a Histogram Bin Frequency


15 71
80
30 37
Frequency

60
45 13
40 60 9
20 75 10
0 90 18
15 30 45 60 75 90 105 120 105 28
Bills 120 14

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Interpreting the Histogram

What information is visible from this histogram?


About half of all A few bills are in Relatively,
the bills are small the middle range large number
of large bills
80 71+37=108 13+9+10=32
18+28+14=60
60
Frequency

40

20

0
60
15
30
45

75
90
105
120

Bills
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Shapes of histograms

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Shapes of histograms

Negatively (left) skewed

Positively (right) skewed

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Modal classes

A modal class is the one with the largest number


of observations.
A unimodal histogram

The modal class


Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Modal classes

A bimodal histogram : two peaks (don’t have to be same height)

A modal class A modal class


Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Bell shaped histograms

• Many statistical techniques require that the


population be bell shaped.
• Drawing the histogram helps verify the shape of
the population in question

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


A Contingency Table for two
Nominal variables
A sample of newspaper readers was asked to report which newspaper they read:
Globe and Mail (1), Post (2), Star (3), or Sun (4), and to indicate whether they
were blue-collar worker (1), white-collar worker (2), or professional (3).

Note how this reader is cross-


classified according to both
variables

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Contingency Table
Interpretation: The relative frequencies in the columns 2 & 3 are similar, but
there are large differences between columns 1 and 2 and between columns 1 and 3.
This tells us that blue collar workers tend to read different newspapers from both
white collar workers and professionals and that white collar and professionals are
quite similar in their newspaper choice.

similar

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. dissimilar


Graphing a contingency table

Professionals tend
to read the Post
more than twice as
often as the Star or
Sun…

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


The Relationship Between Two Interval
Variables

A scatter diagram plots one variable against the


another.
The independent variable is labeled X while the
other, dependent variable, is labeled Y.

For example: A real estate agent wants to


study the relationship between house Size Price
price and house size 23 315
18 229
X variable: Size 26 335
20 261
Y variable: Price ……………..
……………..
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Scatter Diagram

It appears that in fact there is a relationship: the greater the


house size the greater the selling price:

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Typical Patterns of Scatter Diagrams

Positive linear relationship No relationship Negative linear relationship

Negative nonlinear relationship Nonlinear (concave) relationship


This is a weak linear relationship.
A non linear relationship seems to
fit the data better.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Summary

Interval Nominal
Data Data
Histogram, Ogive, Frequency and
Single Set of or Stem-and-Leaf Relative Frequency
Data Display Tables, Bar and
Pie Charts
Relationship Scatter Diagram Contingency Table,
Between Bar Charts

Two Variables
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
Numerical Description
1 Measures of Central Location
Mean, Median, Mode

The measure of central location reflects an “average” location for the data
points.

2 Measures of Variability
Range, Standard Deviation, Variance, Coefficient of Variation

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 37


1. Measures of location

This is the most popular measure of central location

Sum of the observations


Mean =
Number of observations

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 38


The Mean

Sample mean Population mean


n
ii11xxi i
n N
 i1 x i
x 
nn N

Sample size Population size

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 39


The Mean
is appropriate for describing approximately symmetric
measurement data, e.g. heights of people, student grades, etc.

is seriously affected by extreme values called “outliers”. E.g.


as soon as a billionaire moves into a neighborhood, the
average household income increases beyond what it was
previously!

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 40


The Mean

• Example 4.1
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
 i 1 xi
10
0x1  7x2  ...  22
x10
x   11.0
10 10

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 41


The Mean of a Probability Distribution

MEAN
•The mean is a typical value used to represent the
central location of a probability distribution.
•The mean of a probability distribution is also
referred to as its expected value.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


The Median

The Median of a set of observations is the value that


falls in the middle when the observations are
arranged in order of magnitude.

Example 4.3 Comment


Find the median of the time on the internet Suppose only 9 adults were sampled
for the 10 adults of example 4.1 (exclude, say, the longest time (33))
Even number of observations Odd number of observations

0, 0, 5, 7, 8,8.5, 9, 12, 14, 22, 33 0, 0, 5, 7, 8 9, 12, 14, 22


Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 43
The Mode
The Mode of a set of observations is the value that
occurs most frequently.
Set of data may have one mode (or modal class), or two
or more modes.

The modal class

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 44


The Mode
Example 4.5
Find the mode for the data in Example 4.1. Here are the data
again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22

Solution

All observation except “0” occur once. There are two “0”. Thus, the mode
is zero.
Is this a good measure of central location?
The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the mode = 8.5).

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 45


Relationship among Mean and
Median
If a distribution is symmetrical, the mean and
median coincide

• If the distribution is skewed (right or left),


the mean follows the tail (right or left).

A positively skewed distribution A negatively skewed distribution


(skewed to the right) (skewed to the left)

Mean Mean
Median Median
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 46
Validity of mean, median and
mode

Mean: valid only for interval data.

Median: valid for ordinal and interval data.

Mode: Valid for ordinal and nominal data.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 47


2 Measures of variability

Measures of central location fail to tell the whole story


about the distribution.
A question of interest still remains unanswered:

How much are the observations spread out


around the central (mean) value?

Note: All measures of variability are applicable for


interval data only

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 48


Variability of Data

Two sets of class grades are


shown. The mean (=50) is the
same in each case…

But, the red class has greater


variability than the blue class.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 49


The range

The range of a set of observations is the largest


observation – smallest observation.
It is easy to compute.
However, it cannot provide any information on the
dispersion of the data between these two extremes.

Range

Smallest Largest
observation observation
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 50
The Variance

This measure reflects the dispersion of all the


observations

The variance of a sample of n observations


x1, x2, …,xn with mean x is defined as

ni 1( x i  x )2
s2 
n 1

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 51


The Variance

Example 4.7
The following sample consists of the number of jobs six
students applied for: 17, 15, 23, 7, 9, 13. Finds its mean
and variance
Solution:

i61 x i 17  15  23  7  9  13 84
x    14 jobs
6 6 6
 
n 2
2 ( x i  x) 1
s  i1
 (17  14) 2  (15  14) 2  ...(13  14) 2
n 1 6 1
 33.2 jobs 2
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 52
The Variance – Shortcut
method

2 1  n
2 (  n
x
i1 i ) 2

s   x i  
n  1  i1 n 


1  2 2
 2 17  15  ...  13
2

 17  15  ...  13  
6  1  6 
 33.2 jobs 2

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 53


Standard Deviation

The standard deviation of a set of observations is the


square root of the variance .

2
Sample standard deviation : s  s
2
Population standard deviation :   

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 54


Standard Deviation
Example 4.8
To examine the consistency of shots for a new innovative golf club,
a golfer was asked to hit 150 shots, 75 with a currently used (7-
iron) club, and 75 with the new club.
The distances were recorded.
Which 7-iron is more consistent?

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 55


Standard Deviation
Example 4.8 – solution

Excel printout, from the Current Innovation

“Descriptive Statistics” sub- Mean 150.5467 Mean 150.1467


menu. Standard Error
Median
0.668815
151
Standard Error
Median
0.357011
150
Mode 150 Mode 149
Standard Deviation 5.792104 Standard Deviation 3.091808
Sample Variance 33.54847 Sample Variance 9.559279
Kurtosis 0.12674 Kurtosis -0.88542
The innovation club is Skewness
Range
-0.42989
28
Skewness
Range
0.177338
12
more consistent, so is Minimum 134 Minimum 144
Maximum 162 Maximum 156
more predictable. Sum 11291 Sum 11261
Count 75 Count 75

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 56


The Variance and Standard
Deviation of a Probability Distribution

VARIANCE AND STANDARD DEVIATION


• Measures the amount of spread in a distribution
• The computational steps are:
1. Subtract the mean from each value, and square this
difference.
2. Multiply each squared difference by its probability.
3. Sum the resulting products to arrive at the variance.

The standard deviation is found by taking the positive


square root of the variance.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Mean, Variance, and Standard
Deviation of a Probability Distribution - Example

John Ragsdale sells new cars for Pelican Ford.


John usually sells the largest number of cars on
Saturday. He has developed the following
probability distribution for the number of cars he
expects to sell on a particular Saturday.

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Mean of a Probability Distribution - Example

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.


Variance and Standard
Deviation of a Probability Distribution - Example

   2  1.290  1.136
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc.
The Coefficient of
Variation
The coefficient of variation of a set of measurements is
the standard deviation divided by the mean value.
s
Sample coefficient of variation : cv 
x

Population coefficient of variation : CV 

This coefficient provides a proportionate measure of
variation.
A standard deviation of 10 may be perceived
large when the mean value is 100, but only
moderately large when the mean value is 500
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 61
Box Plot

This is a graph that shows five descriptive measures of the


data:
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation

1.5(Q3 – Q1) 1.5(Q3 – Q1)


Whisker Whisker
S Q1 Q2 Q3 L

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 62


Box Plot: Telephone Bill Amounts

Example 4.14
Bills
42.19
38.45 Left hand boundary = 9.275–1.5(IQR)= -104.226
29.23
89.35
Right hand boundary=84.9425+ 1.5(IQR)=198.4438
118.04
110.46
.
Smallest =. 0 -104.226 0 9.275 84.9425 119.63 198.4438
.
Q1 = 9.275 26.905
Median = 26.905
Q3 = 84.9425 No outliers are found
Largest = 119.63
IQR = 75.6675
Outliers = ()
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 63
Box Plot: GMAT Scores

Additional Example - GMAT scores


Create a box plot for the data regarding the GMAT
scores of 200 applicants.
GMAT Smallest = 449
512 Q1 = 512
531 Median = 537
461 Q3 = 575
515 Largest = 788
. IQR = 63
. Outliers = (788, 788, 766, 763, 756, 719, 712, 707, 703, 694, 690, 675, )
.

417.5 449 512 537 575 669.5 788


512-1.5(IQR) 575+1.5(IQR)
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 64

You might also like