Slide
Descriptive Statistics
Data and Statistics
Applications in Business and Economics
Data
Data Sources
Descriptive Statistics
Statistical Inference
2
Slide
Applications in Business and Economics
Statistics is the process of data collection, organizing,
analyzing the data, interpertation and make
decisions.
Accounting
Public accounting firms use statistical sampling
procedures when conducting audits for their clients.
Finance
Financial analysts use a variety of statistical
information, including priceearnings ratios and
dividend yields, to guide their investment
recommendations.
Marketing
Pointofsale scanners at retail checkout counters are
being used to collect data for a variety of marketing
research applications.
3
Slide
Production
A variety of statistical quality control charts are used
to monitor the output of a production process.
Economics
Economists use statistical information in making
forecasts about the future of the economy or some
aspect of it.
Applications in
Business and Economics
4
Slide
Why Study Statistics?
1. Numerical information is everywhere
2. Statistical techniques are used to make decisions that
affect our daily lives
3. The knowledge of statistical methods will help you to
understand how decisions are made and give you a better
understanding of how they affect you.
4. No matter what line of work you select, you will find
yourself faced with decisions where an understanding of
data analysis is helpful.
Some examples of the need for data collection.
1. Research analysts evaluate many facets of a particular
stock before making a “buy” or “sell” recommendation.
2. The marketing department Managers must make
decisions about the quality of their product or service.
5
Slide
What is Meant by Statistics?
In the more common usage, statistics refers to
numerical information.
Examples: the average starting salary of college
graduates, the number of deaths due to alcoholism last
year etc.
We often present statistical information in a
graphical form for capturing reader attention.
6
Slide
Types of Statistics –
Descriptive Statistics  methods of organizing,
summarizing, and presenting data in an informative
way.
Inferential Statistics: A decision, estimate, prediction, or
generalization about a population, based on a
sample.
7
Slide
Population versus Sample
A population is a collection of all possible individuals,
objects, or measurements of interest.
A sample is a portion, or part, of the population of
interest
8
Slide
Data
Elements, Variables, and Observations
Scales of Measurement
Qualitative and Quantitative Data
CrossSectional and Time Series Data
9
Slide
Data and Data Sets
Data are the facts and figures that are collected,
summarized, analyzed, and interpreted. E.g.,
•IBM’s sales revenue is $100 bn.; stock price $80.
The data collected in a particular study are referred to
as the data set. E.g.,
•The sales revenue and stock price data for a
number of firms including IBM, Dell, Apple, etc.
10
Slide
Elements, Variables, and Observations
The elements are the entities on which data are
collected. E.g.,
•IBM, Dell, Apple, etc. in the previous setting.
A variable is a characteristic of interest for the
elements. E.g.,
•Sales revenue, stock price (of a company)
The set of measurements collected for a particular
element is called an observation.
•Sales revenue, stock price for 2003
11
Slide
Scales of Measurement
Scales of measurement include:
•Nominal
•Ordinal
•Interval
•Ratio
The scale determines the amount of information
contained in the data.
The scale indicates the data summarization and
statistical analyses that are most appropriate.
12
Slide
Scales of Measurement
Nominal
•data that is classified into categories and cannot be
arranged in any particular order. A numeric code may
be used. The Nominal Scales Categorize Individuals
or Groups And This Scale Measure The Percentage
Response E.G. Male Female, PakistaniAmerican
Example:
Students of a university are classified by the school
in which they are enrolled using a nonnumeric
label such as Business, Humanities, Education, and
so on.
Alternatively, a numeric code could be used for the
school variable (e.g. 1 denotes Business, 2 denotes
Humanities, 3 denotes Education, and so on).
13
Slide
Scales of Measurement
Ordinal
similar to the nominal level, with the additional property
that meaningful amounts of differences between data
values can be determined. It categorizes and ranks the
variables according to the preferences e.g. from best to
worst, first to last, a numeric code may be used.
e.g. rank job characteristics
•Example:
Students of a university are classified by their class
standing using a nonnumeric label such as
Freshman, Junior, Senior.
Alternatively, a numeric code could be used for the
class standing variable (e.g. 1 denotes Freshman, 2
denotes, Junior and so on).
14
Slide
Scales of Measurement
Interval
•The data have the properties of ordinal and
interval between observations is expressed in
terms of a fixed unit of measure. Preferences on a
5/7 point scale. It also measures the magnitude of
the differences in the preferences among the
individuals. Interval data are always numeric.
•Example:
strongly disagree, disagree, neither agree nor
disagree, agree, strongly agree etc.
15
Slide
Scales of Measurement
Ratio
•The data have all the properties of interval data and
the ratio of two values is meaningful. This scale
must contain a zero value that indicates that
nothing exists for the variable at the zero point.
•Example:
Variables such as distance, height, weight, and
time use the ratio scale.
16
Slide
Scales of Measurement
Ratio scales: used when exact numbers are called for
e.g. how many orders do you operate?
Interval scale: used for responses to various items
on 5/7 points use of stats measures as ratio scale, a.
mean, stand. deviation.
Ordinal scale: for preference in use, stats measures
are median, range, rank order correlations
Nominal scale: used for personal data
17
Slide
Types of Variables
A. Qualitative variable  the characteristic
being studied is nonnumeric.
EXAMPLES: Gender, religious affiliation, type of
automobile owned, eye color are examples.
use either the nominal or ordinal scale of measurement.
B. Quantitative variable  information is
reported numerically.
EXAMPLES: balance in your account, minutes
remaining in class, or number of children in a family.
18
Slide
Quantitative Data
Quantitative data indicate either how many or how
much.
•Quantitative data that measure how many are
discrete.
•Quantitative data that measure how much are
continuous.
Quantitative data are always numeric.
Arithmetic operations (e.g., +, ) are meaningful only
with quantitative data.
19
Slide
Summary of Types of Variables
LO4
20
Slide
CrossSectional and Time Series Data
Crosssectional data are collected at the same or
approximately the same point in time.
•Example: data detailing the number of building
permits issued in June 2000
Time series data are collected over several time
periods.
•Example: Texas in each of the last 36 months
21
Slide
Data Sources
Existing Sources
•Data needed for a particular application might
already exist within a firm. Detailed information
is often kept on customers, suppliers, and
employees.
•Substantial amounts of business and economic
data are available from organizations that
specialize in collecting and maintaining data.
•Government agencies are another important
source of data.
•Data are also available from a variety of industry
associations and specialinterest organizations.
22
Slide
Data Sources
Internet
•The Internet has become an important source of
data.
•Most government agencies, like the Bureau of the
Census (www.census.gov), make their data
available through a web site.
•More and more companies are creating web sites
and providing public access to them.
•A number of companies now specialize in making
information available over the Internet.
23
Slide
Data Acquisition Considerations
Time Requirement
•Searching for information can be time consuming.
•Information might no longer be useful by the time
it is available.
Cost of Acquisition
•Organizations often charge for information even
when it is not their primary business activity.
Data Errors
•Using any data that happens to be available or
that were acquired with little care can lead to poor
and misleading information.
24
Slide
Descriptive Statistics
Descriptive statistics are the tabular, graphical, and
numerical methods used to summarize data.
25
Slide
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Example: Hudson Auto Repair
The manager of Hudson Auto would like to have
a better understanding of the cost of parts used in the
engine tuneups performed in the shop. She examines
50 customer invoices for tuneups. The costs of parts,
rounded to the nearest dollar, are listed below.
26
Slide
Example: Hudson Auto Repair
Tabular Summary (Frequencies and Percent
Frequencies)
Parts Percent
Cost ($) Frequency Frequency
5059 2 4
6069 13 26
7079 16 32
8089 7 14
9099 7 14
100109 5 10
Total 50 100
27
Slide
Example: Hudson Auto Repair
Graphical Summary (Histogram)
Parts
Cost ($)
2
4
6
8
10
12
14
16
18
F
r
e
q
u
e
n
c
y
50 60 70 80 90 100 110
28
Slide
Example: Hudson Auto Repair
Numerical Descriptive Statistics
•The most common numerical descriptive statistic
is the average (or mean).
•Hudson’s average cost of parts, based on the 50
tuneups studied, is $79 (found by summing the
50 cost values and then dividing by 50).
29
Slide
Statistical Inference
Statistical inference is the process of using data
obtained from a small group of elements (the sample)
to make estimates and test hypotheses about the
characteristics of a larger group of elements (the
population).
30
Slide
Example: Hudson Auto Repair
Process of Statistical Inference
1. Population
consists of all
tuneups. Average
cost of parts is
unknown.
2. A sample of 50
engine tuneups
is examined.
3. The sample data
provide a sample
average cost of
$79 per tuneup.
4. The value of the
sample average is used
to make an estimate of
the population average.
31
Slide
Descriptive Statistics:
Tabular and Graphical Methods
Summarizing the Qualitative Data
Frequency Distribution
Relative Frequency
Percent Frequency Distribution
Bar Graph
Pie Chart
32
Slide
Frequency Distribution
A frequency distribution is a tabular summary of
data showing the frequency (or number) of items in
each of several classes.
33
Slide
Example: Marada Inn
Guests staying at Marada Inn were asked to rate the
quality of their accommodations as being excellent,
above average, average, below average, or poor. The
ratings provided by a sample of 20 quests are shown
below.
Below Average Average Above Average
Above Average Above Average Above Average
Above Average Below Average Below Average
Average Poor Poor
Above Average Excellent Above Average
Average Above Average Average
Above Average Average
34
Slide
Frequency Distribution
Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
Example: Marada Inn
35
Slide
Relative Frequency Distribution
The relative frequency of a class is the fraction or
proportion of the total number of data items
belonging to the class.
A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.
36
Slide
Percent Frequency Distribution
The percent frequency of a class is the relative
frequency multiplied by 100.
A percent frequency distribution is a tabular
summary of a set of data showing the percent
frequency for each class.
37
Slide
Example: Marada Inn
Relative Frequency and Percent Frequency
Distributions
Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25
Above Average .45 45
Excellent .05 5
Total 1.00 100
38
Slide
Bar Graph
A bar graph is a graphical device for depicting
qualitative data.
On the horizontal axis we specify the labels that are
used for each of the classes.
A frequency, relative frequency, or percent frequency
scale can be used for the vertical axis.
The bars are separated to emphasize the fact that
each class is a separate category.
39
Slide
Example: Marada Inn
Bar Graph
1
2
3
4
5
6
7
8
9
Poor Below
Average
Average Above
Average
Excellent
F
r
e
q
u
e
n
c
y
Rating
40
Slide
Pie Chart
The pie chart is a commonly used graphical device
for presenting relative frequency distributions for
qualitative data.
First draw a circle; then use the relative frequencies
to subdivide the circle into sectors that correspond to
the relative frequency for each class.
Since there are 360 degrees in a circle, a class with a
relative frequency of .25 would consume .25(360) =
90 degrees of the circle.
41
Slide
Example: Marada Inn
Pie Chart
Average
25%
Below
Average
15%
Poor
10%
Above
Average
45%
Exc.
5%
Quality Ratings
42
Slide
Insights Gained from the Preceding Pie Chart
•Onehalf of the customers surveyed gave Marada
a quality rating of “above average” or “excellent”
(looking at the left side of the pie). This might
please the manager.
•For each customer who gave an “excellent” rating,
there were two customers who gave a “poor”
rating (looking at the top of the pie). This should
displease the manager.
Example: Marada Inn
43
Slide
Summarizing Quantitative Data
Frequency Distribution
Relative Frequency
Percent Frequency Distributions
Cumulative Distributions
Dot Plot
Histogram
Ogive/ Frequency Polygon
44
Slide
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Example: Hudson Auto Repair
The manager of Hudson Auto would like to get a
better picture of the distribution of costs for engine
tuneup parts. A sample of 50 customer invoices has
been taken and the costs of parts, rounded to the
nearest dollar, are listed below.
45
Slide
Frequency Distribution
Guidelines for Selecting Number of Classes
•Use between 5 and 20 classes.
•Data sets with a larger number of elements
usually require a larger number of classes.
•Smaller data sets usually require fewer classes.
46
Slide
Frequency Distribution
Guidelines for Selecting Width of Classes
•Use classes of equal width.
•Approximate Class Width =
Largest Data Value Smallest Data Value
Number of Classes
÷
47
Slide
Example: Hudson Auto Repair
Frequency Distribution
If we choose six classes:
Approximate Class Width = (109  52)/6 = 9.5 ~ 10
Cost ($) Frequency
5059 2
6069 13
7079 16
8089 7
9099 7
100109 5
Total 50
48
Slide
Relative Frequency and Percent Frequency
Distributions
Relative Percent
Cost ($) Frequency Frequency
5059 .04 4
6069 .26 26
7079 .32 32
8089 .14 14
9099 .14 14
100109 .10 10
Total 1.00 100
Example: Hudson Auto Repair
49
Slide
Example: Hudson Auto Repair
Insights Gained from the Percent Frequency
Distribution
•Only 4% of the parts costs are in the $5059 class.
•30% of the parts costs are under $70.
•The greatest percentage (32% or almost onethird)
of the parts costs are in the $7079 class.
•10% of the parts costs are $100 or more.
50
Slide
Dot Plot
One of the simplest graphical summaries of
quantitative data is a dot plot.
A horizontal axis shows the range of data values.
Then each data value is represented by a dot placed
above the axis.
51
Slide
Example: Hudson Auto Repair
Dot Plot
.
. .. . . .
50 60 70 80 90 100 110
. . . ..... .......... .. . .. . . ... . .. .
. .. .. .. .. . .
Cost ($)
52
Slide
Histogram
Another common graphical presentation of
quantitative data is a histogram.
The variable of interest is placed on the horizontal
axis.
A rectangle is drawn above each class interval’s
frequency, relative frequency, or percent frequency.
Unlike a bar graph, a histogram has no natural
separation between rectangles of classes.
53
Slide
Example: Hudson Auto Repair
Histogram
Parts
Cost ($)
2
4
6
8
10
12
14
16
18
F
r
e
q
u
e
n
c
y
50 60 70 80 90 100 110
54
Slide
Cumulative Distributions
Cumulative frequency distribution  shows the
number of items with values less than or equal to the
upper limit of each class.
Cumulative relative frequency distribution  shows
the proportion of items with values less than or equal
to the upper limit of each class.
Cumulative percent frequency distribution  shows
the percentage of items with values less than or equal
to the upper limit of each class.
55
Slide
Example: Hudson Auto Repair
Cumulative Distributions
Cumulative Cumulative
Cumulative Relative Percent
Cost ($) Frequency Frequency Frequency
< 59 2 .04 4
< 69 15 .30 30
< 79 31 .62 62
< 89 38 .76 76
< 99 45 .90 90
< 109 50 1.00 100
56
Slide
Ogive
An ogive is a graph of a cumulative distribution.
The data values are shown on the horizontal axis.
Shown on the vertical axis are the:
•cumulative frequencies, or cumulative relative
frequencies, or cumulative percent frequencies
The frequency (one of the above) of each class is
plotted as a point.
The plotted points are connected by straight lines.
57
Slide
Example: Hudson Auto Repair
Ogive
•Because the class limits for the partscost data are
5059, 6069, and so on, there appear to be oneunit
gaps from 59 to 60, 69 to 70, and so on.
•These gaps are eliminated by plotting points
halfway between the class limits.
•Thus, 59.5 is used for the 5059 class, 69.5 is used
for the 6069 class, and so on.
58
Slide
Example: Hudson Auto Repair
Ogive with Cumulative Percent Frequencies
Parts
Cost ($)
20
40
60
80
100
C
u
m
u
l
a
t
i
v
e
P
e
r
c
e
n
t
F
r
e
q
u
e
n
c
y
50 60 70 80 90 100 110
59
Slide
Cross tabulations and Scatter Diagrams
Thus far we have focused on methods that are used
to summarize the data for one variable at a time.
Often a manager is interested in tabular and
graphical methods that will help to understand the
relationship between two variables.
Cross tabulation and a scatter diagram are two
methods for summarizing the data for two (or more)
variables simultaneously.
60
Slide
Crosstabulation
Crosstabulation is a tabular method for summarizing
the data for two variables simultaneously.
Crosstabulation can be used when:
•One variable is qualitative and the other is
quantitative
•Both variables are qualitative
•Both variables are quantitative
The left and top margin labels define the classes for
the two variables.
61
Slide
Example: Finger Lakes Homes
Crosstabulation
The number of Finger Lakes homes sold for each
style and price for the past two years is shown below.
Price Home Style
Range Colonial Ranch Split AFrame Total
< $99,000 18 6 19 12 55
> $99,000 12 14 16 3 45
Total 30 20 35 15 100
62
Slide
Example: Finger Lakes Homes
Insights Gained from the Preceding Crosstabulation
•The greatest number of homes in the sample (19)
are a splitlevel style and priced at less than or
equal to $99,000.
•Only three homes in the sample are an AFrame
style and priced at more than $99,000.
63
Slide
Crosstabulation: Row or Column Percentages
Converting the entries in the table into row
percentages or column percentages can provide
additional insight about the relationship between the
two variables.
64
Slide
Example: Finger Lakes Homes
Row Percentages
Price Home Style
Range Colonial Ranch Split AFrame Total
< $99,000 32.73 10.91 34.55 21.82 100
> $99,000 26.67 31.11 35.56 6.67 100
Note: row totals are actually 100.01 due to rounding.
65
Slide
Example: Finger Lakes Homes
Column Percentages
Price Home Style
Range Colonial Ranch Split AFrame
< $99,000 60.00 30.00 54.29 80.00
> $99,000 40.00 70.00 45.71 20.00
Total 100 100 100 100
66
Slide
Scatter Diagram
A scatter diagram is a graphical presentation of the
relationship between two quantitative variables.
One variable is shown on the horizontal axis and the
other variable is shown on the vertical axis.
The general pattern of the plotted points suggests the
overall relationship between the variables.
67
Slide
Example: Panthers Football Team
Scatter Diagram
The Panthers football team is interested in
investigating the relationship, if any, between
interceptions made and points scored.
x = Number of y = Number of
Interceptions Points Scored
1 14
3 24
2 18
1 17
3 27
68
Slide
Example: Panthers Football Team
Scatter Diagram
y
x
Number of Interceptions
1
2 3
N
u
m
b
e
r
o
f
P
o
i
n
t
s
S
c
o
r
e
d
0
5
10
15
20
25
30
0
69
Slide
Example: Panthers Football Team
The preceding scatter diagram indicates a positive
relationship between the number of interceptions
and the number of points scored.
Higher points scored are associated with a higher
number of interceptions.
The relationship is not perfect; all plotted points in
the scatter diagram are not on a straight line.
70
Slide
Scatter Diagram
A Positive Relationship
x
y
71
Slide
Scatter Diagram
A Negative Relationship
x
y
72
Slide
Scatter Diagram
No Apparent Relationship
x
y
73
Slide
Tabular and Graphical Procedures
Data
Qualitative Data Quantitative Data
Tabular
Methods
Tabular
Methods
Graphical
Methods
Graphical
Methods
•Frequency
Distribution
•Rel. Freq. Dist.
•% Freq. Dist.
•Crosstabulation
•Bar Graph
•Pie Chart •Frequency
Distribution
•Rel. Freq. Dist.
•Cum. Freq. Dist.
•Cum. Rel. Freq.
Distribution
•Cross tabulation
•Dot Plot
•Histogram
•Ogive
•Scatter
Diagram
74
Slide
Descriptive Statistics: Numerical
Methods
Measures of Location
The Mean (A.M, G.M and H. M)
The Median
The Mode
Percentiles
Quartiles
75
Slide
Summary Measures
Center and Location
Mean
Median
Mode
Describing Data Numerically
Variation
Variance
Standard Deviation
Coefficient of
Variation
Range
Percentiles
Quartiles
Weighted Mean
76
Slide
Mean
The Mean is the average of data values
The most common measure of central tendency
Mean = sum of values divided by the number of values
0 1 2 3 4 5 6 7 8 9
10
Mean = 3
0 1 2 3 4 5 6 7 8 9
10
Mean = 4
4
5
20
5
10 4 3 2 1
= =
+ + + +
77
Slide
Mean
The mean (or average) is
the basic measure of
location or ―central
tendency‖ of the data.
•The sample mean is a
sample statistic.
•The population mean µ is a
population statistic.
x
78
Slide
Mean
•Sample mean
•Population mean
n = Sample
Size
N = Population
Size
n
x x x
n
x
x
n
n
i
i
+ + +
= =
¿
=
2 1 1
N
x x x
N
x
N
N
i
i
+ + +
= = µ
¿
=
2 1 1
79
Slide
Example: College Class Size
We have the following sample of data
for 5 college classes:
46 54 42 46 32
We use the notation x
1
, x
2
, x
3
, x
4
, and x
5
to represent the
number of students in each of the 5 classes:
X
1
= 46 x
2
= 54 x
3
= 42 x
4
= 46 x
5
= 32
Thus we have:
44
5
32 46 42 54 46
5
5 4 3 2 1
=
+ + + +
=
+ + + +
=
¿
=
x x x x x
n
x
x
i
The average class size is 44 students
80
Slide
Median
The median is the value in the
middle when the data are arranged in
ascending order (from smallest value
to largest value).
a. For an odd number of observations the median
is the middle value.
b. For an even number of observations the
median is the average of the two middle values.
81
Slide
The College Class Size example
First, arrange the data in ascending order:
32 42 46 46 54
Notice than n = 5, an odd number. Thus the
median is given by the middle value.
32 42 46 46 54
The median class
size is 46
82
Slide
Median Starting Salary For a Sample
of 12 Business School Graduates
A college placement office has obtained the
following data for 12 recent graduates:
Graduate Starting Salary Graduate Starting Salary
1 2850 7 2890
2 2950 8 3130
3 3050 9 2940
4 2880 10 3325
5 2755 11 2920
6 2710 12 2880
83
Slide
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Notice that n = 12, an even number. Thus we take an
average of the middle 2 observations:
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Middle two
values
First we arrange
the data in
ascending order
2905
2
2920 2890
Median =
+
=
Thus
84
Slide
Mode
The mode is the value that occurs with
greatest frequency
A measure of central tendency
Value that occurs most often
There may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13
14
Mode = 5
0 1 2 3 4 5 6
No Mode
85
Slide
The Mode
MODE The value of the observation that appears most frequently.
86
Slide
Characteristics of the Mean
1. The most widely used measure of
location.
2. Major characteristics:
• All values are used.
• It is unique.
• It is calculated by summing the
values and dividing by the
number of values.
3. Weakness: Its value can be unclear
when extremely large or extremely
small data compared to the majority
of data are present.
Properties and Uses of the Median
1. There is a unique median for each data set.
2. Not affected by extremely large or small
values and is therefore a valuable measure
of central tendency when such values
occur.
Characteristics of the Mode
1. Mode: the value of the
observation that appears
most frequently.
2. Advantage: Not affected
by extremely high or low
values.
3. Disadvantages:
For many sets of data,
there is no mode
because no value
appears more than
once.
For some data sets
there is more than one
mode.
87
Slide
Weighted Mean
When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
88
Slide
Weighted Mean
x = E w
i
x
i
E w
i
where:
x
i
= value of observation i
w
i
= weight for observation i
89
Slide
Sample Data
Population Data
where:
f
i
= frequency of class i
M
i
= midpoint of class i
Mean for Grouped Data
¿
¿
=
i
i i
f
M f
x
N
M f
i i ¿
= µ
90
Slide
Weighted Mean
Used when values are grouped by frequency or relative
importance
Days to
Complete
Frequency
5 4
6 12
7 8
8 2
Example: Sample of 26
Repair Projects
Weighted Mean Days
to Complete:
days 6.31
26
164
2 8 12 4
8) (2 7) (8 6) (12 5) (4
w
x w
X
i
i i
W
= =
+ + +
× + × + × + ×
= =
¿
¿
91
Slide
Example: Apartment Rents
Given below is the previous sample of monthly rents
for onebedroom apartments presented here as grouped
data in the form of a frequency distribution.
Rent ($) Frequency
420439 8
440459 17
460479 12
480499 8
500519 7
520539 4
540559 2
560579 4
580599 2
600619 6
92
Slide
Example: Apartment Rents
Mean for Grouped Data
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
Rent ($) f
i
M
i
f
i
M
i
420439 8 429.5 3436.0
440459 17 449.5 7641.5
460479 12 469.5 5634.0
480499 8 489.5 3916.0
500519 7 509.5 3566.5
520539 4 529.5 2118.0
540559 2 549.5 1099.0
560579 4 569.5 2278.0
580599 2 589.5 1179.0
600619 6 609.5 3657.0
Total 70 34525.0
x = =
34 525
70
493 21
,
.
93
Slide
Five houses on a hill by the beach
Review Example
$2,000 K
$500 K
$300 K
$100 K
$100 K
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
94
Slide
Summary Statistics
Mean: ($3,000,000/5)
= $600,000
Median: middle value of ranked data
= $300,000
Mode: most frequent value
= $100,000
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Sum 3,000,000
95
Slide
Percentiles
The pth percentile is a value such that at least p
percent of the observations are less than or equal to
this value and at least (100 – p) percent of the
observations are greater than or equal to this value.
I scored in the 70
th
percentile on the
Graduate Record Exam
(GRE)—meaning I
scored higher than 70
percent of those who
took the exam
96
Slide
Calculating the pth Percentile
•Step 1: Arrange the data in ascendingorder
(smallest value to largest value).
•Step 2: Compute an index i
n
p
i

.

\

=
100
where p is the percentile of interest and n in the number
of observations.
•Step 3: (a) If i is not an integer, round up. The next
integer greater than i denotes the position of the
pth percentile.
(b) If i is an integer, the pth percentile is the
average of values in i and i + 1
97
Slide
Example: Starting Salaries of
Business Grads
Let’s compute the 85
th
percentile using the starting
salary data. First arrange
the data in ascending order.
Step 1:
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050
3130 3325
Step 2:
2 . 10 12
100
85
100
=

.

\

=

.

\

= n
p
i
Step 3: Since 10.2 in not an integer, round up to
11.The 85
th
percentile is the 11
th
position (3130)
98
Slide
Quartiles
Quartiles are just specific percentiles
Let:
Q
1
= first quartile, or 25
th
percentile
Q
2
= second quartile, or 50
th
percentile (also the median)
Q
3
= third quartile, or 75
th
percentile
Let’s compute the 1
st
and
3rd quartiles using the
starting salary data. Note we
already computed the
median for this sample—so
we know the 2
nd
quartile
99
Slide
Now find the 25
th
percentile: 3 12
100
25
100
=

.

\

=

.

\

= n
p
i
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050
3130 3325
Note that 3 is an integer, so to find the 25
th
percentile we must
average together the 3
rd
and 4
th
values:
Q
1
= (2850 + 2880)/2 = 2865
Now find the 75
th
percentile: 9 12
100
75
100
=

.

\

=

.

\

= n
p
i
Note that 9 is an integer, so to find the 75
th
percentile we must
average together the 9
th
and 10
th
values:
Q
1
= (2950 + 3050)/2 = 3000
100
Slide
Quartiles for the Starting Salary Data
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050
3130 3325
Q
1
= 2865
Q
1
= 2905
(Median)
Q
3
= 3000
101
Slide
Measures of Variability
Measures of Relative Location and Detecting
Outliers
Exploratory Data Analysis
Measures of Association
Between Two Variables
x
102
Slide
Measures of Variability
It is often desirable to consider measures of variability
(dispersion), as well as measures of location.
For example, in choosing supplier A or supplier B we
might consider not only the average delivery time for
each, but also the variability in delivery time for each.
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
103
Slide
Measures of Variation
Variation
Variance Standard Deviation Coefficient of
Variation
Population
Variance
Sample
Variance
Population
Standard
Deviation
Sample
Standard
Deviation
Range
Interquartile
Range
104
Slide
Measures of variation give information on the
spread or variability of the data values.
Variation
Same center,
different variation
105
Slide
Range
Simplest measure of variation
Difference between the largest and the smallest
observations:
Range = x
maximum
– x
minimum
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14  1 = 13
Example:
Chap 3105
106
Slide
Example: Apartment Rents
Range
Range = largest value  smallest value
Range = 615  425 = 190
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
107
Slide
Interquartile Range
The interquartile range of a data set is the difference
between the third quartile and the first quartile.
It is the range for the middle 50% of the data.
108
Slide
Example: Apartment Rents
Interquartile Range
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3  Q1 = 525  445 = 80
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
109
Slide
Variance
The variance is a measure of variability that utilizes
all the data.
It is based on the difference between the value of
each observation (x
i
) and the mean (x for a sample, µ
for a population).
110
Slide
Variance
The variance is the average of the squared differences
between each data value and the mean.
If the data set is a sample, the variance is denoted by
s
2
.
If the data set is a population, the variance is denoted
by o
2
.
s
x
i
x
n
2
2
1
=
÷
¿
÷
( )
o
µ
2
2
=
÷
¿
( ) x
N
i
111
Slide
Variance for Grouped Data
Sample Data
Population Data
1
) (
2
2
÷
÷
=
¿
n
x X f
s
i i
N
X f
i i ¿
÷
=
2
2
) ( µ
o
112
Slide
Standard Deviation
Most commonly used measure of variation
Shows variation about the mean
The standard deviation of a data set is the positive
square root of the variance.
If the data set is a sample, the standard deviation is
denoted s.
If the data set is a population, the standard deviation
is denoted o (sigma).
s s =
2
o o =
2
113
Slide
Calculation Example:
Sample Standard Deviation
Sample
Data (X
i
) : 10 12 14 15 17 18 18 24
n = 8 Mean = x = 16
4.2426
7
126
1 8
16) (24 16) (14 16) (12 16) (10
1 n
) x (24 ) x (14 ) x (12 ) x (10
s
2 2 2 2
2 2 2 2
= =
÷
÷ + + ÷ + ÷ + ÷
=
÷
÷ + + ÷ + ÷ + ÷
=
114
Slide
Coefficient of Variation
Measures relative variation
Always in percentage (%)
Shows variation relative to mean
Is used to compare two or more sets of data measured
in different units
100%
x
s
CV ·


.

\

=
100%
μ
σ
CV ·


.

\

=
Population Sample
115
Slide
Example: Apartment Rents
Variance
Standard Deviation
Coefficient of Variation
s
x
i
x
n
2
2
1
2 996 16 =
÷
÷
=
¿
( )
, .
s s = = =
2
2996 47 54 74 . .
s
x
× = × = 100
54 74
490 80
100 1115
.
.
.
116
Slide
Measures of Relative Location
and Detecting Outliers
zScores
Detecting Outliers
117
Slide
zScores
The zscore is often called the standardized value.
It denotes the number of standard deviations a data
value x
i
is from the mean.
A data value less than the sample mean will have a
zscore less than zero.
A data value greater than the sample mean will have
a zscore greater than zero.
A data value equal to the sample mean will have a
zscore of zero.
z
x x
s
i
i
=
÷
118
Slide
zScore of Smallest Value (425)
Standardized Values for Apartment Rents
z
x x
s
i
=
÷
=
÷
= ÷
425 490 80
54 74
1 20
.
.
.
1.20 1.11 1.11 1.02 1.02 1.02 1.02 1.02 0.93 0.93
0.93 0.93 0.93 0.84 0.84 0.84 0.84 0.84 0.75 0.75
0.75 0.75 0.75 0.75 0.75 0.56 0.56 0.56 0.47 0.47
0.47 0.38 0.38 0.34 0.29 0.29 0.29 0.20 0.20 0.20
0.20 0.11 0.01 0.01 0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
Example: Apartment Rents
119
Slide
Detecting Outliers
An outlier is an unusually small or unusually large
value in a data set.
A data value with a zscore less than 3 or greater
than +3 might be considered an outlier.
It might be an incorrectly recorded data value.
It might be a data value that was incorrectly included
in the data set.
120
Slide
Example: Apartment Rents
Detecting Outliers
The most extreme zscores are 1.20 and 2.27.
Using z > 3 as the criterion for an outlier,
there are no outliers in this data set.
Standardized Values for Apartment Rents
1.20 1.11 1.11 1.02 1.02 1.02 1.02 1.02 0.93 0.93
0.93 0.93 0.93 0.84 0.84 0.84 0.84 0.84 0.75 0.75
0.75 0.75 0.75 0.75 0.75 0.56 0.56 0.56 0.47 0.47
0.47 0.38 0.38 0.34 0.29 0.29 0.29 0.20 0.20 0.20
0.20 0.11 0.01 0.01 0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
121
Slide
Exploratory Data Analysis
FiveNumber Summary
122
Slide
FiveNumber Summary
Smallest Value
First Quartile
Median
Third Quartile
Largest Value
123
Slide
Example: Apartment Rents
FiveNumber Summary
Lowest Value = 425 First Quartile = 450
Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
124
Slide
Measures of Association
between Two Variables
Covariance
Correlation Coefficient
125
Slide
Covariance
The covariance is a measure of the linear association
between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
126
Slide
If the data sets are samples, the covariance is denoted
by s
xy
.
If the data sets are populations, the covariance is
denoted by .
Covariance
s
x x y y
n
xy
i i
=
÷ ÷
¿
÷
( )( )
1
o
µ µ
xy
i x i y
x y
N
=
÷ ÷
¿
( )( )
o
xy
127
Slide
Correlation Coefficient
The coefficient can take on values between 1 and +1.
Values near 1 indicate a strong negative linear
relationship.
Values near +1 indicate a strong positive linear
relationship.
If the data sets are samples, the coefficient is r
xy
.
If the data sets are populations, the coefficient is .
r
s
s s
xy
xy
x y
=
µ
o
o o
xy
xy
x y
=
µ
xy