You are on page 1of 74

Business Statistics

Graphs,
Graphs, Charts,
Charts, and
and Tables
Tables ––
Describing
Describing Your
Your Data
Data

Dr.M.Raghunadh Acharya
11/07/10 1
Contents …
• Construct a frequency distribution both
manually and with a computer
• Construct and interpret a histogram
• Create and interpret bar charts, pie
charts, and stem-and-leaf diagrams
• Present and interpret data in line charts
and scatter diagrams

11/07/10 2
Frequency Distributions
What is a Frequency Distribution?
• A frequency distribution is a list or a table …
• containing the values of a variable (or a set
of ranges within which the data falls) ...
• and the corresponding frequencies with
which each value occurs (or frequencies with
which data falls within each range)

11/07/10 3
Why Use Frequency Distributions?

• A frequency distribution is a way


to summarize data
• The distribution condenses the
raw data into a more useful
form...
• and allows for a quick visual
interpretation of the data

11/07/10 4
Frequency Distribution:
Discrete Data
• Discrete data: possible values are countable

Number of days Frequency


Example: An
advertiser asks read
0 44
200 customers
1 24
how many days
2 18
per week they
read the daily 3 16
newspaper. 4 20
5 22
6 26
7 30
11/07/10 Total 200 5
Relative Frequency
Relative Frequency: What proportion is in each category?

Number of days Frequency Relative


read Frequency
44
0 44 .22 = .22
1 24 .12
200
22% of the
2 18 .09 people in the
3 16 .08 sample report
that they read
4 20 .10 the newspaper
0 days per week
5 22 .11
6 26 .13
7 30 .15
Total
11/07/10 200 1.00 6
Frequency Distribution: Continuous Data

• Continuous Data: may take on any value in some


interval

Example: A manufacturer of insulation randomly selects 20 winter


days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27

(Temperature is a continuous variable because it could


be measured to any degree of precision desired)

11/07/10 7
Grouping Data by Classes

Sort raw data in ascending order:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43,
44, 46, 53, 58

• Find range: 58 - 12 = 46
• Select number of classes: 5 (usually between 5 and 20)

• Compute class width: 10 (46/5 then round off)


• Determine class boundaries:10, 20, 30, 40, 50
• Compute class midpoints: 15, 25, 35, 45, 55
• Count observations & assign to classes
11/07/10 8
Frequency Distribution Example

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Frequency Distribution

Class Frequency Relative


Frequency
10 but under 20 3 .15
20 but under 30 6 .30
30 but under 40 5 .25
40 but under 50 4 .20
50 but under 60 2 .10
Total 20 1.00
11/07/10 9
Histograms
• The classes or intervals are shown on the horizontal
axis
• frequency is measured on the vertical axis

• Bars of the appropriate heights can be used to


represent the number of observations within each
class

• Such a graph is called a histogram

11/07/10 10
Histogram Example
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

No gaps
between
bars, since
continuous
data

Class Midpoints
11/07/10 11
Questions for Grouping Data
into Classes

• 1. How wide should each interval be?


(How many classes should be used?)

• 2. How should the endpoints of the


intervals be determined?
• Often answered by trial and error, subject to
user judgment
• The goal is to create a distribution that is
neither too "jagged" nor too "blocky”
• Goal is to appropriately show the pattern of
variation in the data
11/07/10 12
How Many Class Intervals?

• Many (Narrow class intervals)


• may yield a very jagged distribution
with gaps from empty classes
• Can give a poor indication of how
frequency varies across classes

• Few (Wide class intervals)


• may compress variation too much and
yield a blocky distribution
• can obscure important patterns of
variation.

11/07/10 13
(X axis labels are upper class endpoints)
General Guidelines

• Number of Data Points Number of Classes


under 50 5- 7
50 – 100 6 - 10
100 – 250 7 - 12
over 250 10 - 20

– Class widths can typically be reduced as the number of


observations increases
– Distributions with numerous observations are more likely
to be smooth and have gaps filled since data are plentiful

11/07/10 14
Class Width
• The class width is the distance between the
lowest possible value and the highest possible
value for a frequency class

• The minimum class


width is
Largest Value  Smallest Value
W =
Number of Classes

11/07/10 15
Histograms in Excel

1
Select
Tools/Data
Analysis
11/07/10 16
Histograms in Excel
(continued
)

2
Choose Histogram

3
Input data and bin
ranges
11/07/10 17
Stem and Leaf Diagram
• A simple way to see distribution
details in a data set
METHOD: Separate the sorted
data series into leading digits
(the stem) and the trailing digits
(the leaves)

11/07/10 18
Example:

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Here, use the 10’s digit for the stem unit:


Stem Leaf
• 12 is shown as 1 2

3 5
• 35 is shown as

11/07/10 19
Example:

Data in ordered array:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

• Completed Stem-and-leaf diagram:


Stem Leaves
1 2 3 7
2 1 4 4 6 7
8
3 0 2 5 7 8
4 1 3 4 6
5 3 8
11/07/10 20
Using other stem units

• Using the 100’s digit as the stem:


– Round off the 10’s digit to form the
leaves Stem Leaf
– 613 would become 6 1
• 776 would become 7 8
• ...
• 1224 becomes 12 2

11/07/10 21
Graphing Categorical Data

Categorical
Data

Pie Bar Pareto


Charts Charts Diagram

11/07/10 22
Bar and Pie Charts

• Bar charts and Pie charts are


often used for qualitative
(category) data

• Height of bar or size of pie slice


shows the frequency or
percentage for each category

11/07/10 23
Pie Chart Example
Current Investment Portfolio
Investment Amount Percentage Savings
Type (in thousands $)
15%
Stocks 46.5 42.27 Stocks
Bonds 32.0 29.09 42%
CD
CD 15.5 14.09 14%
Savings 16.0 14.55
Total 110 100

Bonds Percentages
(Variables are Qualitative) are rounded to
29% the nearest
percent

11/07/10 24
Bar Chart Example

11/07/10 25
% invested in each category
(bar graph)

11/07/10
Pareto Diagram Example

(line graph)
cumulative % invested
26
Bar Chart Example

Number of Frequency
days read

0 44
1 24
2 18
3 16
4 20
5 22
6 26
7 30
Total 200

11/07/10 27
Tabulating and Graphing
Multivariate Categorical Data

• Investment in thousands of dollars


Investment Investor A Investor B Investor C Total
Category

Stocks 46.5 55 27.5 129


Bonds 32.0 44 19.0 95
CD 15.5 20 13.5 49
Savings 16.0 28 7.0 51
Total 110.0 147 67.0 324

11/07/10 28
Tabulating and Graphing
Multivariate Categorical Data
(continued
)
• Side by side charts

11/07/10 29
Side-by-Side Chart Example
Sales by quarter for three sales territories:

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East 20.4 27.4 59 20.4
West 30.6 38.6 34.6 31.6
North 45.9 46.9 45 43.9

11/07/10 30
Line Charts and Scatter Diagrams

• Line charts show values of one


variable vs. time
– Time is traditionally shown on the
horizontal axis
Scatter Diagrams show points for
bivariate data
– one variable is measured on the vertical
axis and the other variable is measured
on the horizontal axis

11/07/10 31
Line Chart Example
Inflation
Year
Rate

1985 3.56
1986 1.86
1987 3.65
1988 4.14
1989 4.82
1990 5.40
1991 4.21
1992 3.01
1993 2.99
1994 2.56
1995 2.83
1996 2.95
1997 2.29
1998 1.56
1999 2.21
2000 3.36
2001 2.85
2002 1.58
11/07/10 32
Scatter Diagram Example

Volume Cost per


per day day

23 125
26 140
29 146
33 160
38 167
42 170
50 188
55 195
60 200
11/07/10 33
Types of Relationships

• Linear Relationships
Y Y

X X

11/07/10 34
Types of Relationships
(continued
)

• Curvilinear Relationships
Y Y

X X

11/07/10 35
Types of Relationships
(continued
)

• No Relationship
Y Y

X X

11/07/10 36
Chapter Summary

• Data in raw form are usually not easy to use for


decision making -- Some type of organization is
needed:
♦ Table ♦ Graph

• Techniques reviewed in this chapter:


– Frequency Distributions and Histograms
– Bar Charts and Pie Charts
– Stem and Leaf Diagrams
– Line Charts and Scatter Diagrams

11/07/10 37
Summarization measures …..

Summarization measures are single or few number representations of the


data which are helpful in representing data and also to compare between
data. Based on the summary measures of the sample ,population measures
can be forecasted.
The following will illustrate the above, different measures to represent the
data are as follows :

1. Measures of Center and Location


2. Mean, median, mode, geometric mean, midrange
3. Other measures of Location
4. Weighted mean, percentiles, quartiles
5. Measures of Variation
6. Range, Inter quartile range, variance and standard deviation,
coefficient of variation

11/07/10 38
Summary Measures

Describing Data Numerically

Center and Location Other Measures of Variation


Location
Mean Range
Percentiles
Median Inter quartile Range
Quartiles
Variance
Mode
Standard Deviation
Weighted Mean
Coefficient of Variation

11/07/10 39
Overview: Measures of Center and Location

Center and Location

Mean Median Mode Weighted Mean


n

∑x i
∑ wx i i
x= i =1 XW =
N
n ∑w i

∑x i
µW =
∑ wxi i
µ=
∑w
i =1

N i

11/07/10 40
Mean (Arithmetic Average)

• The Mean is the arithmetic average of data values


– Sample mean
n

∑x
n = Sample
Size
i
x + x +  + xn
x= i =1
= 1 2

n n
– Population mean
N N = Population

∑x Size
x1 + x 2 +  + x N
i
µ= =i=1
N N
11/07/10 41
Mean (Arithmetic Average)

• The most common measure of central tendency


• Mean = sum of values divided by the number of values
• Affected by extreme values (outliers)

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Mean = 3 Mean = 4

1 + 2 + 3 + 4 + 5 15 1 + 2 + 3 + 4 + 10 20
= =3 = =4
5 5 5 5
11/07/10 42
Median

• Not affected by extreme values

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Median = 3 Median = 3

• In an ordered array, the median is the “middle” number


– If n or N is odd, the median is the middle number
– If n or N is even, the median is the average of the two middle numbers

11/07/10 43
Mode

• A measure of central tendency


• Value that occurs most often
• Not affected by extreme values
• Used for either numerical or categorical data
• There may be no mode
• There may be several modes

0 1 2 3 4 5 6
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5 No Mode

11/07/10 44
Weighted Mean

Used when values are grouped by frequency or relative importance


Example: Sample of
26 Repair Projects
Weighted Mean Days to Complete:
Days to Frequency
Complete

5 4
XW =
∑ wx
i i
=
(4 × 5) + (12 × 6) + (8 × 7) + (2 × 8)
6 12 ∑w i 4 + 12 + 8 + 2
7 8 164
= = 6.31 days
8 2 26

11/07/10 45
Review Example

$2,000 K
House

Prices:
Five houses on a hill by the beach

$2,000,000
500,000 $500 K
300,000
100,000 $300 K
100,000

$100 K

$100 K

11/07/10 46
Summary Statistics

House Prices: • Mean: ($3,000,000/5)


= $600,000

$2,000,000
• Median: middle value of ranked data
500,000 = $300,000

300,000
100,000 • Mode: most frequent value
= $100,000
100,000
Sum 3,000,000

11/07/10 47
Which measure of location is the “best”?

• Mean is generally used, unless extreme values (outliers) exist


• Then median is often used, since the median is not sensitive to
extreme values.
– Example: Median home prices may be reported for a region –
less sensitive to outliers

11/07/10 48
Shape of a Distribution
Describes how data is distributed

Symmetric or skewed

Left-Skewed Symmetric Right-Skewed

Mean < Median < Mode

Mean = Median = Mode Mode < Median < Mean


(Longer tail extends to left) (Longer tail extends to right)
11/07/10 49
Other Location Measures

Other Measures of
Location

Percentiles Quartiles

The pth percentile in a data array: • 1st quartile = 25th percentile


• p% are less than or equal to this
value • 2nd quartile = 50th percentile
• (100 – p)% are greater than or = median
equal to this value
(where 0 ≤ p ≤ 100) • 3rd quartile = 75th percentile

11/07/10 50
Percentiles

• The p percentile in an ordered array of n values is the value in i position, where


th th

p
i= (n + 1)
100
• Example: The 60th percentile in an ordered array of 19 values is the
value in 12th position:

p 60
i= (n + 1) = (19 + 1) = 12
100 100

11/07/10 51
Quartiles

25% 25% 25% 25%

Q1 Q2 Q3
• Quartiles split the ranked data into 4 equal groups

• Example: Find the first quartile

Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22

(n = 9)
25
Q1 = 25th percentile, so find the 25 100
(9+1) = 2.5 position
100
so use the value half way between the 2nd and 3rd values,
so Q1=12.5

11/07/10 52
Box and Whisker Plot

• A Graphical display of data using 5-number summary:

Minimum -- Q1 -- Median -- Q3 -- Maximum

Example:

25% 25% 25% 25%

Minimum 1st Median 3rd Maximum


Quartile Quartile

11/07/10 53
Shape of Box and Whisker Plots

• The Box and central line are centered between the


endpoints if data is symmetric around the median

• A Box and Whisker plot can be shown in either vertical or


horizontal format

11/07/10 54
Distribution Shape and Box and Whisker Plot

Left-Skewed Symmetric Right-Skewed

Q1 Q2 Q3
Q1 Q2 Q3 Q1 Q2 Q3

11/07/10 55
Box-and-Whisker Plot Example

• Below is a Box-and-Whisker plot for the following data:

0Min
2 2 2 Q1
3 3 4 5 5 Q2
10 27 Q3 Max

00 223 35 5 27 27

• This data is very right skewed, as the plot depicts

11/07/10 56
Measures of Variation

Variation

Range Variance Standard Deviation Coefficient of


Variation

Population Population
Interquartile
Variance Standard
Range
Deviation

Sample Sample
Variance Standard
Deviation

11/07/10 57
Variation

• Measures of variation give information on the spread or


variability of the data values.

Same center,
different variation

11/07/10 58
Range

• Difference between the largest and the smallest observations.

Range = xmaximum – xminimum

Example:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

11/07/10 59
Disadvantages of the Range

• Ignores the way in which data are distributed

7 8 9 10 11 7 8 9 10 11
12 Range = 12 - 7 = 5 12 Range = 12 - 7 = 5

Sensitive to outliers

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4

1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119

11/07/10 60
Interquartile Range

• Can eliminate some outlier problems by using the Interquartile range

• Eliminate some high-and low-valued observations and calculate the range


from the remaining values.

• Interquartile range = 3rd quartile – 1st quartile

11/07/10 61
Interquartile Range

Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%

12 30 45 57 70

Interquartile range
= 57 – 30 = 27

11/07/10 62
Variance

• Average of squared deviations of values from


n the mean

∑ i
(x − x ) 2

– Sample variance: s2 = i =1
n -1
N
– Population variance: ∑ (x i − μ) 2

σ2 = i=1
N

11/07/10 63
Standard Deviation

• Most commonly used measure of variation


• Shows variation about the mean
• Has the same units as the original data
n
– Sample standard deviation:
∑ i
(x − x ) 2

s= i=1
n -1
N

– Population standard deviation:


∑ i
(x − μ) 2

σ= i =1
N

11/07/10 64
Calculation Example: Sample Standard Deviation

Sample
Data (Xi) : 10 12 14 15 17 18 18 24
n=8 Mean = x = 16

(10 −x )2 +(12 −x )2 +(14 −x )2 + +(24 −x )2


s =
n −1

(10 −16) 2
+(12 −16) 2
+(14 −16) 2
+ +(24 −16) 2
=
8 −1

126
= = 4.2426
7

11/07/10 65
Comparing Standard Deviations

Data A
Mean = 15.5
s = 3.338
11 12 13 14 15 16 17 18 19 20 21

Data B

Mean = 15.5
11 12 13 14 15 16 17 18 19 20 21 s = .9258
Data C

Mean = 15.5
s = 4.57
11 12 13 14 15 16 17 18 19 20 21

11/07/10 66
Coefficient of Variation

• Measures relative variation


• Always in percentage (%)
• Shows variation relative to mean
• Is used to compare two or more sets of data measured in different units

Population Sample

σ  s 
CV =   ⋅ 100% CV =
x
 ⋅100%

μ  
11/07/10 67
Comparing Coefficient of Variation

• Stock A:
– Average price last year = $50
– Standard deviation = $5

s $5
CVA =   ⋅ 100% = ⋅ 100% = 10%
x $50
Both stocks
have the same
Stock B: standard
Average price last year = $100 deviation, but
Standard deviation = $5 stock B is less
variable
s  $5
CVB =   ⋅ 100% =
 ⋅ 100% = 5% relative to its
price
x  $100
11/07/10 68
The Empirical Rule

• If the data distribution is bell-shaped, then the interval:


• μ ± 1σ contains about 68% of the values in the population or the sample

X
68%

μ ± 1σ
μ

11/07/10 69
The Empirical Rule

• μ ± 2σ contains about 95% of the values in the population or the sample


• μ ± 3σ contains about 99.7% of the values in the population or the sample

95% 99.7%

μ ± 2σ μ ± 3σ

11/07/10 70
Tchebysheff’s Theorem

• Regardless of how the data are distributed, at least (1 - 1/k2) of


the values will fall within k standard deviations of the mean
• Examples: At least within
– (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ)
(1 - 1/22) = 75% …........ k=2 (μ ± 2σ)
(1 - 1/32) = 89% …........ k=3 (μ ± 3σ)

11/07/10 71
Standardized Data Values

• A standardized data value refers to the number of standard deviations a value is from the mean

• Standardized data values are sometimes referred to as z-scores

11/07/10 72
Standardized Population Values

x −μ
z=
σ

where:
• x = original data value
• μ = population mean
• σ = population standard deviation
• z = standard score
(number of standard deviations x is from μ)

11/07/10 73
Standardized Sample Values

x −x
z=
where: s
• x = original data value
• x = sample mean
• s = sample standard deviation
• z = standard score
(number of standard deviations x is from μ)
Remark: The standardized sample values are used for
constructing the confidence limits for the
population parameters.

11/07/10 74

You might also like