You are on page 1of 55

Session 2: Describing data

visually
Statistics for Business
Dr. Le Anh Tuan

1
Graphical Presentation of Data
►Data in raw form are usually not easy to use for
decision making.

►Methods of organizing, exploring, and summarizing


data include:
►Visual (charts and graphs) provides insight into
characteristics of a data set without using
mathematics.
►Numerical (statistics or tables) provides insight
into characteristics of a data set using
mathematics.

►The type of graph to use depends on the variable being


summarized
2
Graphical Presentation of Data

►Categorical Variables
►Frequency distribution
►Bar chart
►Pie chart
►Pareto diagram

►Numerical variables
►Line chart
►Frequency distribution
►Histogram and ogive
►Stem-and-leaf display
►Scatter plot

3
Graphical Presentation of Data

Categorical
Data

Tabulating Data Graphing Data

Frequency
Distribution Bar Pie Pareto

Table Chart Chart Diagram


4
Bar and Pie Charts

► Bar charts and Pie charts are often used for qualitative
(category) data.

► Height of bar or size of pie slice shows the frequency or


percentage for each category.

► A simple bar chart can be used to display the same data,


and would be preferred by many statisticians.

5
Graphical Presentation of Data
Summarize data by category

Example: Students by Majors


Major Number of students

Finance 120

International Business 200

Marketing 150

Management 50

Accounting 75

(Variables are categorical)


Bar Charts
Major Number of
students

Finance 120
Number of students
International 200
Business Number of students
Marketing 150

Management 50
200
Accounting 75

150
120

75
50

Finance International Marketing Management Accounting


Business

7
Bar Charts

8
Bar Charts

► Clustered bar charts


group several values side
by side within the same
category in a vertical
direction.

► Stacked bar charts


group several values in a
single column within the
same category in a
vertical direction.

9
Pie Charts
► Pie charts are another excellent tool for comparing
proportions for categorical data.

► Each segment of the pie represents the relative


frequency of one category.

► Pie charts should be used to portray data which sum to a


total (e.g., percent market shares).
► All categories in the data set must be included in
the pie.

► A pie chart should only have a few (i.e., 2 to 5) slices.

► Each slice can be labeled with data values or percents.

10
Pie Charts
Major # of students Percentage

Finance 120 20

IB 200 34

Marketing 150 25

Management 50 8

Accounting 75 13

Number of students
Finance Int ernational Business Marketing Management Accoun ting

13%
20%

8%

25%
34%

11
Pareto Diagram

12
Pareto Diagram Example
► A Pareto Chart is a combination of a bar graph and a
line graph. A Pareto Chart is a graph that indicates
the frequency of defects, as well as their cumulative
impact. Pareto Charts are useful to find the defects
to prioritize in order to observe the greatest overall
improvement.
► The most problematic categories are shown first.
► For example, you collect customer complaints
information.
Customer Complaints Frequency

Product 9
Service 7
Store 5
Price 3
Location 2
13
Pareto Diagram Example

► Step 1: Sort by defect cause, in descending order

► Step 2: Determine % in each category

Customer Complaints Frequency Percentage

Product 9 35
Service 7 27
Store 5 19
Price 3 11
Location 2 8
Total 26 100

14
Pareto Diagram Example

► Step 3: Show results graphically


Frequency Percentage

10 120

9
100
8

7
80
6

5 60

4
40
3

2
20
1

0 0
Product Service Store Price Location

15
Graphical Presentation of Data

Numerical Data

Frequency Distributions and Stem-and-Leaf


Cumulative Distributions Display

Histogram Ogive

16
Frequency Distribution

►A frequency distribution is a table formed by


classifying n data values into k classes (bins).

►Frequencies are the number of observations within


each (class) bin.

17
Relative Frequency Distribution

►Relative frequency distributions display proportion of


observations of each class relative to the number of
observations.

►Show the fraction of observations in each class.


►Founding by dividing each frequency by the total
number of observations.
►The fractions in a relative frequency distribution
add up to 1.00.

18
Number of Classes

►Use at least 5 but no more than 15-20 classes.


►One method to determine the number of class in a
frequency distribution is the rule
2" ≥ $
Where k=Number of Classes
n=Number of Data points.
►Find the lowest value of k that satisfies the rule.
►For example, n=50
25 = 32 < 50
26 = 64 > 50, " = 6 ./ 0 1223 4ℎ2.46.

19
Frequency Distribution

►Once desired classes (k) is known, the width of each


class can be found.
►The width is the range of numbers to put into each
class.
►Determine the width of each class by
largest number - smallest number
w = interval width =
number of desired intervals
►Classes never overlap
►Round up the interval width to get desirable class
endpoints

20
Frequency Distribution

►There is no one correct answer for the class width.

►The goal is to create a histogram to clearly and


usefully show the pattern in the data.

►Often there is more than one acceptable way to


accomplish this.

21
Class Boundaries

►Class boundaries represent the minimum and


maximum values for each class.

►Choose the class boundaries that are easy to


read

3 to less than 6 minutes 3.21 to less than 6.21 minutes


6 to less than 10 minutes 6.21 to less than 10.21 minutes
10 to less than 15 minutes 10.21 to less than 15.21 minutes

22
Frequency Distribution

►Example: A manufacturer randomly selects 20 winter


days and records the daily high temperature

24, 35, 17, 21, 24, 37, 26, 46, 58, 30,
32, 13, 12, 38, 41, 43, 44, 27, 53, 27

23
Frequency Distribution

►Sort raw data in ascending order:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

►Find range: 58 - 12 = 46

►Select number of classes: 5 (or may use the 2k rule)

►Compute interval width: 10 (46/5 then round up)

►Determine interval boundaries: 10 but less than 20, 20


but less than 30, . . . , 60 but less than 70

►Count observations & assign to classes

24
Frequency Distribution

►Sort raw data in ascending order:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Relative
Interval Frequency Frequency Percentage

10 but less than 20 3 0.15 15


20 but less than 30 6 0.30 30
30 but less than 40 5 0.25 25
40 but less than 50 4 0.20 20
50 but less than 60 2 0.10 10
Total 20 1 100

25
Histograms
►A histogram is a graphical representation of a
frequency distribution.

►A histogram is a bar chart.

►Y-axis (vertical) shows frequency within each class.

►X-axis (horizontal) ticks shows end points of each


class.

26
Histograms
Interval Frequency

10 but less than 20 3

20 but less than 30 6 Histogram: Daily High Temperature


30 but less than 40 5 7 6
40 but less than 50 4 6 5
50 but less than 60 2 5 4

Frequency
Total 20
4 3
3 2
2
1 0 0
0
(No gaps 0 10 20 30 40 50 60
between bars) Temperature in Degrees

27
The Shapes of Histograms

28
The Consequences of Too Few or
Too Many Classes
► Wide classes result in few class Weight Distribution
9
intervals
► Can be hide important
8

pattern. 7

► Gives a “blocky” 6

distribution graph. 5

► Summarizes the data too 4

much 3

► Tell us little about the true 2

true distribution shape. 1

0
[8, 51] (51, 94] (94, 137]

29
The Consequences of Too Few or
Too Many Classes
► Too many narrow
classes has

4
consequences:
► Result in a

3
“jagged”

Frequency
histogram

2
► Some classes
may be empty
► Does not
1

summarize the
0

data enough 0 20 40
weight
60 80 100

10

(bin=13, start=8, width=7)

30
The Ogive

►A cumulative relative frequency distribution totals the


proportion of observations that are less than or equal
to the class at which you are looking.
►Show the accumulated proportion as values vary
from low to high.

►The Ogive is a line graph that plots the cumulative


relative frequency distribution.

31
The Cumulative Frequency Distribution

►Sort raw data in ascending order:


12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

Cumulative
Relative Cumulative
Interval Frequency Percentage Percentage
Frequency Frequency
10 but less than 20 3 0.15 15 3 15
20 but less than 30 6 0.30 30 9 45

30 but less than 40 5 0.25 25 14 70


40 but less than 50 4 0.20 20 18 90
50 but less than 60 2 0.10 10 20 100

Total 20 1 100

32
The Ogive Graphing Cumulative Frequencies

Ogive: Daily High Temperature

100
Cumulative Percentage

80
60
40
20
0
10 20 30 40 50 60

33
Stem-and-Leaf Diagram

34
Stem-and-Leaf Diagram

► A simple way to see distribution details in a data set.

► Method: Separate the sorted data series into leading


digits (the stem) and the trailing digits (the leaves)

► By listing all of the leaves to the right of each stem, we


can graphically describe how the data are distributed.
► All the original data points are visible on the
display.
► Easy to construct by hand.
► Provide a histogram – like view of the
distribution.

35
Stem-and-Leaf Diagram

Data in ordered array:


21, 24, 24, 26, 27, 27, 30, 32, 38, 41
Here, use the 10’s digit for the stem unit.
Completed stem-and-leaf diagram:

Stem Leaves

2 1 4 4 6 7 7
3 0 2 8
4 1

36
Stem-and-Leaf Diagram

► Using the 100’s digit as the stem:


► Round off the 10’s digit to form the leaves

Stem Leaf

613 6 1
729 7 3
800 8 0
1221 12 2
Stem-and-Leaf Diagram

► Using the 100’s digit as the stem:


► The completed stem-and-leaf display:

► Data: Stem Leaf


6 1, 3, 6
613, 632, 658, 717,
722, 750, 776, 827, 7 2, 2, 5, 8,
841, 859, 863, 891, 8 3, 4, 6, 6, 9
894, 906, 928, 933,
955, 982, 1034, 1047, 9 6, 8
1056, 1140, 1169, 1224 10 3, 5, 6
11 4, 7
12 2
Stem-and-Leaf Diagram
► The stem-and-leaf can reveal central tendency (the data was
in the 80–89 stem) as well as dispersion (the range is from
613 to 1224).
► In this illustration, the leaf digits have been sorted, although
this is not necessary.
Frequency Stem Leaf
3 6 1, 3, 6
4 7 2, 2, 5, 8,
5 8 3, 4, 6, 6, 9
2 9 6, 8
3 10 3, 5, 6
2 11 4, 7
1 12 2
Dot Plots

40
Dot Plots
►A dot plot is the simplest graphical display of n
individual values of numerical data.
►Easy to understand.
►It reveals dispersion, central tendency, and the
shape of the distribution.

►If more than one data value lies at about the same
axis location, the dots are stacked vertically.

41
Dot Plots

►The range is from 0 to 7.


►High frequency focuses on 3

42
Graphs for Time-Series Data

43
Graphs for Time-Series Data
► A line chart (time-series plot) is used to show the values of a
variable over time

► Time is measured on the horizontal axis (X)

► The variable of interest is measured on the vertical axis (Y)

44
Graphs for Time-Series Data

45
Relationships Between Variables

► Graphs illustrated so far have involved only a single variable

► When two variables exist other techniques are used:

► Categorical (Qualitative) Variables ➔ Cross tables

► Numerical (Quantitative) Variables ➔ Scatter plots

46
Cross Tables

► Cross Tables (or contingency tables) list the number of observations


for every combination of values for two categorical or ordinal variables

► If there are r categories for the first variable (rows) and c categories
for the second variable (columns), the table is called an r x c cross
table

► Tools: PivotTables

47
Cross Tables
► 4 x 3 Cross Table for Investment Portfolios by Investor (values in
millions VND)

Investor A Investor B Investor C Total


Savings 25 40 5 70
Stock market 31 10 28 69
Bond market 10 20 40 70
Insurance 0 5 20 25
66 75 93 234

48
Cross Tables

Investment Portfolio
45
40
35
30
25
20
15
10
5
0
Savings Stock market Bon d market Insurance

Investor A Investor B Investor C

49
Scatter Plots

50
Scatter Plots
► Scatter plots can convey patterns in data pairs that would
not be apparent from a table.

► A scatter plot is a starting point for bivariate data analysis


in which we investigate the association and relationship
between two quantitative variables.

► The dependent variable, which is placed on the vertical


axis of the scatter plot, is influenced by changes in the
independent variable, which is placed on the horizontal
axis.

51
Scatter Plots
GDP
Happiness Per Capita
Index ($US) Happiness and GDP Per Capita
9 40,000 70000

3 10,230
4 12,939 60000

3 9,383
50000
6 28,300
2 4,000

GDP per capita


40000
10 65,000
3 9,999
30000
9 33,200
4 9,311 20000
5 15,494
8 32,030 10000

0
0 2 4 6 8 10 12
Happiness Index

52
Scatter Plots
► The figure shows a scatter plot
with Happiness Index on the X-
axis and GDP per Capita on the
Happiness and GDP Per Capita
Y-axis.
► In this illustration, there seems to
70000

be an association between X and


60000

Y. 50000

► That is, nations with higher

GDP per capita


40000

happiness level tend to have 30000

higher GDP per capita (and vice 20000

versa). 10000

► No cause-and-effect relationship
0
is implied because, in this 0 2 4 6 8 10 12
Happiness Index
example, both variables could be
influenced by a third variable that
is not mentioned (e.g.,
Population).

53
Scatter Plots
► A scatter plot can convey patterns in data pairs that would not be
apparent from a table.
► Some scatter plot patterns similar to those that you might observe when
you have a sample of (X, Y) data pairs.

54
Exercise

► Review Session 1, Online Quiz 2.

► Reading Chapter 3. Descriptive statistics.

55

You might also like