You are on page 1of 57

Data Visualization

(fancy name: Data dashboarding)

1
Types of Data

Data

Categorical Numerical
Examples:
■ Marital Status
■ Are you registered to vote?
■ Eye Color
(Defined categories or
groups) Discrete Continuous

Examples: Examples:
■ Number of Children ■ Weight
■ Defects per hour ■ Voltage
(Counted items) (Measured characteristics)
Graphical Presentation of Data

• Data in raw form are usually not easy to use for decision making
• Some type of organization is needed
• Table
• Graph
• The type of graph to use depends on the variable being summarized.
Graphical Presentation of Data

Categorical Numerical
Variables Variables

• Frequency distribution • Line Charts


• Bar chart • Frequency distribution
• Pie chart • Histogram
• Pareto diagram • Box plot
• Stem-and-leaf display
• Scatter plot
Tables and Graphs for Categorical
Variables

Categorical Data

Tabulating Data Graphing Data

Frequency
Distribution Table Bar Chart Pie Chart Pareto
Diagram
DESCRIBING CATEGORICAL DATA

6
Charts of Categorical Data

Bar Charts and Pie Charts

• Unless you need to know exact counts, charts are better than tables for summarizing
more than five categories
• The two most common displays of a categorical variable are a bar chart and a pie
chart
• Both Describe a categorical variable by displaying its frequency table.
BAR CHART

9
Charts of Categorical Data

• Bar Chart (Horizontal) of Top 10 Hosts

Emphasise
bigger hosts
Charts of Categorical Data

Can be drawn either way


Bar Chart (Vertical) of Top 10 Hosts
Charts of Categorical Data

• Uses horizontal or vertical bars to show the distribution of a categorical variable

• Is called a Pareto chart when the categories are sorted by frequency (popular in
quality control)

• Becomes cluttered with too many categories

• Is appropriate for ordinal categorical variables- preserve the ordering.


Question: Does every bar chart show the distribution of a categorical variable???

No. At times the x-axis may represent time (years, for example): Employees of an
organization over years.
PIE CHART

14
Charts of Categorical Data

• Uses wedges of a circle to show the distribution of a categorical variable

• Commonly chosen to illustrate market shares or sources of revenue for a company-


share of the pie!!

• Less useful than bar charts if we want to compare actual counts (easier to compare
bars than angles of wedges)
Pie chart of recruiters
PARETO DIAGRAM

17
Pareto Diagram

• Used to portray categorical data


• A bar chart, where categories are shown in descending order of frequency.
• A cumulative polygon is often shown in the same graph.
• Used to separate the “vital few” from the “trivial many”.
Pareto Diagram Example

Example: 400 defective items are examined for cause of defect:

Source of
Manufacturing Error Number of defects
Bad Weld 34
Poor Alignment 223
Missing Part 25
Paint Flaw 78
Electrical Short 19
Cracked case 21
Total 400
Pareto Diagram Example
(continued)

Step 1: Sort by defect cause, in descending order


Step 2: Determine % in each category
Source of
Manufacturing Error Number of defects % of Total Defects
Poor Alignment 223 55.75
Paint Flaw 78 19.50
Bad Weld 34 8.50
Missing Part 25 6.25
Cracked case 21 5.25
Electrical Short 19 4.75
Total 400 100%
AREA PRINCIPLE

21
The Area Principle
The Fundamental Rule for Data Displays

• The area occupied by a part of the graph/chart that displays data should be
proportional to the amount of data it represents.
• Charts decorated to attract attention often violate the area principle.
The Area Principle
An Example Violating the Area Principle
The Area Principle
The Same Example Respecting the Area Principle
Mode and Median
Mode
• Category with the highest frequency
• The longest bar in a bar chart
• The widest slice in a pie chart
• Two or more categories can tie with the highest frequency (bimodal or multimodal)

Median
• Not appropriate for nominal data
• Data must be ordinal
• It is the category label of the middle observation in ordered data
Self exercise

• Look at the iPod song size example from the textbook…..


Best Practices

• Use a bar chart to show the frequencies of a categorical variable.


• Use a pie chart to show the proportions of a categorical variable.
• Preserve the ordering of an ordinal variable.
• Respect the area principle.
• Show the best plots to answer the motivating question.
• Label your chart to show the categories and indicate whether some have been
combined or omitted.
Pitfalls

• Avoid elaborate plots that may be deceptive.


• Do not show too many categories.
• Do not put ordinal data in a pie chart.
• Do not carelessly round data.
Graphs for Time-Series Data

• A line chart (time-series plot) is used to show the values of a variable over time.
• Time is measured on the horizontal axis.
• The variable of interest is measured on the vertical axis.
Graphs to describe numerical variables

Numerical Data

Frequency Distributions and Stem-and-Leaf


Cumulative Distributions Display

Histogram Box Plots


HISTOGRAMS

31
Histograms and the Distribution of Numerical Data

• Plot the distribution of a numerical variable by showing counts of values occurring


within adjacent intervals

• Similar to bar charts but designed for continuous quantitative data (bar charts are
only appropriate for discrete categories)
Class Intervals and Class Boundaries

• Each class grouping has the same width


• Determine the width of each interval by

• Use at least 5 but no more than 15-20 intervals


• Intervals never overlap
• Round up the interval width to get desirable interval
endpoints
Histograms in Excel

1
2
Select Data Tab
Click on Data Analysis
Histograms in Excel
(continued)

Choose Histogram

(
Input data range and bin range (bin
range is a cell range containing the
upper interval endpoints for each
4 class grouping)

Select Chart Output


and click “OK”
Histogram of Song Sizes
Histograms and the
Distribution of Numerical Data
▪ Indicates a few very long songs (outliers)

▪ The graph devotes more than half of its area to show less than 1% of the songs
(white space rule: graphs with mostly white space can be improved by changing the
interval of the plot to focus on the data rather than the white space)
Questions for Grouping Data
into Intervals
1. How wide should each interval be?
(How many classes should be used?)
2. How should the endpoints of the intervals be determined?
• Often answered by trial and error, subject to user
judgment
• The goal is to create a distribution that is neither too
"jagged" nor too "blocky”
• Goal is to appropriately show the pattern of variation in
the data
How Many Class Intervals?

• Many (Narrow class intervals)


• may yield a very jagged
distribution with gaps from empty
classes
• Can give a poor indication of how
frequency varies across classes

• Few (Wide class intervals)


• may compress variation too much
and yield a blocky distribution
• can obscure important patterns of
variation. (X axis labels are upper class endpoints)
BOX PLOTS

40
Boxplot
Boxplot

Combining Boxplots with Histograms

• Boxplots locate the median and quartiles and highlight outliers

• The median splits the area of the histogram in half (unlike the mean, it is resistant or
robust to the effects of outliers)
Boxplot with Histogram of Song Sizes
STEM AND LEAF DISPLAY

44
Stem-and-Leaf Diagram

• A simple way to see distribution details in a data set

METHOD: Separate the sorted data series


into leading digits (the stem) and
the trailing digits (the leaves)
Example

Data in ordered array: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41

• Here, use the 10’s digit for the stem unit:


Stem Leaf
• 21 is shown as 2 1

• 38 is shown as 3 8
Using other stem units
(continued)

• Using the 100’s digit as the stem:


– The completed stem-and-leaf display:

Data:
Stem Leaves
613, 632, 658, 717, 6 136
722, 750, 776, 827,
7 2258
841, 859, 863, 891,
894, 906, 928, 933, 8 346699
955, 982, 1034, 9 13368
1047,1056, 1140, 10 356
1169, 1224
11 47
12 2
Relationships Between Variables

• Graphs illustrated so far have involved only a single variable


• When two variables exist other techniques are used:

Categorical Numerical
(Qualitative) (Quantitative)
Variables Variables

Cross tables Scatter plots


SHAPE OF A DISTRIBUTION

49
Shape of a Distribution

Modes
• Position of an isolated peak in a histogram
• A histogram with one peak is unimodal; two is bimodal; three or more is multimodal
• A histogram with all bars about the same height is uniform
Shape of a Distribution

Symmetry and Skewness

• A distribution is symmetric if the two sides of its histogram are mirror images

• A distribution is skewed if one tail of the histogram stretches out farther than the
other
Empirical Rule

• A bell-shaped distribution is symmetric and unimodal.


• The empirical rule uses the standard deviation to describe how data with a
bell-shaped distribution cluster around the mean.

• If the data distribution is bell-shaped, then the interval μ ± 1σ contains


about 68% of the values in the population or the sample.
The Empirical Rule

68%
The Empirical Rule
• Interval μ ± 2σ contains about 95% of the values in the
population or the sample
• Interval μ ± 3σ contains almost all (about 99.7%) of the values
in the population or the sample

95% 99.7%
Shape of a Distribution

• The Empirical Rule


Best Practices

• Be sure that data are numerical when using histograms and summaries such as the
mean and standard deviation.
• Summarize the distribution of a numerical variable with a graph.
• Choose interval widths appropriate to the data when preparing a histogram.
• Scale your plots to show data, not empty space.
• Anticipate what you will see in a histogram.
• Label clearly.
• Check for gaps.
Pitfalls

• Do not assume that all numerical data have a bell-shaped distribution.


• Do not ignore the presence of outliers.
• Do not remove outliers unless you have a good reason.
• Do not forget to take the square root of a variance.

You might also like