Professional Documents
Culture Documents
1
Types of Data
Data
Categorical Numerical
Examples:
■ Marital Status
■ Are you registered to vote?
■ Eye Color
(Defined categories or
groups) Discrete Continuous
Examples: Examples:
■ Number of Children ■ Weight
■ Defects per hour ■ Voltage
(Counted items) (Measured characteristics)
Graphical Presentation of Data
• Data in raw form are usually not easy to use for decision making
• Some type of organization is needed
• Table
• Graph
• The type of graph to use depends on the variable being summarized.
Graphical Presentation of Data
Categorical Numerical
Variables Variables
Categorical Data
Frequency
Distribution Table Bar Chart Pie Chart Pareto
Diagram
DESCRIBING CATEGORICAL DATA
6
Charts of Categorical Data
• Unless you need to know exact counts, charts are better than tables for summarizing
more than five categories
• The two most common displays of a categorical variable are a bar chart and a pie
chart
• Both Describe a categorical variable by displaying its frequency table.
BAR CHART
9
Charts of Categorical Data
Emphasise
bigger hosts
Charts of Categorical Data
• Is called a Pareto chart when the categories are sorted by frequency (popular in
quality control)
No. At times the x-axis may represent time (years, for example): Employees of an
organization over years.
PIE CHART
14
Charts of Categorical Data
• Less useful than bar charts if we want to compare actual counts (easier to compare
bars than angles of wedges)
Pie chart of recruiters
PARETO DIAGRAM
17
Pareto Diagram
Source of
Manufacturing Error Number of defects
Bad Weld 34
Poor Alignment 223
Missing Part 25
Paint Flaw 78
Electrical Short 19
Cracked case 21
Total 400
Pareto Diagram Example
(continued)
21
The Area Principle
The Fundamental Rule for Data Displays
• The area occupied by a part of the graph/chart that displays data should be
proportional to the amount of data it represents.
• Charts decorated to attract attention often violate the area principle.
The Area Principle
An Example Violating the Area Principle
The Area Principle
The Same Example Respecting the Area Principle
Mode and Median
Mode
• Category with the highest frequency
• The longest bar in a bar chart
• The widest slice in a pie chart
• Two or more categories can tie with the highest frequency (bimodal or multimodal)
Median
• Not appropriate for nominal data
• Data must be ordinal
• It is the category label of the middle observation in ordered data
Self exercise
• A line chart (time-series plot) is used to show the values of a variable over time.
• Time is measured on the horizontal axis.
• The variable of interest is measured on the vertical axis.
Graphs to describe numerical variables
Numerical Data
31
Histograms and the Distribution of Numerical Data
• Similar to bar charts but designed for continuous quantitative data (bar charts are
only appropriate for discrete categories)
Class Intervals and Class Boundaries
1
2
Select Data Tab
Click on Data Analysis
Histograms in Excel
(continued)
Choose Histogram
(
Input data range and bin range (bin
range is a cell range containing the
upper interval endpoints for each
4 class grouping)
▪ The graph devotes more than half of its area to show less than 1% of the songs
(white space rule: graphs with mostly white space can be improved by changing the
interval of the plot to focus on the data rather than the white space)
Questions for Grouping Data
into Intervals
1. How wide should each interval be?
(How many classes should be used?)
2. How should the endpoints of the intervals be determined?
• Often answered by trial and error, subject to user
judgment
• The goal is to create a distribution that is neither too
"jagged" nor too "blocky”
• Goal is to appropriately show the pattern of variation in
the data
How Many Class Intervals?
40
Boxplot
Boxplot
• The median splits the area of the histogram in half (unlike the mean, it is resistant or
robust to the effects of outliers)
Boxplot with Histogram of Song Sizes
STEM AND LEAF DISPLAY
44
Stem-and-Leaf Diagram
Data in ordered array: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41
• 38 is shown as 3 8
Using other stem units
(continued)
Data:
Stem Leaves
613, 632, 658, 717, 6 136
722, 750, 776, 827,
7 2258
841, 859, 863, 891,
894, 906, 928, 933, 8 346699
955, 982, 1034, 9 13368
1047,1056, 1140, 10 356
1169, 1224
11 47
12 2
Relationships Between Variables
Categorical Numerical
(Qualitative) (Quantitative)
Variables Variables
49
Shape of a Distribution
Modes
• Position of an isolated peak in a histogram
• A histogram with one peak is unimodal; two is bimodal; three or more is multimodal
• A histogram with all bars about the same height is uniform
Shape of a Distribution
• A distribution is symmetric if the two sides of its histogram are mirror images
• A distribution is skewed if one tail of the histogram stretches out farther than the
other
Empirical Rule
68%
The Empirical Rule
• Interval μ ± 2σ contains about 95% of the values in the
population or the sample
• Interval μ ± 3σ contains almost all (about 99.7%) of the values
in the population or the sample
95% 99.7%
Shape of a Distribution
• Be sure that data are numerical when using histograms and summaries such as the
mean and standard deviation.
• Summarize the distribution of a numerical variable with a graph.
• Choose interval widths appropriate to the data when preparing a histogram.
• Scale your plots to show data, not empty space.
• Anticipate what you will see in a histogram.
• Label clearly.
• Check for gaps.
Pitfalls