Note 2

DS 535: ADVANCED DATA MINING FOR BUSINESS
Lecture Notes #2: Data Visualization
(Textbook reading: Chapter 3)
Graphs for Data Exploration
1. Basic Plots
Line Graphs
Bar Charts
Scatterplots
2. Distribution Plots
Boxplots
Histograms
Line Graph for Time Series

Amtrak Ridership
Amtrak, a US railway company, routinely collects data on ridership. Here we focus on forecasting future
ridership using the series of monthly ridership between January 1991 and March 2004. The data and their
source are described in Chapter 16. Hence our task here is (numerical) time series forecasting.
2200
2000
Ridership (in 000s)
1800
1600
1400
1992 1994 1996 1998 2000 2002 2004
Year
Amtrak.df <- read.csv("Amtrak data.csv")
1
# use time series analysis
library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
Bar Chart for Categorical Variable

Boston Housing Data The Boston Housing data contain information on census tracts in Boston2 for which
several measurements are taken (e.g., crime rate, pupil/teacher ratio). It has 14 variables.
There are 14 attributes in each case of the dataset. They are:
CRIM per capita crime rate by town

ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town.
CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000
Average median neighborhood value for neighborhoods that do and do not border the Charles River
25
20
Avg. MEDV
15
10
5
0
0 1
CHAS
2
## barchart of CHAS vs. mean MEDV
## Boston housing data
housing.df <- read.csv("BostonHousing.csv")
# compute mean MEDV per CHAS = (0, 1)

data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean)
names(data.for.plot) <- c("CHAS", "MeanMEDV")
barplot(data.for.plot$MeanMEDV, names.arg = data.for.plot$CHAS, xlab = "CHAS", ylab =
"Avg. MEDV")
30
25
20
% of CAT.MEDV
15
10
5
0
0 1
CHAS
## barchart of CHAS vs. % CAT.MEDV

data.for.plot <- aggregate(housing.df$CAT..MEDV, by = list(housing.df$CHAS), FUN = mean)
names(data.for.plot) <- c("CHAS", "MeanCATMEDV")
barplot(data.for.plot$MeanCATMEDV * 100, names.arg = data.for.plot$CHAS,
xlab = "CHAS", ylab = "% of CAT.MEDV")
Scatterplot
Displays relationship between two numerical variables
## Boston housing data

housing.df <- read.csv("BostonHousing.csv")
## scatter plot with axes names

plot(housing.df$MEDV ~ housing.df$LSTAT, xlab = "MDEV", ylab = "LSTAT")
3
50
40
30
LSTAT
20
10
10 20 30
MDEV
# alternative plot with ggplot

library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = LSTAT, y = MEDV), colour = "navy", alpha = 0.7)
4
Display “how many” of each value occur in a data set
Or, for continuous data or data with many possible values, “how many” values are in each of a
series of ranges or “bins”
Histograms
## histogram of MEDV
hist(housing.df$MEDV, xlab = "MEDV")
150
100
count
50
0 10 20 30 40 50
MEDV

library(ggplot2)
ggplot(housing.df) + geom_histogram(aes(x = MEDV), binwidth = 5)
5
Boxplots
## boxplot of MEDV for different values of CHAS

boxplot(housing.df$MEDV ~ housing.df$CHAS, xlab = "CHAS", ylab = "MEDV")
50
40
30
MEDV
20
10
0 1
CHAS

ggplot(housing.df) + geom_boxplot(aes(x = as.factor(CHAS), y = MEDV)) + xlab("CHAS")
6
## side-by-side boxplots
22
25
0.8
20
30
20
0.7
18
PTRATIO
15
LSTAT
INDUS
NOX
20
0.6
16
10
0.5
10
5
14
0.4
0 1 0 1 0 1 0 1
CAT.MEDV CAT.MEDV CAT.MEDV CAT.MEDV
#### Figure 3.3
## side-by-side boxplots
# use par() to split the plots into panels.
par(mfcol = c(1, 4))
boxplot(housing.df$NOX ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "NOX")
boxplot(housing.df$LSTAT ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "LSTAT")
boxplot(housing.df$PTRATIO ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab =
"PTRATIO")
boxplot(housing.df$INDUS ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "INDUS")
7
8
Heat Maps
Color conveys information
In data mining, used to visualize

Correlations
Missing Data
A heatmap is a graphical display of numerical data where color is used to denote values. In a data mining
context, heatmaps are especially useful for two purposes: for visualizing correlation tables and for visualizing
missing values in the data. In both cases the information is conveyed in a two-dimensional table. A correlation
table for p variables has p rows and p columns. A data table contains p columns (variables) and n rows
(observations). If the number of rows is huge, then a subset can be used. In both cases, it is much easier and
faster to scan the color-coding rather than the values. Note that heatmaps are useful when examining a large
number of values, but they are not a replacement for more precise graphical display, such as bar charts,
because color differences cannot be perceived accurately.
Heatmap to highlight correlations
#### Figure 3.4

## simple heatmap of correlations (without values)
heatmap(cor(housing.df), Rowv = NA, Colv = NA)
9
## heatmap with values
library(gplots)
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(housing.df),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))
10
library(ggplot2)
library(reshape) # to generate input for the plot
cor.mat <- round(cor(housing.df),2) # rounded correlation matrix
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x = X1, y = X2, fill = value)) +
geom_tile() +
geom_text(aes(x = X1, y = X2, label = value))
11
Multidimensional Visualization
Basic plots can convey richer information with features such as color, size, and multiple panels, and by
enabling operations such as rescaling, aggregation, and interactivity.
12
Scatter plot of two numerical predictors, color-coded by the categorical outcome CAT.MEDV
CAT .MEDV = 1
CAT .MEDV = 0
0.8
0.7
NOX
0.6
0.5
0.4
10 20 30
LSTAT
#### Figure 3.6

## color plot
par(mfcol = c(1,1), xpd=TRUE) # allow legend to be displayed outside of plot area
plot(housing.df$NOX ~ housing.df$LSTAT, ylab = "NOX", xlab = "LSTAT",
col = ifelse(housing.df$CAT..MEDV == 1, "black", "gray"))
# add legend outside of plotting area
# In legend() use argument inset = to control the location of the legend relative
# to the plot.
legend("topleft", inset=c(0, -0.2),
legend = c("CAT.MEDV = 1", "CAT.MEDV = 0"), col = c("black", "gray"),
pch = 1, cex = 0.5)
#
13
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)
14
Boston Housing
Bar chart of MEDV by two categorical predictors (CHAS and RAD), using multiple panels for CHAS
## panel plots
# compute mean MEDV per RAD and CHAS
# In aggregate() use argument drop = FALSE to include all combinations
# (exiting and missing) of RAD X CHAS.
data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$RAD,
housing.df$CHAS), FUN = mean, drop = FALSE)
names(data.for.plot) <- c("RAD", "CHAS", "meanMEDV")
# plot the data
par(mfcol = c(2,1))
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 0],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 0],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 0")
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 1],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 1],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 1")
15
#
ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)
50
40
30
0
20
10
meanMEDV
50
40
30
1
20
10
0
1 2 3 4 5 6 7 8 24
RAD
16
Matrix Scatterplot
A special plot that uses scatter plots with multiple panels is the scatter plot matrix. In it, all pairwise
scatter plots are shown in a single display. The panels in a matrix scatter plot are organized in a
special way, such that each column and each row correspond to a variable, thereby the intersections
create all the possible pairwise scatter plots. The scatter plot matrix is useful in unsupervised
learning for studying the associations between numerical variables, detecting outliers and identifying
clusters.
0 5 10 20 10 20 30 40 50
80
40 60
CRIM
20
0
20
INDUS
5 10
0
30
LSTAT
20
10
50
20 30 40
MEDV
10
0 20 40 60 80 10 20 30
#### Figure 3.7

## simple plot
# use plot() to generate a matrix of 4X4 panels with variable name on the diagonal,
# and scatter plots in the remaining panels.
plot(housing.df[, c(1, 3, 12, 13)])
17
# alternative, nicer plot (displayed)
library(GGally)
ggpairs(housing.df[, c(1, 3, 12, 13)])
18
Most of the time spent in data mining projects is spent in preprocessing. Typically, considerable effort is
expended getting all the data in a format that can actually be used in the data mining software. Additional
time is spent processing the data in ways that improve the performance of the data mining procedures.
Manipulation:
• Rescaling
• Aggregation
• Zooming
The ability to zoom in and out of certain areas of the data on a plot is important for revealing patterns
and outliers.
• Filtering
Filtering means removing some of the observations from the plot.
Rescaling to log scale

“uncrowds” the data
#### Figure 3.8

options(scipen=999) # avoid scientific notation
## scatter plot: regular and log scale

plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV")
# to use logarithmic scale set argument log = to either 'x', 'y', or 'xy'.
plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV", log = 'xy')
19
# alternative log-scale plot with ggplot
library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = CRIM, y = MEDV)) +
scale_x_log10(breaks = 10^(-2:2),
labels = format(10^(-2:2), scientific = FALSE, drop0trailing = TRUE)) +
scale_y_log10(breaks = c(5, 10, 20, 40))
40
20
MEDV
10
0.01 0.1 1 10 100

CRIM
## boxplot: regular and log scale

boxplot(housing.df$CRIM ~ housing.df$CAT..MEDV,
xlab = "CAT.MEDV", ylab = "CRIM")
boxplot(housing.df$CRIM ~ housing.df$CAT..MEDV,
xlab = "CAT.MEDV", ylab = "CRIM", log = 'y')
100.00
80
10.00
60
1.00
CRIM
CRIM
40
0.10
20
0.01
0
0 1 0 1
CAT.MEDV CAT.MEDV
20
Amtrak Ridership – Monthly Data – Curve Added
2200
2000
Ridership (in 000s)
1800
1600
1400
1992 1994 1996 1998 2000 2002 2004
Year
#### Figure 3.9
library(forecast)
Amtrak.df <- read.csv("Amtrak data.csv")
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
#
21
Aggregation
Another useful manipulation of scaling is changing the level of aggregation. For a temporal scale, we can
aggregate by different granularity (e.g., monthly, daily, hourly) or even by a “seasonal” factor of interest
such as month-of-year or day-of-week. A popular aggregation for time series is a moving average, where
the average of neighboring values within a given window size is plotted.
22
Ridership (in 000s)
2000
1400
1991.0 1991.5 1992.0 1992.5
Year
Average Ridership
2000
1400
Jan Mar May Jul Aug Oct Dec
Month
2200
2000
Average Ridership
1800
1600
1400
1992 1994 1996 1998 2000 2002
Year
## zoom in, monthly, and annual plots

ridership.2yrs <- window(ridership.ts, start = c(1991,1), end = c(1992,12))
plot(ridership.2yrs, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
monthly.ridership.ts <- tapply(ridership.ts, cycle(ridership.ts), mean)

plot(monthly.ridership.ts, xlab = "Month", ylab = "Average Ridership",
ylim = c(1300, 2300), type = "l", xaxt = 'n')
## set x labels
axis(1, at = c(1:12), labels = c("Jan","Feb","Mar", "Apr","May","Jun",
"Jul","Aug","Sep", "Oct","Nov","Dec"))
annual.ridership.ts <- aggregate(ridership.ts, FUN = mean)

plot(annual.ridership.ts, xlab = "Year", ylab = "Average Ridership",
ylim = c(1300, 2300))
23
In displays that are not overcrowded, the use of in-plot labels can be useful for better exploration of outliers
and clusters.
Scatter Plot with Labels (Utilities)

Scaling: Smaller markers, jittering, color contrast (Universal Bank; red = accept loan)
te d
Un i
NY
2.0
d
Diegnog lan
n
SaNe w E
n
aiia
Ha w
ton
Bo s
1.5
if ic
Pa c
inia
Fuel Cost
Virg ida
Flo r
n
o u
l the r
S
tr a
Ce n
1.0
y
tuck
Ke n lth
ea a da
m ocnown sin Ne v
Wsios n
Co m na as
Ma di
A rizo laho ma Te x Pu g
e t
n k
th er O
No r
0.5
o
Ida h
5000 10000 15000 20000
Sales
#### Figure 3.10
utilities.df <- read.csv("Utilities.csv")
plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)
24
# alternative with ggplot
library(ggplot2)
ggplot(utilities.df, aes(y = Fuel_Cost, x = Sales)) + geom_point() +
geom_text(aes(label = paste(" ", Company)), size = 4, hjust = 0.0, angle = 15) +
ylim(0.25, 2.25) + xlim(3000, 18000)
25
Multivariate Plot: Parallel Coordinates Plot
Another approach toward presenting multidimensional information in a two-dimensional plot is via specialized
plots such as the parallel coordinates plot. In this plot a vertical axis is drawn for each variable. Then each
observation is represented by drawing a line that connects its values on the different axes, thereby creating a
“multivariate profile.” An example is shown in Figure 3.12 for the Boston Housing data. In this display,
separate panels are used for the two values of CAT.MEDV, in order to compare the profiles of homes in the
two classes (for a classification task). We see that the more expensive homes (bottom panel) consistently have
low CRIM, low LSAT, and high RM compared to cheaper homes (top panel), which are more mixed on
CRIM, and LSAT, and have a medium level of RM. This observation gives indication of useful predictors and
suggests possible binning for some numerical predictors.
Parallel Coordinate Plot (Boston Housing)
CAT.MEDV = 0
CRIM ZN INDUS NOX RM AGE DIS RAD TAX LSTAT
CAT.MEDV = 1
CRIM ZN INDUS NOX RM AGE DIS RAD TAX LSTAT
#### Figure 3.12
library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")
26
Problems 3.1
Shipments of Household Appliances: Line Graphs.
The file ApplianceShipments.csv contains the series of quarterly shipments (in millions of
dollars) of US household appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using R.
b. Does there appear to be a quarterly pattern? For a closer view of the patterns, zoom in to
the range of 3500–5000 on the y-axis.
c. Using R, create one chart with four separate lines, one line for each of Q1, Q2, Q3, and
Q4. In R, this can be achieved by generating a data.frame for each quarter Q1, Q2, Q3,
Q4 , and then plotting them as separate series on the line graph. Zoom in to the range of
3500–5000 on the y-axis. Does there appear to be a difference between quarters?
d. Using R, create a line graph of the series at a yearly aggregated level (i.e., the total
shipments in each year).
Problem 3.2
Sales of Riding Mowers: Scatter Plots.
A company that manufactures riding mowers wants to identify the best sales prospects for an
intensive sales campaign. In particular, the manufacturer is interested in classifying households
as prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000
ft2). The marketing expert looked at a random sample of 24 households, given in the file
RidingMowers.csv.
a. Using R, create a scatter plot of Lot Size vs. Income, color-coded by the outcome
variable owner/nonowner. Make sure to obtain a well-formatted plot (create legible labels
and a legend, etc.).
Problem 3.3
Laptop Sales at a London Computer Chain: Bar Charts and Boxplots.
The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in
London in January 2008. This is a subset of the full dataset that includes data for the entire year.
a. Create a bar chart, showing the average retail price by store. Which store has the highest
average? Which has the lowest?
b. To better compare retail prices across stores, create side-by-side boxplots of retail price
by store. Now compare the prices in the two stores from (a). Does there seem to be a
difference between their price distributions?
27

Note 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Note 2

Uploaded by

Copyright:

Available Formats

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #2: Data Visualization

(Textbook reading: Chapter 3)

Graphs for Data Exploration

Line Graph for Time Series

1992 1994 1996 1998 2000 2002 2004

Amtrak.df <- read.csv("Amtrak data.csv")

Bar Chart for Categorical Variable

There are 14 attributes in each case of the dataset. They are:

CRIM per capita crime rate by town

# compute mean MEDV per CHAS = (0, 1)

## barchart of CHAS vs. % CAT.MEDV

## Boston housing data

## scatter plot with axes names

# alternative plot with ggplot

# alternative plot with ggplot

## boxplot of MEDV for different values of CHAS

# alternative plot with ggplot

CAT.MEDV CAT.MEDV CAT.MEDV CAT.MEDV

#### Figure 3.3

In data mining, used to visualize

Heatmap to highlight correlations

#### Figure 3.4

#### Figure 3.6

#### Figure 3.7

Rescaling to log scale

#### Figure 3.8

## scatter plot: regular and log scale

0.01 0.1 1 10 100

## boxplot: regular and log scale

1992 1994 1996 1998 2000 2002 2004

#### Figure 3.9

1991.0 1991.5 1992.0 1992.5

Jan Mar May Jul Aug Oct Dec

1992 1994 1996 1998 2000 2002

## zoom in, monthly, and annual plots

monthly.ridership.ts <- tapply(ridership.ts, cycle(ridership.ts), mean)

annual.ridership.ts <- aggregate(ridership.ts, FUN = mean)

Scatter Plot with Labels (Utilities)

5000 10000 15000 20000

#### Figure 3.10

utilities.df <- read.csv("Utilities.csv")

Parallel Coordinate Plot (Boston Housing)

CRIM ZN INDUS NOX RM AGE DIS RAD TAX LSTAT

CRIM ZN INDUS NOX RM AGE DIS RAD TAX LSTAT

#### Figure 3.12

You might also like