You are on page 1of 27

DS 535: ADVANCED DATA MINING FOR BUSINESS

Lecture Notes #2: Data Visualization

(Textbook reading: Chapter 3)

Graphs for Data Exploration

1. Basic Plots
Line Graphs
Bar Charts
Scatterplots

2. Distribution Plots
Boxplots
Histograms

Line Graph for Time Series


Amtrak Ridership
Amtrak, a US railway company, routinely collects data on ridership. Here we focus on forecasting future
ridership using the series of monthly ridership between January 1991 and March 2004. The data and their
source are described in Chapter 16. Hence our task here is (numerical) time series forecasting.
2200
2000
Ridership (in 000s)

1800
1600
1400

1992 1994 1996 1998 2000 2002 2004

Year

Amtrak.df <- read.csv("Amtrak data.csv")

1
# use time series analysis
library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))

Bar Chart for Categorical Variable


Boston Housing Data The Boston Housing data contain information on census tracts in Boston2 for which
several measurements are taken (e.g., crime rate, pupil/teacher ratio). It has 14 variables.

There are 14 attributes in each case of the dataset. They are:

CRIM per capita crime rate by town


ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town.
CHAS Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000

Average median neighborhood value for neighborhoods that do and do not border the Charles River
25
20
Avg. MEDV

15
10
5
0

0 1

CHAS

2
## barchart of CHAS vs. mean MEDV
## Boston housing data
housing.df <- read.csv("BostonHousing.csv")

# compute mean MEDV per CHAS = (0, 1)


data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean)
names(data.for.plot) <- c("CHAS", "MeanMEDV")
barplot(data.for.plot$MeanMEDV, names.arg = data.for.plot$CHAS, xlab = "CHAS", ylab =
"Avg. MEDV")
30
25
20
% of CAT.MEDV

15
10
5
0

0 1

CHAS

## barchart of CHAS vs. % CAT.MEDV


data.for.plot <- aggregate(housing.df$CAT..MEDV, by = list(housing.df$CHAS), FUN = mean)
names(data.for.plot) <- c("CHAS", "MeanCATMEDV")
barplot(data.for.plot$MeanCATMEDV * 100, names.arg = data.for.plot$CHAS,
xlab = "CHAS", ylab = "% of CAT.MEDV")

Scatterplot
Displays relationship between two numerical variables

## Boston housing data


housing.df <- read.csv("BostonHousing.csv")

## scatter plot with axes names


plot(housing.df$MEDV ~ housing.df$LSTAT, xlab = "MDEV", ylab = "LSTAT")

3
50
40
30
LSTAT

20
10

10 20 30

MDEV

# alternative plot with ggplot


library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = LSTAT, y = MEDV), colour = "navy", alpha = 0.7)

4
Display “how many” of each value occur in a data set

Or, for continuous data or data with many possible values, “how many” values are in each of a
series of ranges or “bins”

Histograms

## histogram of MEDV
hist(housing.df$MEDV, xlab = "MEDV")

150

100
count

50

0 10 20 30 40 50
MEDV

# alternative plot with ggplot


library(ggplot2)
ggplot(housing.df) + geom_histogram(aes(x = MEDV), binwidth = 5)
5
Boxplots

## boxplot of MEDV for different values of CHAS


boxplot(housing.df$MEDV ~ housing.df$CHAS, xlab = "CHAS", ylab = "MEDV")

50

40

30
MEDV

20

10

0 1
CHAS

# alternative plot with ggplot


ggplot(housing.df) + geom_boxplot(aes(x = as.factor(CHAS), y = MEDV)) + xlab("CHAS")

6
## side-by-side boxplots

22

25
0.8

20
30

20
0.7

18
PTRATIO

15
LSTAT

INDUS
NOX

20
0.6

16

10
0.5

10

5
14
0.4

0 1 0 1 0 1 0 1

CAT.MEDV CAT.MEDV CAT.MEDV CAT.MEDV

#### Figure 3.3

## side-by-side boxplots
# use par() to split the plots into panels.
par(mfcol = c(1, 4))
boxplot(housing.df$NOX ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "NOX")
boxplot(housing.df$LSTAT ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "LSTAT")
boxplot(housing.df$PTRATIO ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab =
"PTRATIO")
boxplot(housing.df$INDUS ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "INDUS")

7
8
Heat Maps
Color conveys information

In data mining, used to visualize


Correlations
Missing Data

A heatmap is a graphical display of numerical data where color is used to denote values. In a data mining
context, heatmaps are especially useful for two purposes: for visualizing correlation tables and for visualizing
missing values in the data. In both cases the information is conveyed in a two-dimensional table. A correlation
table for p variables has p rows and p columns. A data table contains p columns (variables) and n rows
(observations). If the number of rows is huge, then a subset can be used. In both cases, it is much easier and
faster to scan the color-coding rather than the values. Note that heatmaps are useful when examining a large
number of values, but they are not a replacement for more precise graphical display, such as bar charts,
because color differences cannot be perceived accurately.

Heatmap to highlight correlations

#### Figure 3.4


## simple heatmap of correlations (without values)
heatmap(cor(housing.df), Rowv = NA, Colv = NA)

9
## heatmap with values

library(gplots)
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(housing.df),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))

10
# alternative plot with ggplot

library(ggplot2)
library(reshape) # to generate input for the plot
cor.mat <- round(cor(housing.df),2) # rounded correlation matrix
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x = X1, y = X2, fill = value)) +
geom_tile() +
geom_text(aes(x = X1, y = X2, label = value))

11
Multidimensional Visualization
Basic plots can convey richer information with features such as color, size, and multiple panels, and by
enabling operations such as rescaling, aggregation, and interactivity.

12
Scatter plot of two numerical predictors, color-coded by the categorical outcome CAT.MEDV

CAT .MEDV = 1
CAT .MEDV = 0
0.8
0.7
NOX

0.6
0.5
0.4

10 20 30

LSTAT

#### Figure 3.6


## color plot
par(mfcol = c(1,1), xpd=TRUE) # allow legend to be displayed outside of plot area
plot(housing.df$NOX ~ housing.df$LSTAT, ylab = "NOX", xlab = "LSTAT",
col = ifelse(housing.df$CAT..MEDV == 1, "black", "gray"))
# add legend outside of plotting area
# In legend() use argument inset = to control the location of the legend relative
# to the plot.
legend("topleft", inset=c(0, -0.2),
legend = c("CAT.MEDV = 1", "CAT.MEDV = 0"), col = c("black", "gray"),
pch = 1, cex = 0.5)
#

13
# alternative plot with ggplot
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)

14
Boston Housing

Bar chart of MEDV by two categorical predictors (CHAS and RAD), using multiple panels for CHAS

## panel plots
# compute mean MEDV per RAD and CHAS
# In aggregate() use argument drop = FALSE to include all combinations
# (exiting and missing) of RAD X CHAS.
data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$RAD,
housing.df$CHAS), FUN = mean, drop = FALSE)
names(data.for.plot) <- c("RAD", "CHAS", "meanMEDV")
# plot the data
par(mfcol = c(2,1))
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 0],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 0],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 0")
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 1],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 1],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 1")

15
#
# alternative plot with ggplot
ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)

50

40

30

0
20

10
meanMEDV

50

40

30

1
20

10

0
1 2 3 4 5 6 7 8 24
RAD

16
Matrix Scatterplot
A special plot that uses scatter plots with multiple panels is the scatter plot matrix. In it, all pairwise
scatter plots are shown in a single display. The panels in a matrix scatter plot are organized in a
special way, such that each column and each row correspond to a variable, thereby the intersections
create all the possible pairwise scatter plots. The scatter plot matrix is useful in unsupervised
learning for studying the associations between numerical variables, detecting outliers and identifying
clusters.

0 5 10 20 10 20 30 40 50

80
40 60
CRIM

20
0
20

INDUS
5 10
0

30
LSTAT

20
10
50
20 30 40

MEDV
10

0 20 40 60 80 10 20 30

#### Figure 3.7


## simple plot
# use plot() to generate a matrix of 4X4 panels with variable name on the diagonal,
# and scatter plots in the remaining panels.
plot(housing.df[, c(1, 3, 12, 13)])

17
# alternative, nicer plot (displayed)
library(GGally)
ggpairs(housing.df[, c(1, 3, 12, 13)])

18
Most of the time spent in data mining projects is spent in preprocessing. Typically, considerable effort is
expended getting all the data in a format that can actually be used in the data mining software. Additional
time is spent processing the data in ways that improve the performance of the data mining procedures.

Manipulation:
• Rescaling
• Aggregation
• Zooming
The ability to zoom in and out of certain areas of the data on a plot is important for revealing patterns
and outliers.
• Filtering
Filtering means removing some of the observations from the plot.

Rescaling to log scale


“uncrowds” the data

#### Figure 3.8


options(scipen=999) # avoid scientific notation

## scatter plot: regular and log scale


plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV")
# to use logarithmic scale set argument log = to either 'x', 'y', or 'xy'.
plot(housing.df$MEDV ~ housing.df$CRIM, xlab = "CRIM", ylab = "MEDV", log = 'xy')

19
# alternative log-scale plot with ggplot
library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = CRIM, y = MEDV)) +
scale_x_log10(breaks = 10^(-2:2),
labels = format(10^(-2:2), scientific = FALSE, drop0trailing = TRUE)) +
scale_y_log10(breaks = c(5, 10, 20, 40))

40

20
MEDV

10

0.01 0.1 1 10 100


CRIM

## boxplot: regular and log scale


boxplot(housing.df$CRIM ~ housing.df$CAT..MEDV,
xlab = "CAT.MEDV", ylab = "CRIM")
boxplot(housing.df$CRIM ~ housing.df$CAT..MEDV,
xlab = "CAT.MEDV", ylab = "CRIM", log = 'y')
100.00
80

10.00
60

1.00
CRIM

CRIM
40

0.10
20

0.01
0

0 1 0 1

CAT.MEDV CAT.MEDV

20
Amtrak Ridership – Monthly Data – Curve Added

2200
2000
Ridership (in 000s)

1800
1600
1400

1992 1994 1996 1998 2000 2002 2004

Year

#### Figure 3.9

library(forecast)
Amtrak.df <- read.csv("Amtrak data.csv")
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)

## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
#

21
Aggregation
Another useful manipulation of scaling is changing the level of aggregation. For a temporal scale, we can
aggregate by different granularity (e.g., monthly, daily, hourly) or even by a “seasonal” factor of interest
such as month-of-year or day-of-week. A popular aggregation for time series is a moving average, where
the average of neighboring values within a given window size is plotted.

22
Ridership (in 000s)

2000
1400

1991.0 1991.5 1992.0 1992.5

Year
Average Ridership

2000
1400

Jan Mar May Jul Aug Oct Dec

Month
2200
2000
Average Ridership

1800
1600
1400

1992 1994 1996 1998 2000 2002

Year

## zoom in, monthly, and annual plots


ridership.2yrs <- window(ridership.ts, start = c(1991,1), end = c(1992,12))
plot(ridership.2yrs, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))

monthly.ridership.ts <- tapply(ridership.ts, cycle(ridership.ts), mean)


plot(monthly.ridership.ts, xlab = "Month", ylab = "Average Ridership",
ylim = c(1300, 2300), type = "l", xaxt = 'n')
## set x labels
axis(1, at = c(1:12), labels = c("Jan","Feb","Mar", "Apr","May","Jun",
"Jul","Aug","Sep", "Oct","Nov","Dec"))

annual.ridership.ts <- aggregate(ridership.ts, FUN = mean)


plot(annual.ridership.ts, xlab = "Year", ylab = "Average Ridership",
ylim = c(1300, 2300))

23
In displays that are not overcrowded, the use of in-plot labels can be useful for better exploration of outliers
and clusters.

Scatter Plot with Labels (Utilities)


Scaling: Smaller markers, jittering, color contrast (Universal Bank; red = accept loan)

te d
Un i
NY
2.0

d
Diegnog lan
n
SaNe w E

n
aiia
Ha w
ton
Bo s
1.5

if ic
Pa c
inia
Fuel Cost

Virg ida
Flo r
n
o u
l the r
S
tr a
Ce n
1.0

y
tuck
Ke n lth
ea a da
m ocnown sin Ne v
Wsios n
Co m na as
Ma di
A rizo laho ma Te x Pu g
e t
n k
th er O
No r
0.5

o
Ida h

5000 10000 15000 20000

Sales

#### Figure 3.10

utilities.df <- read.csv("Utilities.csv")

plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)

24
# alternative with ggplot
library(ggplot2)
ggplot(utilities.df, aes(y = Fuel_Cost, x = Sales)) + geom_point() +
geom_text(aes(label = paste(" ", Company)), size = 4, hjust = 0.0, angle = 15) +
ylim(0.25, 2.25) + xlim(3000, 18000)

25
Multivariate Plot: Parallel Coordinates Plot
Another approach toward presenting multidimensional information in a two-dimensional plot is via specialized
plots such as the parallel coordinates plot. In this plot a vertical axis is drawn for each variable. Then each
observation is represented by drawing a line that connects its values on the different axes, thereby creating a
“multivariate profile.” An example is shown in Figure 3.12 for the Boston Housing data. In this display,
separate panels are used for the two values of CAT.MEDV, in order to compare the profiles of homes in the
two classes (for a classification task). We see that the more expensive homes (bottom panel) consistently have
low CRIM, low LSAT, and high RM compared to cheaper homes (top panel), which are more mixed on
CRIM, and LSAT, and have a medium level of RM. This observation gives indication of useful predictors and
suggests possible binning for some numerical predictors.

Parallel Coordinate Plot (Boston Housing)

CAT.MEDV = 0

CRIM ZN INDUS NOX RM AGE DIS RAD TAX LSTAT

CAT.MEDV = 1

CRIM ZN INDUS NOX RM AGE DIS RAD TAX LSTAT

#### Figure 3.12

library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")

26
Problems 3.1
Shipments of Household Appliances: Line Graphs.
The file ApplianceShipments.csv contains the series of quarterly shipments (in millions of
dollars) of US household appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using R.
b. Does there appear to be a quarterly pattern? For a closer view of the patterns, zoom in to
the range of 3500–5000 on the y-axis.
c. Using R, create one chart with four separate lines, one line for each of Q1, Q2, Q3, and
Q4. In R, this can be achieved by generating a data.frame for each quarter Q1, Q2, Q3,
Q4 , and then plotting them as separate series on the line graph. Zoom in to the range of
3500–5000 on the y-axis. Does there appear to be a difference between quarters?
d. Using R, create a line graph of the series at a yearly aggregated level (i.e., the total
shipments in each year).

Problem 3.2
Sales of Riding Mowers: Scatter Plots.
A company that manufactures riding mowers wants to identify the best sales prospects for an
intensive sales campaign. In particular, the manufacturer is interested in classifying households
as prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000
ft2). The marketing expert looked at a random sample of 24 households, given in the file
RidingMowers.csv.
a. Using R, create a scatter plot of Lot Size vs. Income, color-coded by the outcome
variable owner/nonowner. Make sure to obtain a well-formatted plot (create legible labels
and a legend, etc.).

Problem 3.3
Laptop Sales at a London Computer Chain: Bar Charts and Boxplots.
The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in
London in January 2008. This is a subset of the full dataset that includes data for the entire year.
a. Create a bar chart, showing the average retail price by store. Which store has the highest
average? Which has the lowest?
b. To better compare retail prices across stores, create side-by-side boxplots of retail price
by store. Now compare the prices in the two stores from (a). Does there seem to be a
difference between their price distributions?

27

You might also like