Professional Documents
Culture Documents
1. Basic Plots
Line Graphs
Bar Charts
Scatterplots
2. Distribution Plots
Boxplots
Histograms
1800
1600
1400
Year
1
# use time series analysis
library(forecast)
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
Average median neighborhood value for neighborhoods that do and do not border the Charles River
25
20
Avg. MEDV
15
10
5
0
0 1
CHAS
2
## barchart of CHAS vs. mean MEDV
## Boston housing data
housing.df <- read.csv("BostonHousing.csv")
15
10
5
0
0 1
CHAS
Scatterplot
Displays relationship between two numerical variables
3
50
40
30
LSTAT
20
10
10 20 30
MDEV
4
Display “how many” of each value occur in a data set
Or, for continuous data or data with many possible values, “how many” values are in each of a
series of ranges or “bins”
Histograms
## histogram of MEDV
hist(housing.df$MEDV, xlab = "MEDV")
150
100
count
50
0 10 20 30 40 50
MEDV
50
40
30
MEDV
20
10
0 1
CHAS
6
## side-by-side boxplots
22
25
0.8
20
30
20
0.7
18
PTRATIO
15
LSTAT
INDUS
NOX
20
0.6
16
10
0.5
10
5
14
0.4
0 1 0 1 0 1 0 1
## side-by-side boxplots
# use par() to split the plots into panels.
par(mfcol = c(1, 4))
boxplot(housing.df$NOX ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "NOX")
boxplot(housing.df$LSTAT ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "LSTAT")
boxplot(housing.df$PTRATIO ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab =
"PTRATIO")
boxplot(housing.df$INDUS ~ housing.df$CAT..MEDV, xlab = "CAT.MEDV", ylab = "INDUS")
7
8
Heat Maps
Color conveys information
A heatmap is a graphical display of numerical data where color is used to denote values. In a data mining
context, heatmaps are especially useful for two purposes: for visualizing correlation tables and for visualizing
missing values in the data. In both cases the information is conveyed in a two-dimensional table. A correlation
table for p variables has p rows and p columns. A data table contains p columns (variables) and n rows
(observations). If the number of rows is huge, then a subset can be used. In both cases, it is much easier and
faster to scan the color-coding rather than the values. Note that heatmaps are useful when examining a large
number of values, but they are not a replacement for more precise graphical display, such as bar charts,
because color differences cannot be perceived accurately.
9
## heatmap with values
library(gplots)
heatmap.2(cor(housing.df), Rowv = FALSE, Colv = FALSE, dendrogram = "none",
cellnote = round(cor(housing.df),2),
notecol = "black", key = FALSE, trace = 'none', margins = c(10,10))
10
# alternative plot with ggplot
library(ggplot2)
library(reshape) # to generate input for the plot
cor.mat <- round(cor(housing.df),2) # rounded correlation matrix
melted.cor.mat <- melt(cor.mat)
ggplot(melted.cor.mat, aes(x = X1, y = X2, fill = value)) +
geom_tile() +
geom_text(aes(x = X1, y = X2, label = value))
11
Multidimensional Visualization
Basic plots can convey richer information with features such as color, size, and multiple panels, and by
enabling operations such as rescaling, aggregation, and interactivity.
12
Scatter plot of two numerical predictors, color-coded by the categorical outcome CAT.MEDV
CAT .MEDV = 1
CAT .MEDV = 0
0.8
0.7
NOX
0.6
0.5
0.4
10 20 30
LSTAT
13
# alternative plot with ggplot
library(ggplot2)
ggplot(housing.df, aes(y = NOX, x = LSTAT, colour= CAT..MEDV)) +
geom_point(alpha = 0.6)
14
Boston Housing
Bar chart of MEDV by two categorical predictors (CHAS and RAD), using multiple panels for CHAS
## panel plots
# compute mean MEDV per RAD and CHAS
# In aggregate() use argument drop = FALSE to include all combinations
# (exiting and missing) of RAD X CHAS.
data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$RAD,
housing.df$CHAS), FUN = mean, drop = FALSE)
names(data.for.plot) <- c("RAD", "CHAS", "meanMEDV")
# plot the data
par(mfcol = c(2,1))
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 0],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 0],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 0")
barplot(height = data.for.plot$meanMEDV[data.for.plot$CHAS == 1],
names.arg = data.for.plot$RAD[data.for.plot$CHAS == 1],
xlab = "RAD", ylab = "Avg. MEDV", main = "CHAS = 1")
15
#
# alternative plot with ggplot
ggplot(data.for.plot) +
geom_bar(aes(x = as.factor(RAD), y = `meanMEDV`), stat = "identity") +
xlab("RAD") + facet_grid(CHAS ~ .)
50
40
30
0
20
10
meanMEDV
50
40
30
1
20
10
0
1 2 3 4 5 6 7 8 24
RAD
16
Matrix Scatterplot
A special plot that uses scatter plots with multiple panels is the scatter plot matrix. In it, all pairwise
scatter plots are shown in a single display. The panels in a matrix scatter plot are organized in a
special way, such that each column and each row correspond to a variable, thereby the intersections
create all the possible pairwise scatter plots. The scatter plot matrix is useful in unsupervised
learning for studying the associations between numerical variables, detecting outliers and identifying
clusters.
0 5 10 20 10 20 30 40 50
80
40 60
CRIM
20
0
20
INDUS
5 10
0
30
LSTAT
20
10
50
20 30 40
MEDV
10
0 20 40 60 80 10 20 30
17
# alternative, nicer plot (displayed)
library(GGally)
ggpairs(housing.df[, c(1, 3, 12, 13)])
18
Most of the time spent in data mining projects is spent in preprocessing. Typically, considerable effort is
expended getting all the data in a format that can actually be used in the data mining software. Additional
time is spent processing the data in ways that improve the performance of the data mining procedures.
Manipulation:
• Rescaling
• Aggregation
• Zooming
The ability to zoom in and out of certain areas of the data on a plot is important for revealing patterns
and outliers.
• Filtering
Filtering means removing some of the observations from the plot.
19
# alternative log-scale plot with ggplot
library(ggplot2)
ggplot(housing.df) + geom_point(aes(x = CRIM, y = MEDV)) +
scale_x_log10(breaks = 10^(-2:2),
labels = format(10^(-2:2), scientific = FALSE, drop0trailing = TRUE)) +
scale_y_log10(breaks = c(5, 10, 20, 40))
40
20
MEDV
10
10.00
60
1.00
CRIM
CRIM
40
0.10
20
0.01
0
0 1 0 1
CAT.MEDV CAT.MEDV
20
Amtrak Ridership – Monthly Data – Curve Added
2200
2000
Ridership (in 000s)
1800
1600
1400
Year
library(forecast)
Amtrak.df <- read.csv("Amtrak data.csv")
ridership.ts <- ts(Amtrak.df$Ridership, start = c(1991, 1), end = c(2004, 3), freq = 12)
## fit curve
ridership.lm <- tslm(ridership.ts ~ trend + I(trend^2))
plot(ridership.ts, xlab = "Year", ylab = "Ridership (in 000s)", ylim = c(1300, 2300))
lines(ridership.lm$fitted, lwd = 2)
#
21
Aggregation
Another useful manipulation of scaling is changing the level of aggregation. For a temporal scale, we can
aggregate by different granularity (e.g., monthly, daily, hourly) or even by a “seasonal” factor of interest
such as month-of-year or day-of-week. A popular aggregation for time series is a moving average, where
the average of neighboring values within a given window size is plotted.
22
Ridership (in 000s)
2000
1400
Year
Average Ridership
2000
1400
Month
2200
2000
Average Ridership
1800
1600
1400
Year
23
In displays that are not overcrowded, the use of in-plot labels can be useful for better exploration of outliers
and clusters.
te d
Un i
NY
2.0
d
Diegnog lan
n
SaNe w E
n
aiia
Ha w
ton
Bo s
1.5
if ic
Pa c
inia
Fuel Cost
Virg ida
Flo r
n
o u
l the r
S
tr a
Ce n
1.0
y
tuck
Ke n lth
ea a da
m ocnown sin Ne v
Wsios n
Co m na as
Ma di
A rizo laho ma Te x Pu g
e t
n k
th er O
No r
0.5
o
Ida h
Sales
plot(utilities.df$Fuel_Cost ~ utilities.df$Sales,
xlab = "Sales", ylab = "Fuel Cost", xlim = c(2000, 20000))
text(x = utilities.df$Sales, y = utilities.df$Fuel_Cost,
labels = utilities.df$Company, pos = 4, cex = 0.8, srt = 20, offset = 0.2)
24
# alternative with ggplot
library(ggplot2)
ggplot(utilities.df, aes(y = Fuel_Cost, x = Sales)) + geom_point() +
geom_text(aes(label = paste(" ", Company)), size = 4, hjust = 0.0, angle = 15) +
ylim(0.25, 2.25) + xlim(3000, 18000)
25
Multivariate Plot: Parallel Coordinates Plot
Another approach toward presenting multidimensional information in a two-dimensional plot is via specialized
plots such as the parallel coordinates plot. In this plot a vertical axis is drawn for each variable. Then each
observation is represented by drawing a line that connects its values on the different axes, thereby creating a
“multivariate profile.” An example is shown in Figure 3.12 for the Boston Housing data. In this display,
separate panels are used for the two values of CAT.MEDV, in order to compare the profiles of homes in the
two classes (for a classification task). We see that the more expensive homes (bottom panel) consistently have
low CRIM, low LSAT, and high RM compared to cheaper homes (top panel), which are more mixed on
CRIM, and LSAT, and have a medium level of RM. This observation gives indication of useful predictors and
suggests possible binning for some numerical predictors.
CAT.MEDV = 0
CAT.MEDV = 1
library(MASS)
par(mfcol = c(2,1))
parcoord(housing.df[housing.df$CAT..MEDV == 0, -14], main = "CAT.MEDV = 0")
parcoord(housing.df[housing.df$CAT..MEDV == 1, -14], main = "CAT.MEDV = 1")
26
Problems 3.1
Shipments of Household Appliances: Line Graphs.
The file ApplianceShipments.csv contains the series of quarterly shipments (in millions of
dollars) of US household appliances between 1985 and 1989.
a. Create a well-formatted time plot of the data using R.
b. Does there appear to be a quarterly pattern? For a closer view of the patterns, zoom in to
the range of 3500–5000 on the y-axis.
c. Using R, create one chart with four separate lines, one line for each of Q1, Q2, Q3, and
Q4. In R, this can be achieved by generating a data.frame for each quarter Q1, Q2, Q3,
Q4 , and then plotting them as separate series on the line graph. Zoom in to the range of
3500–5000 on the y-axis. Does there appear to be a difference between quarters?
d. Using R, create a line graph of the series at a yearly aggregated level (i.e., the total
shipments in each year).
Problem 3.2
Sales of Riding Mowers: Scatter Plots.
A company that manufactures riding mowers wants to identify the best sales prospects for an
intensive sales campaign. In particular, the manufacturer is interested in classifying households
as prospective owners or nonowners on the basis of Income (in $1000s) and Lot Size (in 1000
ft2). The marketing expert looked at a random sample of 24 households, given in the file
RidingMowers.csv.
a. Using R, create a scatter plot of Lot Size vs. Income, color-coded by the outcome
variable owner/nonowner. Make sure to obtain a well-formatted plot (create legible labels
and a legend, etc.).
Problem 3.3
Laptop Sales at a London Computer Chain: Bar Charts and Boxplots.
The file LaptopSalesJanuary2008.csv contains data for all sales of laptops at a computer chain in
London in January 2008. This is a subset of the full dataset that includes data for the entire year.
a. Create a bar chart, showing the average retail price by store. Which store has the highest
average? Which has the lowest?
b. To better compare retail prices across stores, create side-by-side boxplots of retail price
by store. Now compare the prices in the two stores from (a). Does there seem to be a
difference between their price distributions?
27