Professional Documents
Culture Documents
0.1 Introduction
Data visualization is often the first step in any data analysis. Pictorial represen-
tations give useful initial insights about the data and also become useful in model
diagnostics. In this chapter, we discuss a few popular data visualization tools for
univariate and multivariate data. Histograms, boxplots and stem-and-leaf plots
are popularly used for the exploration of numerical data. By numerical data, we
mean quantitative data that may be either integer-valued or real-valued. These
data are represented using different graphical methods than qualitative data,
which usually occur as categories or labels and are usually represented using
barplots, pie diagrams, etc. The R software has wide graphic capabilities; use
demo(graphics) to explore the various possibilities.
attach(Wirebond)
str(Wirebond) # structure of variables
Example 0.2.2 (Crab Mating Data) (Source: [?]) Each female horseshoe
crab in this study had a male crab attached to her in her nest. The study
investigated some factors that were thought to affect whether the female crab
had any other males, called satellites, residing nearby. These factors included
the color, spine condition, weight (kg), and carapace width (cm) of the female
crabs. Color has four levels (1: light medium; 2: medium; 3: dark medium; 4:
1
2
dark). Spine condition has three levels (1: both good; 2: one worn or broken; 3:
both worn or broken). This data is available as crabs in the R package glmbb.
Note that the spine condition levels are coded as good, bad, middle in the R
package.
library(glmbb)
data("crabs",package = "glmbb")
attach(crabs)
The data has 173 observations on six variables of which two are categorical
while the rest are quantitative. Use the function levels on the data object
through the command sapply to see the levels of the variables. The command
factor can be used to encode a vector as a factor.
$spine
[1] "bad" "good" "middle"
$width
NULL
$satell
NULL
$weight
NULL
$y
NULL
color=factor(color,levels=c("light","medium","Dark","Darker"))
#Changes the order of levels.
library(Sleuth3)
data(case2002)
birds=case2002
str(birds)
’data.frame’: 147 obs. of 7 variables:
$ LC: Factor w/ 2 levels "LungCancer","NoCancer": 1 1 1 1 1 1 1 1 1 1 ...
$ FM: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ SS: Factor w/ 2 levels "High","Low": 2 2 1 2 2 1 1 2 2 1 ...
$ BK: Factor w/ 2 levels "Bird","NoBird": 1 1 2 1 1 2 1 2 1 2 ...
$ AG: int 37 41 43 46 49 51 52 53 56 56 ...
$ YR: int 19 22 19 24 31 24 31 33 33 26 ...
$ CD: int 12 15 15 15 20 15 20 20 10 25 ...
The command relevel(x, ref) reorders the levels of the factor variable LC so that
the level specified in ref becomes the reference or first level.
Example 0.2.4 (Tensile Strength Data) (Source: [?]) This data was dis-
cussed in [?]. An engineer wants to see whether the cotton weight percent-
age(CWP) in a synthetic fiber affects its tensile strength. The cotton weight
percentage is fixed at five different levels, i.e., 15%, 20%, 25%, 30%, and 35%.
Each percentage level is assigned five experimental units and the tensile strength
of the fabric is measured on each of them. The randomization is specified in
the run number column. The goal of the engineer is to investigate whether the
tensile strength is the same across all levels of cotton weight percentage.
library(ACSWR)
data(tensile)
attach(tensile)
str(tensile)
’data.frame’: 25 obs. of 4 variables:
$ Test_Sequence : int 1 2 3 4 5 6 7 8 9 10 ...
4
0.3.2 Boxplots
A box and whisker plot, also known as a boxplot, is a popular way of displaying
the distribution of numerical data based on the five number summary which in-
cludes the minimum, first quartile, median, third quartile, and maximum. Box
plots display information about the shape of the data distribution as well as
its central value and variability. The box represents the middle 50 percent of
the data. Boxplots also have lines extending vertically from the boxes; these
are called whiskers and indicate variability beyond the upper and lower quar-
tiles.The spacing between the different parts of the box indicates the degree of
dispersion (spread) and skewness in the data and also reveal outliers. To draw
a boxplot of sample data, use the command
The option range provides a distance to which the whiskers extend out from the
box. The option width gives the relative width of the boxes. The option outline
determines whether the outliers are to be drawn. By default, the boxplots
are vertical. The option horizontal=T gives horizontal boxplots. The option
notch=TRUE provides notches on each side of the box. If the notches in two
boxplots do not overlap, we can conclude that the two medians differ.
classes of values occur. For cases where there are many leaves per stem, we may
split the data corresponding to one stem into several rows to enhance readability.
We can use a stem and leaf plot for
• detecting outliers.
stem(pull.strength)
0 | 00024778
2 | 12244582557
4 | 25747
6 | 9
The options lty= and lwd= specifies line type and line width, the default in
each case being 1. Try changing the pch option to any number between 1 and
25. What do you see?
on the x-axis and the corresponding quantile values of the data set (which
correspond to the ordered values of the data set) on the y-axis. To draw a
normal Q-Q plot of data, use the command qqnorm().
Note that although the central part of the plot in Figure 4 appear to lie along
a straight line, points at the tails seem to deviate from the line. You may also
consider using tests for normality such as Kolmogrov-Smirnov or Shapiro-Wilk
tests.
d=table(spine)
pie(d,labels = rownames(d),col=rainbow(length(d)),main=NA)
0.4.2 Barplots
A barplot or a barchart is a simple graphical tool for visually summarizing
qualitative data, viz., data that may be classified into one of several, say K
categories. The frequency of each category is the total number of measurements
in that category. The relative frequency of each category is the proportion of
data in that category, and is the the ratio of the frequency of the category to
the total number of measurements. To construct a barchart, we show categories
on the x-axis, place corresponding frequencies (or relative frequencies) on the
y-axis, and construct vertical bars of equal width, one for each category. The
height of a bar is proportional to the frequency (or relative frequency) of that
category. There are many variations of the barchart, for instance, the bars may
be displayed horizontally or vertically.
10
x=table(spine, color)
barplot(x,beside = F,col=c("red","yellow","orange"),ylab = "Counts",
main =NA)
legend("topleft",rownames(x),cex=1,fill = c("red","yellow","orange"))
Does Figure 7 suggest that the percentage of cotton affects the tensile
strength of the fabric? We will discuss this question in Chapter ??.
Figure 8: Empirical Q-Q plot of width for crabs with good and bad spine
condition
library(car)
14
scatterplotMatrix(Wirebond[,-1],main=NA)
Note the [,-1] excludes the observation number from the plot.
plot(wire.length,pull.strength)
plot(jitter(wire.length),jitter(pull.strength))
(a) (b)
(c)
Figure 11: (a) Scatter plot of pull.strength versus wire.length; (b) Scat-
ter plot of jittered pull.strength versus jittered wire.length; (c) Smoothed
scatter plot of pull.strength versus wire.length
16
smoothScatter(wire.length,pull.strength)
The darker shades correspond to high density regions. Here, we do not have
to jitter the variables since the smoothScatter function automatically does it by
spreading a pair of variables over a small region.
attach(birds)
y=table(LC,BK)
spineplot(y,col = c("pink","orange"),main=NA)
In Figure 12, the partitions within the bars are the proportions of a given
status of bird keeping in having lung cancer or not. It appears that individuals
who have not kept birds are less prone to lung cancer. We will analyze this data
further in Chapter 13.
Note that all the graphics can be saved as files with commonly used filetypes
such as eps, pdf, jpg, svg, etc. For a more elaborate illustration of base R’s
capabilities for data visualization, refer to [?].
Exercises
1. (i) In the hist command, try changing the values of breaks and col, and
redraw Figure 1 (a).
(ii) Show that by using label=T, you can label the corresponding fre-
quency at the top of a frequency histogram.
0.6. VISUALIZATION FOR DEPENDENT MULTIVARIATE DATA 17
2. Explain what you see when you use the boxplot() function in these ways.
5. The area in thousands of square miles of the land masses which exceed
10,000 sq.miles is given in the dataset Islands in R. Draw a stem and
leaf plot. Comment on the skewness. Do you see an outlier?
7. In Example 0.2.2, draw a pie diagram categorizing the crabs by the con-
dition of their spines.
8. ggplot2 is an extensive R package with very flexible options compared to
the base graphics in R. Draw a barplot of the variable color in Exam-
ple 0.2.2 using the package ggplot2. Hint: ggplot(crabs,aes(color))+geom bar().
9. For the data in Example 0.2.2, draw a geometric count diagram for the
variables satell, width, color.
10. Use the data mtcars to answer these questions.
(i) Draw a scatter plot of the variables mpg and wt. Discuss whether the
two variables are associated.
(ii) Repeat (i) using the package ggplot2.
(iii) Create a matrix scatter plot for the data mtcars. On which variables
do you think mpg depends.
(iv) Which pair of variables in mtcars has the highest correlation? (Hint:
use the cor command.)