Notes 3

xii
0.1 Introduction
Data visualization is often the first step in any data analysis. Pictorial represen-
tations give useful initial insights about the data and also become useful in model
diagnostics. In this chapter, we discuss a few popular data visualization tools for
univariate and multivariate data. Histograms, boxplots and stem-and-leaf plots
are popularly used for the exploration of numerical data. By numerical data, we
mean quantitative data that may be either integer-valued or real-valued. These
data are represented using different graphical methods than qualitative data,
which usually occur as categories or labels and are usually represented using
barplots, pie diagrams, etc. The R software has wide graphic capabilities; use
demo(graphics) to explore the various possibilities.
0.2 Data Sets

Before embarking on a statistical analysis, it is important to explore the data
we are dealing with. It is good practice to identify the variables, study the data
structure and handle the data appropriately. This helps us to identify suitable
statistical tools for data analysis. The function str() helps us to understand the
structure of data sets stored in R. Data types are character, numeric, integer,
complex or logical. The function typeof() will exhibit the data type. Factors
are normally used to group variables into categories or levels. The function
as.factor() coerces its arguments to a factor. The function levels() provide access
to the levels attribute of a factor variable.
Example 0.2.1 (Wirebond Strength Data) The data describes an obser-

vational study at a semiconductor manufacturing plant where the finished semi-
conductor was wire-bonded to a frame. This data given in [?] consists of 25
observations on three quantitative or numeric variables: pull.strength (the
amount of force required to break the bond), wire.length (length of the wire)
and die.height (height of the die).
attach(Wirebond)
str(Wirebond) # structure of variables
data.frame’: 25 obs. of 3 variables:

$ pull.strength: num 9.95 24.45 31.75 35 25.02 ...
$ wire.length : int 2 8 11 10 8 4 2 2 9 8 ...
$ die.height : int 50 110 120 550 295 200 375 52 100 300 ...
Example 0.2.2 (Crab Mating Data) (Source: [?]) Each female horseshoe
crab in this study had a male crab attached to her in her nest. The study
investigated some factors that were thought to affect whether the female crab
had any other males, called satellites, residing nearby. These factors included
the color, spine condition, weight (kg), and carapace width (cm) of the female
crabs. Color has four levels (1: light medium; 2: medium; 3: dark medium; 4:
1
2
dark). Spine condition has three levels (1: both good; 2: one worn or broken; 3:
both worn or broken). This data is available as crabs in the R package glmbb.
Note that the spine condition levels are coded as good, bad, middle in the R
package.
library(glmbb)
data("crabs",package = "glmbb")
attach(crabs)
The data has 173 observations on six variables of which two are categorical
while the rest are quantitative. Use the function levels on the data object
through the command sapply to see the levels of the variables. The command
factor can be used to encode a vector as a factor.
sapply(crabs, levels) #levels of variables in crabs.

$color
[1] "dark" "darker" "light" "medium"
$spine
[1] "bad" "good" "middle"
$width
NULL
$satell
NULL
$weight
NULL
$y
NULL
color=factor(color,levels=c("light","medium","Dark","Darker"))
#Changes the order of levels.
Example 0.2.3 (Birdkeeping Data) (Source: [?]) A 1972–1981 health sur-

vey in The Hague, The Netherlands, discovered an association between keeping
pet birds and increased risk of lung cancer. To investigate bird keeping as a risk
factor, researchers conducted a case-control study of patients in 1985 at four
hospitals in The Hague (population 450,000). They identified 49 cases of lung
cancer among the patients who were registered with a general practice, who
were aged 65 or younger and who had resided in the city since 1965. They also
selected 98 controls from a population of residents having the same general age
structure. They collected data on the following variables:
LC, whether the subject has lung cancer,

0.2. DATA SETS 3
FM, gender of the subject,

SS, socioeconomic status, determined by occupation of the household’s
principal wage earner
BK, indicator of bird keeping,
AG, age of the subject (in years),
YR, years of smoking prior to diagnosis or examination, and
CD, average rate of smoking (in number of cigarettes per day).
This data is available in the R package Sleuth3.
library(Sleuth3)
data(case2002)
birds=case2002
str(birds)
’data.frame’: 147 obs. of 7 variables:
$ LC: Factor w/ 2 levels "LungCancer","NoCancer": 1 1 1 1 1 1 1 1 1 1 ...
$ FM: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ SS: Factor w/ 2 levels "High","Low": 2 2 1 2 2 1 1 2 2 1 ...
$ BK: Factor w/ 2 levels "Bird","NoBird": 1 1 2 1 1 2 1 2 1 2 ...
$ AG: int 37 41 43 46 49 51 52 53 56 56 ...
$ YR: int 19 22 19 24 31 24 31 33 33 26 ...
$ CD: int 12 15 15 15 20 15 20 20 10 25 ...
The command relevel(x, ref) reorders the levels of the factor variable LC so that
the level specified in ref becomes the reference or first level.
#Change the reference level of LC to NoCancer.

birds$LC=relevel(birds$LC,ref = "NoCancer")
Example 0.2.4 (Tensile Strength Data) (Source: [?]) This data was dis-
cussed in [?]. An engineer wants to see whether the cotton weight percent-
age(CWP) in a synthetic fiber affects its tensile strength. The cotton weight
percentage is fixed at five different levels, i.e., 15%, 20%, 25%, 30%, and 35%.
Each percentage level is assigned five experimental units and the tensile strength
of the fabric is measured on each of them. The randomization is specified in
the run number column. The goal of the engineer is to investigate whether the
tensile strength is the same across all levels of cotton weight percentage.
library(ACSWR)
data(tensile)
attach(tensile)
str(tensile)
’data.frame’: 25 obs. of 4 variables:
$ Test_Sequence : int 1 2 3 4 5 6 7 8 9 10 ...
4
$ Run_Number : int 8 18 10 23 17 5 14 6 15 20 ...

$ CWP : int 20 30 20 35 30 15 25 20 25 30 ...
$ Tensile_Strength: int 12 19 17 7 25 7 14 12 18 22 ...
Recall as.factor coerces its argument to a factor.

tensile$CWP=as.factor(tensile$CWP) #coerces CWP to a factor variable.
0.3 Visualization for Univariate Quantitative Data

0.3.1 Histograms
A histogram is a pictorial representation of numerical or quantitative data. It
uses rectangles to show the frequency or relative frequencies of data items in
successive intervals. A histogram is constructed by dividing the entire range of
data values into a series of intervals called bins, which are represented along the
x-axis, while the frequencies (or relative frequencies of data items) are shown as
rectangles on the y-axis. The shape of a histogram provides useful information
such as the location, spread, skewness, outliers and data density.
A histogram can be plotted using the command hist(x, breaks = , freq =
NULL, right = TRUE,density = NULL, angle = 45) where x is the vector of
values for which the histogram is to be plotted. Using the parameter breaks we
can specify the number of bins. When freq= FALSE, probability densities are
plotted. For right= TRUE, the cells in the histogram will be right-closed (left
open) intervals. The parameter density provides the density of shading lines, in
lines per inch, the default value of NULL giving no shading lines. The option
angle gives the slope of shading lines.
Example 0.2.1. Wirebond Data (continued)

hist(pull.strength,main=NA,
breaks=12,col=3)
Figure 1 shows a histogram of pull.strength. To add a normal curve to

the plot we use
hist(pull.strength,main="Histogram of pull
strength",breaks=12,col=3,prob=T)
x=seq(0,100,0.01)
curve(dnorm(x,mean=mean(pull.strength),sd
=sd(pull.strength)),add=T)
We may also add a kernel density plot to the histogram using
hist(pull.strength,main=NA,
breaks=12,col=3,prob=T)
lines(density(pull.strength))
0.3. VISUALIZATION FOR UNIVARIATE QUANTITATIVE DATA 5
(a) with a normal curve (b) with a kernel density function
Figure 1: Histogram of pull strength
0.3.2 Boxplots
A box and whisker plot, also known as a boxplot, is a popular way of displaying
the distribution of numerical data based on the five number summary which in-
cludes the minimum, first quartile, median, third quartile, and maximum. Box
plots display information about the shape of the data distribution as well as
its central value and variability. The box represents the middle 50 percent of
the data. Boxplots also have lines extending vertically from the boxes; these
are called whiskers and indicate variability beyond the upper and lower quar-
tiles.The spacing between the different parts of the box indicates the degree of
dispersion (spread) and skewness in the data and also reveal outliers. To draw
a boxplot of sample data, use the command
boxplot(x, range = , width = , notch = , outline =, names, horizontal = ).
The option range provides a distance to which the whiskers extend out from the
box. The option width gives the relative width of the boxes. The option outline
determines whether the outliers are to be drawn. By default, the boxplots
are vertical. The option horizontal=T gives horizontal boxplots. The option
notch=TRUE provides notches on each side of the box. If the notches in two
boxplots do not overlap, we can conclude that the two medians differ.

boxplot(pull.strength,main="Pull Strength")
A boxplot for pull.strength is shown in Figure 2. Since the box represents

the middle 50 percent of the data, we see that the median pull strength is just
above 20. The data set is right skewed and one point seems to be an outlier.
0.3.3 Stem and Leaf plots

A stem and leaf plot is a textual graph of quantitative data. Each data value is
split into a stem and a leaf. The plot shows the frequency with which certain
6
Figure 2: Boxplot of pull.strength
classes of values occur. For cases where there are many leaves per stem, we may
split the data corresponding to one stem into several rows to enhance readability.
We can use a stem and leaf plot for
• showing the frequency corresponding to observed data values,
• checking skewness of the data, and
• detecting outliers.
In R we can use the command stem(x, scale =, width=, atom =) where x is a

numeric vector. The values assigned to scale and width control the length and
width of the plot respectively, while atom specifies the tolerance. The default
is scale = 1, width = 80 and atom = 1e-08.

The stem and leaf plot of pull.strength is obtained as follows:
stem(pull.strength)
The decimal point is 1 digit(s) to the right of the |
0 | 00024778
2 | 12244582557
4 | 25747
6 | 9
We observe that the distribution of pull.strength is right skewed, which

is what the boxplot showed as well. Most of the data values lies in the range
20-30. Further, the last row of the stem plot (data value=69) may be treated
as an outlier.
0.3. VISUALIZATION FOR UNIVARIATE QUANTITATIVE DATA 7
Figure 3: Dot plot of color of crabs categorized by spine condition
0.3.4 Dot Plot or Dot Chart

A dot plot or a dot chart, which is sometimes referred to as a one-dimensional
scatter plot, is a simple and compact way to describe univariate data. It consists
of a plot of the data along an axis labeled according to the measurement scale.
We can plot a dot chart use the command dotchart(x, labels = NULL, groups
= NULL, gdata = NULL). Here x is a vector or matrix of numeric values. The
option labels provides labels for each point. Grouping the elements of x can be
done using the option groups. The option gdata provides the data values for
the groups which is usually a summary statistic such as mean or median of each
group.
Example 0.2.2. Crab Mating Data (continued)

y=table(color,spine)
dotchart(y,color = c("red","blue","green","black"),pch = 16,
main = NA,xlab="Counts")
The options lty= and lwd= specifies line type and line width, the default in
each case being 1. Try changing the pch option to any number between 1 and
25. What do you see?
0.3.5 Normal Quantile-Quantile (Q-Q) plot

The normal Q-Q plot is a graphical technique for assessing whether or not a
random sample of data is normally distributed. In R, we draw the Normal Q-Q
plot where the quantile values of the standard normal distribution are plotted
8
Figure 4: Normal Q-Q plot of pull strength
on the x-axis and the corresponding quantile values of the data set (which
correspond to the ordered values of the data set) on the y-axis. To draw a
normal Q-Q plot of data, use the command qqnorm().
qqnorm(y) # normal Q-Q plot

To draw the normal QQ plot of pull.strength, use:
qqnorm(pull.strength, main = NA,

xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
plot.it = TRUE, datax = FALSE, pch=20)
qqline(pull.strength,datax=FALSE, distribution = qnorm,
probs = c(0.25, 0.75), qtype = 7) # adds a QQ line to the plot
Note that although the central part of the plot in Figure 4 appear to lie along
a straight line, points at the tails seem to deviate from the line. You may also
consider using tests for normality such as Kolmogrov-Smirnov or Shapiro-Wilk
tests.
0.4 Visualization of Univariate Qualitative or Cat-

egorical Data
We describe the pie diagram or pie chart for one qualitative variable followed
by a segmented bar graph to represent two qualitiative variables.
0.4. VISUALIZATION OF UNIVARIATE QUALITATIVE OR CATEGORICAL DATA9
Figure 5: Pie diagram showing the crabs categorized by spine condition.
0.4.1 Pie Diagram

A pie chart or a pie diagram is often used to represent the frequency of each
categorical data value as a percentage of the total number of observations. We
construct a circle (which spans 360◦ at the center), then divide the circle into
sectors, one for each distinct category value, with the size of any given sector
being proportional to the percentage (or relative frequency) of that category.
Note that the complete circle represents the total number of measurements. We
determine the angle of each slice by multiplying the relative frequency by 360◦ .

Figure 5 shows the pie chart for spine condition of the crabs. The spine condition
for most of the crabs appears to be either worn or broken!
d=table(spine)
pie(d,labels = rownames(d),col=rainbow(length(d)),main=NA)
Try heat.colors(n), terrain.colors(n), topo.colors(n), and cm.colors(n) in-

stead of using the rainbow option. The command colors() returns the 657 color
names available in R. To see these names, refer to the color chart in [?]. The
dot plot or the bar graph are preferred by users over the pie chart ([?], p. 264).
0.4.2 Barplots
A barplot or a barchart is a simple graphical tool for visually summarizing
qualitative data, viz., data that may be classified into one of several, say K
categories. The frequency of each category is the total number of measurements
in that category. The relative frequency of each category is the proportion of
data in that category, and is the the ratio of the frequency of the category to
the total number of measurements. To construct a barchart, we show categories
on the x-axis, place corresponding frequencies (or relative frequencies) on the
y-axis, and construct vertical bars of equal width, one for each category. The
height of a bar is proportional to the frequency (or relative frequency) of that
category. There are many variations of the barchart, for instance, the bars may
be displayed horizontally or vertically.
10
Figure 6: Barplot of crabs categorized by spine condition and color.

The barplot in Figure 0.4.2 can be constructed using the barplot(). Light colored
crabs seem to be less prone to having a bad spine condition than the other
colored crabs, suggesting a possible relationship between spine condition and
color of the crabs.
x=table(spine, color)
barplot(x,beside = F,col=c("red","yellow","orange"),ylab = "Counts",
main =NA)
legend("topleft",rownames(x),cex=1,fill = c("red","yellow","orange"))
0.5 Visualization of Independent Multivariate Data

In many examples, we have observations on more than one variable per exper-
imental unit. Parallel boxplots and empirical Q-Q plots are useful graphical
tools and are discussed below.
0.5.1 Parallel Boxplots

Parallel boxplots are used for comparing several independent samples. They are
also useful when we need to create separate boxplots for a variable for each level
of another variable. In both situations, parallel boxplots enable us to compare
boxes that are drawn on the same scale. These boxplots are also referred to as
side-by-side boxplots. To draw parallel boxplots of a numerical variable y across
several levels of a categorical variable x, use the command
boxplot (y~x)

We can draw parallel boxplots for the tensile strength against different levels of
the cotton percentages:
boxplot (Tensile_Strength~CWP,main="",xlab="Percentage of cotton",
ylab="Tensile strength")
0.5. VISUALIZATION OF INDEPENDENT MULTIVARIATE DATA 11
Figure 7: Parallel boxplots of tensile strength versus percentage of cotton.
Does Figure 7 suggest that the percentage of cotton affects the tensile
strength of the fabric? We will discuss this question in Chapter ??.
0.5.2 Empirical Q-Q Plots

Quantile-quantile (Q-Q) plots are useful graphical tools that enable us to com-
pare two data sets via their distribution functions. In general, in any Q-Q plot,
the quantiles of one distribution (theoretical or empirical) are plotted against
the quantiles of another.
The empirical Q-Q plot is constructed by plotting the sample quantiles (or
the quantiles of the empirical distribution) of one data against the corresponding
quantiles (for the same proportion p) of the other. For example, the median and
quartiles of one set are plotted against the median and quartiles of the second
set.
This plot is extremely simple to construct in the situation where both data
sets have the same number of observations. In this case, the empirical Q-Q
plot is obtained by plotting the order statistics of one data against the order
statistics of the second data. If the two samples have different sizes, interpolation
is necessary.
If the two empirical distributions coincide, all of the points in the empirical
Q-Q plot would lie exactly on the 45 degree line y = x; otherwise, departures
from this line provide information on how the two distributions differ. If the
points lie on any straight line, the two data sets have the same shape. If
the points in an empirical Q-Q plot show large, systematic departures from
a straight line, the two data distributions differ in shape. In R, we use the
12
Figure 8: Empirical Q-Q plot of width for crabs with good and bad spine
condition
command qqplot(x, y, plot.it = TRUE, xlab, ylab).
Example 0.2.2 Crab Mating Data (continued)

The empirical Q-Q plot of the widths pertaining to good and bad spine condi-
tions in Figure 8 do not show any large systematic departures indicating that
the widths in both the groups do not differ in shape.
qqplot(width[spine=="good"], width[spine=="bad"], plot.it = TRUE,

xlab= "good spine condition", ylab="bad spine condition",
main=NA )
0.6 Visualization for Dependent Multivariate

Data
In many situations, we encounter data sets with several variables that may be
associated with one another. In this section, we discuss a few ways to visualize
such data.
0.6.1 Scatter Plots

A scatter plot is a graph in which each case in the data set is plotted in a two-
dimensional coordinate system. If the points show a tendency to lie about a line
with either positive or negative slope then we can suspect a linear relationship.
To draw a scatter plot of data on two variables x and y, use the command
plot(x,y) #scatter plot

0.6. VISUALIZATION FOR DEPENDENT MULTIVARIATE DATA 13
(a) Against wire length (b) Against die height
Figure 9: Scatter plots for the Wirebond Data

The variables pull.strength and wire.length are associated with the same
semiconductor and may be dependent. Likewise, pull.strength and die.height
may also be associated. A scatter plot of pull.strength against wire.length
and a scatterplot of pull.strength against die.height can be constructed as
follows:
plot(wire.length,pull.strength,
main=NA,
xlab= "Wire length",ylab="Pull strength",
pch=20,col=2)
plot(die.height,pull.strength,main=NA, xlab= "Die height",ylab="Pull strength",
pch=20,col=4)
From the first scatter plot, we suspect a linear relationship between wire.length
and pull.strength whereas the points are more randomly scattered in the sec-
ond scatter plot, and a strong linear relationship between pull.strength and
die.height is less likely. Try changing pch and col to different values. You can
see the pattern and color of plotted points changing.
0.6.2 Matrix Scatter Plots

A matrix scatter plot (also referred to as scatter plot matrix) helps us to visualize
relationships between three or more variables. The plot consists of pairwise
scatter plots of the variables in a matrix format. Kernel densites (by default)
of the variables are shown along the main diagonal.

A scatter plot of pull.strength against both the predictors wire.length and
die.height can be constructed in a single figure by using the scatterplotMa-
trix() function in the car library.
library(car)
14
Figure 10: Matrix Scatter Plot for the Wirebond Data
scatterplotMatrix(Wirebond[,-1],main=NA)
Note the [,-1] excludes the observation number from the plot.
0.6.3 Jittered Scatter Plots

In some situations, a jittered scatter plot gives a better visualization of the
data, by avoiding over plotting. Over plotting can occur when a continuous
measurement is rounded to some convenient unit. Jittering is the act of adding
random noise to the data through pseudo random variables. That is, we replace
y by y + u where u may be a Uniform(−d, d) variable, and d may be chosen as
the smallest distance between two distinct values of y. Over plotting may also
occur if we have a very large number of data points or when there is a perfect
linear relationship between two variables, and jittering may not be helpful in
such situations. We construct such a plot using the command
plot(jitter(x),jitter(y)) # jittered scatter plot
An alternative approach to jittering uses the smoothScatter() function in R.

It produces a smooth density representation of the scatter plot using color. To
draw the smoothed scatter plot, use the command smoothScatter().
smoothScatter(x,y) # smooth scatter plot

We examine the scatter plot for the unjittered data in Figure 11 (a). To draw a
jittered scatter plot of the data (see Figures 11 (b) and 11 (c)), use the command:
plot(wire.length,pull.strength)
plot(jitter(wire.length),jitter(pull.strength))
To draw the smoothed scatter plot, use the command smoothScatter():

(a) (b)
(c)
Figure 11: (a) Scatter plot of pull.strength versus wire.length; (b) Scat-
ter plot of jittered pull.strength versus jittered wire.length; (c) Smoothed
scatter plot of pull.strength versus wire.length
16
smoothScatter(wire.length,pull.strength)
The darker shades correspond to high density regions. Here, we do not have
to jitter the variables since the smoothScatter function automatically does it by
spreading a pair of variables over a small region.
0.6.4 Spine plots and Spinograms

The relationship between two categorical variables can be explored using spine
plots. For the spine plot, both quantities are approximated by the correspond-
ing empirical relative frequencies. When x is numerical, spineplot(x,y) gives
spinograms. For the spinogram, we first discretize x by using hist with the
breaks option, and then compute empirical relative frequencies. Spinograms
may be viewed as extended stacked histograms. Spine plots may also be viewed
as generalizations of stacked barplots, where the widths of the bars (and not
their heights) correspond to the relative frequencies of x. The heights of the bars
then correspond to the conditional relative frequencies of y in every x group.
The partitions within the bars are the conditional relative frequencies of y in
each x group which are labeled on the left axis and quantified on the right axis.
To draw a spine plot, use the commands spineplot(x,y) or spineplot(y ∼ x),
where y is categorical and x can be categorical or numerical.
Example 0.2.3. Birdkeeping Data (continued)

We draw a spine plot to investigate whether bird keeping is related to lung
cancer.
attach(birds)
y=table(LC,BK)
spineplot(y,col = c("pink","orange"),main=NA)
In Figure 12, the partitions within the bars are the proportions of a given
status of bird keeping in having lung cancer or not. It appears that individuals
who have not kept birds are less prone to lung cancer. We will analyze this data
further in Chapter 13.
Note that all the graphics can be saved as files with commonly used filetypes
such as eps, pdf, jpg, svg, etc. For a more elaborate illustration of base R’s
capabilities for data visualization, refer to [?].
Exercises
1. (i) In the hist command, try changing the values of breaks and col, and
redraw Figure 1 (a).
(ii) Show that by using label=T, you can label the corresponding fre-
quency at the top of a frequency histogram.
Figure 12: Spine plot of bird keeping versus lung cancer
(iii) Draw a histogram of the variable wire.length. By adding a normal

curve to the plot, discuss whether you may assume that wire.length
follows a normal distribution.
(iv) Compare the three variables in the Wirebond data in Example 0.2.1
by drawing all three histograms in a single plot. (Hint: use add=T)
2. Explain what you see when you use the boxplot() function in these ways.
(i) boxplot(pull.strength, varwidth=T)

(ii) boxplot(pull.strength, notch=T)
(iii) boxplot(pull.strength,outline=T)
(iv) boxplot(pull.strength, col=3 border=1)
(v) boxplot(pull.strength, add=T)
3. Construct and interpret a stem and leaf plot of pull.strength with

scale=2.
4. Construct and interpret a stem and leaf plot of wire.length. Comment

on the skewness in the distribution of this variable.
5. The area in thousands of square miles of the land masses which exceed
10,000 sq.miles is given in the dataset Islands in R. Draw a stem and
leaf plot. Comment on the skewness. Do you see an outlier?
6. Generate a sample of size n = 100 from normal distribution with mean 25

and standard deviation 3. Find summary statistics of the data. Construct
a histogram, boxplot and a stem-and-leaf plot of the data. Discuss their
features.
18
7. In Example 0.2.2, draw a pie diagram categorizing the crabs by the con-
dition of their spines.
8. ggplot2 is an extensive R package with very flexible options compared to
the base graphics in R. Draw a barplot of the variable color in Exam-
ple 0.2.2 using the package ggplot2. Hint: ggplot(crabs,aes(color))+geom bar().
9. For the data in Example 0.2.2, draw a geometric count diagram for the
variables satell, width, color.
10. Use the data mtcars to answer these questions.
(i) Draw a scatter plot of the variables mpg and wt. Discuss whether the
two variables are associated.
(ii) Repeat (i) using the package ggplot2.
(iii) Create a matrix scatter plot for the data mtcars. On which variables
do you think mpg depends.
(iv) Which pair of variables in mtcars has the highest correlation? (Hint:
use the cor command.)

Notes 3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes 3

Uploaded by

Copyright:

Available Formats

xii

0.2 Data Sets

Example 0.2.1 (Wirebond Strength Data) The data describes an obser-

data.frame’: 25 obs. of 3 variables:

sapply(crabs, levels) #levels of variables in crabs.

Example 0.2.3 (Birdkeeping Data) (Source: [?]) A 1972–1981 health sur-

LC, whether the subject has lung cancer,

FM, gender of the subject,

#Change the reference level of LC to NoCancer.

$ Run_Number : int 8 18 10 23 17 5 14 6 15 20 ...

Recall as.factor coerces its argument to a factor.

0.3 Visualization for Univariate Quantitative Data

Example 0.2.1. Wirebond Data (continued)

Figure 1 shows a histogram of pull.strength. To add a normal curve to

(a) with a normal curve (b) with a kernel density function

Figure 1: Histogram of pull strength

boxplot(x, range = , width = , notch = , outline =, names, horizontal = ).

Example 0.2.1. Wirebond Data (continued)

A boxplot for pull.strength is shown in Figure 2. Since the box represents

0.3.3 Stem and Leaf plots

Figure 2: Boxplot of pull.strength

• showing the frequency corresponding to observed data values,

• checking skewness of the data, and

In R we can use the command stem(x, scale =, width=, atom =) where x is a

Example 0.2.1. Wirebond Data (continued)

The decimal point is 1 digit(s) to the right of the |

We observe that the distribution of pull.strength is right skewed, which

Figure 3: Dot plot of color of crabs categorized by spine condition

0.3.4 Dot Plot or Dot Chart

Example 0.2.2. Crab Mating Data (continued)

0.3.5 Normal Quantile-Quantile (Q-Q) plot

Figure 4: Normal Q-Q plot of pull strength

qqnorm(y) # normal Q-Q plot

Example 0.2.1. Wirebond Data (continued)

qqnorm(pull.strength, main = NA,

0.4 Visualization of Univariate Qualitative or Cat-

Figure 5: Pie diagram showing the crabs categorized by spine condition.

0.4.1 Pie Diagram

Example 0.2.2. Crab Mating Data (continued)

Try heat.colors(n), terrain.colors(n), topo.colors(n), and cm.colors(n) in-

Figure 6: Barplot of crabs categorized by spine condition and color.

Example 0.2.2. Crab Mating Data (continued)

0.5 Visualization of Independent Multivariate Data

0.5.1 Parallel Boxplots

Example 0.2.1. Wirebond Data (continued)

Figure 7: Parallel boxplots of tensile strength versus percentage of cotton.

0.5.2 Empirical Q-Q Plots

command qqplot(x, y, plot.it = TRUE, xlab, ylab).

Example 0.2.2 Crab Mating Data (continued)

qqplot(width[spine=="good"], width[spine=="bad"], plot.it = TRUE,

0.6 Visualization for Dependent Multivariate

0.6.1 Scatter Plots

plot(x,y) #scatter plot

(a) Against wire length (b) Against die height

Figure 9: Scatter plots for the Wirebond Data

Example 0.2.1. Wirebond Data (continued)

0.6.2 Matrix Scatter Plots

Example 0.2.1. Wirebond Data (continued)

Figure 10: Matrix Scatter Plot for the Wirebond Data

0.6.3 Jittered Scatter Plots

plot(jitter(x),jitter(y)) # jittered scatter plot

An alternative approach to jittering uses the smoothScatter() function in R.