You are on page 1of 28

Chapter 1: Some basic statistical concepts

Statistical concepts and methods are not only useful but indeed often indispensable in understanding
the world around us. They provide ways of gaining new insights into the behaviour of many phenomena
that you will encounter in your chosen field of specialisation. The discipline of statistics teaches us how
to make intelligent judgments and informed decisions in the presence of uncertainty and variation.
Without uncertainty or variation, there would be little need for statistical methods or statisticians. If the
yield of a crop was the same in every field, if all individuals reacted the same way to a drug, if everyone
gave the same response to an opinion survey, and so on, then a single observation would reveal all
desired information.

1. The language of statistics

We are constantly exposed to collections of facts, or data, both in our professional capacities and in
everyday activities. The discipline of statistics provides methods for organising and summarising data
and for drawing conclusions based on information contained in the data.
An investigation will typically focus on a well-defined collection of objects constituting a population
of interest. In one study, the population might consist of all multivitamin capsules produced by a certain
manufacturer in a particular week. Another investigation might involve the population of all individuals
who received a B.S. in statistics or mathematics during the most recent academic year. When desired
information is available for all objects in the population, we have what is called a census. Constraints
on time, money, and other scarce resources usually make a census impractical or infeasible. Instead, a
subset of the population—a sample—is selected in some prescribed manner. Thus we might obtain a
sample of pills from a particular production run as a basis for investigating whether pills are conforming
to manufacturing specifications, or we might select a sample of last year’s graduates to obtain feedback
about the quality of the curriculum.
We are usually interested only in certain characteristics of the objects in a population: the amount of
vitamin C in the pill, the sex of a student, the age of a vehicle, and so on. A characteristic may be
categorical, such as sex or college major, or it may be quantitative in nature. In the former case, the
value of the characteristic is a category (e.g., female or economics), whereas in the latter case, the value
is a number (e.g., age = 5.1 years or vitamin C content = 65 mg). A variable is any characteristic whose
value may change from one object to another in the population. We shall initially denote variables by
lowercase letters from the end of our alphabet. Examples include

x = brand of computer owned by a student


y = number of items purchased by a customer at a grocery store
z = braking distance of an automobile under specified conditions

Data comes from making observations either on a single variable or simultaneously on two or more
variables. A univariate data set consists of observations on a single variable. For example, we might
consider the type of computer, laptop (L) or desktop (D), for ten recent purchases, resulting in the
categorical data set

DLLLDLLDLL

The following sample of lifetimes (hours) of cell phone batteries under continuous use is a quantitative
univariate data set:

10.6 10.1 11.2 9.0 10.0 8 9.5 8.8 11.5

We have bivariate data when observations are made on each of two variables. Our data set might consist
of a (height, weight) pair for each basketball player on a team, with the first observation as (72, 168),
the second as (75, 212), and so on. If a kinesiologist determines the values of x = recuperation time

1
from an injury and y = type of injury, the resulting data set is bivariate with one variable quantitative
and the other categorical. Multivariate data arises when observations are made on more than two
variables. For example, a research physician might determine the systolic blood pressure, diastolic
blood pressure, and serum cholesterol level for each patient participating in a study. Each observation
would be a triple of numbers, such as (120, 80, 146). In many multivariate data sets, some variables are
quantitative, and others are categorical. Thus, the annual automobile issue of Consumer Reports gives
values of such variables as type of vehicle (small, sporty, compact, midsize, large), city fuel efficiency
(mpg), highway fuel efficiency (mpg), drive train type (rear wheel, front wheel, four wheel), and so on.

Summarising data
An investigator who has collected data may wish simply to summarise and describe important features
of the data. This entails using methods from descriptive statistics. Some of these methods are graphical
in nature; the constructions of histograms, boxplots, and scatterplots are primary examples. Other
descriptive methods involve calculation of numerical summary measures, such as means, standard
deviations, and correlation coefficients. The wide availability of statistical computer software packages
has made these tasks much easier to carry out than they used to be. Computers are much more efficient
than human beings at calculation and the creation of pictures (once they have received appropriate
instructions from the user!). This means that the investigator doesn’t have to expend much effort on
“grunt work” and will have more time to study the data and extract important messages. In our class,
we will present output from the free statistical software R.

Example 1 Charity is a big business in the USA. The website charitynavigator.com gives information
on roughly 5500 charitable organisations, and there are many smaller charities that fly below the
navigator’s radar. Some charities operate very efficiently, with fundraising and administrative expenses
that are only a small percentage of total expenses, whereas others spend a high percentage of what they
take in on such activities. Here is data on fundraising expenses, as a percentage of total expenditures,
for a random sample of 60 charities:

6.1 12.6 34.7 1.6 18.8 2.2 3.0 2.2 5.6 3.8 2.2 3.1 1.3
1.1 14.1 4.0 21.0 6.1 1.3 20.4 7.5 3.9 10.1 8.1 19.5 5.2
12.0 15.8 10.4 5.2 6.4 10.8 83.1 3.6 6.2 6.3 16.3 12.7 1.3
0.8 8.8 5.1 3.7 26.3 6.0 48.0 8.2 11.7 7.2 3.9 15.3 16.6
8.8 12.0 4.7 14.7 6.4 17.0 2.5 16.2

Without any organisation, it is difficult to get a sense of the data’s most prominent features: what a
typical (i.e., representative) value might be, whether values are highly concentrated about a typical value
or quite dispersed, whether there are any gaps in the data, what fraction of the values are less than 20%,
and so on. Figure 1 next page shows a histogram. After introducing you to R, we will discuss
construction and interpretation of this graph. For the moment, I hope you see how it describes the way
the percentages are distributed over the range of possible values from 0 to 100. Of the 60 charities, 36
use less than 10% on fundraising, and 18 use between 10% and 20%. Thus 54 out of the 60 charities in
the sample, or 90%, spend less than 20% of money collected on fundraising. How much is too much?
There is a delicate balance: most charities must spend money to raise money, but then money spent on
fundraising is not available to help beneficiaries of the charity. Perhaps each individual giver should
draw his or her own line in the sand.

Having obtained a sample from a population, an investigator would frequently like to use sample
information to draw some type of conclusion (make an inference of some sort) about the population.
That is, the sample is typically a means to an end rather than an end in itself. Techniques for generalising
from a sample to a population in a precise and objective way are gathered within the branch of our
discipline called inferential statistics.

2
Figure 1 A histogram for the charity fundraising data of Example 1

Example 2 The authors of the article “Fire Safety of Glued-Laminated Timber Beams in Bending” (J.
of Structural Engr. 2017) conducted an experiment to test the fire resistance properties of wood pieces
connected at corners by sawtooth-shaped “fingers” along with various types of commercial adhesive.
The beams were all exposed to the same fire and load conditions. The accompanying data on fire
resistance time (min) for a sample of timber beams bonded with polyurethane adhesive appeared in the
article:

47.0 53.0 52.5 52.0 47.5 56.5 45.0 43.5 48.0 48.0 41.0 34.0 36.5
49.0 47.5 34.0 34.0 36.0 42.0

Suppose we want an estimate of the true average fire resistance time under these conditions.
(Conceptualising a population of all such beams with polyurethane bonding under these experimental
conditions, we are trying to estimate the population mean.) It can be shown that, with a high degree of
confidence, the population mean fire resistance time is between 41.2 and 48.0 min; this is called a
confidence interval or an interval estimate. On the other hand, this data can also be used to predict the
fire resistance time of a single timber beam under these conditions. With a high degree of certainty, the
fire resistance time of a single such beam will exceed 29.4 min; the number 29.4 is called a lower
prediction bound.

3
Probability versus statistics
In probability, properties of the population under study are assumed known (e.g., in a numerical
population, some specified distribution of the population values may be assumed), and questions
regarding a sample taken from the population are posed and answered. In statistics, characteristics of a
sample are available to the experimenter, and this information enables the experimenter to draw
conclusions about the population. The relationship between the two disciplines can be summarised by
saying that probability reasons from the population to the sample (deductive reasoning), whereas
inferential statistics reasons from the sample to the population (inductive reasoning). This is illustrated
in Figure 2.

Figure 2 The relationship between probability and inferential statistics

Before we can understand what a particular sample can tell us about the population, we should first
understand the uncertainty associated with taking a sample from a given population. This is why we
study probability before statistics. As an example of the contrasting focus of probability and inferential
statistics, consider drivers’ use of seatbelts in automobiles. According to the article “Somehow, Way
Too Many Americans Still Aren’t Wearing Seatbelts” (www.wired.com, Sept. 2016), data collected by
observers from the National Highway Traffic Safety Administration indicates that 88.5% of drivers and
front seat passengers buckle up. But this percentage varies considerably by location. In the 34 states in
which a driver can be pulled over and cited for nonusage, 91.2% wore their seatbelts in 2015. By
contrast, in the 15 states where a citation can be given only if a driver is pulled over for another
infraction and the one state where usage is not mandatory (New Hampshire), usage drops to 78.6%. In
a probability context, we might assume that 85% of all drivers in a particular metropolitan area regularly
use seatbelts (an assumption about the population) and then ask, “How likely is it that a sample of 100
drivers will include at most 70 who regularly use their seatbelt?” or “How many drivers in a sample of
size 100 can we expect to regularly use their seatbelt?” On the other hand, in inferential statistics, sample
information is available, e.g., a sample of 100 drivers from this area reveals that 80 regularly use their
seatbelts. We might then ask, “Does this provide strong evidence for concluding that less than 90% of
all drivers in this area are regular seatbelt users?” In this latter scenario, sample information will be
employed to answer a question about the structure of the entire population from which the sample was
selected.
Next, consider a study involving a sample of 25 patients to investigate the efficacy of a new minimally
invasive method for rotator cuff surgery. The amount of time that each individual subsequently spends
in physical therapy is then determined. The resulting sample of 25 PT times is from a population that
does not actually exist. Instead, it is convenient to think of the population as consisting of all possible
times that might be observed under similar experimental conditions. Such a population is referred to as
a conceptual or hypothetical population. There are a number of situations in which we fit questions into
the framework of inferential statistics by conceptualising a population.

4
2. Graphical methods in descriptive statistics

There are two general types of methods within descriptive statistics: graphical and numerical
summaries. In this section we will discuss the first of these types—representing a data set using visual
techniques. In the next section, we will develop some numerical summary measures for data sets. Many
visual techniques may already be familiar to you: frequency tables, histograms, pie charts, bar graphs,
scatterplots, and the like. Here we focus on a selected few of these techniques that are most useful and
relevant to probability and inferential statistics.

Notation
Some general notations will make subsequent discussions easier. The number of observations in a single
sample, that is, the sample size, will often be denoted by n. So, n = 4 for the sample of universities
{Stanford, Iowa State, Wyoming, Rochester} and also for the sample of pH measurements {6.3, 6.2,
5.9, 6.5}. If two samples are simultaneously under consideration, either m and n or n 1 and n2 can be
used to denote the numbers of observations. Thus if {3.75, 2.60, 3.20, 3.79} and {2.75, 1.20, 2.45} are
GPAs for two sets of friends, respectively, then m = 4 and n = 3.
Given a data set consisting of n observations on some variable x, the individual observations will be
denoted by x1, x2, x3, …, xn. The subscript bears no relation to the magnitude of a particular observation.
Thus x1 will not in general be the smallest observation in the set, nor will x n typically be the largest. In
many applications, x1 will be the first observation gathered by the experimenter, x2 the second, and so
on. The ith observation in the data set will be denoted by xi.

Stem-and-leaf plot
Consider a numerical data set x1, x2, x3, …, xn for which each xi consists of at least two digits. A quick
way to obtain an informative visual representation of the data set is to construct a stem-and-leaf display,
or stem plot.

Steps for constructing a stem-and-leaf display


1. Select one or more leading digits for the stem values. The trailing digits become the leaves.
2. List possible stem values in a vertical column.
3. Record the leaf for every observation beside the corresponding stem value.
4. Order the leaves from smallest to largest on each line.
5. Indicate the units for stems and leaves someplace in the display.

If the data set consists of exam scores, each between 0 and 100, the score of 83 would have a stem of 8
and a leaf of 3. For a data set of automobile fuel efficiencies (mpg), all between 8.1 and 47.8, we could
use the tens digit as the stem, so 32.6 would then have a leaf of 2.6. Usually, a display based on between
5 and 20 stems is appropriate. For a simple example, assume a sample of seven test scores: 93, 84, 86,
78, 95, 81, 72. Then the first-pass stem plot would be
7|82
8|461
9|35
With the leaves ordered this becomes
7|28 Stem: tens digit
8|146 Leaf: ones digit
9|35
Occasionally stems will be repeated to spread out the stem-and-leaf display. For instance, if the
preceding test scores included dozens of values in the 70s, we could repeat the stem 7 twice, using 7L
for scores in the low 70s (leaves 0, 1, 2, 3, 4) and 7H for scores in the high 70s (leaves 5, 6, 7, 8, 9).

5
Histograms
While stem-and-leaf displays are useful for smaller data sets, histograms are well-suited to larger
samples or the results of a census.
Consider first data resulting from observations on a “counting variable” x, such as the number of traffic
citations a person received during the last year, or the number of people arriving for service during a
particular period. The frequency of any particular x value is simply the number of times that value
occurs in the data set. The relative frequency of a value is the fraction or proportion of times the value
occurs.
Suppose, for example, that our data set consists of 200 observations on x = the number of major defects
in a new car of a certain type. If 70 of these x values are 1, then the frequency of the value 1 is
(obviously) 70, while the relative frequency of the value 1 is 70/200 = .35. Multiplying a relative
frequency by 100 gives a percentage; in the defect example, 35% of the cars in the sample had just one
major defect. The relative frequencies, or percentages, are usually of more interest than the frequencies
themselves. In theory, the relative frequencies should sum to 1, but in practice the sum may differ
slightly from 1 because of rounding. A frequency distribution is a tabulation of the frequencies and/or
relative frequencies.

Example 3 How unusual is a no-hitter or a one-hitter in a major league baseball game, and how
frequently does a team get more than 10, 15, or 20 hits? Table 1 is a frequency distribution for the
number of hits per team per game for all games in the 2016 regular season, courtesy of the website
www.retrosheet.org.

Table 1 Frequency distribution for hits per team in 2016 MLB games

The corresponding histogram in Figure 3 (see the next page) rises rather smoothly to a single peak and
then declines. The histogram extends a bit more to the right (toward large values) than it does on the
left—a slight “positive skew.”

Constructing a histogram for measurement data (e.g., weights of individuals, reaction times to a
particular stimulus) requires subdividing the measurement axis into a suitable number of class intervals
or classes, such that each observation is contained in exactly one class. Suppose, for example, that we
have 50 observations on x = fuel efficiency of an automobile (mpg), the smallest being 27.8 and the
largest 31.4. Then we could use the class boundaries 27.5, 28.0, 28.5, …, and 31.5 as shown here:

6
Figure 3 Relative frequency histogram of the number of hits per team per game for the 2016 MLB season

When all class widths are equal, a histogram is constructed as follows: first, mark the class boundaries
on a horizontal axis like the one in the last example. Then, above each interval, draw a rectangle whose
height is the corresponding relative frequency (or frequency). One potential difficulty is that
occasionally an observation falls on a class boundary and therefore does not lie in exactly one interval,
for example, 29.0. R uses the convention that any observation falling on a class boundary will be
included in the class to the left of the observation. Thus 29.0 would go in the 28.5–29.0 class rather than
the 29.0–29.5 class. In contrast, for another software Minitab, it is done the other way, with 29.0 going
into the 29.0–29.5 class. Note that the R default can easily be changed.
Example 4 Power companies need information about customer usage to obtain accurate forecasts of
demands. Investigators from Wisconsin Power and Light determined energy consumption (BTUs)
during a particular period for a sample of 90 gas-heated homes. For each home, an adjusted consumption
value was calculated to account for weather and house size. This resulted in the accompanying data,
which we have ordered from smallest to largest.

2.97 4.00 5.20 5.56 5.94 5.98 6.35 6.62 6.72 6.78
6.80 6.85 6.94 7.15 7.16 7.23 7.29 7.62 7.62 7.69
7.73 7.87 7.93 8.00 8.26 8.29 8.37 8.47 8.54 8.58
8.61 8.67 8.69 8.81 9.07 9.27 9.37 9.43 9.52 9.58
9.60 9.76 9.82 9.83 9.83 9.84 9.96 10.04 10.21 10.28
10.28 10.30 10.35 10.36 10.40 10.49 10.50 10.64 10.95 11.09
11.12 11.21 11.29 11.43 11.62 11.70 11.70 12.16 12.19 12.28
12.31 12.62 12.69 12.71 12.91 12.92 13.11 13.38 13.42 13.43
13.47 13.60 13.96 14.24 14.35 15.12 15.24 16.06 16.90 18.26

We let R select the class intervals. The most striking feature of the histogram in Figure 4 is its
resemblance to a bell-shaped (and therefore symmetric) curve, with the point of symmetry roughly at
10.

7
Figure 4 Histogram of the energy consumption data from Example 4

Class [2,4] (4,6] (6,8] (8,10] (10,12] (12,14] (14,16] (16,18] (18,20]
Frequency 2 4 18 23 20 16 4 2 1
Relative Frequency 0.022 0.044 0.2 0.256 0.222 0.178 0.044 0.022 0.011

Table 2 Frequency and relative frequency distribution of the energy consumption data

From the histogram, proportion of observations less than 10 ≈ 0.022 + 0.044 + 0.2 + 0.256 = 0.522.

There are no hard-and-fast rules concerning either the number of classes or the choice of classes
themselves. Between 5 and 20 classes will be satisfactory for most data sets. Generally, the larger the
number of observations in a data set, the more classes should be used. Equal-width classes may not be
a sensible choice if a data set “stretches out” to one side or the other, (see for example, Figure 1.) A
sound choice is to use a few wider intervals near extreme observations and narrower intervals in the
region of high concentration. In such situations a density histogram must be used.
For any class to be used in a histogram, the density of the data in that class is defined by the relative
frequency of the class divided by the class width. A histogram can then be constructed in which the
height of the rectangle over each class is its density. The vertical scale on such a histogram is called a
density scale. When the class widths are unequal, not using a density scale will give a picture with
distorted areas. For equal class widths, the divisor is the same in each density calculation, and the extra
arithmetic simply results in a rescaling of the vertical axis (i.e., the histogram using relative frequency
and the one using density will have exactly the same appearance).
A density histogram does have one interesting property. The area of each rectangle is the relative
frequency of the corresponding class. Furthermore, because the sum of relative frequencies must be 1.0
(except for roundoff), the total area of all rectangles in a density histogram is 1.

Histogram shapes
Histograms come in a variety of shapes. A unimodal histogram is one that rises to a single peak and
then declines. A bimodal histogram has two different peaks. Bimodality can occur when the data set

8
consists of observations on two quite different kinds of individuals or objects. For example, the
histogram of a data set consisting of driving times between San Luis Obispo and Monterey in California
would show two peaks, one for those cars that took the inland route (roughly 2.5 h) and another for
those cars traveling up the coast (3.5–4 h). A histogram with more than two peaks is said to be
multimodal. A histogram is symmetric if the left half is a mirror image of the right half. A unimodal
histogram is positively skewed if the right or upper tail is stretched out compared with the left or lower
tail and negatively skewed if the stretching is to the left. Figure 5 shows “smoothed” histograms,
obtained by superimposing a smooth curve on the rectangles, that illustrate various possibilities.

Figure 5 Smoothed histograms: (a) symmetric unimodal, (b) bimodal, (c) positively skewed, and (d)
negatively skewed

Categorical Data
Both a frequency distribution and a pie chart or bar graph can be constructed when a data set is
categorical in nature; generally speaking, statisticians prefer bar graphs over pie charts in most
circumstances. Sometimes there will be a natural ordering of categories (freshman, sophomore, junior,
senior, graduate student); for such ordinal data the categories should be presented in their natural order.
In other cases, the order will be arbitrary (e.g., Catholic, Jewish, Protestant, and so on); while we have
the choice of displaying nominal data in any order, it is common to sort the categories in decreasing
order of their (relative) frequencies. Either way, the rectangles for the bar graph should have equal
width.
Example 5 Each member of a sample of 120 individuals owning motorcycles was asked for the name
of the manufacturer of his or her bike. The frequency distribution for the resulting data is given in Table
3 and the bar chart is shown in Figure 6.

Table 3 Frequency distribution for motorcycle data

Figure 6 Bar chart representing the motorcycle data

9
Measures of centre
Visual summaries of data are excellent tools for obtaining preliminary impressions and insights. More
formal data analysis often requires the calculation and interpretation of numerical summary measures
— numbers that might serve to characterise the data set and convey some of its most important features.
Our primary concern will be with quantitative data. Suppose that our data set is of the form x1, x2, x3,
…, xn where each xi is a number. What features of such a set of numbers are of most interest and deserve
emphasis? One important characteristic of a set of numbers is its “centre”: a single value that we might
consider typical or representative of the entire data set. This section presents methods for describing the
centre of a data set; in Section 1.4 we will turn to methods for measuring variability in a set of numbers.
The Mean
For a given set of numbers x1, x2, x3, …, xn the most familiar and useful measure of the centre is the
mean, or arithmetic average, of the set. Because we will almost always think of the xi’s as constituting
a sample, we will often refer to the arithmetic average as the sample mean and denote it by x̅.
x1 +x2 +⋯+xn ∑n
i=1 xi
The sample mean of observations x1, x2, x3, …, xn: = .
n n

Example 6 Students in a class were assigned to make wingspan measurements at home. The wingspan
is the horizontal measurement from fingertip to fingertip with outstretched arms. Here are the
measurements (inches) given by 21 of the students:

60 64 72 63 66 62 75 66 59 75 69 62 63
61 65 67 65 69 95 60 70

With n = 21 and ∑ni=1 xi = 1408, the sample mean is 1408/21 = 67 inches.

Just as x̅ represents the average value of the observations in a sample, the average of all values in a
population can, in principle, be calculated. This average is called the population mean and will be
denoted by the Greek letter μ. When there are N values in the population (a finite population), then μ =
(sum of the N population values)/N. In our probability class, we will give a more general definition for
μ that applies to both finite and (conceptually) infinite populations. In the chapters on statistical
inference, we will present methods based on the sample mean for drawing conclusions about a
population mean. For example, we might use the sample mean x̅ = 67.0 computed in Example 6 as a
point estimate (a single number that is our “best” guess) of μ = the true average wingspan for all students
in introductory statistics classes.
The mean suffers from one deficiency that makes it an inappropriate measure of centre in some
circumstances: its value can be greatly affected by the presence of even a single outlier (i.e., an
unusually large or small observation). In Example 6, the value 95 is obviously an outlier. Without this
observation, x̅ = 1313/20 = 65.7inches; so the outlier increases the mean by 1.3inch. The value 95 is
clearly an error — there is no way a student could have a wingspan of almost 8ft. As Leonardo da Vinci
noticed, wingspan is usually quite close to height. (Note, though, that outliers are often not the result of
recording errors!)
We will next consider an alternative to the mean, namely the median, that is insensitive to outliers.
However, the mean is still by far the most widely used measure of centre, largely because there are
many populations for which outliers are very scarce. When sampling from such a population (a
“normal” or bell-shaped distribution being the most important example), outliers are highly unlikely to
enter the sample. The sample mean will then tend to be stable and quite representative of the sample.

The Median
The word median is synonymous with “middle,” and the sample median is indeed the middle value
when the observations are ordered from smallest to largest. We will use the symbol 𝑥̃ to represent the
sample median.

10
The sample median is obtained by first ordering the n observations from smallest to largest (with any
repeated values included so that every sample observation appears in the ordered list). Then,
n+1
( ) th observation if n is odd
x̃ = { 2
n n
average of ( ) th and ( + 1) th observation if n is even
2 2

Example 7 People not familiar with classical music tend to believe that a composer’s instructions for
playing a particular piece are so specific that the duration would not depend at all on the performer(s).
However, there is typically plenty of room for interpretation, and orchestral conductors and musicians
take full advantage of this. A sample of 12 recordings of Beethoven’s Symphony No. 9 (the “Choral,”
a stunningly beautiful work) from the website www.ArkivMusic.com, yielded the following durations
(min) listed in increasing order:

62.3 62.8 63.6 65.2 65.7 66.4 67.4 68.4 68.8 70.8 75.7 79.0

Since n = 12 is even, the sample median is the average of the n/2 = 6th and (n/2 + 1) = 7th values from
the ordered list:

x̃ = (66.4 + 67.4)/2 = 66.9.

Note that half of the durations in the sample are less than 66.9 min, and half are greater than that. The
sample mean is 816.1/12 = 68.01 min, a bit more than a full minute larger than the median. The mean
is pulled out a bit relative to the median because the sample “stretches out” somewhat more on the upper
end than on the lower end.

The data in Example 7 illustrates an important property of x̃ in contrast to x̅. The sample median is very
insensitive to a number of extremely small or extremely large data values. If, for example, we increased
the two largest xi’s from 75.7 and 79.0 to 95.7 and 99.0, respectively, x̃ would be unaffected. Thus, in
the treatment of outlying data values, x̅ and x̃ are at opposite ends of a spectrum: x̅ is sensitive to even
one such value, whereas x̃ is insensitive to a large number of outlying values. Although x̅ and x̃ both
provide a measure for the centre of a data set, they will not in general be equal because they focus on
different aspects of the sample.

Figure 7 Three different shapes for a population distribution

Analogous to x̃ as the middle value in the sample is a middle value in the population, the population
median, denoted by μ̃. As with x̅ and μ, we can think of using the sample median x̃ to make an inference
about μ̃. In Example 7, we might use x̃ = 66.9 min as an estimate of the median duration in the entire
population from which the sample was selected. Or, if the median salary for a sample of statisticians
was x̃ = $96,416, we might use this as a basis for concluding that the median salary μ̃ for all statisticians
exceeds $90,000.

11
The population mean μ and median μ̃ will not generally be identical. If the population distribution is
positively or negatively skewed, as shown in Figure 7 (see last page), then μ ≠ μ̃. When this is the case,
in making inferences we must first decide which of the two population characteristics is of greater
interest and then proceed accordingly. As an example, according to the report “How America Saves
2019” issued by the Vanguard Funds investment company, the mean retirement fund balance among
workers 65 and older is $192,877, whereas the median balance is just $58,035. Clearly a small minority
of such people has extremely large retirement fund balances, inflating the mean relative to the median;
the latter is arguably a better representation of a “typical” retirement fund balance.
3. Measures of variability
Reporting a measure of centre gives only partial information about a data set or distribution. Different
samples or populations may have identical measures of centre yet differ from one another in other
important ways. For example, consider the following three samples:

Sample 1: 4 5 5 5 6

Sample 2: 4 4 5 6 6

Sample 3: 1 1 5 9 9

The three samples have the same mean and median, yet the extent of spread about the centre is different
for all three samples. The first sample has the least variability, the second has more variability than the
first, and the third has the highest amount.
Measures of variability for sample data
The simplest measure of variability in a sample is the range, which is the difference between the largest
and smallest sample values. Notice that the value of the range for Sample 3 above is much larger than
it is for Sample 1, reflecting less variability in the first sample than in the third. A defect of the range,
though, is that it depends on only the two most extreme observations and disregards the positions of the
remaining n - 2 values. Samples 1 and 2 above have identical ranges, yet when we take into account the
observations between the two extremes, there is less variability or dispersion in the first sample than in
the second.

Our primary measures of variability will involve the n deviations from the mean, obtained by
subtracting x̅ from each sample observation. A deviation will be positive if the observation is larger
than the mean (to the right of the mean on the measurement axis) and negative if the observation is
smaller than the mean. If all the deviations are small in magnitude, then all x i’s are close to the mean
and there is little variability. On the other hand, if some of the deviations are large in magnitude, then
some xi’s lie far from x̅, suggesting a greater amount of variability.

A simple way to combine the deviations into a single quantity is to average them (sum them and divide
by n). Unfortunately, this does not yield a useful measure, because the positive and negative deviations
counteract one another:

Sum of deviations = ∑ni=1(xi − x̅) = 0.

Thus the average deviation is always zero! Another possibility is to base a measure on the absolute
values of the deviations, in particular the mean absolute deviation ∑ni=1|xi − x̅| /n. But because the
absolute value operation leads to some calculus-related difficulties, a more popular option is to work
2
with the squared deviations. Rather than use the average squared deviation ∑ni=1(xi − x̅) /n, for some
technical reasons to be made clear later in the class, the sum of squared deviations is divided by n − 1
rather than n:

12
1 2
The sample standard deviation, denoted by s, is given by √ ∑ni=1(xi − x̅) . The quantity s2 is known
n−1
as the sample variance.

The unit for s is the same as the unit for each of the x i’s. If, for example, the observations are fuel
efficiencies in miles per gallon, then we might have s = 2.0 mpg. A rough interpretation of the sample
standard deviation is that it represents the size of a typical deviation from the sample mean within the
given sample. Thus if s = 2.0 mpg, then some xi’s in the sample are closer than 2.0 to x̅, whereas others
are farther away; 2.0 is a representative (or “standard”) deviation from the mean fuel efficiency. If s =
3.0 for a second sample of cars of another type, a typical deviation in this sample is roughly 1.5 times
what it is in the first sample, an indication of more variability in the second sample.

Example 8 The website www.fueleconomy.gov contains a wealth of information about fuel


characteristics of various vehicles. In addition to EPA (Environmental Protection Agency) mileage
ratings, there are many vehicles for which users have reported their own values of fuel efficiency (mpg).
Consider the following sample with n = 10 efficiencies for the 2015 Toyota Camry (for this model, the
EPA reports an overall rating of 25 mpg in city driving and 34 mpg in highway driving):

31 27.8 38.3 27 23.4 30 30.1 21.5 25.4 34.5

The sample mean comes out to be 28.9 mpg, and s = 5.04 mpg. Thus the size of a typical difference
between a driver’s fuel efficiency and the mean of 28.9 in this sample is roughly 5.04 mpg.
To explain heuristically why n – 1 rather than n is used to compute s, note first that whereas s measures
variability in a sample, there is a measure of population variability called the population standard
deviation. We will use  (lowercase Greek letter sigma) to denote the population standard deviation
and 2 to denote the population variance. When the population is finite and consists of N values, 2 =
1 2
∑Ni=1(xi − μ) , which is the average of all squared deviations from the population mean (for the
N
population, the divisor is N and not N − 1).

Just as x̅ will be used to make inferences about the population mean μ, we should define the sample
standard deviation s so that it can be used to make inferences about . Note that  involves squared
deviations about the population mean μ. If we actually knew the value of μ, then we could define the
sample standard deviation as the average squared deviation of the sample xi’s about μ. However, the
value of μ is almost never known, so the sum of squared deviations about x̅ must be used in the definition
of s. But the xi’s tend to be closer to their own average x̅ than to the population average μ. Using the
divisor n - 1 rather than n compensates for this tendency. A more formal explanation for this choice will
be given later in class.

Quartiles and the Interquartile Range

In Section 3, we discussed the sensitivity of the sample mean x̅ to outliers. Since the standard deviation
is based on measurements from the mean, s is also heavily influenced by outliers. (In fact, the effect of
outliers on s can be especially severe, since each deviation is squared during computation.) It is therefore
desirable to create a measure of variability that is “resistant” to the presence of a few outliers, analogous
to the median.

Order the n observations from smallest to largest, and separate the lower half from the upper half; the
median is included in both halves if n is odd. The lower quartile (or first quartile), q1, is the median of
the lower half of the data, and the upper quartile (or third quartile), q3, is the median of the upper half.
A measure of spread that is resistant to outliers is the interquartile range (IQR), given by IQR = q3 – q1.

The term quartile comes from the fact that the lower quartile divides the smallest quarter of observations
from the remainder of the data set, while the upper quartile separates the top quarter of values from the

13
rest. The interquartile range is unaffected by observations in the smallest 25% or the largest 25% of the
data—hence, it is robust against (resistant to) outliers. Roughly speaking, we can interpret the IQR as
the range of the “middle 50%” of the observations.

Different software packages calculate the quartiles (and, thus, the IQR) somewhat differently, for
example using different interpolation methods between x values. In particular, R offers nine different
definitions! The definition we have given above are often referred to as lower and upper hinges.

For smaller data sets, the difference can be noticeable; this is typically less of an issue for larger data
sets.

Example 8 (continued) Consider the ordered fuel efficiency data:

21.5 23.4 25.4 27 27.8 | 30 30.1 31 34.5 38.3

The vertical line separates the two halves of the data; the median efficiency is (27.8 + 30.0)/2 = 28.9
mpg, coincidentally exactly the same as the mean. The quartiles are the middle values of the two halves;
from the displayed data, we see that q1 = 25.4, q3 = 31.0, so IQR = 5.6 mpg.

The default option in R will report q1 and q3 respectively as 25.8 and 30.775, while Minitab reports 24.9
and 31.875.

Imagine that the lowest value had been 10.5 instead of 21.5 (indicating something very wrong with that
particular Camry!). Then the sample standard deviation would explode from 5.04 mpg (see Example 8)
to 7.46 mpg, a nearly 50% increase. Meanwhile, the quartiles and the IQR would not change at all;
those quantities would be unaffected by this low outlier.
The quartiles and interquartile range lead to a popular statistical convention for defining outliers (i.e.,
unusual observations) first proposed by renowned statistician John Tukey:
Any observation farther than 1.5IQR from the closest quartile is an outlier. An outlier is extreme if it is
more than 3IQR from the nearest quartile, and it is mild otherwise.
That is, outliers are defined to be all x values in the sample that satisfy either x < q1- 1.5IQR or x > q3
+1.5IQR.
The idea of a quantile generalises the concept of median and quartiles. The pth quantile, (also known as
the 100p% percentile) is the point in the data where 100p% is less, and 100(1-p)% is larger. If there
are n data points, then the pth quantile occurs at the position 1+(n-1)p with weighted averaging if this is
between integers.
The 0.25, 0.5 and 0.75th quantiles are respectively the lower quartile, the median and the upper quartile,
according to this definition. However, this definition slightly varies from the definitions of quartiles
(hinges) given earlier.
For example the 0.25th quantile of the numbers 10, 17, 18, 25, 28, 28 occurs at the position 1+(6-1)(0.25)
= 2.25. That is 1/4th of the way between the second and third number which in this example is 17.25.
Notice that our previous definition yields a first quartile (lower hinge) value of 17, and hence is slightly
different from the value of the first quartile from this definition.

Boxplot
In Section 2, some graphical displays (stem-and-leaf, histogram) were introduced as tools for
visualising quantitative data. We now introduce one more graph, the boxplot, which relies on the
quartiles, IQR, and aforementioned outlier rule. A boxplot shows several of a data set’s most prominent
features, including centre, spread, the extent and nature of any departure from symmetry, and outliers.

14
Steps for Constructing a Boxplot
1. Draw a measurement scale (horizontal or vertical).
2. Draw a rectangle adjacent to this axis beginning at q1 and ending at q3 (so rectangle length = IQR).
3. Place a line segment at the location of the median. (The position of the median symbol relative to the
two edges conveys information about the skewness of the middle 50% of the data.)
4. Determine which data values, if any, are outliers. Mark each outlier individually. (We may use
different symbols for mild and extreme outliers; most statistical software packages do not make a
distinction.)
5. Finally, draw “whiskers” out from either end of the rectangle to the smallest and largest observations
that are not outliers.

Example 9 The Clean Water Act and subsequent amendments require that all waters in the USA meet
specific pollution reduction goals to ensure that water is “fishable and swimmable.” The article
“Spurious Correlation in the USEPA Rating Curve Method for Estimating Pollutant Loads” (J. Environ.
Engr. 2008: 610–618) investigated various techniques for estimating pollutant loads in watersheds; the
authors discuss “the imperative need to use sound statistical methods” for this purpose. Among the data
considered is the following sample of (n = 57) total nitrogen loads (TN, in kg of nitrogen/day) from a
particular Chesapeake Bay location, displayed here in increasing order.

9.69 13.16 17.09 18.12 23.70 24.07 24.29 26.43 30.75 31.54 35.07 36.99 40.32
42.51 45.64 48.22 49.98 50.06 55.02 57.00 58.41 61.31 64.25 65.24 66.14 67.68
81.40 90.80 92.17 92.42 100.82 101.94 103.61 106.28 106.80 108.69 114.61 120.86
124.54 143.27 143.75 149.64 167.79 182.50 192.55 193.53 271.57 292.61 312.45 352.09
371.47 444.68 460.86 563.92 690.11 826.54 1529.35

Our definitions give: x̃ = 92.17 q1 = 45.64 q3 = 167.79 IQR = 122.15


1.5IQR = 183.225 3IQR = 366.45

Note that the default R output matches with the above in this case.

Now, subtracting 1.5IQR from the lower quartile gives a negative number, and none of the observations
are negative, so there are no outliers on the lower end of the data. However, q3 + 1.5IQR = 351.015 and
q3 + 3IQR = 534.24, hence we have four extreme outliers, and a further 4 mild outliers.

Figure 8 A boxplot of the nitrogen load data showing the outliers

15
Figure 8 last page gives the corresponding boxplot (produced by R). While I have given a horizontal
boxplot here, the default is a vertical boxplot. (We will revisit that later.) The whiskers in the boxplot
extend out to the smallest observation 9.69 on the low end and 312.45, the largest observation that is
not an outlier, on the upper end. There is some positive skewness in the middle half of the data (the
right edge of the box is somewhat further from the median line than is the left edge) and a great deal of
positive skewness overall. Observe that the 8 outliers are marked as circles, but the mild and extreme
outliers have not been distinguished.
Among other advantages, placing individual boxplots side by side can reveal similarities and differences
between two or more data sets consisting of observations on the same variable.

16
Chapter2: R preliminaries
The first step is to install R and RStudio, in that order. The RStudio IDE is a set of integrated tools
designed to help you be more productive with R. It includes a console, syntax-highlighting editor that
supports direct code execution, and a variety of robust tools for plotting, viewing history, debugging
and managing your workspace.

1. Install R
For windows:
a) Download the latest version of R, for Windows, from CRAN at: https://cran.r-
project.org/bin/windows/base/

b) Double-click on the file you just downloaded to install R


c) Click ok –> Next –> Next –> Next …. (no need to change default installation parameters)
For Mac:
a) Download the latest version of R, for MAC OSX, from CRAN at: https://cran.r-
project.org/bin/macosx/
b) Double-click on the file you just downloaded to install R
c) Click ok –> Next –> Next –> Next …. (no need to change default installation parameters)

2. Install RStudio
Download RStudio at: https://www.rstudio.com/products/rstudio/download/#download

3. Use R inside RStudio


After installing R and RStudio, launch RStudio from your computer “application folders”. RStudio is a
four pane work-space, respectively for 1) creating file containing R script, 2) typing R commands, 3)
viewing command histories, 4) viewing plots and more.
I. Top-left panel: Code editor allowing you to create and open a file containing R script. The R script
is where you keep a record of your work. R script can be created as follows: File –> New –> R Script.
II. Bottom-left panel: R console for typing R commands.
III. Top-right panel:
Workspace tab: shows the list of R objects you created during your R session
History tab: shows the history of all previous commands

17
IV. Bottom-right panel:
Files tab: show files in your working directory
Plots tab: show the history of plots you created. From this tab, you can export a plot to a PDF or an
image files
Packages tab: show external R packages available on your system. If checked, the package is loaded
in R.

4. Starting R
R is most easily used in an interactive manner. You ask it a question and R gives you an answer.
Questions are asked and answered on the command line. The > is called the prompt. In what follows
below it is not typed, but is used to indicate where you are to type if you follow the examples. If a
command is too long to fit on a line, a + is used for the continuation prompt.

5. R as a calculator
R evaluates and prints out the result of any expression that one types in at the command line in the c
console window. Expressions are typed following the prompt (>) on the screen. The result, if any,
appears on the subsequent lines. Note that anything after # is a ‘comment’; it is not executed. It is a
good practice to insert comments in strategic places for later reference.

> 2+2
[1] 4
> sqrt(10)
[1] 3.162278
> 2*3*4*5
[1] 120
> 1000*(1+0.075)^5 - 1000 # Interest on $1000, compounded annually

18
[1] 435.6293
> # at 7.5% p.a. for five years
> pi # R knows about pi
[1] 3.141593
> 2*pi*6378 #Circumference of Earth at Equator, in km; radius is 6378 km
[1] 40074.16
> sin(c(30,60,90)*pi/180) # Convert angles to radians, then take sin()
[1] 0.5000000 0.8660254 1.0000000

6. R as a smart calculator

> x = 1 # Can define variables


> y = 3 # use “=” or “<-” operator to set values
> z = 4
> x * y * z
[1] 12
> X * Y * Z # Variable names are case sensitive
Error: Object "X" not found
> This.Year = 2021 # Variable names can include period
> This.Year
[1] 2021

7. Entering data with c


The most useful R command for quickly entering in small data sets is the c function. This
function combines or concatenates terms together. As an example, suppose we have the following count
of the number of typos for the first eight pages of these notes: 2 3 0 3 1 0 0 1.
To enter this into an R session we do so with
> typos = c(2,3,0,3,1,0,0,1)
> typos
[1] 2 3 0 3 1 0 0 1

Notice a few things:


• Every line starts with >, the prompt. Don’t type it!
• We assigned the values to a variable called typos
• The assignment operator is a =. You can also use a <-. Both can be used interchangeably, but you
should learn one and stick with it.
• The values of the typos do not automatically print out. They do when we type just the name though
as the last input line indicates.
• The value of typos is prefaced with a funny looking [1]. This indicates that the value is a vector.
More on that later.

8. Typing less

For many implementations of R, you can save yourself a lot of typing if you learn that the arrow keys
can be used to retrieve your previous commands. In particular, each command is stored in a history and
the up arrow will traverse backwards along this history and the down arrow forwards. The left and right
arrow keys will work as expected. This combined with a mouse can make it quite easy to do simple
editing of your previous commands.

19
9. Applying a function
R comes with many built in functions that one can apply to data such as typos. One of them is
the mean function for finding the sample mean. To use it is easy
> mean(typos)

[1] 1.25

As well, we could call the functions median, var and sd to find respectively the sample median,
sample variance and the sample sd. The syntax is the same -- the function name followed by parentheses
to contain the argument(s). For example
> median(typos)
[1] 1
> var(typos)
[1] 1.642857

10. Data as vector


The data are stored in R as a vector. This means simply that it keeps track of the order that the data is
entered in. In particular there is a first element, a second element up to a last element. This is a good
thing for several reasons:
• Our simple data vector typos has a natural order -- page 1, page 2 etc. We would not want to mix
these up.
• We would like to be able to make changes to the data item by item instead of having to enter in the
entire data set again.
• Vectors are also a mathematical object. There are natural extensions of mathematical concepts such
as addition and multiplication that make it easy to work with data when they are vectors.
• Basic vector operations work in R:

> x = c(2,0,0,4) # Creates vector with elements 2,0,0,4


> y = c(1,9,9,9)
> x + y # Sums elements of two vectors
[1] 3 9 9 13
> x * 4 # Multiplies elements
[1] 8 0 0 16
> sqrt(x) # Function applies to each element, returns vector
[1] 1.41 0.00 0.00 2.00

Example 10 Keeping track of a stock; adding to the data

Suppose the daily closing prices of your favourite stock for two weeks are (weekdays only): 45, 43, 46,
48, 51, 46, 50, 47, 46, 45.

We keep track of this with R using a vector:

> x = c(45,43,46,48,51,46,50,47,46,45)
> mean(x) # the mean
[1] 46.7
> median(x) # the median
[1] 46
> max(x) # the maximum or largest value
[1] 51
> min(x) # the minimum value
[1] 43

20
This illustrates that many interesting functions can be found easily. Let us see how we can do some
others. First, let us add the next two weeks’ data to x. This was: 48,49,51,50,49,41,40,38,35,40.
We can add this in several ways.
> x = c(x,48,49,51,50,49) # appended 5 values to x
> length(x) # how long is x now (it was 10)
[1] 15
> x[16] = 41 # add to one specified index
> x[17:20] = c(40,38,35,40) # add to many specified indices

Notice that we did three different things to add to a vector. All are useful, so let us explain. First, we
used the c (combine) operator to combine the previous value of x with the next week's numbers. Then
we assigned directly to the 16th index. At the time of the assignment, x had only 15 indices, this
automatically created another one. Finally, we assigned to a slice of indices. This latter makes some
things very simple to do.

Now notice a few more things. First, the comment character, #, is used to make comments. Basically,
anything after the comment character is ignored (by R, hopefully not the reader). More importantly, the
assignment to an entry in the vector is done by referencing that entry in the vector. This is done with
square brackets [ ]. It is important to keep this in mind: parentheses () are for functions, and square
brackets [ ] are for vectors. Note that negative indices give everything except these indices.

To create the vector 17 18 19 20 we used the simple : colon operator. We could have typed this in,
but this is a useful thing to know. The command a:b is simply a, a+1, a+2, ..., b if a,b are
integers and intuitively defined if not. A more general R function is seq() which is a bit more typing.
Try ?seq to see its options. To produce the above try seq(17,20,1).

R Basics: accessing data


There are several ways to extract data from a vector. Here is a summary using both slicing and extraction
by a logical vector. Suppose x is the data vector, for example x = 1:10.

how many elements? length(x)


ith element x[2] (i = 2)
all but ith element x[-2] (i = 2)
first k elements x[1:5] (k = 5)
last k elements x[(length(x)-4):length(x)] (k = 5)
specific elements. x[c(1,3,5)] (First, 3rd and 5th)
all greater than some value x[x>3] (the value is 3)
bigger than or less than some values x[x < -2 | x > 2]
which indices are largest which(x == max(x))

Notice the usage of double equals signs (= =). This tests all the values of x to see if they are equal to
the maximum of x. The 10th answer yes (TRUE) the others no. Think of this as asking R a question. Is
the value equal to max(x))? R answers all at once with a long vector of TRUE's and FALSE's.

Now the question is, how can we get the index(/indices) corresponding to the TRUE values? Let us
rephrase, which indices have values equal to the maximum of the vector? That is done by the
command which; it picks up the corresponding index/indices.

21
11. Summarising numerical data using R

Numerical data

There are many options for displaying and/or summarising numerical data. We will only consider some
common numerical summaries of centre and spread introduced in the first chapter.

Numeric measures of centre and spread

To describe a distribution, we often want to know where it is centred and what is the spread. These are
typically measured with mean and variance (or standard deviation), or the median and more generally
the five-number summary (the minimum and the maximum, the two hinges and the median).
The R commands for these are mean, var, sd, median, and fivenum or summary.

Example 11 CEO salaries

Suppose CEO yearly compensations are sampled, and the following are found (in millions): 12 0.4 5 2
50 8 3 1 4 0.25. (This is before being indicted for cooking the books.)
> sals = scan() # read in with scan
1: 12 0.4 5 2 50 8 3 1 4 0.25
11:
Read 10 items
> mean(sals) # the average
[1] 8.565
> var(sals) # the variance
[1] 225.5145
> sd(sals) # the standard deviation
[1] 15.01714
> median(sals) # the median
[1] 3.5
> fivenum(sals) # min, lower hinge, Median, upper hinge, max
[1] 0.25 1.00 3.50 8.00 50.00
> summary(sals)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.250 1.250 3.500 8.565 7.250 50.000
Notice the summary command. For a numeric variable it prints out the five-number summary and the
mean. For other variables, it adapts itself in an intelligent manner.

The difference between fivenum and the quantiles: You may have noticed the slight difference
between the fivenum and the summary command. In particular, one gives 1.00 for the lower hinge and
the other 1.250 for the first quantile. The difference is that while fivenum uses the hinges for quartiles,
summary uses a different definition of quantile, which is the default choice for R codes quantile and
IQR. Here is an illustration with the sals data, which has n=10. From above we should have the median
at (10+1)/2=5.5, the lower hinge at the 3rd largest value and the upper hinge at the 8th largest value.
Whereas, the value of q1 should be at the 1+(10-1)(1/4) = 3.25th value. We can check that this is the
case by sorting the data:
> sort(sals)
[1] 0.25 0.40 1.00 2.00 3.00 4.00 5.00 8.00 12.00 50.00
> fivenum(sals) # note 1 is the 3rd value, 8 the 8th.
[1] 0.25 1.00 3.50 8.00 50.00
> summary(sals) # note 3.25 value is 1/4 way between 1 and 2

22
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.250 1.250 3.500 8.565 7.250 50.000

Stem-and-leaf plot

If the data set is relatively small, the stem-and-leaf diagram is very useful for seeing the shape of the
distribution and the values. The number on the left of the bar is the stem, the number on the right the
digit. We put them together to find the observation, as we have seen in Chapter 1.

To create a stem and leaf chart is simple:


> x = c(93, 84, 86, 78, 95, 81, 72)
> stem(x)
The decimal point is 1 digit(s) to the right of the |

7 | 2
7 | 8
8 | 14
8 | 6
9 | 3
9 | 5

Tabulating numerical variables

Sometimes we may need to tabulate numeric variables by aggregating values. For example. The CEO
salaries could be placed into broad categories of 0-1 million, 1-5 million and over 5 million. To do this
using R one uses the cut function and the table function.

Example 11 CEO salaries (continued)

Consider the salaries again: 12 0.4 5 2 50 8 3 1 4 0.25.

Suppose we want to break that data into three classes [0,1], (1,5], and (5,50]. To use the cut command,
we need to specify the cut points. In this case 0, 1 , 5 and 50 (=max(sals)). Here is the syntax:
> sals = c(12, 0.4, 5, 2, 50, 8, 3, 1, 4, 0.25) # enter data
> cats = cut(sals, breaks=c(0,1,5,max(sals))) # specify the breaks
> cats # view the values
[1] (5,50] (0,1] (1,5] (1,5] (5,50] (5,50] (1,5] (0,1] (1,5] (0,1]
Levels: (0,1] (1,5] (5,50]
> table(cats) # organise
cats
(0,1] (1,5] (5,50]
3 4 3
> levels(cats) = c("poor","rich","rolling in it") # change labels
> table(cats)
cats
poor rich rolling in it
3 4 3

Notice that cut answers the question “which interval is the number in?''. The output is the interval.
The table command is used to summarise the result of cut in a frequency table. Additionally, the
names of the levels were changed as an illustration of how to manipulate these. We detail the use of the
table command in the section on categorical variables.

23
Histogram

Unless the number of data points is low, as we have already seen in Section 1, the histogram is a better
choice to stem and leaf. The histogram defines a sequence of breaks and then counts the number of
observation in the bins formed by the breaks, which is identical to the features of the cut function. It
plots these with a bar similar to the bar chart, but the bars are touching. The height can be the
frequencies, or the proportions. In the latter case the areas sum to 1. In either case the area is proportional
to probability.

Let us begin with a simple example. Suppose the top 25 ranked movies made the following gross
receipts for a week: 29.6 28.2 19.6 13.7 13.0 7.8 3.4 2.0 1.9 1.0 0.7 0.4 0.4 0.3 0.3
0.3 0.3 0.3 0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Let us visualise it (Figure 9). First we scan it in then make some histograms:
> x=scan()
1: 29.6 28.2 19.6 13.7 13.0 7.8 3.4 2.0 1.9 1.0 0.7 0.4 0.4 0.3 0.3 0.3 0.3 0.3
0.2 0.2 0.2 0.1 0.1 0.1 0.1 0.1
27:
Read 26 items
> hist(x) # frequencies
> hist(x,probability=TRUE) # proportions (or probabilities)

Figure 9 Histograms using frequencies and proportions

Two graphs are shown. The first is the default histogram displaying the frequencies (total counts). The
second does a histogram of proportions which makes the total area add to 1. This is preferred as it relates
better to the concept of a probability density. Note the only difference is the scale on the y axis.
The basic histogram has a predefined set of break points for the bins. If you want, you can specify the
number of breaks or your own break points (Figure 10).

> hist(x,breaks = 10) # 10 breaks, or just hist(x,10); no changes


> hist(x,breaks = c(0,1,2,3,4,5,10,20,max(x))) # specify break points

You can check that the first command does not affect the histogram at all (not shown). That is because
R takes it as a suggestion only, and may ignore it. However, the second command produces the graph
as given in Figure 10.

24
Figure 10 Histograms with the breakpoints specified

From the histogram, you can easily make guesses as to the values of the mean, the median, and the IQR.
To do so, you need to know that the median divides the histogram into two equal area pieces, the mean
would be the point where the histogram would balance if you tried to, and the IQR captures exactly the
middle half of the data.

Boxplot

The boxplot, as we have seen in Chapter 1, is used to summarise data succinctly, quickly displaying if
the data is symmetric or has suspected outliers. It is based on the 5-number summary. In its simplest
usage, the boxplot has a box with lines at the lower hinge (basically q1), the Median, the upper hinge
(basically q3) and whiskers which extend to the minimum and the maximum. To showcase possible
outliers, a convention is adopted to shorten the whiskers to a length of 1.5 times the IQR. Any points
beyond that are plotted separately. These may further be marked differently if the data is more than 3
IQR away (extreme outliers). Thus, the boxplots allow us to check quickly for symmetry (the shape
looks unbalanced) and outliers (lots of data points beyond the whiskers).

Example 9 (continued)

In this example, we look at a sample of (n = 57) total nitrogen loads (TN, in kg of nitrogen/day) from a
particular Chesapeake Bay location, and produce the corresponding boxplot.

> loadN = scan()


1: 9.69 13.16 17.09 18.12 23.70 24.07 24.29 26.43
9: 30.75 31.54 35.07 36.99 40.32 42.51 45.64 48.22
17: 49.98 50.06 55.02 57.00 58.41 61.31 64.25 65.24
25: 66.14 67.68 81.40 90.80 92.17 92.42 100.82 101.94
33: 103.61 106.28 106.80 108.69 114.61 120.86 124.54 143.27
41: 143.75 149.64 167.79 182.50 192.55 193.53 271.57 292.61
49: 312.45 352.09 371.47 444.68 460.86 563.92 690.11 826.54 1529.35
58:
Read 57 items
> boxplot(loadN) #Vertical boxplot
> boxplot(loadN, horizontal = T) #Horizontal boxplot

The graph in Figure 11 gives the vertical boxplot. The horizontal boxplot is given in Figure 8.

25
Figure 11 A vertical boxplot of the nitrogen load data

12. Summarising categorical data using R

We view categorical data with tables. Graphically we look at the data graphically with bar graphs or
pie charts.

Using tables

As seen before, the table command allows us to look at tables. Its simplest usage looks
like table(x) where x is a categorical variable.

Example 12 Smoking survey

A survey asks people if they smoke or not. The data are: Yes, No, No, Yes, Yes. We can enter this
into R with the c() command, and summarise with the table command as follows

> x=c("Yes","No","No","Yes","Yes")
> table(x)
x
No Yes
2 3

The table command simply adds up the frequency of each unique value of the data.

Factors

Categorical data is often used to classify data into various levels or factors. For example, the smoking
data could be part of a broader survey on student health issues. R has a special class for working with
factors which is occasionally important to know as R will automatically adapt itself when it knows it
has a factor. To make a factor is easy with the command factor or as.factor. Notice the difference
in how R treats factors with this example:

> x=c("Yes","No","No","Yes","Yes")
> x # print out values in x
[1] "Yes" "No" "No" "Yes" "Yes"
> factor(x) # print out value in factor(x)
[1] Yes No No Yes Yes
Levels: No Yes # notice levels are printed.

26
Bar charts

A bar chart draws a bar with a height proportional to the count in the table. The height could be given
by the frequency, or the proportion. The graph will look the same, but the scales may be different.

Suppose, a group of 25 people are surveyed as to their beer-drinking preference. The categories were
(1) Domestic can, (2) Domestic bottle, (3) Microbrew and (4) import. The raw dataset is:

3411343313212123231111431

Let us produce barplots of both frequencies and proportions. First, we use the scan function to read in
the data then we plot (Figure 12).

> beer = scan()


1: 3 4 1 1 3 4 3 3 1 3 2 1 2 1 2 3 2 3 1 1 1 1 4 3 1
26:
Read 25 items
> barplot(beer) # this isn't correct
> barplot(table(beer)) # Yes, call with summarised data
> barplot(table(beer)/length(beer)) # divide by n for proportion

Figure 12 Sample barplots

Notice a few things:

• We used scan() to read in the data. This command is very useful for reading data from a file or by
typing. Try ?scan for more information, but the basic usage is simple. You type in the data. It stops
adding data when you enter a blank row.
• We produced 3 barplots. The first to show that we do not use barplot with the raw data.
• The second shows the use of the table command to create summarised data, and the result of this is
sent to barplot creating the barplot of frequencies shown.
• Finally, the command
> table(beer)/length(beer)
1 2 3 4
0.40 0.16 0.32 0.12

27
produces the proportions first. (We divided by the number of data points which is 25
or length(beer).) The result is then handed off to barplot to produce a graph. Notice it has the same
shape as the previous one, but the height axis is now between 0 and 1 as it measures the proportion and
not the frequency.

Note: Following codes will produce the frequency and proportion barplots for the bikes data from
Example 5:
> bikes = c(41,27,20,18,3,11)
> names(bikes)=c("Honda", "Yamaha","Kawasaki", "Harley-Davidson", "BMW", "Other")
> barplot(bikes, main ="Frequency") #Frequency barplot
> barplot(bikes/sum(bikes), main="Proportion") #Proportion barplot

Pie charts

The same beer data can be studied with pie charts using the pie function. Here are some simple
examples illustrating the usage (similar to barplot, but with some added features):

> beer.counts = table(beer) # store the table result


> pie(beer.counts) # first pie
> names(beer.counts) = c("Domestic\n can","Domestic\n bottle", "Microbrew","Import") # give
names
> pie(beer.counts) # prints out names
> pie(beer.counts,col=c("purple","green2","cyan","white")) #now with your choice of colours

Figure 13 Piechart example

The first one was confusing, so we added names. This is done with the names which allows us to
specify names to the categories. The resulting piechart shows how the names are used. Finally, we
added our choice of colours to the piechart. This is done by setting the piechart attribute col. We set
this equal to a vector of colour names that was the same length as our beer.counts. The help
command (?pie) gives some examples for automatically getting different colours, notably
using rainbow and grey.

Notice that we can use additional arguments in the function pie, colour, for example. The syntax for
these is name = value. The ability to pass in named values to a function makes it easy to have fewer
functions as each one can have more functionality. This is a common feature of R commands.

28

You might also like