You are on page 1of 22

University of Illinois at Urbana-Champaign Instructor: Junho Song

Dept. of Civil and Environmental Engineering junho@illinois.edu

CE202: Engineering Risk and Uncertainty


Supplement 1

Graphical Representation and Numerical Summaries of Data*

Prepared by A. Der Kiureghian**

1 INTRODUCTION

This supplement describes useful methods to represent statistical data in graphical form
and formulas to compute important numerical descriptors of statistical data. Applicable
MATLAB commands are introduced and examples are given.

2 SAMPLE OF DATA

Virtually all phenomena of engineering interest are subject to variability. Material proper-
ties, flow in rivers, earthquake magnitudes, structural capacities, pollutant concentrations,
etc., tend to vary when measured at different locations, times or occurrences, or for dif-
ferent specimens. This variability must be accounted for when we design engineering
systems or assess their condition. Naturally, we cannot measure the entire population of
the phenomenon of interest. Instead, we usually have a representative sample of meas-
urements, much smaller in number and volume than the population. By use of statistics
and probability, we try to learn about the population by analyzing the sample.

Below, we describe various graphical methods for representing a sample of data. For
this discussion, assume the sample is represented by the measurements x1 , x2 , K , x N ,
where N is the sample size. These could be, for example, the measured strengths of N
specimens of a material. In statistical sampling, it is important that the members of the
sample be independent of one another. This is known as random sampling. For example,
if samples are to be taken from the conveyor belt of a manufacturing plant, it is important
that they be selected randomly and from a long stream of the output, and not from a sin-

*
Used as a supplementary material for CEE202 under the permission of the author.
**
Taisei Professor of Civil Engineering, Dept. of Civil and Environmental Engineering, UC Berkeley.

1
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

gle stretch next to each other. In the case of the material specimens, it is important that
the specimens be taken from different parts of the material population. Otherwise, the
sample may not correctly represent the variability in the population. Several samples of
real data collected by civil & environmental engineers have been posted in the course.
Please review these data and download them into your computer for future use.

In general, we deal with two types of variables: discrete and continuous. Measure-
ments of a discrete variable are discrete numbers, usually integers. An example is the
number of damaging earthquakes (say having magnitudes greater than 5 on the Richter
scale) occurring per year in California. The possible outcomes obviously are 0, 1, 2, 3,
K . For a continuous variable, the measured values may be anywhere inside an interval.
For example, the daily runoff of a manufacturing plant could be any positive real number,
i.e., within the interval from 0 to (theoretically) infinity. This distinction is important
when preparing graphical representations of data.

Table 1 shows data on the seasonal number of rainy days and the amount of rainfall in
San Francisco from the 1960-61 season to the 2002-2003 season (total of 43 seasons), as
reported by Golden Gate Weather Services (see http://ggweather.com/sf/daily.html). A
season is defined as July 1 until June 30 of the following year. Clearly, the number of
rainy days is a discrete variable, whereas the amount of rainfall is a continuous variable.
Considerable variation in the data for both variables is observed. Graphical representa-
tions of these data are described in the following sections.

3 GRAPHICAL REPRESENTATION OF DATA

Various graphical methods are available to explore features of statistical data. Below we
describe the most common and useful graphical methods. Corresponding MATLAB
commands are introduced.

3.1 Histogram Diagram

A histogram describes the number of occurrences in the sample for different outcomes of
a variable. For a discrete variable, the histogram is a bar chart describing the number of
occurrences in the sample for each discrete outcome of the variable. To construct this
diagram, first put the sample values in an ascending order, and then count the number of

2
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

Table 1. Seasonal rainfall and rainy days in San Francisco since 1960

Season Rainfall Rainy Season Rainfall Rainy Season Rainfall Rainy


(in.) days (in.) days (in.) days

1960-61 13.87 68 1975-76 7.97 47 1990-91 14.08 59


1961-62 17.65 52 1976-77 11.06 41 1991-92 19.20 59
1962-63 22.15 63 1977-78 26.87 81 1992-93 26.66 83
1963-64 12.32 55 1978-79 18.74 56 1993-94 15.22 58
1964-65 22.29 76 1979-80 24.49 64 1994-95 34.02 100
1965-66 16.32 54 1980-81 15.39 54 1995-96 24.89 65
1966-67 29.41 81 1981-82 37.10 80 1996-97 22.63 63
1967-68 14.46 54 1982-83 38.10 100 1997-98 47.22 119
1968-69 25.09 93 1983-84 22.47 70 1998-99 23.49 80
1969-70 20.80 63 1984-85 20.01 77 1999-00 24.89 82
1970-71 18.79 67 1985-86 28.68 71 2000-01 19.47 69
1971-72 11.06 60 1986-87 13.86 65 2001-02 25.03 75
1972-73 34.36 84 1987-88 17.74 57 2002-03 23.87 71
1973-74 27.76 83 1988-89 17.43 71
1974-75 18.26 72 1989-90 14.32 49

Figure 1. Histogram of seasonal rainy days in San Francisco

3
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

occurrences for each discrete outcome in the sample. Plot the number of occurrences ver-
sus the discrete outcome value as a bar chart. Figure 1 shows the histogram of the number
of rainy days in San Francisco based on the data in Table 1. The MATLAB command
hist is used to compute the number of occurrences at each discrete value and then the
bar command is used to generate the plot. It is seen that, during the period of observa-
tion, the number of rainy days varies from 41 to 117 with more likelihood in the range of
55 to 85 days. Since the size of the data, N = 43 , is few in relation to the range of varia-
tion, the histogram does not show a clear distributional shape. In such cases, it is often
useful to consolidate sets of the discrete values of the variable into bins, as described be-
low for a continuous variable.

For a continuous variable, it is necessary to select bins in order to count the number
of occurrences. Proceed as follows:

1. Put the sample values in an ascending order. Let xmin denote the smallest value and
xmax denote the largest value. The difference r = xmax xmin denotes the range of ob-
servations.

2. Divide the range into a number of bins. If necessary, you may adjust the end points of
the range outward so as to select more convenient numbers for the bin edges. The
number of bins, denoted m , depends on the sample size, N . Ideally, it should be
somewhere between 10 and 20, but may have to be fewer for a small sample size. A
good rule of thumb is m = N . Samples with fewer than about 20 items are not ap-
propriate for developing a histogram. The bin sizes are usually equal, but it is possible
to make them unequal.

3. Count the number of occurrences within each bin in the sample.

4. Plot a bar chart (with bars extending over the entire widths of the bins) showing the
number of occurrences in each bin verses the center value of the bin. MATLAB
commands hist and histc are useful for this purpose. The command hist can be
used when bin widths are equal, or when the bins are defined by their center values.
For example, if x denotes a vector containing the sample values and m denotes the
desired number of bins, then hist(x,m) will compute and plot the histogram for m
equal-size bins. However, you will not have control over the bin sizes or the location

4
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

of the bin edges. On the other hand, if c denotes a vector storing the desired bin cen-
ters in an ascending order, then hist(x,c) will compute and plot the histogram
with bins centered at the specified locations, if the distances between the bin centers
are uniform. You can use this version of the command to control the bin sizes. By
specifying a non-uniform distribution of bin centers, you can obtain a histogram with
unequal bin sizes. MATLAB claims that the bins will then be centered at the loca-
tions specified by the vector c. However, this turns out not to be true. MATLAB ac-
tually positions the bin edges at the middle of the specified locations. In any case, you
can generate histograms with unequal bin sizes by use of this command. To have a
better control over the bin sizes when they are unequal, you can use the command
histc. Let e denote a vector containing the locations of the bin edges in an ascend-
ing order. Then the command n=histc(x,e) will compute the number of occur-
rences in each bin within the specified edges and store them in the vector n. You can
then use the bar command to plot n versus the center coordinates of the bins, say c.
In that case use the command bar(n,c,hist) so that the bars extend over
the entire widths of the bins.

Figure 2. Histograms with different dispersions

One can learn a great deal from a histogram. It readily provides the range of the data,
as well as the distribution of the sample data over this range. Of particular interest are the
central region of the histogram, where the bulk of observations occur, and the extent of
dispersion and asymmetry of the distribution. For example, the two nearly symmetric his-
tograms shown in Figure 2 are both centered around zero. Furthermore, the one on the
right shows a larger dispersion than the one on the left. Similarly, of the two asymmetric
histograms with similar dispersions shown in Figure 3, the one on the left is centered

5
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

around 8 and shows skewness (long tail) to the right, whereas the one on the right is cen-
tered around 3 and shows a skewness to the left. In Section 4 we will introduce numerical
measures to quantify these characteristics of a histogram.

positive skewness negative skewness

Figure 3. Asymmetric histograms with positive and negative skewness

For the data in Table 1, we have N = 43 , xmin = 7.97 in, xmax = 47.22 in and
r = 39.25 in for the amounts of seasonal rainfall. It is decided to use 11 bins of 4in width
with the first bin centered at 8in, the second bin centered at 12in, etc., up to the 11th bin
centered at 48in. Figure 4a shows the histogram developed by use of MATLAB com-
mand hist by specifying these bin centers. Figures 5a shows a histogram for the same
data, where the last two bins are taken to be wider. This is generated by using the hist
command and specifying the bin centers at 8in, 12in, , 40in and 46in. Note that the bin
centered at 44in has been dropped and the center of the next bin is positioned at 46in.
These locations are shown in the MATLAB output with asterisks. Note that the last two
bins are wider. However, contrary to what MATLAB claims, these bins are not centered
at 40in and 46in. Since occurrences in the tail regions are usually few, sometimes it is
convenient to have wider bins there in order to capture more occurrences.

3.2 Frequency Diagram

Sometimes it is useful to show the histogram in terms of the frequency (percentage) of


observations in each bin. The graph is then called a frequency diagram. This is obtained
by simply dividing the number of occurrences in each bin by the total number of observa-
tions, N . MATLAB does not have a built-in function to at once compute and plot a fre-
quency diagram. A modified version of the hist command, called freqdm, has been
developed by Junho Song and is available for download from the course website.

6
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

(a) (a)

(b) (b)

(c) (c)

Figure 4. Histogram, frequency diagram and Figure 5. Histogram, frequency diagram and
normalized frequency diagram and normalized frequency diagram and
polygon of seasonal rainfall in San polygon of seasonal rainfall in San
Francisco with equal bins Francisco with unequal bins

7
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

This command functions in a manner similar to hist, but it computes and plots the fre-
quency of occurrence rather than the number of occurrence in each bin.

Figures 4b and 5b show the frequency diagrams for the San Francisco rainfall data
with the same bins as in the histogram plots shown in Figures 4a and 5a, respectively.
These have been generated by use of the command freqdm. Note that the shapes of
these diagrams are identical to the corresponding histograms.

3.3 Normalized Frequency Diagram

A normalized frequency diagram shows the percentage of observations that fall within
each bin per unit value of the variable. It is obtained by scaling the histogram and is use-
ful in selecting a probability distribution model for a continuous variable (to be described
later). The area underneath the normalized frequency diagram represents 100% of the ob-
servations. Therefore, this area must be equal to 1. If equal size bins are used, the height
of each bar in the histogram is scaled by the factor 1 /( Nx ) , where x is the common
width of the bins and N is the sample size. In this case, the shape of the normalized fre-
quency diagram is similar to the shape of the corresponding histogram or frequency dia-
gram. If unequal bins are used, the height of the histogram bar for the i-th bin should be
scaled by 1 /( Nxi ) , where xi is the width of the i-th bin. Figures 4c and 5c show the
normalized frequency diagrams for the San Francisco rainfall data with the same bins as
the histograms in Figures 4a and 5a, respectively. Note that in the case of unequal bin
widths, the shape of the normalized frequency diagram is different from the correspond-
ing histogram. MATLAB does not have a built-in command to generate a normalized
frequency diagram. However, based on the command hist, Junho Song has developed
the command normfreq for computing and plotting the normalized frequency diagram.
This command can be downloaded from the course website. The plots in Figures 4c and
5c have been generated by use of this command.

3.4 Normalized Frequency Polygon

A normalized frequency polygon is obtained by first extending the normalized frequency


diagram by one bin of zero height at each end, and then joining the midpoints of the tops
of the bars for each bin. This plot is useful in constructing a theoretical distribution
model, to be described later in this course. Figures 4c and 5c show the normalized fre-

8
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

quency polygons for the San Francisco rainfall data, which are superimposed on the re-
spective normalized frequency diagrams. These plots have been generated by use of the
command normpg, which is developed by J. Song and is available for download from
the course website.

3.5 Cumulative Frequency Diagram

The cumulative frequency diagram shows the fraction of observations in the sample that
are equal to or smaller than a given value. To generate such a plot, proceed as follows:

1. Put the sample values in an ascending order. As before, let xmin denote the smallest
value and xmax denote the largest value. Also let x (i ) , i = 1, K, N , denote the i-th ob-
servation such that x (1) = xmin and x ( N ) = xmax .

2. For each selected value x of the variable, count the number of observations that sat-
isfy the condition x (i ) x , i.e., that are equal to or smaller than the threshold x. Let
m(x) be that number. Obviously, we have 0 m( x ) n . Note that m(x) is a non-
decreasing function of x .

3. Plot m( x ) / N as a function of x . Clearly, m( x) / N = 0 if x is selected smaller than


xmin and m( x) / N = 1 if x is selected greater than xmax . Therefore, the cumulative
frequency diagram starts from zero on the extreme left and gradually increases to a
value of 1 on the extreme right. At each observed value it makes a jump equal to the
frequency of occurrence of that observation in the sample. In between the observed
values, the cumulative frequency has a constant value. As a result, the diagram in
general appears as a series of steps of varying height and depth. The MATLAB com-
mand cdfplot is used to generate the cumulative frequency diagram, which is also
known as the empirical cumulative distribution function thus the reason for the
name of the command.

The cumulative frequency diagram provides useful information about the nature of the
distribution of a random quantity. In particular, it can be used to test the fitness of a theo-
retical distribution model, as we will learn later in this course. Unlike the histogram and
the normalized frequency diagram, it does not require selection of bins for a continuous
variable. Therefore, it can be used for samples with even fewer than 20 observations.

9
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

As examples, Figure 6 shows the cumulative frequency diagrams for the number of
seasonal rainy days in San Francisco, a discrete variable, and the amount of seasonal cu-
mulative rainfall, a continuous variable. These have been generated by use of the com-
mand cdfplot and the data in Table 1. Observe that both plots appear as staircases,
with the steps having varying depths and heights.

3.6 Quantiles

The value xq of a variable associated with a cumulative frequency q , 0 < q 1 , is known


as the q-quantile or q 100 -percentile value of the variable. For example, x0.10 is the
0.10-quantile or 10-percentile value of the variable; that is, x0.10 is the value of the vari-
able for which the cumulative frequency equals 0.10. Thus, a quantile is the inverse solu-
tion of the cumulative frequency. If q happens to fall on a horizontal segment of the cu-
mulative frequency curve, then the midpoint of that segment is taken as the value of xq .
Several quantiles are of particular interest when exploring a sample of data. The 0.50-
quantile, i.e., x0.50 , is known as the median. It is also known as the second quartile, since
two quarters of the observations are equal to or smaller than this value. If the sample size
N is an odd number, x0.50 is the middle observation when the sample is put in an ascend-
ing order. If N is even, x0.50 equals the average of the two observations that fall in the
middle of the ordered set. The first quartile of the sample is x0.25 and the third quartile is
x0.75 . Furthermore, the so-called interquartile range (IQR) is defined as the distance be-
tween the first and third quartiles, i.e., IQR = x0.75 x0.25 . These quantities are used in the
box plot to be described below, as well as in developing the so-called Q-Q plots. Other
quantiles commonly used in statistics are x0.05 , x0.10 , x0.90 and x0.95 .

For the San Francisco rainfall data in Table 1, the quartiles of the number of seasonal
rainy days are x0.25 = 58 , x0.50 = 68 and x0.75 = 80 . Note that x0.50 = 68 also represents the
median. The interquartile range is IQR= 22 . Other quantiles of interest are x0.05 = 49 ,
x0.10 = 54 , x0.90 = 84 and x0.95 = 100 . For the seasonal cumulative rainfall, the quartiles
are y0.25 = 15.39 in, y0.50 = 20.80 in and y0.75 = 25.09 in, the interquartile range is IQR = 9.70
in and the other quantiles of interest are y0.05 = 11.06 in, y0.10 = 13.86 in, y0.90 = 34.02 in
and y0.95 = 37.10 in. These have been computed by the MATLAB command
prctile_step, which you can download from the course website. The standard

10
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

MATLAB command prctile does the same, but it interpolates between successive
values of the variable rather than using the stepwise cumulative frequency function.

(a)

(b)

Figure 6. Cumulative frequency diagrams of the seasonal rainfall data in San Francisco:
(a) data for number of rainy days, (b) data for the cumulative rainfall amount.

11
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

3.7 Box Plot

The box plot, or the box-and-whiskers plot as it is sometimes called, is a summary


graphical representation of the data. It uses the quartiles x0.25 , x0.50 and x0.75 and the ex-
treme values xmin and xmax to construct the diagram. First, two rectangles are constructed
by drawing horizontal lines at x0.25 , x0.50 and x0.75 . The vertical sides are drawn such that
the width of the rectangles is proportional to the size of the sample (relative to the width
of box plots of other data). The rectangle contains 50% of the data. To represent the re-
maining data, vertical whiskers are extended from the top and bottom of the box. These
are usually extended to the extreme (minimum and maximum) data points within a dis-
tance of 1.5 IQR from the top and bottom of the box, where IQR stands for the inter-
quartile range IQR = x0.75 x0.25 . Two short horizontal lines denote the ends of the whisk-
ers. Any data points beyond the whiskers are shown with a mark and are usually consid-
ered as outliers. These are unexpectedly large or small magnitude observations and pos-
sibly unrepresentative of the variability in the sample. One should carefully reexamine
these data points to make sure that errors have not occurred in measuring or reporting
them. In statistical analysis, the outliers are often excluded from consideration, if they are
believed to be unrepresentative of the variability in the sample or population. More pre-
cise methods for determining outliers are available in advanced statistics texts. In es-
sence, the box plot provides a quick review of the location, spread, range and symme-
try/asymmetry of the data, as well as the presence of any outliers.

Figure 7 shows the box plots of the data in Tables 1 for both the number of rainy days
and the amount of rainfall in a season. These are generated by use of the MATLAB
command boxplot. It is seen that in each case one data item (marked + in the figure)
can be considered as an outlier. This is the data for the 1997-98 season, which has 119
rainy days and a cumulative rainfall amount of 47.22in. Evidently, these observations are
unexpectedly large in magnitude for a sample of this size and, hence, potentially not rep-
resentative of the expected variations in the number of rainy days and the amount of rain-
fall in San Francisco. However, a careful reexamination of the data reveals that no errors
were made in measuring or reporting these values. Whether or not they should be in-
cluded in statistical analysis of this data is a matter of judgment.

12
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

(a)

(b)

Figure 7. Box plots of the seasonal rainfall data in San Francisco: (a) data for number of
rainy days, (b) data for the cumulative rainfall amount.

3.8 Scatter Plot

Often data are observed in pairs and one wishes to examine the relationship between the
two variables. A scatter plot is a graphical means for this purpose. Let ( x1, y1 ), ( x2 , y2 ),

13
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

, ( xN , y N ) denote N pairs of observations. To generate a scatter plot, we simply place


a mark for each pair of observations in a two-dimensional coordinate system with x and
y as its coordinates. The MATLAB command scatter can be used for this purpose.

As an example, Figure 8 shows the scatter diagram for the San Francisco rainfall data
showing the amount of rainfall versus the number of rainy days for each season. A clear
tendency for the amount of rainfall to increase with increasing number of rainy days is
seen. A numerical measure of this tendency is the correlation coefficient, which is intro-
duced in Section 4.4.

Figure 8. Scatter diagram for San Francisco seasonal rainfall data

Sometimes it is of interest to develop a scatter diagram with groupings of the data.


The MATLAB command gscatter is useful for this purpose. As an example, suppose
we are interested in the scatter diagram of the rainfall data with the various decades
grouped separately. To use the gscatter command of MATLAB, we introduce a vec-
tor g of length 43 (same as the rainfall data) with its elements indicating the decades.
Thus, the first 10 elements of g are denoted as 60, elements 11 to 20 are denoted as 70,
and so on. The gscatter(x,y,g,'bgrcm','o*+.^') command then produces

14
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

the graph shown in Figure 9 with different colors and symbols used for the different sea-
sonal decades.

Figure 9. Scatter diagram for San Francisco seasonal rainfall data

3.9 Q-Q Plot

Quantiles, xq , were defined in Section 3.3 as values of the variable x associated with
specified values of the frequency of non-exceedance, q . Another way to explore the rela-
tionship between pairs of data sets is to plot the quantiles of one data set against the cor-
responding quantiles of the other data set. That is plot yq versus xq for the same values
of q . This is known as quantile-quantile plot or Q-Q plot in short. As we will see later, if
the two data sets come from similar distributions, the Q-Q plot will nearly coincide with a
straight line. Any deviation from a straight line indicates that the distributions of the two
variables are dissimilar. This method can also be used to check the distribution of a data
set against a theoretical distribution model. This topic will be addressed later in this
course.

Shown in Figure 10 is the Q-Q plot of the SF rainfall data generated by using the
MATLAB command qqplot. This command also draws a straight line connecting the
first and third quartile points of the data, that is, it connects the points ( x0.25 , y0.25 ) and

15
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

( x0.75 , y0.75 ) . It can be seen that the data points line up on a nearly, but not perfectly
straight line. So, the seasonal number of rainy days and the amount of seasonal rainfall in
San Francisco are coming from nearly but not perfectly similar distributions.

Figure 10. Q-Q plot of cumulative rainfall versus number of rainy days

Figure 11. Q-Q plot of cumulative rainfall versus normal distribution

16
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

If the qqplot command is used with one data set alone, MATLAB generates a Q-Q
plot of the data set versus the corresponding quantiles of a normal distribution (to be de-
scribed later). If the data points on this plot fall on a nearly straight line, then it is reason-
able to assume that the data is coming from a nearly normal distribution. Figure 11 shows
the Q-Q plot of the rainfall data against the normal distribution, which is generated by
issuing the command qqplot(y). It is seen that the cumulative rainfall data is nearly
normal in the middle region, but far from normal in the tail regions. More precise meth-
ods for checking the fitness of a theoretical distribution model to data will be described
later in this course.

4 NUMERICAL DESCRIPTORS OF DATA

Several numerical descriptors of data have already been introduced in the foregoing.
These included the extreme values xmin and xmax , the range r = xmax xmin , the median
x0.50 , the first and third quartiles x0.25 and x0.75 (with the median being the second quar-
tile), the interquartile range IQR = x0.75 x0.25 , and other quantiles xq for various values
of q . In this section we introduce several other important numerical descriptors of data.

4.1 Measure of Central Tendency

As we have already seen, the median provides a central measure of the data, as it is the
middle value of the data when it is sorted in an ascending order. The sample mean pro-
vides another central measure of the data. For a sample x1 , x2 , , xN , the sample mean,
denoted x , is the arithmetic average of the sample values defined as
N
1
x=
N
x
i =1
i
(1)

One can show that with increasing sample size x tends to coincide with the centroidal
coordinate of the normalized frequency diagram. For a given sample, the sample mean
and the median generally are different, though both describe central measures of the data.
Whereas the median is not affected by the extreme values of the data (as it is simply the
middle value when the data is sorted), the mean can be affected by the extreme values, as
can be readily seen in (1).

17
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

MATLAB commands median and mean can be used to compute these measures
for any given data. For the San Francisco rainfall data, using these commands we find
that the median number of rainy days in a season is x0.50 = 68 , whereas the mean is
x = 69.6 . Note that the mean value can be non-integer, even if the individual observa-
tions are integer valued. For the amount of seasonal cumulative rainfall, we find the me-
dian value to be y0.50 = 20.80in., whereas the mean is y = 21.85in. Observe that these
median and mean values fall in the central regions of the respective histograms and fre-
quency diagrams shown in Figures 1 and 4.

4.2 Measures of Dispersion

The dispersion or spread in the data is an indication of variability in the population from
which the data is derived. The example histograms in Figure 2 showed varying degrees of
dispersion. Here, we wish to define numerical measures to describe the magnitude of dis-
persion. Several of the previously introduced numerical descriptors provide measures of
dispersion. For example, the range r is one such measure, as it describes the spread be-
tween the largest and smallest observations. However, this measure is not stable, as it
tends to increase with increasing sample size. Another measure is the interquartile range,
IQR. This measure is more stable and is used in some applications, including box plots,
as we have already seen. More generally, the spread between any pair of quantiles xq and
x1q for a small q (say q = 0.05 or 0.25) can be used to describe a measure of dispersion.
In fact, such a measure is used to describe interval estimates of distribution parameters, as
we will see later in this course.

One way to measure dispersion is to consider the absolute deviations of the data items
from the sample mean, i.e., xi x , and take some sort of average over the sample val-
ues. Since the sample mean is a central value of the sample, these deviations measure the
spread of the data away from the center. Taking the absolute value makes them additive.
The mean absolute deviation, denoted d , is the arithmetic average of these absolute de-
viations, i.e.,
1 N
d= xi x
N i =1
(2)

18
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

MATLAB command mad can be used to compute this value. For the San Francisco rain-
fall data, the mean absolute deviation for the number of rainy days is 11.99 and for the
seasonal cumulative rainfall it is 6.17in.

A more commonly used measure of dispersion is the sample standard deviation, de-
noted s . This is obtained as the square root of the average of the squares of the devia-
tions defined above, i.e.,

1 N
s=
N i =1
(xi x )2 (3)

For reasons to be described later in this course, the N in the above expression is often
replaced by N 1 (to make s 2 an unbiased estimator of the population variance), i.e., s
is defined as

1 N
s=
N 1 i =1
(xi x )2 (4)

Furthermore, by replacing for the sample mean from (1) and expanding the expression,
one can show that (4) equivalent to

1 N 2
xi Nx 2
(5)
s=
N 1 i =1

This expression is more convenient for computing, as it does not require calculating the
deviations of the individual data items from the sample mean. The square of the sample
standard deviation, s 2 , is known as the sample variance. Furthermore, the ratio of the
sample standard deviation to the absolute value of the sample mean, i.e.,
s
= (6)
x

is known as the sample coefficient of variation or sample c.o.v. in short. The latter is a
dimensionless measure of dispersion, as it describes the standard deviation in units of the
mean. This definition is meaningful only if the sample mean is not zero or near zero. Fur-
thermore, if the sample mean has a negative value, its absolute value should be used in
(6).

When comparing the dispersions of two samples, the standard deviations can be com-
pared if the two samples represent the same quantity in the same units. More generally,

19
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

the dispersion in two samples can be compared in terms of the c.o.v., even if the samples
do not represent the same quantity, as long as the sample means are not zero or near zero.

MATLAB commands std and var can be used to compute the sample standard de-
viation and the sample variance, respectively. The standard versions of these commands
use the definition in (4) or (5). However, by specifying a flag, one can compute s accord-
ing to the definition in (3). For the San Francisco rainfall data, using the standard version
of the command std, the sample standard deviation for the number of rainy days is de-
termined to be s x = 15.6 and that of the seasonal cumulative rainfall is determined to be
s y = 8.05in. These are substantially larger than the mean absolute deviations computed
above. The reason is that the standard deviation is more affected by the extreme values of
the data due to the squaring of the deviations. Dividing by the corresponding sample
mean values, the c.o.v. of the number of rainy days is found to be = 0.224, whereas
x

that of the seasonal cumulative rainfall is found to be y = 0.368. It is seen that there is
substantially more variability in the amount of rainfall than in the number of rainy days.
Note that we cannot compare the variabilities in the two data sets in terms of their stan-
dard deviations.

4.3 Measure of Asymmetry

To measure the asymmetry of the data relative to its central value, the sample coefficient
of skewness is used. Denoted , this dimensionless measure is defined as

(x
N
x)
3
(7)
= i =1 i
3
Ns

Since the cubed deviations from the mean, ( xi x )3 , assume positive or negative values
depending on whether xi is greater or smaller than x , the sum in the numerator tends to
zero if the data is symmetrically distributed on the two sides of the mean. On the other
hand, if the data is asymmetrically distributed, the sum over the cubed deviations does
not cancel out and assumes a positive or negative value, depending on whether the dis-
tribution has a longer tail to the right side or the left side of the mean (see Figure 3). The
MATLAB command skewness can be used to compute this coefficient. Using this
command for the San Francisco rainfall data, the skewness coefficient for the number of
rainy days is determined to be = 0.845 and that for the seasonal cumulative rainfall is
x

20
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

determined to be y = 0.915. Thus, both data are positively skewed, which means that
their distributions have a longer tail towards the right, i.e., towards larger values. This
tendency is clearly evident in the histograms plotted in Figures 1 and 4, and to a lesser
extent in the box plots shown in Figure 7.

4.4 Measure of Linear Dependence Between Two Data Samples

The numerical descriptors introduced so far are all for a single data set. For data in pairs,
as in ( x1 , y1 ), ( x2 , y 2 ), , ( x N , y N ), one is often interested in exploring the possibility of
a relationship between the pairs of observations. The simplest measure of this kind is the
sample covariance. Denoted s XY , this measure is defined as
1 N
s XY = xi y i Nx y (8)
N 1 i =1

A more convenient measure is obtained by normalizing the covariance by the sample


standard deviations of the two variables. The resulting dimensionless quantity, denoted
rXY , is the sample correlation coefficient defined by

1 i =1 xi y i Nx y
N

rXY =
N 1 sx s y
(9)

One can show that the correlation coefficient is always bounded between 1 and 1. Fur-
thermore, the covariance and the correlation coefficient provide measures of linear rela-
tion between the two data sets. Thus, if in a scatter diagram the pairs of data lie close to a
straight line, the sample correlation coefficient will be near 1 if the line has a negative
slope and near + 1 if the line has a positive slope. If the data in the scatter diagram are
spread without any linear tendency, then rXY will be near zero. Shown in Figure 12 are
example scatter diagrams with corresponding correlation coefficients. In case (a), the
variables have a strong linear dependence with a positive slope and the correlation coeffi-
cient is near +1. In case (b), the variables have strong linear dependence, but with a nega-
tive slope. The correlation coefficient in this case is near 1 . In case (c), no dependence
between the two variables is evident and the correlation coefficient is near zero. In case
(d), the variables are clearly strongly related; however, the correlation coefficient is near
zero. This is because the relationship between the two variables in this case has no linear
component.

21
University of Illinois at Urbana-Champaign Instructor: Junho Song
Dept. of Civil and Environmental Engineering junho@illinois.edu

rxy = 0.952 rxy = 0.907

(a) (b)

rxy = 0.069 rxy = 0.003

(c) (d)
Figure 12. Scatter diagrams with corresponding sample correlation coefficients

MATLAB commands cov and corrcoef can be used to compute the sample co-
variance and correlation coefficient, respectively. For the San Francisco rainfall data, us-
ing corrcoef, rXY is determined to be 0.871. This number being close to + 1 , is indica-
tive of a strong linear tendency between the number of rainy days in the season and the
seasonal cumulative rainfall. This implies that the larger the number of rainy days, the
greater the amount of rainfall is likely to be. This kind of tendency is clearly evident in
the scatter diagram shown in Figure 8.

22

You might also like