You are on page 1of 19

AACE INTERNATIONAL STATISTICS & PROBABILITY

Chapter 29

Statistics & Probability

Dr. Elizabeth Y. Chen and Mark T. Chen, PE CCE

INTRODUCTION DESCRIBING DATA

Statistics is the field of study where data are collected for the In general, data can be classified as either qualitative or quan-
purpose of drawing conclusions and making inferences. titative.
Descriptive statistics is the summarization and description of
data, and inferential statistics is the estimation, prediction, Qualitative Data
and/or generalization about the population based on the Qualitative data can be categorized or summarized.
data from a sample
Example:
Four elements are essential to inferential statistical problems: As of September 2, 2003, the total membership of AACE
International is 4,307. This can be classified according to
1. Population is the collection of all elements of interest to the member types:
decision-maker. The size of the population is usually
denoted by N. Very often, the population is so large that a members and associates 4,036
complete census is out of the question. Sometimes, not students 147
even a small population can be examined entirely because honorary 124
it may be destructive or prohibitively expensive to obtain total 4,307
the data. Under these situations, we draw inferences based
upon a part of the population (called a sample). Or according to geographical distribution:
2. Sample is a subset of data randomly selected from a
population. the size of a sample is usually denoted by n. U.S. 3,509
3. Statistical inference is an estimation, prediction or gen- Canada 480
eralization about the population based on the informa- Caribbean 28
tion from the sample. Asia 158
4. Reliability is the measurement of the “goodness” of the Africa 29
inference. Europe 48
Australia 55
Only the first two elements will be discussed in this chapter. total 4,307
Numerical characteristics of a population are called parame-
ters of the population. The corresponding numerical charac- Quantitative Data
teristics calculated from a sample are called sample statistics. The description of quantitative data is more complex. It can
be described graphically or numerically.

LEARNING OBJECTIVES Some graphic methods for describing quantitative data


include the following:
After completing this chapter, readers should be able to
• frequency distribution and relative frequency (f/n),
• understand basic definitions and terminologies in prob- • stem and leaf plots, and
ability and statistics, and • histogram.
• apply statistical techniques in decision making.

29.1
STATISTICS & PROBABILITY AACE INTERNATIONAL

Numerical methods for describing quantitative data include Stem Leaf f f/n
the following: frequency relative
frequency
• measures of location (central tendency),
-mean (average), 2 5, 0, 0, 0, 0, 0 6 6/50
-median, and 3 5, 0, 0, 0, 5, 5, 0, 0, 0, 5, 0 11 11/50
4 0, 0, 0, 0, 0, 5, 5, 5, 0, 0, 0, 0, 0, 5, 0 15 15/50
-mode;
5 0, 0, 0, 0, 0, 0, 5, 0 8 8/50
• measures of dispersion, 6 0, 0 2 2/50
-range, 7 0,0,0,5 4 4/50
-variance, and 8 0 0
-standard deviation; 9+ 95, 145, 160, 140 4 4/50
• relative standing,
-percentile, and 50 50/50
-Z-score.

Example: Frequency
Many companies invest on training their employees. The fol-
lowing average training hours for every employee are select- 15
ed from “The 100 Best Companies to Work For” (Fortune,
January 20, 2003). Numbers are rounded to the nearest 5 for 11
convenience.

145 50 35 50 25 160 30 40 40 20
40 20 30 95 40 50 30 50 20 35 6 8
70 35 40 45 30 45 70 45 40 140 4 4
50 70 20 30 40 40 30 60 40 50
60 55 50 35 40 30 45 75 20 40
2 Training Hours

The data will first be divided into smaller equal intervals 20 30 40 50 60 70 80 90+
(classes). The number of observations that fall into each class,
the frequency, is then counted. These classes should not over-
Figure 29.1—Stem, Leaf, Frequency and Relative Frequency
lap and there should be enough classes to include all the
Distributions
data. There may be open-ended intervals when the first class
contains no lower limit or the last data class contains no
upper limit. The number of classes depends on the number
of observations in the data set. In practice, a frequency distri-
bution usually has from five to twenty classes.

The stem and leaf plot is developed by first determining the


stem and then adding leaves. In this example, the stem is
formed by the “tens” digit and the leaves are the “ones” digit.
Note that the stem values are placed to the left of the vertical
line and the leaves on the right.

Example: 45 is shown as:

Stem Leaf

4 5

Figure 29.2—Histogram of Training Hours

29.2
AACE INTERNATIONAL STATISTICS & PROBABILITY

The sum of the training hours from the above 50 companies is


GLOSSARY TERMS IN THIS CHAPTER
145 + 50 + 35 + 50 +….+20 + 40 = 2,445 hours
frequency distribution ◆ standard deviaton
Mean (average): Mean is the sum of measurements divided statistics
by the number of measurements.
Population mean is denoted by µ = sum of all numbers
in population/N Range: The difference between the largest and the smallest
Sample mean is denoted by x = sum of all numbers in values of the data set.
sample/n The range of this example is 160 - 20 = 140 hours.
The mean of this example is 2,445/50 = 48.9 hours
The range only uses the two extreme values and ignores the
Median: Median is the middle number when the data obser rest of the data set. One instinctive attempt to measure the dis-
vations are arranged in ascending or descending order. persion would be to find the deviation of each value from the
If the number n of measurements is even, the median is the mean and then calculate the average of these deviations. One
average of the two middle measurements in the ranking. will find that this value is always zero, an answer which is no
The median of this example is 40 hours.

For symmetric data set, the mean equals to the median.


If the median is less than the mean, the data set is
skewed to the right.
If the median is greater than the mean, the data set is
skewed to the left.

Mode: Mode is the measurement that occurs most often in


the data set
The mode of this example is 40 hours.
Figure 29.3—Symmetry
If the observations have two modes, the data set is said
to have a bimodal distribution.
When the data set is multi-modal, the mode(s) is no
longer a viable measure of the central tendency.
In a large data set, the modal class is the class containing
the largest frequency. The simplest way to define the
mode will then be the midpoint of the modal class.

Comparison of the Mean, Median, and Mode


The mean is the most commonly used measure of central
location. However, it is affected by extreme values. For exam-
ple, the high incomes of a few employees will influence the Figure 29.4—Rightward Skewness
mean income of a small company. Under such situation, the
median maybe a better measure of central tendency.

The median is of most value in describing large data sets. It is


often used in reporting salaries, ages, sale prices, and test scores.

The mode is frequently applied in marketing. For example, the


modal men’s shirt neck size and sleeve length, shoe size, etc.

Numerical Measures of Variability


Measures of central tendency do not describe the spread of the
data set, which may be of greater interest to the decision-maker.

The simplest measure of the variability of a data set is its range. Figure 29.5—Leftward Skewness

29.3
STATISTICS & PROBABILITY AACE INTERNATIONAL

accident. The alternative might be to calculate the average The first (lower) quartile: the 25th percentile
absolute deviation. However, this measure is rarely used The third (upper) quartile: the 75th percentile
because it is difficult to handle algebraically and does not have
the nice mathematical properties possessed by the variance. And you guessed it; the second (middle) quartile is the
median.
Variance: The average of the squared deviations from the mean.
For the above training hours example, the 80th percentile is
The population variance is denoted by 60 hours.

∑(x - µ)2 = ∑x2 - N µ2 Another measure of relative standing is the famous z-score.
σ2 =
N N
A z-score is the number of standard deviations a point is
above or below the mean of a set of data.
The sample variance is denoted by
The population z-score for a measurement x is z = (x - µ)/?
∑ (x - x)2 ∑x2 - nx2
s2 = = The sample z-score for a measurement x is z = (x - x)/s
n-1 n-1

RANDOM VARIABLES AND SOME IMPORTANT


Note the divisor (n -1) is used instead of the more obvious n. PROBABILITY DISTRIBUTIONS
This will make the sample variance, s2, a better estimate of
the population variance, σ2. The explanation for this practice Analyzing frequency distribution for every decision-making
is beyond the scope of this text but can be found in many sta- situation would be very time consuming. Fortunately, many
tistics textbooks. physical events that appear to be unrelated have the same
underlying characteristics and can be explained by the same
The variance of this example is: laws of probability. The mathematical model used to repre-
sent frequency distributions is called a probability distribu-
162,825 – 50(48.4)2
s2 = = 882.95 hours tion. To understand the concept and know which distribution
50 - 1 to use in a particular situation will save considerable time
and effort in the decision-making process. We will start from
The variance has a squared unit and is in a much larger scale the concept of a random variable.
than that of the original data. To offset these, the square root
is used. Random Variable: A random variable is a variable whose
numerical value is determined by the outcome of a random
Standard Deviation: The positive square root of the variance. experiment. A random experiment is the type of experiment
The population standard deviation is denoted by σ. that may produce different results in spite of all efforts to
The sample standard deviation is denoted by s. keep the conditions of performance constant.

The sample standard deviation s of this example is If a random variable can take on only countable number of
values, then we call it a discrete random variable. For exam-
√882.95 = 29.71 hours ple, the number of sales made by a salesperson in a given day

The standard deviation can be approximated by range/6. Random variables that can assume any value within some
Some may prefer to use range/4. interval or intervals are called continuous random variables.
For example, the length of time an employee is late for work
Measures of Relative Standing
Another measure of interest is the description measurement Probability Distribution: Probability Distribution is
of the relative location of a particular observation within a expressed in a table listing all possible values that a random
data set. variable can take on together with the associated probabilities.

pth percentile: In any data set, the pth percentile is the num Notice that the random variable itself will be denoted by X,
ber with exactly p percent of the measurements fall while the small x denotes a particular value of X. The symbol
below it and (100-p) percent fall above it when the data p(x) = Pr( X=x ) means the probability that the experiment
are arranged in ascending or descending order. yields the value x.

29.4
AACE INTERNATIONAL STATISTICS & PROBABILITY

In a discrete probability distribution, each p(x)≥0 for all val- Suppose you work for an insurance company and sell an
ues of x and ∑p(x)=1. individual 10-year $100,000 term life insurance coverage at
an annual premium of $240. Actuarial tables show that the
Example: Two coins are tossed. Let X be the number of heads probability of death during the next year for a person of your
appeared. customer’s age, sex, health, etc., is .001. What is the expected
gain to the company for a policy of this type?
x 0 1 2
Probability distribution of X
p(x) 1/4 2/4 1/4
Gain(X) Event Probability
The probabilities p(x) can be interpreted as long-run relative $240 customer lives .999
frequencies. For instance, if two coins were flipped many, $240–100,000 customer dies .001
many times, we could anticipate obtaining two tails (X=0)
about one-fourth of the time, one head and one tail (X=1) one
half of the time and two heads one-fourth of the time. If the customer lives, the company keeps the $240 premium.
Therefore, the probability distribution for a random variable If the customer dies, the company must pay $100,000 and will
is a theoretical model for the relative frequency distribution have a net “gain” of $(240-100,000). The expected gain is
of a population. therefore

Like the frequency distribution, the mean and standard devi- µ=E(X)=(240)(.999) + (240 - 100,000)(.001) = 240(.999 +
ation of a probability distribution need to be calculated to .001)-100,000(.001) = $140
describe the central location and spread of the probability
distribution. Please note that for each policy sold, the insurance company
is taking a risk of either gaining $240 or losing $99,760.
The predicted long-range average of a discrete random vari- However, if the company were to sell a very large number of
able X, often called the expected value (or mean) of X, is such insurance policies to customers possessing the charac-
defined by teristics described above, the company would on the average
net $140 per policy written.
µ = E(x) = ∑xp(x)

The population variance is defined as DISCRETE RANDOM VARIABLES

µ = E(x-µ)2 = ∑(x- µ)2p(x) There are several theoretical discrete probability distribu-
tions that have extensive applications in decision-making.
The standard deviation σ is the square root of the variance. One will be introduced in this section.

In the above two-coin example, the expected value of X is Binomial Distribution


Many decisions are of the either/or variety. A company bid-
µ = ∑xp(x) = 0*p(x=0) + 1*p(x=1) + 2*p(x=2) = 0*(1/4) + ding for a contract may either get the contract or it won’t.
1*(1/2) + 2*(1/4) = 1 The responses to a public opinion poll may be either “favor”
or “oppose”. Many experiments (situations) have only two
It is very important to remember that the expected value is possible alternatives, such as yes/no, pass/fail, or accept-
not a number we “expect” to get at a given experiment. What able/defective.
this expected value tells is that if we were to toss two fair
coins many, many times, carefully record the number of Consider a series of experiments which have the following
heads appeared in each toss, and at the end calculate the properties:
average number of heads, then this average would be 1.
• The experiment is performed n times under identical
The variance and the standard deviation of X is respectively, conditions.
• The result of each experiment can be classified into one
σ2= (0 - 1)2 (1/4) + (1-1)2 (1/2) + (2-1)2 (1/4) = 1/2 of two categories, say, success (S) and failure (F).
σ= √ 1

2
• The probability of a success, denoted by p, is the same for
each experiment. The probability of a failure is denoted
by q. Note that q =1-p.
• Each experiment is independent of all the others.
Another example of expected value is the following: • The binomial random variable X is the number of successes
in n experiments. Probability of x successes in n experiments:

29.5
STATISTICS & PROBABILITY AACE INTERNATIONAL

Continuous Random Variables


p(x) = n()
x p q
x n-x x =0,1,2,…,n
The probability distribution for a continuous random vari-
able is often denoted by f(x) and is variously called a proba-
bility density function. The primary difference between prob-
The name binomial arises from the fact that the probabilities abilities for discrete and continuous random variables is that
p(x), x = 0,1,2,…,n, are terms of the binomial expansion, while probabilities for a discrete random variable are defined
(q+p)n. for specific values of the variable, the probabilities of a con-
tinuous random variable are defined for a range of values of
Mean, Variance, and Standard Deviation for a Binomial the variable. The graphic form of f(x) is a smooth curve and
Random Variable: the area under the curve corresponds to probabilities for x.
For example, the area A beneath the curve between the two
Mean: µ = np points a and b, is the probability Pr(a<x<b).
Variance: σ2 = npq
Standard Deviation: σ = √npq
Several extensive tables of the binomial distributions for
some values of p and n have been published. Either one of the
cumulated probabilities Pr(X ≤ x) or Pr(X ≥ x) is listed.

Example: Suppose your company ships electrical fuses in


lots, each lot containing 10,000 fuses. Your quality control
plan requires that you will randomly sample twenty-five
fuses from each lot and accept the lot if the number of defec-
tive fuses, x, is less than 2. If x ≥ 2 you will reject the lot and
will conduct a complete re-inspection. What is the probabili-
ty of accepting a lot (x=0,1) if the actual fraction defectives in
the lot is (a) .1? (b) .01?

Solution: Figure 29.6—Continuous Distribution


n = 25
(a) If p = 0.1; q = 1-p=1-0.1=0.9

Pr (Accepting the lot) = Pr(x < 2) = Pr (x ≤ 1) = Because there is no area over a point, the probability associ-
Pr(x = 0) + Pr(x = 1) ated with any particular value of x, say, x=a, is equal to zero.
Hence, Pr(a≤x≤b)=Pr(a<x<b). In other words, the probability
= ( 250 ) (.1)0 (.9)25 + ( 251 ) (.1)1(.9)24 is the same regardless of whether the endpoints of the inter-
val are included. The total area under the curve, which is the
total probability for x, equals to 1.
= .27121
The areas under most probability density functions are
(b) If p = .01; q = .99 obtained by the use of calculus or other numerical methods.
This is often a difficult procedure. However, as with com-
25 monly used discrete probability distributions, there are tables
Pr(x≤1) = ( 250 ) (.01)0(.99)25+ ( 1 ) exist for finding probabilities under commonly used contin-
uous probability distributions.
(.01)1(.99)24 = .77782 + .19642
= .97424 Similar to the requirements for a discrete probability distri-
bution, we require
The two measurements indicate that, under the proposed
quality control plan, there is 73 percent chance that you will f(x)≥0 and ∫ f(x)dx = 1 for all x
reject the lot if in fact 10 percent (p=0.1) of the fuses manu-
factured are defective. If only 1 percent of the fuses are defec- The Normal Distribution
tive, the chance of re-inspection is very small (less than 3 per- The most important continuous distribution in statistical
cent). decision making is the normal distribution. It is important for
the following reasons:

29.6
AACE INTERNATIONAL STATISTICS & PROBABILITY

• as odd as it may seem, many observed variables are nor-


mally distributed, or approximately, so.
• many of the procedures used in statistical inference
require the assumption that a population is normal.

Probability distribution for a normal random variable X is


1 x-µ 2
- –
2 ( √2π σ )
1
f(x) = e for -∞x<+∞
√2π σ

where
Figure 29.8—A Normal Probability Distribution
µ = mean of the random variable X
σ = standard deviation of X
e = 2.71828…
π = 3.14159…
is a normally distributed variable with mean zero and stan-
dard deviation 1. The probability distribution of Z is called
the standard normal distribution. Notice that z gives the
number of standard deviations that a value of x lies above or
below the mean. By using the Z score, all normal distribu-
tions can be transformed to Standard Normal Distribution.
We can say that if X is N(µ, σ 2), then Z=(x - µ)/σ is N(0,1).

The standard normal distribution table that gives the area


under the standard normal curve is available. Some tables
give the area between the mean 0 and any particular value of
z, where 0< z < 3.59. Remember that the area represents the
probability that a value of z will lie between zero and the
Figure 29.7—Two Normal Distributions given value and it must always be positive. Further, since the
normal curve is symmetric, the area between -z and zero is
the same as the area between zero and z; the area between -z
and +z is twice the area between zero and z.
The graph of a normal distribution is called a normal curve
and it has the following characteristics:
Frequently quoted z values and the probabilities:
• It is bell-shaped and is symmetrical about the mean. The
mean, median and mode are all equal. Probability density Table 29.1—Frequently Quoted Z Values and Probabilities
decreases symmetrically as x values move from the mean
in either direction. Since the total area (probability) under
the curve is 1, the area on each side of the mean is 1/2. Z Pr ( -z < Z < z) Pr(Z<-z) or Pr(Z>z)
1.00 .683 .158
• The curve approaches but never touches the horizontal
axis. However, when the value of X is more than three 1.282 .80 .10
standard deviations from the mean, the curve approach- 1.645 .90 .05
es the axis so closely that the extended area under the
curve is negligible. 1.96 .95 .025
2.00 .954 .023
The Standard Normal Distribution
2.326 .98 .01
If X is a normally distributed random variable with mean µ
and standard deviation σ, the random variable Z, defined by 2.576 .99 .005
3.00 .997 .0015
Z=(X- µ)/σ

29.7
STATISTICS & PROBABILITY AACE INTERNATIONAL

Example: The actual amount of coffee grounds that a filling 2. The average project duration to build a greenfield man-
machine puts into “6-ounce” jars varies from jar to jar, and it ufacturing plant is 26 months with a standard deviation
may be assumed as a normal random variable with a stan- of two months, assuming the project critical path follows
dard deviation of 0.04 ounce. If the jar contains less than 6 a normal distribution. Your company is planning to
ounces, it is considered unacceptable. Determine the mean build a similar greenfield manufacturing plant. The sen-
fill of the machine so that only 1 percent of the jars will be ior management is interested in the following three
unacceptable. schedule outcomes:

Solution: Let x be the amount of coffee in the jar. We are given a. What is the chance to complete the project between 24
σ = 0.04 and 28 months?
b. What is the likelihood of completing the project in 24
We are asked to find the average fill, µ, such that months?
Pr(x<6)=.01 c. What is the risk that the project duration would exceed
30 months?
x-µ 6-µ
Px(x<6) = Pr(
σ < σ
) = Pr (z< 6 - µ ) = .01
.04 Solutions:

From the above table we find that 1. a. Mean is the average work-hours/unit of the 200
Pr(z < –2.326) = .01; therefore, observations.
6-µ
= -2.326 µ = 6.093
.04 (6x 6) + (7 x 11) + (8 x 27) + (9 x 47) + (10 x 52) +
(11 x 44) + (12 x 9) + (13 x 4)
∑x
µ=—= ——————————————————————
If the average fill is set at 6.093 ounces, only 1 percent of the N 200
jars will contain less than 6 ounces.

b. After arranging the observed 200 workhours/unit


PRACTICE PROBLEMS AND QUESTIONS data in ascending order, the median is the average of
the 100th and 101st observations. Both are 10 work-
hours/unit, so the median is 10 workhours/unit.
1. After a long-term observation of a production line, the
following productivity data are recorded. c. The mode is the observation that occurred most often.
Ten workhours/unit was recorded 52 times, the high-
est frequency. Thus, 10 workhours/unit is the mode.
∑(x -µ)2 ∑x2 - Nµ2
d. variance = σ2 = ———— = —————
N N

= [(6 2x 6) + (72 x 11) + (8 2x 27) + (92 x 47) + (102 x 52) +


(112 x 44) + (122x 9) + (132 x 4)] -200 x (9.58)2
———————————————————————
200

Find: 18786 - 200 (9.58)2 18786 - 18355.28


= ——————— = ——————— = 2.1536
a. mean 200 200
b. median
c. mode
d. variance
e. standard deviation
e. Standard deviation σ = √2.1536 = 1.47
workhours/unit.

29.8
AACE INTERNATIONAL STATISTICS & PROBABILITY

Problem 1—Frequency Distribution Problem 2b—Project Duration within 24 Months

2. a. The project duration between 24 and 28 months is c. Thirty-month duration is four months longer than the
within one standard deviation from the average 26-month mean duration, which is two standard
(mean) duration of 26 months. Since the probability deviations above the mean. From the standard nor-
within one standard deviation from the mean is 0.68, mal distribution table, the probability exceeding two
this project has a 68 percent chance to be competed standard deviations is 0.023. Thus, there is only a 2
between 24 and 28 months. percent risk that the project duration would exceed 30
months.

Problem 2a—Project Duration Between 24 and 28 months


Problem 2c—Project Duration Exceeding 30 Months

b. Twenty-four-month duration is shorter than the 26-


month mean duration by two months. Referring to
the standard normal distribution table, the probabili-
ty is 0.16. This project has only 16 percent chance to
be completed within 24 months.

29.9
STATISTICS & PROBABILITY AACE INTERNATIONAL

REFERENCES

1. AACE International. 2003. Certification Study Guide. 2nd


ed. Chapter 20.
2. Brockett, P., and A. Levine. 1984. Statistics & Probability &
Their Applications. CBS College Publishing.
3. Byrkit, D.R. 1987. Statistics Today. The Benjamin
Cummings Publishing Company.
4. Groebner, D. F., and P.W. Shannon. 1985. Business
Statistics. Charles E. Merrill Publishing Company.
5. McClave, J.T. and F. H. Dietrich. 1985. Statistics. 3rd.ed.
Dellen Publishing Company.
6. Smith, G. 1985. Statistical Reasoning. Allyn and Bacon.
7. Summers, G. W., W. S. Peters, and C. P. Armstrong. 1985.
Basic Statistics in Business and Economics. 4th ed.
Wadsworth Publishing Company.

29.10
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS

Chapter 30

Basic Concepts in Descriptive Statistics

Dr. Frederick B. Muehlhausen

INTRODUCTION LEARNING OBJECTIVES

The manager of the construction process must make deci- After completing this chapter, the readers should be able to
sions daily that affect the operation of an individual project
as well as the company as a whole. Rarely does a manager • understand the basic concepts and procedures of
have the intuition to make decisions that avoid serious or descriptive statistics, such as frequency distributions,
continued error without input from past field and company frequency graphs, the normal curve, and cumulative
performance. The successful company collects information probability curve.
so that when analyzed, good decisions can be made.
Statistics constitutes all methods useful for the analysis of
this information. In general, statistical methods are of two FREQUENCY DISTRIBUTIONS
types and subsequent purpose:
A concrete contractor engaged in installing foundation foot-
1. descriptive statistics, which allow the cost engineer to ings and walls needs to be able to predict with some accura-
organize, summarize, interpret, and communicate quan- cy the time it takes to install formwork. Not only will a
titative information obtained from observations; and knowledge of past performance help the contractor predict
2. inferential statistics, which allow the cost engineer to go future performance useful in estimating and scheduling, this
beyond the data collected from a small sample to formu- knowledge also will provide an internal benchmark for cost
late tentative conclusions about the population from control purposes once the project is in progress. Before these
which the sample was taken. predictions can be made, the contractor must know more
about past performance. Thus, data about similar forming
This chapter examines some of the basic concepts and proce- techniques is collected from 20 projects completed over the
dures that are a part of descriptive statistics. last several years. This data is summarized in Table 30.1.

Table 30.1—Formwork Production in Hours Per Square Foot of Contact Area

JOB HRS/SFCA JOB HRS/SFCA JOB HRS/SFCA JOB HRS/SFCA


1 .050 6 .050 11 .040 16 .050
2 .050 7 .065 12 .055 17 .060
3 .065 8 .060 13 .045 18 .055
4 .055 9 .050 14 .050 19 .070
5 .050 10 .045 15 .065 20 .045

The data above is hard to interpret in its present form. It must be organized so that the data yields meaning to the manager.

30.1
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL

Frequency Distribution—A frequency distribution is an Interpretation of the cumulative frequency distribution (col-
organization of measures or observations that lists the class umn 3) indicates that 6 of the 20 productivity rates fall below
(in this case, the productivity rate as measured in labor hours (too many hours expended per square foot of contact area) the
per square foot of contact area) and the frequency or the required rate of .055 HRS/SFCA. In order to be within budget
number of times this production rate was achieved. In Table and on time for more than 2/3 of the projects, the data indicates
30.2, the data has been rearranged by listing the data from that the contractor has two choices: (1) bid future work at a
high productivity to low productivity (column 1), and the higher rate per square foot of contact area, or (2) implement
number of times this production rate occurred (column 2). process changes to increase production on future projects.

Cumulative Percentage Distribution—Sometimes it is use-


Note that it is much easier to get a “feel” for the measures
ful to show the percent of scores that fall below certain val-
observed when arranged by frequency (column 2) than when
ues. The contractor acknowledges the fact that variations
examining the unorganized raw data. One can easily deter-
from project to project in labor and management will prevent
mine the highest productivity (.040 HRS/SFCA), the lowest
process changes from reducing the rate on all projects to .055
(.070 HRS/SFCA), and the production rate that occurred
HRS/SFCA or lower. In addition, market competition will
most often (.050 HRS/SFCA). In addition, one can easily
not allow the budget rate to go above .055 HRS/SFCA. The
observe how the measures are distributed along the entire
contractor will accept a 10 percent failure rate. That is, 90 per-
scale; that is, whether the measures are distributed uniform-
cent of the projects must yield a production rate of .055
ly or whether gaps appear at certain points. In this case, the
HRS/SFCA or less. The cumulative frequency distribution
data is distributed uniformly.
can be converted into a cumulative percentage distribution to
readily find the failure rate. This is accomplished by divid-
Cumulative Frequency Distribution—Sometimes, one is not
ing each cumulative frequency by the total number of meas-
particularly interested in the number of occurrences within a
ures (N = 20). Table 30.2, column 4, shows the cumulative
particular class but in the number of occurrences that fall below
percents for the production rates.
or above a certain value. For example, suppose the contractor
bid this type of formwork at .055 HRS/SFCA. Any value above The advantage of this distribution is that it readily shows the
this rate would be over budget and extend the project duration. percentage of measures falling below a certain value.
The question arises, “How many projects failed to yield the Generally, it is more meaningful to know the percentage of
required production rate?” The cumulative frequency distri- those measures that fall below a certain value rather than to
bution answers this question by adding successively from the know the number of measures. In this case, 30 percent of the
bottom (.07 HRS/SFCA) the number of cases in each class production rates failed to make .055 HRS/SFCA. Hence, one
interval. Thus, the distribution would be developed as shown would conclude that the contractor would not be satisfied.
in Table 30.2, column 3. Note that the topmost entry in the
cumulative frequency column must agree with the total num-
ber of measures (n = 20). If it does not, then an error has been
made in adding the frequencies.

Table 30.2—Frequency Distributions

COLUMN COLUMN COLUMN COLUMN


1 2 3 4
rate (SFCA) frequency (f) cum. freq. (cf) cum. percent (%)
.040 1 20 100
.045 3 19 95
.050 7 16 80
.055 3 9 45
.060 2 6 30
.065 3 4 20
.070 1 1 5
n = 20

30.2
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS

FREQUENCY GRAPHS
Once the contractor finds that 30 percent of the projects yield GLOSSARY TERMS IN THIS CHAPTER
an unacceptable production rate, this information is conveyed
to project management and labor. Rather than show the infor- frequency distribution ◆ standard deviaton
mation in tabular form, the contractor decides to present the statistics ◆
information graphically. Graphs convey the essential charac-
teristics of a frequency distribution in a pictorial form.
Graphical information is much more pleasing to view than
tables, and so provides an effective medium for communicat- togram, but instead of drawing a vertical bar, a point is plot-
ing frequency distribution information to others. ted at the exact score (or midpoint of the interval) and at a
height corresponding to the frequency of that score (or inter-
Frequency graphs have two characteristics in common: (1) one val). These points are then connected by a straight line, which
axis that represents all possible scores or classes within a distri- results in a polygon. See Figure 30.2.
bution, and (2) one axis that represents the frequency of occur-
rence of that score or class. Frequency distributions are repre- The reason for constructing histograms and frequency poly-
sented graphically via the histogram or the frequency polygon. gons is to reveal how scores are distributed along the score
scale. That is, the form of the distribution is shown. A distri-
Histogram—In a histogram, the frequency of each score or bution is symmetrical if one side is a mirror image of the
class is represented as a vertical bar. For the production rate other. If not, it is asymmetrical. Asymmetrical curves can be
data, a histogram would be produced as shown in Figure 30.1. skewed either positively or negatively. See Figure 30.3.

When developing the histogram, the 3/4 rule should be For negative skewness, the tail travels to the left; for positive
applied. That is, the highest frequency should be laid out so skewness, the tail travels to the right. The production rate fre-
that the height is approximately 3/4 the length of the horizon- quency polygon, Figure 30.3, indicates a mild skew in the
tal axis. Otherwise, the viewer may obtain the wrong impres- positive direction. Another noticeable feature of a polygon is
sion based on graph appearance rather than on graph data. the number of humps or high points. If only one high point,
The bar width should be the same as the “real limit” of a class. then the curve is unimodal. If two humps, then the curve is
For example, suppose the production rates were rounded off bimodal. For three humps, it is trimodal, and so on. The pro-
to the nearest .005, then the real productivity rate for class .05 duction rate frequency polygon, Figure 30.2, shows 2 humps,
would fall between .0475 and .0525. Thus, the width of the ver- one much higher than the other.
tical bar would be .005 and extend from .0475 to .0525 on the
graph. In addition, the graph should be titled in a descriptive The frequency graph's information should generate curiosity
fashion to indicate what the graph is showing. among the viewers. In this case, the following questions
arise, “Why the second hump at .065 HRS/SFCA? What
Frequency Polygon—For the frequency polygon, the vertical causes the variation? Can we isolate this cause and fix it?”
and horizontal axes are laid out the same way as for the his- For example, suppose that an examination of job
data reveals that the projects with the lower pro-
ductivity rate (higher number of hours per square
foot of contact area installed) occurred where
8 formwork was stacked. One might conclude that
7 the stacking activity requires more hours related to
6 square feet of contact area than if no stacking
FREQUENCY 5 occurred. Thus, the outcome of the analysis would
4 be to have two budget rates—one for when panels
3 are not stacked and one for when they are. Based
2 on the frequency distribution, viable budget rates
1 would be .050 HRS/SFCA for no stacking and .065
0 HRS/SFCA for stacking. Each rate reflects a mode
of the frequency distribution.
0.045

0.075
0.035

0.055
0.04

0.07
0.05

0.06

0.06

PRODUCTION RATES (HRS/SFCA)

Figure 30.1—Histogram of Production Rates

30.3
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL

MEASURES OF CENTRAL TENDENCY When the number of scores is even or there is a repetition of
a certain score, then the location of the median requires com-
The frequency distributions can be characterized by certain putation. Consider the production rate distribution shown in
statistics. One type of statistic is called the index of central Figure 30.4. Since there are 20 scores, the median is the point
tendency, or average, and represents the general location of a below which 10 cases fall.
distribution of measures on the measurement scale. There are
three commonly used indexes of central tendency—the If one counts from the left of the distribution, one finds that
mode, the median, and the mean. the median falls between the sixth and seventh .05 in the dis-
tribution. Since there is an additional .05 lying above the
Mode—The mode is the simplest measure of central tenden- tenth score which is also .05, one cannot say that .05 is the
cy. It is merely the score value or measure that occurs most median. Looking at the distribution, there are 4 scores that
often in a distribution of scores. For the production rate dis- fall below .05, and 9 scores that lie above .05. Thus, .05 would
tribution, the score occurring most often is .050 HRS/SFCA. not fit the definition of the median. But, one knows that the
Hence, the mode for this distribution is .050. median falls somewhere within the interval .005, somewhere
between .0475 and .0525. One can locate the median within
Median—The median is the middle point in a distribution. an interval by applying the formula shown in Figure 30.5.
Half of the distribution is above this point and half is below.
To find the median one arranges the scores in order. For Mean—The best known and most reliable measure of central
example, consider the following observations of concrete test tendency is the mean. The mean is the arithmetic average of
cylinders: 2700 psi, 2750 psi, 2965 psi, 3100 psi, 3130 psi, 3480 a group of scores. Thus, for the production rate distribution
psi, and 3500 psi. The score, 3100 psi, is the middle point. containing the 20 observations in Table 30.1, one would com-
There are three observations above and below the median of pute the mean as shown in Table 30.3.
3100 psi.
Comparison of Mean, Median, and Mode—If the
contractor wants to know what production rate
occurred most often, the mode would be calculat-
ed. However, the mode is a crude and unstable
measure of central tendency and is generally not
8
used to describe a distribution. Usually the medi-
7
an or the mean is used. However, there is an
6
important difference between the median and
FREQUENCY 5
mean. The median is a rank or a position statistic
4
unaffected by the numerical size of the individual
3
scores, while the mean is sensitive to the size of the
2
individual scores in a distribution, including
1
extreme scores.
0

If the frequency distribution is unimodal and per-


0.035

0.045

0.050

0.070
0.040

0.060

0.075
0.065
0.055

fectly symmetrical, then the mean, median, and


PRODUCTION RATES (HR/SFCA) mode will fall at exactly the same point. This fre-
quency distribution is called the normal curve. If a
Figure 30.2—Frequency Polygon of Production Rates

Positive Negative

Figure 30.3—Skewed Curves

30.4
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS

04, .045, .045, .045, .05, .05, .05, .05, .05, .05, .05, .055, .055, .055, .06, .06, .065, .065, .065, .07
|~~
10 scores <--|-----> 10 scores

Figure 30.4—Production Rate Distribution

Table 30.3—Calculation of the Mean Production Rate Mdn = L + [(N/2-cfb)/fw]i = .0475 + [(20/2 - 4)/7].005 = .05179
RATE (HRS/SFCA) FREQUENCY (F) PRODUCT
where: Mdn = the median
.040 1 .040 N = total number of cases in the distribution
.045 3 .135 L = lower real limit
cfb = the cumulative frequency below
.050 7 .350
fw = the frequency of cases within the median
.055 3 .165 interval
.060 2 .120
.065 3 .195 Figure 30.5—Calculation of the Median Production Rate
.070 1 .070
TOTALS 20 1.075
Mean = Sum of scores/N = 1.075/20 = .05375 the work, the distribution with the greater spread would
yield less confidence in the accuracy of the mean rate for any
one particular project than if the spread were small. In addi-
tion, the real amount of error would be greater.
distribution is skewed (that is, scores are concentrated more
at one end or the other), then the curve will not be symmet-
The Range—The simplest measure of variability is the range.
rical, and the three measures of central tendency will not be
The range is defined as the difference between the lowest and
equal. Note that the median lies between the mode and mean
highest score in a distribution of scores. Figure 30.6 shows the
in all skewed distributions. If negatively skewed, the median
calculation of the range.
is higher than the mean. If positively skewed, the median is
lower than the mean.
Range = Xh- X1 = .065 - .04 = .25
The mean is the most stable or reliable measure of central ten-
dency. If one were to draw a sample from the total popula- where: Xh = the highest score
tion, the mean would show less fluctuation from sample to X1 = the lowest score
sample than the medians. Thus, if one wanted to infer some
characteristic about a population from a sample, the mean
would yield the most reliable estimate of the population Figure 30.6—Calculation of the Range for the Production
parameter. Or, stated another way, if the contractor wanted to Rates
bid the next job based on past experience, the best estimate
(that rate with the least amount of error) of the actual pro-
The range is not considered a stable measure of variability
duction rate would be the mean of .05375 HRS/SFCA.
because the value can change greatly with the change in a
single score within the distribution--either the high or low
score. In addition, there may be frequent or large gaps in the
MEASURE OF VARIABILITY distribution, which the range does not reflect, because it only
uses two scores—the high and low. Thus, the range is only
The measures of central tendency (mean, median, and mode) useful as a quick estimate of variability.
provide a concise index of the average value of a set of scores
or measures. However, there is more to be known about a Quartile Deviation—The quartile deviation is more stable
distribution of scores than this one characteristic. The than the range because it is based on the spread of the scores
amount of variability or spread of the scores within the dis- through the center of the distribution rather than through the
tribution is also an important characteristic to know about a two extremes. The quartile deviation is the measure which is
given distribution. For example, suppose the production half the distance between the 1st and 3rd quadrilles. The first
rates varied from .02 to .09, a spread of .07, rather than .04 to quartile (Q1) is the score that sets off the lowest 25 percent of
.07, a spread of .03; yet, in both instances the mean was the scores while the third quartile (Q3) sets off the upper 25
.05375. If the contractor used the mean production rate to bid

30.5
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL

percent of the scores. The interval from Q1 to Q3 contains the indicator of the spread of a distribution can be found by
middle 50 percent of the scores in a distribution and is called determining the amount each score deviates from the mean
the inter-quartile range. Then, this distance is divided by 2 to of the distribution. In most instances, the contractor will be
give the average distance from the median to each of the computing a sample statistic rather than a population statis-
quadrilles. This is called the quartile deviation (QD). For the tic. The production rates do not include the entire population
production rates, the calculations are shown in Fgure 30.7. of rates for every job past, present, and future. Thus, the 20
rates are a sample of all work the contractor does. The sam-
Since the quartile deviation is an index that reflects the ple standard deviation can be calculated from the frequency
spread of scores throughout the middle part of the distribu- distribution shown in table 30.4.
tion, it should be used whenever extreme scores may distort
the data. Thus, the median and the quartile deviation are The standard deviation can be computed from raw scores
both insensitive to extreme scores in the distribution and with the use of an inexpensive hand-held calculator that has
should be used accordingly. statistical functions built in. One simply enters the raw
scores in the STAT mode on the calculator. Then, a few key
Standard Deviation—The major disadvantage of the quartile strokes will yield such information as the sample mean and
deviation is that it does not take into account the value of the sample standard deviation.
each of the raw scores in the distribution. A more reliable

Q1 = L + [(N/4 - cfb)/fw]i = .0475 + [(20/4 - 4)7].005 = .0482


Q3 = L + [(.75N - cfb)/fw]i = .0575 + [(.75 x 20 - 14)2].005 = .060
QD = (Q3 - Q1)/2 = (.060 - .0482)/2 = .0118

where:

Q1 = first quartile
Q3 = second quartile
L = the lower limit of the interval within which the first quartile lies or the third quartile lies
N = the number of cases
cfb = the cumulative frequency below the interval containing either the first quartile or third quartile
fw = the frequency of cases within the interval containing either the first or third quartile
i = the interval size

Figure 30.7—Calculation of the Quartile Deviation for the Production Rates

Table 30.4—Calculation of the Standard Deviation From the Frequency Distribution

Production Rate (X) Frequency (f) Product (fX) fX2


.070 1 .070 .004900
.065 3 .195 .038025
.060 2 .120 .014400
.055 3 .165 .027225
.050 7 .350 .122500
.045 3 .135 .018225
.040 1 .04 .001600
N = 20 sum of fX = 1.075 sum of fX2 = .226875
s = [Sum of fX2 - (Sum of fX)2/N]/(N - 1) = [.226875 - (1.075)2/20](20-1) = .0088996 or .009
where: s = sample standard deviation
X = raw data (individual observed production rate)
f = frequency with which that raw data occurs
N = number of observations in the sample

30.6
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS

THE NORMAL CURVE The z-score is computed from a sample score by applying the
formula, z = (X-X)/s, where X is the mean of the distribution,
Recall that a frequency distribution that is symmetrical is X is a raw score from the distribution, and s is the sample
known as the normal curve. The normal curve is unimodal standard deviation. For example, assume that the produc-
with mean, median, and mode at the same point. See Figure tion rate frequency distribution approximates the normal
30.8. curve. Then, a rate of .60 would yield the following:

In actuality, the normal curve is a theoretical curve, which by z = (.060-.05375)/.009 = .694 or .69
definition can take many shapes, but in all of those shapes,
the shape is symmetrical, and the curve is unimodal. This When all scores from a distribution are transformed to their
theoretical curve is important because many physical and corresponding z-scores, the result is a distribution of scores
psychological phenomena resemble the normal curve when with a mean of 0 and a standard deviation of 1. Thus, the score
shown in a frequency distribution. above is .694 standard deviation above the mean. One can
determine what this z-score indicates by entering published
Properties—The important properties of normal curves are tables. A portion of one such table is shown in Table 30.5.
(1) the curve is symmetrical with its maximum height at the
mean; (2) the mean, median, and mode fall at the same point; Based on the observed data, the contractor concludes that for
(3) the height of the curve decreases to the left and to the right the z-score of .69 (column 1), the area (proportion of scores)
of the mean at an accelerated rate, which forms the convex from the mean to .69 is .2549 or 25.49 percent (column 2), the
portion of the curve until reaching one standard deviation area below the score of .69 is .7549 or 75.49 percent (column
above or below the mean at which point the decrease decel- 3), the area above the score of .69 is .2451 or 24.51 percent
erates and the curve becomes concave; and (4) the theoretical (column 4), and the .69 score is found on the curve at the y-
range of the curve is plus infinity to minus infinity, but for all ordinate of .3144 (column 5). Thus, the contractor could inter-
practical purposes so little of the curve falls below -3 stan- pret the score by saying that 75.49 percent of all the possible
dard deviations or above +3 standard deviations that for production rates will fall below .060 HRS/SFCA.
most frequency distributions these are the practical limits of
the curve. The contractor also can use the standard normal curve table
to determine a specific production rate to be used in estimat-
z-scores—Though two or more frequency distributions may ing and scheduling. Assume that the contractor wants to
approximate normality yet differ in terms of their means and select a rate based on past experience that will be greater than
standard deviations, any normal distribution can be trans- 95 percent of the rates possible. From the table, the contractor
formed into a distribution of standard scores. These scores finds the area in larger portion (column 3) closest to .95 or 95
are known as z-scores. The distribution is known as the stan- percent. From the table, .9505 is found, with a corresponding
dard normal curve. z-score of 1.65. (Typically, for the kinds of analysis the con-
tractor will perform, interpolation is not necessary.)

Solving for the unknown by applying the equation


30 above, the production rate can be computed as follows:

1.65 = (X -.05375)/.009
X = .0686 HRS/SFCA
20
When one compares this rate (.0686) with that shown on
Frequency

the cumulative percent frequency distribution (.07), one


notices a discrepancy. The difference is that the cumula-
tive percent is based on only those rates in the sample
10
as compared to the rate extracted from the normal
curve, which is based on all possible rates within the
distribution and is an estimate of the true population
rate. In addition, the assumption was made that the
0 distribution of production rates was normally distrib-
1 6 11 uted. An examination of the frequency polygon reveals
Class that it is not normally distributed. If the lower produc-
tion rates (high scores) were omitted, then the curve
Figure 30.8—A Normal Curve would indeed be more normally distributed.

30.7
BASIC CONCEPTS IN DESCRIPTIVE STATISTICS AACE INTERNATIONAL

Table 30.5—Areas of the Standard Normal Curve

1 2 3 4 5
z-score Area from Area in Area in y ordinate
mean to z-score larger portion smaller portion at z-score
.67 .2422 .7422 .2578 .3230
.68 .2517 .7517 .2483 .3166
.69 .2549 .7549 .2451 .3144
.70 .2580 .7580 .2420 .3123
.71 .2611 .7611 .2389 .3101
.72 .2642 .7642 .2358 .3079

1.65 .4505 .9505 .0495 .1023

CUMULATIVE PROBABILITY CURVE to know what the probability is that the production rate will
not meet or exceed the acceptable rate. From the cumulative
One key purpose of the frequency polygon is to examine how probability curve, one reads that 30 percent of the production
the data is distributed. From this examination, one gets an rates are .060 HRS/SFCA or higher; thus, 30 percent of the
indication of whether the scores are normally distributed or observed rates failed to yield the acceptable rate of .055
not. If they are not, then the cumulative probability curve can HRS/SFCA.
be applied to the data. An examination of the frequency poly-
It is important to note that the construction of the cumulative
gon in Figure 30.2 reveals that the production rate data is not
probability curve is dependent upon the question asked. In
normally distributed. The curve plots the cumulative per-
this case, the contractor wanted to determine the probability
centage distribution data (Table 30.2, column 4) as illustrated
of failure. The contractor could just have easily asked for the
in Figure 30.9.
probability of success. In this instance, the data in Table 30.2
would have been rearranged from low production (high
Note that the curve shows the percentage of scores (crude
score) to high production (low score). The cumulative per-
measure of probability) where the production rate will fall
centage frequency would be found as shown in Table 30.6.
below a certain value. For example, the contractor may wish

%
100
90
80
70
CUMULATIVE 60
PERCENTAGE
50
40
30
20
10

0
0.070

0.060

0.050

0.040
0.075

0.065

0.035
0.055

0.045

PRODUCTION RATES (HRS/SFCA)

Figure 30.9—Cumulative Probability Curve

30.8
AACE INTERNATIONAL BASIC CONCEPTS IN DESCRIPTIVE STATISTICS

Table 30.6—Cumulative Probability

Rate (HRS/SFCA) Frequency Probability of Cumulative Probability


Occurrence
.070 1 5 percent 100 percent
.065 3 15 percent 95 percent
.060 2 10 percent 80 percent
.055 3 15 percent 70 percent
.050 7 35 percent 55 percent
.045 3 15 percent 20 percent
.040 1 5 percent 5 percent

The resultant cumulative probability curve is shown in 3. Montgomery, D. C., and G. C. Runger. 1994. Applied
Figure 30.10. Note that the successful production rate is .055 Statistics and Probability for Engineers. New York: John
HRS/SFCA. Thus, from the curve, the probability of success Wiley & Sons, Inc.
is 70 percent. 4. Spiegel, M. R. 1989. Schaum’s Outline of Statistics. New
York: McGraw-Hill, Inc., 1989.

RECOMMENDED READING

1. Ary, D., and L. C. Jacobs. 1976. Introduction to Statistics:


Purposes and Procedures. New York: Holt, Rinehart and
Winston.
2. DeFranco, D., and M. R. Spiegel. 1996. Schaum’s
Interactive Outline of Statistics. New York: McGraw-Hill,
Inc., 1996.

%
100
90
80
70
CUMULATIVE 60
PERCENTAGE
50
40
30
20
10

0
0.040

0.070
0.035

0.045

0.075
0.055
0.05

0.06

0.06

PRODUCTION RATES (HRS/SFCA)

Figure 30.10—Cumulative Probability Curve

30.9

You might also like