You are on page 1of 19

6/1/2020

Ho Chi Minh City University of Technology – Applied Statistics in Construction


Bach Khoa
Management
Lecturer: Nguyen Hoai Nghia, Ph.D.
 Ph.D., Management Technology, SIIT, Thammasat University,
Thailand (graduated in 2018).
Applied Statistics in  M.Eng., Construction Technology and Management, HCMC

Construction Management Polytechnic University, HCMC, Vietnam (graduated in 2008).


 B.Eng., Civil Engineering, HCMC Polytechnic University,
HCMC, Vietnam (graduated in 2002).
Lecturer: Nguyen Hoai Nghia (Jack), Ph.D.
Email: nhnghia@hcmiu.edu,vn
nghianew@yahoo.com

1 2

Applied Statistics in Construction Applied Statistics in Construction


Management Management
Research Areas Course Objectives
 Project Appraisal.  To master the basic concepts in statistics.
 Construction Cost/Finance Management.  To understand the tools and methods usually used in
 Construction Procurement Management. statistics.
 Construction Schedule Management.  To know how to use at least one statistical soft ware.

 Real Estate Development.  To apply that knowledge to survey, analyze, evaluate, and
manage systems and processes in the construction industry.

3 4

3 4
6/1/2020

Applied Statistics in Construction Applied Statistics in Construction


Management Management
Grading Policy Text books
 Assignments + Term project: 40%  De Veaux, R. D., Velleman, P. F., & Bock, D. E. , Intro Stats,
 Software practice: 10% 5th Edition, Pearson Education Inc. , USA, 2017.
 Final exam: 50%  Dalgaard, P., Introductory Statistics with R, 2nd Edition,
Springer, USA, 2008.
 Hair, Jr. , J. F. , Anderson, R. E., Tatham, R. L. , & Black, W.
Note:
C. , Multivariate Data Analysis, 8th Edition, McGraw-Hill, USA,
 Class attendant  Compulsory
2019.
 Benjamin, J. R. & Cornell, C. A. , Probability, Statistics, and
Decision for Civil Engineers, 3rd Edition, McGraw-Hill, USA,
1970.
 Hoàng Trọng và Chu Nguyễn Mộng Ngọc. Thống kê ứng
5
dụng trong kinh tế xã hội, Nhà xuất bản Thống kê, 2015. 6

5 6

Content Content
 Part 1: Exploring and understanding data  Part 2: Exploring relationships between variables
 1.1 What are statistics?  2.1 Scatter plots
 1.2 Populations and samples  2.2 Correlation
 1.3 Variables  2.3 Linear regression
 1.4 Measures of data  2.4 Multiple regression
 1.5 Pattern of data
 1.6 Displaying and describing data  Part 3: Data collection
 1.7 Tables  3.1 Three big ideas of sampling
 1.8 Comparing distributions  3.2 Methods of data collection
 3.3 Scales of measurement
 3.4 Sample surveys

7 8

7 8
6/1/2020

Content 1.1 What are statistics?


 Part 4: Probability  Statistics helps us to make sense of the world described by
 4.1 Probability our data by seeing past the underlying variation to find
 4.2 Variables
patterns and relationships.
 4.3 Sampling distributions
 Statistics are particular calculations made from data.
 Part 5: Inference for relationships
 5.1 Confidence intervals
 5.2 Hypothesis testing
 5.3 Statistical Significance

9 10

9 10

1.1 What are statistics? 1.2 Populations and samples


 Data v.s. information  1.2.1 Populations and samples
 1.2.2 Differences among populations and samples
Data Information
 1.2.3 Simple random sampling
Data is raw, unorganized facts When data is processed,  1.2.4 Sampling with and without replacements
that need to be processed. organized, structured or
Data can be something simple presented in a given context
and seemingly random and so as to make it useful, it is
useless until it is organized. called information.
Ex: Each student's test score is Ex: The average score of a
one piece of data. class or of the entire school
is information that can be
derived from the given data.

11 12

11 12
6/1/2020

1.2 Populations and samples 1.2 Populations and samples


1.2.1 Populations and samples 1.2.2 Differences among populations and samples
 Both are data sets.  Depending on the sampling method, a sample can have
 A population includes all of the elements from a set of fewer observations than the population, the same number of
data. observations, or more observations.
 A sample consists one or more observations drawn from  More than one sample can be derived from the same
the population. population.
 A measurable characteristic of a population (a mean,
a standard deviation …) is called a parameter; but a
measurable characteristic of a sample is called a statistic.
 The formula and symbols are mostly different.

13 14

13 14

1.2 Populations and samples 1.2 Populations and samples


1.2.3 Simple random sampling 1.2.3 Simple random sampling (cont.)
 A sampling method is a procedure for selecting sample  There are many ways to obtain a simple random sample.
elements from a population.  One way would be the lottery method.
 Simple random sampling refers to a sampling method that  Each of the N population members is assigned a unique
has the following properties. number.
 The population consists of N objects.  The numbers are placed in a bowl and thoroughly mixed.
 The sample consists of n objects.  Then, a blind-folded researcher selects n numbers.
 All possible samples of n objects are equally likely to Population members having the selected numbers are
occur. included in the sample.

15 16

15 16
6/1/2020

1.2 Populations and samples 1.3 Variables


1.2.4 Sampling with and without replacements  1.3.1 Definition
 When a population element can be selected more than one  1.3.2 Classifications
time, we are sampling with replacement.
 When a population element can be selected only one time,
we are sampling without replacement.

17 18

17 18

1.3 Variables 1.3 Variables


1.3.1 Definition 1.3.2 Classifications
 A variable is an attribute that describes a person, place,
thing, or idea.
 The value of the variable can "vary" from one entity to
another.

19
Source: Dr. Le Hoai Long’s lecture note 20

19 20
6/1/2020

1.3 Variables 1.3 Variables


1.3.2 Classifications 1.3.2 Classifications

Source: Dr. Le Hoai Long’s lecture note 21


Source: Dr. Le Hoai Long’s lecture note 22

21 22

1.3 Variables 1.3 Variables


1.3.2 Classifications 1.3.2 Classifications
 Qualitative / Quantitative variables
 Discrete / Continuous variables
 Identifier variables
 Ordinal variables

Source: Dr. Le Hoai Long’s lecture note 23 24

23 24
6/1/2020

1.3 Variables 1.3 Variables


1.3.2 Classifications 1.3.2 Classifications
 Qualitative (categorical) variables take on values that are  Experience ???
names or labels.
 Ex: The color of a ball (e.g., red, green, blue) or the breed of a dog
 Qualitative variable?
(e.g., collie, shepherd, terrier).
 Quantitative variable?
 Quantitative variables are numeric. They represent a
measurable quantity.
 Ex: the population of a city  the number of people in the city - a
measurable attribute of the city.

25 26

25 26

1.3 Variables 1.3 Variables


1.3.2 Classifications 1.3.2 Classifications
 Variables that can only take on a finite number of values are  Project cost ???
called discrete variables.
 Variable that can take on any value between its minimum  Discrete variable?
value and its maximum value are called continuous variables.  Continuous variable?

27 28

27 28
6/1/2020

1.3 Variables 1.3 Variables


1.3.2 Classifications 1.3.2 Classifications
 Which of the following statements are true?  Identifier variables are the variables that each individual
I. All variables can be classified as quantitative or receives a unique value
categorical variables.  Ex: ID numbers, such as a student ID.
II. Categorical variables can be continuous variables.
III. Quantitative variables can be discrete variables.  Ordinal variables are variables that report order without
(A) I only natural units
(B) II only  Ex: professional.
(C) III only
(D) I and II
(E) I and III

29 30

29 30

1.4 Measures of data 1.4 Measures of data


1.4.1 Mean and median (measures of central tendency) 1.4.1 Mean and median (measures of central tendency)
 To find the median, we arrange the observations in order Quiz:
from smallest to largest value.  Suppose we draw a sample of five cement bags and
 If there is an odd number of observations, the median is measure their weights. They weigh 90 pounds, 90 pounds,
the middle value. 110 pounds, 120 pounds, and 130 pounds. Find the mean
 If there is an even number of observations, the median is and median.
the average of the two middle values.
 The mean of a sample or a population is computed by adding
all of the observations and dividing by the number of
observations.

31 32

31 32
6/1/2020

1.4 Measures of data 1.4 Measures of data


1.4.1 Mean and median (measures of central tendency) 1.4.1 Mean and median (measures of central tendency)
 Population mean:
 The median may be a better indicator of the most typical
μ = ΣX / N value if a set of scores has an outlier.
 An outlier is an extreme value that differs greatly from

 Sample mean: other values.


𝑥̅ = Σx / n  However, when the sample size is large and does not
include outliers, the mean score usually provides a better
measure of central tendency.

33 34

33 34

1.4 Measures of data 1.4 Measures of data


1.4.1 Mean and median (measures of central tendency) 1.4.1 Mean and median (measures of central tendency)
Ex: Question:
 Suppose we examine a sample of 10 households to estimate  What happen to the mean and median if we add/ multiply
the typical family income. Nine of the households have every value by a constant?
incomes between $20,000 and $100,000; but the tenth
household has an annual income of $1,000,000,000.

 What is the outlier?


 What is the mean and median?
 How do you think about your results?

35 36

35 36
6/1/2020

1.4 Measures of data 1.4 Measures of data


1.4.2 Range, variance, and standard deviation (measures of 1.4.2 Range, variance, and standard deviation (measures of
variability) variability)
Range Variance
 The range is the difference between the largest and  Variance of a population is the average squared deviation
smallest values in a set of values. from the population mean:
 Ex: Consider the following numbers: 2, 3, 5, 6, 6, 8, 10, 12. ∑ µ
σ2 =
For this set of numbers, the range would be: ……..
where:
• σ2 is the population variance,
• μ is the population mean,
• Xi is the ith element from the population,
• and N is the number of elements in the population.
37 38

37 38

1.4 Measures of data 1.4 Measures of data


1.4.2 Range, variance, and standard deviation (measures of 1.4.2 Range, variance, and standard deviation (measures of
variability) variability)
Variance Standard Deviation
 Variance of a sample can be defined by slightly different  The standard deviation of a population is the square root of
formula, and uses a slightly different notation: the variance
∑ ̅
s2 = σ= 𝜎 =
∑ µ

where:
• s2 is the sample variance,  Statisticians often use simple random samples to estimate
the standard deviation of a population, based on sample data
• 𝑥̅ is the sample mean,
 the best estimate of the standard deviation of a population:
• xi is the ith element from the sample,
• and n is the number of elements in the sample. ∑ ̅
s= 𝑠 =
The sample variance can be considered an unbiased estimate 39 40

of the tr e pop lation ariance


39 40
6/1/2020

1.4 Measures of data 1.4 Measures of data


1.4.2 Range, variance, and standard deviation (measures of 1.4.3 Percentiles, Quartiles, and Standard Scores
variability) (measures of position)
Question: Percentiles:
 What happen to the mean and median if we add/ multiply  The values that divide a rank-ordered set of elements into
every value by a constant? 100 equal parts are called percentiles.
 An element having a percentile rank of Pi would have a
greater value than i% of all the elements in the set.  the
observation at the 50th percentile would be denoted P50, and
it would be greater than 50% of the observations in the set.

41 42

41 42

1.4 Measures of data 1.4 Measures of data


1.4.3 Percentiles, Quartiles, and Standard Scores 1.4.3 Percentiles, Quartiles, and Standard Scores
(measures of position) (measures of position)
Quartiles: Question:
 Quartiles divide a rank-ordered data set into four equal parts.  What are the relationships among quartiles and percentiles?
The values that divide each part are called the first, second,
and third quartiles; and they are denoted by Q1, Q2, and Q3,
respectively.

 Ex: Consider a set of numbers: 1, 2, 3, 4, 5, 6, 7, 8. What are


quartiles of the set?

43 44

43 44
6/1/2020

1.4 Measures of data 1.4 Measures of data


1.4.3 Percentiles, Quartiles, and Standard Scores 1.4.3 Percentiles, Quartiles, and Standard Scores
(measures of position) (measures of position)
Standard Scores (Z-scores)  Standardize values Question:
 A standard score (aka, a z-score) indicates how  How to interpret z-scores?
many standard deviations an element is from the mean. A
standard score can be calculated from the following formula.
(X − μ)
Z=
σ
where
• z is the z-score,
• X is the value of the element,
• μ is the mean of the population,
• and σ is the standard deviation. 45 46

45 46

1.4 Measures of data 1.5 Pattern of data


The 68–95–99.7 Rule 1.5.1 Center
 In 1733, Abraham de Moivre many unimodal and  The center of a distribution is located at the median of the
symmetric distributions, about 68% of the values fall within distribution.
one standard deviation of the mean, about 95%  within  This is the point in a graphic display where about half of the
two standard deviations, and about 99.7%  within three observations are on either side.
standard deviations of the mean  the 68–95–99.7 Rule

47 48

47 48
6/1/2020

1.5 Pattern of data 1.5 Pattern of data


1.5.2 Spread 1.5.3 Shape
 The spread of a distribution refers to the variability of the  Symmetry  a symmetric distribution can be divided into
data. two parts so that each part is a mirror image of the other.
 If the observations cover a wide range, the spread is larger. If  Number of peaks.
the observations are clustered around a single value, the  Distributions with one clear peak are called unimodal,
spread is smaller.  and distributions with two clear peaks are called bimodal.
 When a symmetric distribution has a single peak at the
center, it is referred to as bell-shaped.

49 50

49 50

1.5 Pattern of data 1.5 Pattern of data


1.5.3 Shape
 Skewness  some distributions have many more
observations on one side of the graph than the other.
 Distributions with fewer observations on the right (toward
higher values) are said to be skewed right;
 and distributions with fewer observations on the left
(toward lower values) are said to be skewed left.
 Uniform. When the observations in a set of data are equally
spread across the range of the distribution, the distribution
is called a uniform distribution. A uniform distribution has
no clear peaks.

51
Source: Dr. Le Hoai Long’s lecture note 52

51 52
6/1/2020

1.5 Pattern of data 1.5 Pattern of data


1.5.4 Gap 1.5.5 Outlier
 Gaps refer to areas of a distribution where there are no  Distributions are characterized by extreme values that differ
observations. The figure below has a gap. greatly from the other observations  called outliers.

Source: Dr. Le Hoai Long’s lecture note 53 Source: Dr. Le Hoai Long’s lecture note 54

53 54

1.5 Pattern of data 1.6 Displaying and describing data


1.5.5 Outlier (cont.) 1.6.1 Bar Charts
 As a "rule of thumb", an extreme value is often considered to A bar chart is made up of columns
be an outlier if it is at least 1.5 interquartile ranges below the plotted on a graph.
first quartile (Q1), or at least 1.5 interquartile ranges above  The columns are positioned over a
the third quartile (Q3). label that represents a categorical
variable.
 The interquartile range (IQR) is equal to Q3 minus Q1  The height of the column indicates
the size of the group defined by the
column label.

55 56

55 56
6/1/2020

1.6 Displaying and describing data 1.6 Displaying and describing data
1.6.2 Histograms 1.6.2 Histograms (cont.)
Like a bar chart, a histogram is made up of columns plotted on
a graph. Usually, there is no space between adjacent columns.
Here is how to read a histogram.
 The columns are positioned over a label that represents a
continuous, quantitative variable.
 The column label can be a single value or a range of values.
 The height of the column indicates the size of the group
defined by the column label.

57 58

57 58

1.6 Displaying and describing data 1.6 Displaying and describing data
The Difference Between Bar Charts and Histograms 1.6.3 Pie Charts
 With bar charts, each column represents a group defined by Pie charts display all the cases as a circle whose slices have
a categorical variable; areas proportional to each category’s fraction of the whole.
 and with histograms, each column represents a group  Pie charts give a quick impression of the distribution.
defined by a continuous, quantitative variable. Because we’re used to cutting up pies into 2, 4, or 8 pieces,
 it is appropriate to talk about the skewness of a histogram. pie charts are particularly good for seeing relative
frequencies near 1>2, 1>4, or 1>8.
 How about a bar chart?

59 60

59 60
6/1/2020

1.6 Displaying and describing data 1.6 Displaying and describing data
1.6.3 Pie Charts 1.6.4 Dotplot
 Bar charts are almost always A dotplot is made up of dots plotted on a graph.
better than pie charts for  Each dot can represent a single observation or a specified
comparing the relative number of observations.
frequencies of categories.  The dots are stacked in a column over a category
 Pie charts are widely  If the categories are quantitative, the pattern of data in a
understood and colorful, and dotplot can be described in terms of symmetry and skewness
they often appear in reports
 Dotplots are used most often to plot frequency counts within
 problem??.
a small number of categories, usually with small sets of data.

61 62

61 62

1.6 Displaying and describing data 1.6 Displaying and describing data
1.6.4 Dotplot 1.6.5 Stemplots
A stemplot (aka, stem and leaf plot) is a type of chart that shows
how individual values are distributed within a set of data.
 A stemplot is used to display quantitative data, generally from
small data sets (50 or fewer observations).
 The entries on the left are called stems; and the entries on
the right are called leaves.
 Stemplots usually do not include explicit labels for the stems
and leaves

63 64

63 64
6/1/2020

1.6 Displaying and describing data 1.6 Displaying and describing data
1.6.5 Stemplots (cont.) 1.6.6 Boxplot Basics
A boxplot splits the data set into quartiles.
 The body of the boxplot consists of a "box", which goes from
the first quartile (Q1) to the third quartile (Q3).
 Within the box, a horizontal line is drawn at the Q2,
the median of the data set.
 Two vertical lines, called whiskers, extend from the up and
bottom of the box. The up whisker goes from Q3 to the
largest non-outlier in the data set, and the bottom whisker
goes from Q1 to the smallest non-outlier.
 If the data set includes one or more outliers, they are plotted
separately as points on the chart.
65 66

65 66

1.6 Displaying and describing data 1.7 Tables


1.6.6 Boxplot Basics (cont.) Data can be presented in table form
 One-way table
 Two-way table

67 68

67 68
6/1/2020

1.7 Tables 1.7 Tables


1.7.1 One way tables 1.7.2 Two way tables
A one way table is the tabular equivalent A two-way table (also called a contingency table) is a useful tool
of a bar chart. Like a bar chart, a one-way for examining relationships between categorical variables.
table displays categorical data in the form  The entries in the cells of a two-way table can be frequency
of frequency counts and/or relative counts or relative frequencies just like a one-way table
frequencies.
 Frequency Tables: a one way table
shows frequency counts for a particular
category of a categorical variable
 Relative Frequency Tables: a one-
way table shows relative frequencies
for particular categories of a categorical
variable
69 70

69 70

1.7 Tables 1.7 Tables


1.7.3 Simpson’s paradox 1.7.3 Simpson’s paradox (cont.)
Simpson's paradox (or the Yule-Simpson effect) is a paradox in  Consider the situation of two contractors in the table below
which a correlation present in different groups is reversed (Good quality/number of contracts)
when the groups are combined.  Who is better? (Long N.D. 2010)
 It occurs when frequency data are hastily given causal
Type of contract
interpretations.
 Simpson's Paradox disappears when causal relations are Civil Industrial Total
brought into consideration (Wikipedia)
Contractor 40/60 13/15 53/75
A
66.6% 86.7% 70.7%
Contractor 5/8 42/50 47/58
B
62.5% 84% 81%
71 72

71 72
6/1/2020

1.8 Comparing distributions


Four features should be focused when you compare two or
more data sets.
 Center.
 Spread.
 Shape.
 Unusual features (gaps and outliers).

73

73

You might also like