Part 1 Exploring-And-Understanding-Data

6/1/2020
Ho Chi Minh City University of Technology – Applied Statistics in Construction

Bach Khoa
Management
Lecturer: Nguyen Hoai Nghia, Ph.D.
 Ph.D., Management Technology, SIIT, Thammasat University,
Thailand (graduated in 2018).
Applied Statistics in  M.Eng., Construction Technology and Management, HCMC
Construction Management Polytechnic University, HCMC, Vietnam (graduated in 2008).

 B.Eng., Civil Engineering, HCMC Polytechnic University,
HCMC, Vietnam (graduated in 2002).
Lecturer: Nguyen Hoai Nghia (Jack), Ph.D.
Email: nhnghia@hcmiu.edu,vn
nghianew@yahoo.com
1 2
Applied Statistics in Construction Applied Statistics in Construction

Management Management
Research Areas Course Objectives
 Project Appraisal.  To master the basic concepts in statistics.
 Construction Cost/Finance Management.  To understand the tools and methods usually used in
 Construction Procurement Management. statistics.
 Construction Schedule Management.  To know how to use at least one statistical soft ware.
 Real Estate Development.  To apply that knowledge to survey, analyze, evaluate, and
manage systems and processes in the construction industry.
3 4
3 4
6/1/2020
Applied Statistics in Construction Applied Statistics in Construction

Management Management
Grading Policy Text books
 Assignments + Term project: 40%  De Veaux, R. D., Velleman, P. F., & Bock, D. E. , Intro Stats,
 Software practice: 10% 5th Edition, Pearson Education Inc. , USA, 2017.
 Final exam: 50%  Dalgaard, P., Introductory Statistics with R, 2nd Edition,
Springer, USA, 2008.
 Hair, Jr. , J. F. , Anderson, R. E., Tatham, R. L. , & Black, W.
Note:
C. , Multivariate Data Analysis, 8th Edition, McGraw-Hill, USA,
 Class attendant  Compulsory
2019.
 Benjamin, J. R. & Cornell, C. A. , Probability, Statistics, and
Decision for Civil Engineers, 3rd Edition, McGraw-Hill, USA,
1970.
 Hoàng Trọng và Chu Nguyễn Mộng Ngọc. Thống kê ứng
5
dụng trong kinh tế xã hội, Nhà xuất bản Thống kê, 2015. 6
5 6
Content Content
 Part 1: Exploring and understanding data  Part 2: Exploring relationships between variables
 1.1 What are statistics?  2.1 Scatter plots
 1.2 Populations and samples  2.2 Correlation
 1.3 Variables  2.3 Linear regression
 1.4 Measures of data  2.4 Multiple regression
 1.5 Pattern of data
 1.6 Displaying and describing data  Part 3: Data collection
 1.7 Tables  3.1 Three big ideas of sampling
 1.8 Comparing distributions  3.2 Methods of data collection
 3.3 Scales of measurement
 3.4 Sample surveys
7 8
7 8
6/1/2020
Content 1.1 What are statistics?

 Part 4: Probability  Statistics helps us to make sense of the world described by
 4.1 Probability our data by seeing past the underlying variation to find
 4.2 Variables
patterns and relationships.
 4.3 Sampling distributions
 Statistics are particular calculations made from data.
 Part 5: Inference for relationships
 5.1 Confidence intervals
 5.2 Hypothesis testing
 5.3 Statistical Significance
9 10
9 10
1.1 What are statistics? 1.2 Populations and samples

 Data v.s. information  1.2.1 Populations and samples
 1.2.2 Differences among populations and samples
Data Information
 1.2.3 Simple random sampling
Data is raw, unorganized facts When data is processed,  1.2.4 Sampling with and without replacements
that need to be processed. organized, structured or
Data can be something simple presented in a given context
and seemingly random and so as to make it useful, it is
useless until it is organized. called information.
Ex: Each student's test score is Ex: The average score of a
one piece of data. class or of the entire school
is information that can be
derived from the given data.
11 12
11 12
6/1/2020
1.2 Populations and samples 1.2 Populations and samples

1.2.1 Populations and samples 1.2.2 Differences among populations and samples
 Both are data sets.  Depending on the sampling method, a sample can have
 A population includes all of the elements from a set of fewer observations than the population, the same number of
data. observations, or more observations.
 A sample consists one or more observations drawn from  More than one sample can be derived from the same
the population. population.
 A measurable characteristic of a population (a mean,
a standard deviation …) is called a parameter; but a
measurable characteristic of a sample is called a statistic.
 The formula and symbols are mostly different.
13 14
13 14
1.2 Populations and samples 1.2 Populations and samples

1.2.3 Simple random sampling 1.2.3 Simple random sampling (cont.)
 A sampling method is a procedure for selecting sample  There are many ways to obtain a simple random sample.
elements from a population.  One way would be the lottery method.
 Simple random sampling refers to a sampling method that  Each of the N population members is assigned a unique
has the following properties. number.
 The population consists of N objects.  The numbers are placed in a bowl and thoroughly mixed.
 The sample consists of n objects.  Then, a blind-folded researcher selects n numbers.
 All possible samples of n objects are equally likely to Population members having the selected numbers are
occur. included in the sample.
15 16
15 16
6/1/2020
1.2 Populations and samples 1.3 Variables

1.2.4 Sampling with and without replacements  1.3.1 Definition
 When a population element can be selected more than one  1.3.2 Classifications
time, we are sampling with replacement.
 When a population element can be selected only one time,
we are sampling without replacement.
17 18
17 18
1.3 Variables 1.3 Variables

1.3.1 Definition 1.3.2 Classifications
 A variable is an attribute that describes a person, place,
thing, or idea.
 The value of the variable can "vary" from one entity to
another.
19
Source: Dr. Le Hoai Long’s lecture note 20
19 20
6/1/2020

1.3.2 Classifications 1.3.2 Classifications

21 22

 Qualitative / Quantitative variables
 Discrete / Continuous variables
 Identifier variables
 Ordinal variables
Source: Dr. Le Hoai Long’s lecture note 23 24
23 24
6/1/2020

 Qualitative (categorical) variables take on values that are  Experience ???
names or labels.
 Ex: The color of a ball (e.g., red, green, blue) or the breed of a dog
 Qualitative variable?
(e.g., collie, shepherd, terrier).
 Quantitative variable?
 Quantitative variables are numeric. They represent a
measurable quantity.
 Ex: the population of a city  the number of people in the city - a
measurable attribute of the city.
25 26
25 26

 Variables that can only take on a finite number of values are  Project cost ???
called discrete variables.
 Variable that can take on any value between its minimum  Discrete variable?
value and its maximum value are called continuous variables.  Continuous variable?
27 28
27 28
6/1/2020

 Which of the following statements are true?  Identifier variables are the variables that each individual
I. All variables can be classified as quantitative or receives a unique value
categorical variables.  Ex: ID numbers, such as a student ID.
II. Categorical variables can be continuous variables.
III. Quantitative variables can be discrete variables.  Ordinal variables are variables that report order without
(A) I only natural units
(B) II only  Ex: professional.
(C) III only
(D) I and II
(E) I and III
29 30
29 30
1.4 Measures of data 1.4 Measures of data

1.4.1 Mean and median (measures of central tendency) 1.4.1 Mean and median (measures of central tendency)
 To find the median, we arrange the observations in order Quiz:
from smallest to largest value.  Suppose we draw a sample of five cement bags and
 If there is an odd number of observations, the median is measure their weights. They weigh 90 pounds, 90 pounds,
the middle value. 110 pounds, 120 pounds, and 130 pounds. Find the mean
 If there is an even number of observations, the median is and median.
the average of the two middle values.
 The mean of a sample or a population is computed by adding
all of the observations and dividing by the number of
observations.
31 32
31 32
6/1/2020

 Population mean:
 The median may be a better indicator of the most typical
μ = ΣX / N value if a set of scores has an outlier.
 An outlier is an extreme value that differs greatly from
 Sample mean: other values.

𝑥̅ = Σx / n  However, when the sample size is large and does not
include outliers, the mean score usually provides a better
measure of central tendency.
33 34
33 34

Ex: Question:
 Suppose we examine a sample of 10 households to estimate  What happen to the mean and median if we add/ multiply
the typical family income. Nine of the households have every value by a constant?
incomes between $20,000 and $100,000; but the tenth
household has an annual income of $1,000,000,000.
 What is the outlier?

 What is the mean and median?
 How do you think about your results?
35 36
35 36
6/1/2020

1.4.2 Range, variance, and standard deviation (measures of 1.4.2 Range, variance, and standard deviation (measures of
variability) variability)
Range Variance
 The range is the difference between the largest and  Variance of a population is the average squared deviation
smallest values in a set of values. from the population mean:
 Ex: Consider the following numbers: 2, 3, 5, 6, 6, 8, 10, 12. ∑ µ
σ2 =
For this set of numbers, the range would be: ……..
where:
• σ2 is the population variance,
• μ is the population mean,
• Xi is the ith element from the population,
• and N is the number of elements in the population.
37 38
37 38

1.4.2 Range, variance, and standard deviation (measures of 1.4.2 Range, variance, and standard deviation (measures of
variability) variability)
Variance Standard Deviation
 Variance of a sample can be defined by slightly different  The standard deviation of a population is the square root of
formula, and uses a slightly different notation: the variance
∑ ̅
s2 = σ= 𝜎 =
∑ µ
where:
• s2 is the sample variance,  Statisticians often use simple random samples to estimate
the standard deviation of a population, based on sample data
• 𝑥̅ is the sample mean,
 the best estimate of the standard deviation of a population:
• xi is the ith element from the sample,
• and n is the number of elements in the sample. ∑ ̅
s= 𝑠 =
The sample variance can be considered an unbiased estimate 39 40
of the tr e pop lation ariance

39 40
6/1/2020

1.4.2 Range, variance, and standard deviation (measures of 1.4.3 Percentiles, Quartiles, and Standard Scores
variability) (measures of position)
Question: Percentiles:
 What happen to the mean and median if we add/ multiply  The values that divide a rank-ordered set of elements into
every value by a constant? 100 equal parts are called percentiles.
 An element having a percentile rank of Pi would have a
greater value than i% of all the elements in the set.  the
observation at the 50th percentile would be denoted P50, and
it would be greater than 50% of the observations in the set.
41 42
41 42

1.4.3 Percentiles, Quartiles, and Standard Scores 1.4.3 Percentiles, Quartiles, and Standard Scores
(measures of position) (measures of position)
Quartiles: Question:
 Quartiles divide a rank-ordered data set into four equal parts.  What are the relationships among quartiles and percentiles?
The values that divide each part are called the first, second,
and third quartiles; and they are denoted by Q1, Q2, and Q3,
respectively.
 Ex: Consider a set of numbers: 1, 2, 3, 4, 5, 6, 7, 8. What are

quartiles of the set?
43 44
43 44
6/1/2020

1.4.3 Percentiles, Quartiles, and Standard Scores 1.4.3 Percentiles, Quartiles, and Standard Scores
(measures of position) (measures of position)
Standard Scores (Z-scores)  Standardize values Question:
 A standard score (aka, a z-score) indicates how  How to interpret z-scores?
many standard deviations an element is from the mean. A
standard score can be calculated from the following formula.
(X − μ)
Z=
σ
where
• z is the z-score,
• X is the value of the element,
• μ is the mean of the population,
• and σ is the standard deviation. 45 46
45 46
1.4 Measures of data 1.5 Pattern of data

The 68–95–99.7 Rule 1.5.1 Center
 In 1733, Abraham de Moivre many unimodal and  The center of a distribution is located at the median of the
symmetric distributions, about 68% of the values fall within distribution.
one standard deviation of the mean, about 95%  within  This is the point in a graphic display where about half of the
two standard deviations, and about 99.7%  within three observations are on either side.
standard deviations of the mean  the 68–95–99.7 Rule
47 48
47 48
6/1/2020
1.5 Pattern of data 1.5 Pattern of data

1.5.2 Spread 1.5.3 Shape
 The spread of a distribution refers to the variability of the  Symmetry  a symmetric distribution can be divided into
data. two parts so that each part is a mirror image of the other.
 If the observations cover a wide range, the spread is larger. If  Number of peaks.
the observations are clustered around a single value, the  Distributions with one clear peak are called unimodal,
spread is smaller.  and distributions with two clear peaks are called bimodal.
 When a symmetric distribution has a single peak at the
center, it is referred to as bell-shaped.
49 50
49 50

1.5.3 Shape
 Skewness  some distributions have many more
observations on one side of the graph than the other.
 Distributions with fewer observations on the right (toward
higher values) are said to be skewed right;
 and distributions with fewer observations on the left
(toward lower values) are said to be skewed left.
 Uniform. When the observations in a set of data are equally
spread across the range of the distribution, the distribution
is called a uniform distribution. A uniform distribution has
no clear peaks.
51
51 52
6/1/2020

1.5.4 Gap 1.5.5 Outlier
 Gaps refer to areas of a distribution where there are no  Distributions are characterized by extreme values that differ
observations. The figure below has a gap. greatly from the other observations  called outliers.
Source: Dr. Le Hoai Long’s lecture note 53 Source: Dr. Le Hoai Long’s lecture note 54
53 54
1.5 Pattern of data 1.6 Displaying and describing data

1.5.5 Outlier (cont.) 1.6.1 Bar Charts
 As a "rule of thumb", an extreme value is often considered to A bar chart is made up of columns
be an outlier if it is at least 1.5 interquartile ranges below the plotted on a graph.
first quartile (Q1), or at least 1.5 interquartile ranges above  The columns are positioned over a
the third quartile (Q3). label that represents a categorical
variable.
 The interquartile range (IQR) is equal to Q3 minus Q1  The height of the column indicates
the size of the group defined by the
column label.
55 56
55 56
6/1/2020
1.6 Displaying and describing data 1.6 Displaying and describing data
1.6.2 Histograms 1.6.2 Histograms (cont.)
Like a bar chart, a histogram is made up of columns plotted on
a graph. Usually, there is no space between adjacent columns.
Here is how to read a histogram.
 The columns are positioned over a label that represents a
continuous, quantitative variable.
 The column label can be a single value or a range of values.
 The height of the column indicates the size of the group
defined by the column label.
57 58
57 58
The Difference Between Bar Charts and Histograms 1.6.3 Pie Charts
 With bar charts, each column represents a group defined by Pie charts display all the cases as a circle whose slices have
a categorical variable; areas proportional to each category’s fraction of the whole.
 and with histograms, each column represents a group  Pie charts give a quick impression of the distribution.
defined by a continuous, quantitative variable. Because we’re used to cutting up pies into 2, 4, or 8 pieces,
 it is appropriate to talk about the skewness of a histogram. pie charts are particularly good for seeing relative
frequencies near 1>2, 1>4, or 1>8.
 How about a bar chart?
59 60
59 60
6/1/2020
1.6.3 Pie Charts 1.6.4 Dotplot
 Bar charts are almost always A dotplot is made up of dots plotted on a graph.
better than pie charts for  Each dot can represent a single observation or a specified
comparing the relative number of observations.
frequencies of categories.  The dots are stacked in a column over a category
 Pie charts are widely  If the categories are quantitative, the pattern of data in a
understood and colorful, and dotplot can be described in terms of symmetry and skewness
they often appear in reports
 Dotplots are used most often to plot frequency counts within
 problem??.
a small number of categories, usually with small sets of data.
61 62
61 62
1.6.4 Dotplot 1.6.5 Stemplots
A stemplot (aka, stem and leaf plot) is a type of chart that shows
how individual values are distributed within a set of data.
 A stemplot is used to display quantitative data, generally from
small data sets (50 or fewer observations).
 The entries on the left are called stems; and the entries on
the right are called leaves.
 Stemplots usually do not include explicit labels for the stems
and leaves
63 64
63 64
6/1/2020
1.6.5 Stemplots (cont.) 1.6.6 Boxplot Basics
A boxplot splits the data set into quartiles.
 The body of the boxplot consists of a "box", which goes from
the first quartile (Q1) to the third quartile (Q3).
 Within the box, a horizontal line is drawn at the Q2,
the median of the data set.
 Two vertical lines, called whiskers, extend from the up and
bottom of the box. The up whisker goes from Q3 to the
largest non-outlier in the data set, and the bottom whisker
goes from Q1 to the smallest non-outlier.
 If the data set includes one or more outliers, they are plotted
separately as points on the chart.
65 66
65 66
1.6 Displaying and describing data 1.7 Tables

1.6.6 Boxplot Basics (cont.) Data can be presented in table form
 One-way table
 Two-way table
67 68
67 68
6/1/2020
1.7 Tables 1.7 Tables

1.7.1 One way tables 1.7.2 Two way tables
A one way table is the tabular equivalent A two-way table (also called a contingency table) is a useful tool
of a bar chart. Like a bar chart, a one-way for examining relationships between categorical variables.
table displays categorical data in the form  The entries in the cells of a two-way table can be frequency
of frequency counts and/or relative counts or relative frequencies just like a one-way table
frequencies.
 Frequency Tables: a one way table
shows frequency counts for a particular
category of a categorical variable
 Relative Frequency Tables: a one-
way table shows relative frequencies
for particular categories of a categorical
variable
69 70
69 70
1.7 Tables 1.7 Tables

1.7.3 Simpson’s paradox 1.7.3 Simpson’s paradox (cont.)
Simpson's paradox (or the Yule-Simpson effect) is a paradox in  Consider the situation of two contractors in the table below
which a correlation present in different groups is reversed (Good quality/number of contracts)
when the groups are combined.  Who is better? (Long N.D. 2010)
 It occurs when frequency data are hastily given causal
Type of contract
interpretations.
 Simpson's Paradox disappears when causal relations are Civil Industrial Total
brought into consideration (Wikipedia)
Contractor 40/60 13/15 53/75
A
66.6% 86.7% 70.7%
Contractor 5/8 42/50 47/58
B
62.5% 84% 81%
71 72
71 72
6/1/2020
1.8 Comparing distributions

Four features should be focused when you compare two or
more data sets.
 Center.
 Spread.
 Shape.
 Unusual features (gaps and outliers).
73
73

Part 1 Exploring-And-Understanding-Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Part 1 Exploring-And-Understanding-Data

Uploaded by

Copyright:

Available Formats

6/1/2020

Ho Chi Minh City University of Technology – Applied Statistics in Construction

Construction Management Polytechnic University, HCMC, Vietnam (graduated in 2008).

Applied Statistics in Construction Applied Statistics in Construction

Applied Statistics in Construction Applied Statistics in Construction

Content 1.1 What are statistics?

1.1 What are statistics? 1.2 Populations and samples

1.2 Populations and samples 1.2 Populations and samples

1.2 Populations and samples 1.2 Populations and samples

1.2 Populations and samples 1.3 Variables

1.3 Variables 1.3 Variables

1.3 Variables 1.3 Variables

Source: Dr. Le Hoai Long’s lecture note 21

1.3 Variables 1.3 Variables

Source: Dr. Le Hoai Long’s lecture note 23 24

1.3 Variables 1.3 Variables

1.3 Variables 1.3 Variables

1.3 Variables 1.3 Variables

1.4 Measures of data 1.4 Measures of data

1.4 Measures of data 1.4 Measures of data

 Sample mean: other values.

1.4 Measures of data 1.4 Measures of data

 What is the outlier?

1.4 Measures of data 1.4 Measures of data

1.4 Measures of data 1.4 Measures of data

of the tr e pop lation ariance

1.4 Measures of data 1.4 Measures of data

1.4 Measures of data 1.4 Measures of data

 Ex: Consider a set of numbers: 1, 2, 3, 4, 5, 6, 7, 8. What are

1.4 Measures of data 1.4 Measures of data

1.4 Measures of data 1.5 Pattern of data

1.5 Pattern of data 1.5 Pattern of data

1.5 Pattern of data 1.5 Pattern of data

1.5 Pattern of data 1.5 Pattern of data

1.5 Pattern of data 1.6 Displaying and describing data

1.6 Displaying and describing data 1.7 Tables

1.7 Tables 1.7 Tables

1.7 Tables 1.7 Tables

1.8 Comparing distributions

You might also like