You are on page 1of 21

Module 2: Descriptive

Statistics
Answers to the Learning Activities
ESci 117- Engineering Data Analysis
Instructor: Meralyn R. Lebante
Lesson 2.1 Organization of Data
Key points:
• Numerically measured or measurement data – discrete(e.g., shoe size,
year level, count) or continuous (e.g., weight in kg, height in m, age in
years)
Note: We can also have discrete (e.g., 5-point rating scale of 1, 2, 3, 4, or 5) or
continuous categories (e.g., income classes)
• Discrete data - organized/displayed using line diagram and histogram
• Continuous data
• organized/displayed using dotplot, histogram, and stem-and-leaf
display or stem-and-leaf plot
• construction of histogram with equal class widths and unequal class
widths
Lesson 2.1 Organization of Data

bimodal

• Unimodal histogram- (a)symmetric, (c) positively skewed, or (d)


negatively skewed (for both discrete and continuous data)
Learning Activity
Consider the following information from Mendenhall and Sincich (1984):
Bacteria are the most important component of microbial ecosystems in
sewage treatment plants. Water management engineers must know the
percentage of active bacteria at each stage of the sewage treatment. The
following data represent the percentages of respiring bacteria in twenty-five
raw sewage samples collected from a sewage plant (lowest and highest
values highlighted). (Measurements were obtained by passing the samples
through a membrane filter and staining with a chlorination solution.)
42.3 50.6 41.8 36.5 28.6
40.7 48.1 48.0 45.7 39.9

31.3 30.7 40.5 40.9 51.2

38.6 35.6 22.9 33.4 46.5

41.5 43.5 41.1 38.5 44.4


Data
42.3 50.6 41.8 36.5 28.6 Stem: tens digit; H-high, L-low
40.7 48.1 48.0 45.7 39.9 Leaf: one and tenths digits
or Unit = 0.1
31.3 30.7 40.5 40.9 51.2

38.6 35.6 22.9 33.4 46.5

41.5 43.5 41.1 38.5 44.4

SALD
Stem Leaf
2L 29
2H 86
3L 13 07 34
3H 86 56 65 85 99
4L 23 07 15 35 18 05 11 09 44
4H 81 80 57 65
5L 06 12
A. SALD
2L 29 Stem: tens digit; H-high, L-low
2H 86 Leaf: one and tenths digits
3L 13 07 34
3H 86 56 65 85 99 or Unit = 0.1
4L 23 07 15 35 18 05 11 09 44
4H 81 80 57 65
5L 06 12

The values tend to cluster around the 40 to 45% levels.


B. To construct a frequency distribution with classes of equal width:
1. R = 51.2 − 22.9 = 28.3
2. K = 1 + 3.322 log 25 = 1 + (3.322)(1.3979) ≅ 5.6438 ≅ 6
R
3. C′ = K = 28.3/6 ≅ 4.72; C = 4.8 (choosing between 4.7 and 4.8)
4. Lower limit of first class: 22.8
5. List of all classes, their frequencies, relative frequencies, and class boundaries:
UL=Lower Limit + C- 0.1 Rel. Freq =Freq/total 22.8-0.05=22.75
27.5+4.8=32.3
UL= 22.8+4.8-0.1=27.5 RF= 1/25 =0.04 3/25 =0.12
Since our class
Each succeeding Relative Class limits have
LL is the sum of Class Frequency Frequency Boundaries one decimal
the class size C place, we will
and the preceding 22.8 − 27.5 1 0.04 22.75 − 27.55
subtract 0.05
LL 27.6 − 32.3 3 0.12 27.55 − 32.35 from each

LL+ C= 22.8+ 4.8 =


32.4 − 37.1 3 3/25= 0.12 32.35 − 37.15 lower limit
and add 0.05
27.6 37.2 − 41.9 9 9/25= 0.36 37.15 − 41.95 to each upper
42.0 − 46.7 5 5/25= 0.20 41.95 − 46.75 limit.
27.6+ 4.8 =32.4 46.8 − 51.5 4 4/25= 0.16 46.75 − 51.55 27.5+0.05=27.55
Total 25 25/25= 1.00
C. A relative frequency histogram of the data:

Relative frequency

Percentage of respiring bacteria

The histogram is negatively skewed since it stretches in the left part. This indicates that
most of the samples have percentages of respiring bacteria at relatively high levels, with
only a few at relatively low levels.
Lesson 2.2 Measures of Location
Key points:
• Measures of central tendency or average:
omean (at least interval, similar values)
omedian (at least ordinal)
omode ( at least nominal)

• Quantiles (at least ordinal):


oquartiles (divide the ordered observations into 4 equal parts)
odeciles (10 equal parts)
opercentiles (100 equal parts)
Lesson 2.2 Measures of Location
Learning Activity
• Suppose we are interested in describing the average monthly high temperatures (℃) for 2019 in
Manila, Philippines: 29.5, 30.2, 31.9, 33.3, 33.4, 32.1, 31.2, 30.4, 30.6, 30.9, 30.5, 29.6. (Note:
Since 2019 is the year of interest, these 12 values already comprise the population of interest.)
• Find the following and interpret: mean, median, mode, first quartile, and third quartile. First use
a stem-and-leaf display to organize these temperatures.
Solution:
We start with a SALD to organize the 12 observations:

29 5 6
30 2 4 6 9 5
31 9 2
32 1
33 3 4
SALD:
29 5 6 Unit = 0.1
30 2 4 5 6 9
31 2 9
32 1
33 3 4

Array: 29.5, 29.6, 30.2, 30.4, 30.5, 30.6, 30.9, 31.2, 31.9, 32.1, 33.3, 33.4
σ𝑁
𝑖=1 𝑥𝑖 373.6
1. To find the population mean: 𝜇 = = ≅ 31.13℃
𝑁 12

Interpretation:
Most of the temperatures are close to 31.13℃. Or, if the monthly high
temperature was constant in 2019, this temperature would be
31.13℃.
SALD:
29 5 6 Unit = 0.1
30 2 4 5 6 9
31 2 9
32 1
33 3 4

Array: 29.5, 29.6, 30.2, 30.4, 30.5, 30.6, 30.9, 31.2, 31.9, 32.1, 33.3, 33.4
𝑥 𝑁 +𝑥 𝑁
2 2 +1 𝑥 6 +𝑥 7 30.6+30.9
2. To find the population median, 𝑁 is even: 𝜇෤ = = =
2 2 2

= 30.75℃
Interpretation: Half of the monthly high temperatures in 2019 were below 30.75℃ and half were
above it (as can be observed).

Note: Since the SALD indicates a positively skewed histogram, the median is the better measure of
central tendency over the mean. That is, the middle value better represents the 12
temperatures in terms of “average” level.
3. There is no mode as each monthly high temperature in 2019
is unique.
4. To find the first quartile: (𝑄1 = 𝑃25 )
𝑛𝑘 12 (25) 12 1
a. Finding = = =3
100 100 4
b. Since we obtained an integer in a), we have
𝑥 3 +𝑥 4 30.2+30.4
𝑄1 = 𝑃25 = = = 30.3℃.
2 2
We can say that, in 2019, the lowest 25% of the monthly high
temperatures were below 30.3℃ (as we can observe).
5. To find the third quartile: (𝑄3 = 𝑃75 )
𝑛𝑘 12 (75) 12 3
a. Finding = = =9
100 100 4
b. Since we obtained an integer in a), we have
𝑥 9 +𝑥 10 31.9+32.1
𝑄3 = 𝑃75 = = = 32℃.
2 2
We can say that, in 2019, the highest 25% of the monthly high
temperatures were above 32℃ (as we can observe).
Lesson 2.3 Measures of Variability
Key points:

• Variation – differences; spread from the center (mean or


median)
• Based on the mean (values are similar): variance, standard
deviation
• Based on the median: MAD (normal distribution), average
deviation (skewed distribution)
• Relative measure of variability – coefficient of variation
• Boxplot – to detect outliers
Lesson 2.3 Measures of Variability
Learning Activity
Consider the data on crack length, given its stem-and-leaf display below, from Lesson 2. Perform as
indicated.

1. Compute the appropriate measure of variability and interpret.


2. Compute a relative measure of variability.
3. Construct a boxplot and characterize farther the differences among the data values.

0H 89 96

1L 03 18 27 40 46 Stem: tens digit, H-high, L-low

1H 61 85 Leaf: one and tenths digit

2L 04 12 33 42 49

2H 53 58 71 85 or Unit = 0.1

3L 02 24

3H Array:
4L 8.9, 9.6, 10.3, 11.8, 12.7, 14.0, 14.6, 16.1, 18.5, 20.4, 21.2, 23.3, 24.2, 24.9,
4H 50 25.3, 25.8, 27.1, 28.5, 30.2, 32.4, 45.0
Crack Length ෥
𝒙𝒊 − 𝒙
8.9 12.3
• Compute the appropriate measure of 9.6 11.6
10.3 10.9
variability and interpret. 11.8 9.4
Since the SALD indicates a set of values having one different 12.7 8.5
from the rest located in the upper tail, the mean will not be a 14 7.2
good measure of central tendency. We then need to use an 14.6 6.6
alternative measure of variability to the standard deviation. This 16.1 5.1
18.5 2.7
is the average deviation based on the median as the shape of
20.4 0.8
the histogram indicated by the SALD is not symmetric. We
21.2 0
recall that the median of this data set is 21.2. We now compute 23.3 2.1
for the average deviation as follows: 24.2 3
24.9 3.7
σ21
𝑖=1 𝑥𝑖 − 21.2 12.3 + 11.6 + ⋯ + 23. 149.8 25.3 4.1
A.D. = = = ≅ 7.13 25.8 4.6
21 21 21 27.1 5.9
28.5 7.3
Interpretation: On the average, the 21 lengths deviated by 30.2 9
7.133 units from their median. 32.4 11.2
45 23.8
Total 444.8 149.8
2. The CV of this sample data is computed as follows:

𝑠 9.0018
CV= × 100% = × 100% ≅ 42.50%
𝑥ҧ 21.18
𝑛𝑖 21(1) 𝑛𝑖 21(3)
𝑄1 = = = 5.25 𝑄3 = = = 15.75
4 4 4 4

3. We first prepare the summary measures needed for the boxplot:


Sample median, 𝑥෤ = 21.2
𝑄1 = 𝑥 6 = 14.0, 𝑄3 = 𝑥 16 = 25.8, IQR = 25.8 − 14.0 = 11.8

Lower inner fence = 14.0 − 1.5 11.8 = −3.7, not useful (no negative value)
Upper inner fence = 25.8 + 1.5 11.8 = 43.5
Lower outer fence = 14.0 − 3 11.8 = −21.4, not useful
Upper outer fence = 25.8 + 3(11.8) = 61.2
Lower inner fence = 14.0 − 1.5 11.8 = −3.7, not useful
The boxplot for this data set is: (no negative value)
Upper inner fence = 25.8 + 1.5 11.8 = 43.5
Median 𝑥෤ IQR=11.8 Lower outer fence = 14.0 − 3 11.8 = −21.4, not useful
Upper outer fence = 25.8 + 3(11.8) = 61.2

𝑄1 = 14.0 𝑄3 = 25.8
This data value is
considered mild outlier
since it is beyond the
upper inner fence

Crack length (𝜇m)

Clearly, the largest value, 45.0, is a mild outlier. The sample data is
negatively skewed with the median line closer to the upper quartile.
There is no outlier in the lower half.
Thank You!

You might also like