You are on page 1of 70

N

W E
EDA – S
2021-2022

EDA – Module 2
DESCRIBING DATA
ENGINEERING DATA ANALYSIS

Unit 1: Module 1: Unit 2: Module 2: Unit3: Module 3: Unit 4: Module 4: Unit 5: Module
Intro to EDA as a Subject in Describing Data Descriptive Statistics, 5: Sensitivity Analysis ,
Introduction to Probability
BSCE Frequency Tables Sampling Distributions Project Study use of MS
A survey of probability Excel and PowerPoint
EDA Overview Frequency Distributions concepts Mean, Median, Mode,
Percentile, quartile, Outliers Presentations
CHED PSG for BSCE program Graphical Presentations Discrete probability Concepts
Bayes Theorem, Empirical Sensitivity Analysis
Descriptive Statistics. Symmetry and skewness Discrete Probability Rule, Hypothesis testing, Z- Project Study in Statistics
Qualitative and Quantitative Measure of Central Locations Distributions test and t-tests, Normal
Variable, Discrete and Statistical Tools and MS Excel,
Measure of Variations Continuous Probability Distribution, Linear
continuous Variables PowerPoint presentations of
Distributions Regression
. analyzed data
Levels of measurements- . ANOVA

Week 1 Week 2 Week 3 Week 4 Week 5


EDA:
DESCRIBING DATA

1.Frequency Tables
2.Frequency Distributions
3.Graphical Presentations
4.Symmetry and skewness
5.Measure of Central Locations
6.Measure of Variations
Course Outline

B. Statistical Measures of Data


 Ungrouped data
1. Parameters and statistics
2. Measures of central location
3. Measures of variation
Course Outline

C. Statistical Description of Data


 Grouped Data
1. Frequency Distribution
2. Graphical Representation
3. Symmetry and Skewness
4. Measure of Central Locations
5. Measure of Variation
Engineering Data Analysis
Today’s Topic is about describing data
Statistical Measures Of Data
Ungrouped Data Analysis
Measure of Central Locations/ Tendencies
1. Mean
σ𝑛𝑖=1 𝑋𝑖
a. Arithmetic Mean 𝜇=
n = number of observations 𝑛
b. Weighted Mean
σ𝑛𝑖=1 𝑤𝑖 𝑋𝑖 wi = weighting
𝜇𝑤 = 𝑛 factor of the ith
σ𝑖=1 𝑤𝑖 observation
Measure of Central Locations/Tendencies

c. Geometric Mean – the geometric mean is used primarily


to average data for which the ratio of consecutive terms
remains approximately constant like data as rate of
change, ratios, economic index numbers, and
population sizes over consecutive periods.
𝑛
𝜇𝐺 = 𝑋1 𝑋2 𝑋3 . . . 𝑋𝑛
Measure of Central locations/Tendencies

d. Harmonic Mean – the harmonic mean is most frequently


used in averaging speeds for various distances covered
where the distances remain constant, and in finding the
average cost of some commodity, such as mutual funds,
and when several different purchases are made by
investing the same amount of money each time.

𝑛
𝜇𝐻 =
𝑛 1
σ𝑖=1
𝑋𝑖
Measure of Central Locations/Tendencies

2. Median
The median of a set of observations arranged in an
increasing or decreasing order of magnitude is
the middle value when the number of observations is
odd or
the arithmetic mean of the two middle values when
the number of observations is even.
Measure of Central Locations/Tendencies

3. Mode
The mode of a set of observations is
that value which occurs most often or
with the greatest frequency.
Measure of Variation

➢ Population Variance
2
σ𝑁𝑋𝑖 − 𝜇
𝑖=1
2 𝑁 σ𝑁 𝑋
𝑖=1 𝑖
2
− 𝑁
σ𝑖=1 𝑋𝑖
𝜎2= =
𝑁 𝑁2
N = population size
➢ Sample Variance

2
σ𝑛
𝑖=1 𝑋𝑖 − 𝜇
2 𝑛 σ𝑛𝑖=1 𝑋𝑖 2 − σ𝑛𝑖=1 𝑋𝑖 2
𝑆 = =
𝑛−1 𝑛 𝑛−1
n = sample size
Measure of Variation
Standard Deviation = square root of the variance

➢ Population Standard Deviation, σ

𝜎= 𝜎2

➢ Sample Standard Deviation, S

𝑠= 𝑠2
Mean:
Coding Techniques

Addition: 𝜇actual = 𝜇𝑐𝑜𝑑𝑒𝑑 − constant


𝜇coded
𝜇actual =
Multiplication: constant
Variance:
2 2
Addition/Subtraction 𝜎actual = 𝜎𝑐𝑜𝑑𝑒𝑑

2
2
𝜎coded
Multiplication: 𝜎actual =
(constant)2
Sample Problems
1. The number of incorrect answers on a true-false
competency test for a random sample of 15
students were recorded as follows: 2, 1, 3, 0, 1, 3,
6, 0, 3, 3, 5, 2, 1, 4, and 2. Find (a) the mean; (b)
the median; (c) the mode.
2. The average IQ of 10 students in a mathematics
course is 114. If 9 of the students have IQs of 101,
125, 118, 128, 106, 115, 99, 118, and 109, what
must be the other IQ?
Sample Problems
1. The number of incorrect answers on a true-false competency test
for a random sample of 15 students were recorded as follows: 2,
1, 3, 0, 1, 3, 6, 0, 3, 3, 5, 2, 1, 4, and 2. Find (a) the mean; (b)
the median; (c) the mode.
By MS Excel

MEAN = 2.4
MEDIAN = 2
MODE = 3
Sample Problems

1. The average IQ of 10 students in a mathematics course is 114. If 9 of the


students have IQs of 101, 125, 118, 128, 106, 115, 99, 118, and 109, what
must be the other IQ?
Solution

1 2 3 4 5 6 7 8 9 10
Mean
101 = 114118 128 106 115
125 99 118 109 X

Total Scores = (101+ 105 +…..x) = 1019 + X

1019 + 𝑥
= 114; 1140 = 1019 + 𝑥 ; 𝑥 = 121
10
Sample Problems

3. In estimating the mean breaking strength, in kg, of a


new synthetic fishing line, the data were coded by
subtracting 10 from each observation. Find the sample
mean for 5 of these lines if the coded measurements
are 0.4, -0.2, 1.5, 1.8, and -0.7.
4. The arithmetic mean of 6 numbers is 17. If two
numbers are added to the progression, the new set of
numbers will have an arithmetic mean of 19. What are
the two numbers if their difference is 4?
Sample Problems

3. In estimating the mean breaking strength, in kg, of a


new synthetic fishing line, the data were coded by
subtracting 10 from each observation. Find the sample
mean for 5 of these lines if the coded measurements
are 0.4, -0.2, 1.5, 1.8, and -0.7.
Standard Value or constant = 10kg
1 2 3 4 5
Actual 10.4 9.8 11.5 11.8 9.3
Standard 10 10 10 10 10
Value
Coded Value 0.4 -0.2 1.5 1.8 -0.7

Mean Actual = Mean Coded + Constant = 0.56 + 10


= 10.56 kg
Sample Problems
4. The arithmetic mean of 6 numbers is 17. If two numbers are
added to the progression, the new set of numbers will have an
arithmetic mean of 19. What are the two numbers if their difference
is 4?
Solution 1 2 3 4 5 6 7 8

Mean = 17 x x+ 4
23 27
Mean = 19

By Block system
(17)(6) + 𝑥 + 𝑥 + 4 = 19(8)
2𝑥 = (19)(8) – (17)(6) − 4
46
𝑥 = = 23
2
other number 23 + 4 = 27
Sample Problems

5. The number of goals scored by a college


soccer team for a given season are 4, 9, 0, 1,
3, 12, 3, 7, 4, 5, 8, and 2. Treating the data as
a population, calculate the standard deviation.
6. Find the variance for the population consisting
of the measurements ½, ¼, 1/3, ½, and 1/8
by using the coded data 12, 6, 8, 12, and 3.
Sample Problems

7. A taxi company tested a random sample of 10 steel-belted


radial tires of a certain brand and recorded the following
tread wear: 48,000; 53,000; 45,000; 61,000; 59,000;
56,000; 63,000; 49,000; 53,000; and 54,000 kilometers.
Find the standard deviation of this set of data by first
dividing each observation by 1,000 and then subtracting
50.
Quiz

Faculty salaries for a random sample of teachers in


the public school system of a certain town were
coded by dividing each salary by 1000. Find the
variance of these salaries if the coded observations
are 18, 15, 21, 19, 15, 15, 16, 18, 23, and 17 pesos.
Grouped Data Analysis
Frequency distribution table (FDT)
Steps:
1. Determine The Range Of The Data,
R = Highest Value – Lowest Value

2. Determine The Number Of Classes, K


or 𝐾 = 1 + 3.3 log 𝑁
𝐾= 𝑁
3. Determine the class width, C
𝑅
𝐶=
𝐾
GROUPED DATA ANALYSIS

4. Write the different classes. The lowest class must


include the lowest value and the highest class must
include the highest value. Make a frequency tally.
5. Construct the Frequency Distribution Table by
providing the following columns such as the Classes,
the True Class Boundaries, Class Marks (midpoint),
Cumulative Frequencies, and the Relative Cumulative
Frequencies.
6. Draw the Frequency Histogram and the Frequency
Ogives.
MEASURE OF CENTRAL LOCATIONS
(Grouped Data)
σ𝑘𝑖=1 𝑓𝑖 𝑋𝑖
𝜇𝐺 = 𝑘 k = number of
Mean σ𝑖=1 𝑓𝑖
class intervals

𝑁
− 𝐹𝑐𝑢𝑚
𝑀𝑒 = 𝐿𝑚𝑒 + 𝐶 2
Median 𝑓𝑚𝑒

𝑓𝑚𝑜 − 𝑓𝑏
Mode 𝑀𝑂 = 𝐿𝑚𝑜 + 𝐶
2𝑓𝑚𝑜 − 𝑓𝑏 − 𝑓𝑎
Measure of Central
Locations

Where:
fi = frequency of the ith class
Xi = class mark (midpoint) of the ith class
Lme = the lower limit on the true class boundary of the
median class
C = the class width
N = the total number of observations
Fcum = the cumulative frequency (<) of the class just
before the median class
Measure of Central Locations
fme = the frequency of the median class
Lmo = the lower limit on the true class boundary of the
modal class
fmo = the frequency of the modal class
fb = the frequency of the class just before the modal class
fa = the frequency of the class just after the modal class
Measure of Variation
(Grouped Data)
RANGE = Highest Value – Lowest Value

Population Variance
2
2
σ𝑘𝑖=1 𝑓𝑖 𝑋𝑖 2 𝑁 σ𝑘𝑖=1 𝑓𝑖 𝑋𝑖 2
− 𝑘
σ𝑖=1 𝑓𝑖 𝑋𝑖
𝜎𝐺 = − 𝜇2 =
𝑁 𝑁2

Standard Deviation, σ

𝜎= 𝜎𝐺 2
MEASURE OF VARIATION

Sample Variance

2
2
𝑛 σ𝑘𝑖=1 𝑓𝑖 𝑋𝑖 2
− 𝑘
σ𝑖=1 𝑓𝑖 𝑋𝑖
𝑠𝐺 =
𝑛 𝑛−1
Standard Deviation, S

𝑠= 𝑠𝐺 2
EXAMPLE:
The following numbers represent the total number of projects
undertaken by 60 different contractors in CAR for the past five
years.

12 6 8 23 6 7 25 7 3 3 4 1

18 10 14 7 19 9 6 8 4 4 6 3

18 13 24 7 6 8 9 7 5 5 6 5

19 6 8 14 8 8 14 8 6 5 21 22

12 8 17 10 2 17 7 7 16 10 22 25
Excel Sheet
12 6 8 23 6 7 25 7 3 3 4 1
18 10 14 7 19 9 6 8 4 4 6 3
18 13 24 7 6 8 9 7 5 5 6 5
19 6 8 14 8 8 14 8 6 5 21 22
12 8 17 10 2 17 7 7 16 10 22 25

MAX 25
MIN 1
COUNT 60
RANGE 24
K 7.745967
SAY 7 or 8
CLASSES 3 K=8
4 K=7
Excel Sheet (Final)

23 6 7 25 7 3 3 4 1
7 19 9 6 8 4 4 6 3
7 6 8 9 7 5 5 6 5
14 8 8 14 8 6 5 21 22
10 2 17 7 7 16 10 22 25
Excel Sheet (tentative)

Classes Frequency Tally


1-3
4-6
7-9
10-12
13-15
16-18
19-21
22-24
Solution:
R = 25 – 1 = 24 Let: K = 7
K = N = 60 = 7.746
24
Let: K = 8 C= = 3.43 say 4
7
R 24
C= = =3
K 8 Classes: Frequency Tally
Classes: Frequency Tally 0–3............... 5
1–3 4 – 7 . . . . . . . . . . . . . . . 22
4–6 8 – 11 . . . . . . . . . . . . . . 13
7–9 12 – 15 . . . . . . . . . . . . . 6
10 – 12 16 – 19 . . . . . . . . . . . . . 7
13 – 15 20 – 23 . . . . . . . . . . . . . 4
16 – 20 24 – 27 . . . . . . . . . . . . . 3
19 – 21
22 – 24 - the highest value is not
included
Frequency Distribution Table (FDT)

True Class Boundaries Class Mark (midpoint) Frequency


Classes
(TCB) Xi fi

0–3 -0.5 – 3.5 1.5 5


4–7 3.5 – 7.5 5.5 22
8 – 11 7.5 – 11.5 9.5 13
12 – 15 11.5 – 15.5 13.5 6
16 – 19 15.5 – 19.5 17.5 7
20 – 23 19.5 – 23.5 21.5 4
24 - 27 23.5 – 27.5 25.5 3
Excel Sheet (Final)

Classes Upper Limit Frequency Tally


0-3 3 5
4-7 7 22
8-11 11 13
12-15 15 6
16-19 19 7
20-23 23 4
24-27 27 3
Frequency Distribution Table (FDT)
Cumulative Frequency Relative Cumulative Frequency
True Class
(No. of Observations) (percent)
Boundaries
(TCB)
Less Than Greater Than Less Than Greater Than

-0.5 – 3.5 5 60 8.33 100


3.5 – 7.5 27 55 45.00 91.67
7.5 – 11.5 40 33 66.67 55.00
11.5 – 15.5 46 20 76.67 33.33
15.5 – 19.5 53 14 88.33 23.33
19.5 – 23.5 57 7 95 11.67
23.5 – 27.5 60 3 100 5.00
Excel Sheet
Cumulative Frequency Relative Cumulative Frequency
True Class Boundaries (No. of Observations) (percent)
Greater
(TCB) Less Than Less Than Greater Than
Than
-0.5 – 3.5 5 60 8.33 100
3.5 – 7.5 27 55 45 91.67
7.5 – 11.5 40 33 66.67 55
11.5 – 15.5 46 20 76.67 33.33
15.5 – 19.5 53 14 88.33 23.33
19.5 – 23.5 57 7 95 11.67
23.5 – 27.5 60 3 100 5

Chart Title
120

100

80

60

40

20

0
-0.5 – 3.5 3.5 – 7.5 7.5 – 11.5 11.5 – 15.5 15.5 – 19.5 19.5 – 23.5 23.5 – 27.5

Series1 Series2 Series3 Series4


Analysis (Mean and Variance)
Class Mark Frequency
Xi fi fixi fix2i
1.5 5 7.5 11.25
5.5 22 121.0 665.50
9.5 13 123.5 1173.25
13.5 6 81.0 1093.50
17.5 7 122.5 2143.75
21.5 4 86.0 1849.00
25.5 3 76.5 1950.75
Total 618.0 8887.00
Excel Sheet (Final)

1.5 5 7.5 11.25


5.5 22 121 665.5
9.5 13 123.5 1173.25
13.5 6 81 1093.5
17.5 7 122.5 2143.75
21.5 4 86 1849
25.5 3 76.5 1950.75
Total 618 8887
Frequency Histogram
Shows the skewness of the data

20

Frequency
Modal Reading
10

19.5
15.5

27.5
23.5
11.5
-0.5

7.5
3.5

True Class Boundaries


Frequency Histogram

Coefficient of Skewness
3 mean − median
Coef. of Skewness =
Standard Deviation
Frequency Ogives
Figure

60
Cumulative Frequency

Less than Ogive


50

40

30
Median Reading
20

Greater than Ogive


10

21.5
13.5

25.5
17.5
9.5
1.5

5.5

Class Mark
Central Tendencies
1. Mean
σ 𝑓𝑖 𝑋𝑖 618
𝜇𝐺 = = = 10.3 contracts
σ 𝑓𝑖 60

2. Median
𝑁
− 𝐹𝑐𝑢𝑚
𝑀𝑒𝐺 = 𝐿𝑚𝑒 + 𝐶 2
𝑓𝑚𝑒
60
− 27
𝑀𝑒𝐺 = 7.5 + 4 2 = 8.4 contracts
13
Central Tendencies

3. Mode
𝑓𝑚𝑜 − 𝑓𝑏
𝑀𝑜𝐺 = 𝐿𝑚𝑜 + 𝐶
2𝑓𝑚𝑜 − 𝑓𝑏 − 𝑓𝑎
22 − 5
𝑀𝑜𝐺 = 3.5 + 4
2 22 − 5 + 13
= 6.1 contracts
Variance:

1. As a population;

σ 2 σ 2
2
𝑁 𝑓 𝑋
𝑖 𝑖 − 𝑋𝑖
𝜎𝐺 =
𝑁2
2
2
60 8887 − 618
𝜎𝐺 = = 42.03
60 2
Standard Deviation,

𝜎𝐺 = 42.03 = 6.48 contracts


Variance:

2. As a sample;
σ 2 σ 2
𝑛 𝑓𝑖 𝑋𝑖 − 𝑋𝑖
𝑆𝐺2 =
𝑛 𝑛−1
2
2 60 8887 − 618
𝑆𝐺 = = 42.74
60 59

Standard Deviation,

𝑆 = 42.74 = 6.54 contracts


Percentile

Percentile, Pi
➢Values that divide a set of observations into 100 equal parts.
➢Interpretation: These values, denoted by P1, P2, . . . . , P99, are such that
1% of the data falls below P1; 69% of the data falls below P69.
Decile

Decile, Di
➢Values that divide a set of observations into 10 equal parts.
➢Interpretation: These values, denoted by D1, D2, . . . . , D9, are such that
10% of the data falls below D1, 40% of the data falls below D4, . . . . , 90% of
the data falls below D9.
Quartile

Quartile, Qi
➢Values that divide a set of observations into 4 equal parts.
➢Interpretation: These values, denoted by Q1, Q2, and Q3, are such that 25% of
the data falls below Q1, 50% falls below Q2, and 75% falls below Q3.
Group Data – problem Solving

Table 2.5 Audit Time


2019 Oct new Data\2020 Audit.xls
12 15 20 22 14

14 15 27 21 18

19 18 22 33 16

18 17 23 28 13
Grouped Data Analysis - Solution
Frequency distribution table (FDT)
Steps:
1. Determine The Range Of The Data,
R = Highest Value – Lowest Value
R = 33-12 = 21
2. Determine The Number Of Classes, K some Textbooks
recommend between 5 and 20
𝐾 = 𝑁 = 4.47 or 𝐾 = 1 + 3.3 log 𝑁 =
say 5 5.29

3. Determine the class width, C


𝑅 21
𝐶= = = 4.2 say 5 days
𝐾 5
Group Data – Table 2.6 Frequency data

Audit time (days) Frequency

10-14 2019 Oct new Data\2020 Audit.xls


4
15-19 8
20-24 5
25-29 2
30-34 1
Total 20
Group Data – problem Solving – MS
Excel
Audit Time
12
15
20
22
14
14
15
27
21
18
19
18
22
33
16
18
17
23
28
13
Group Data – Charting – MS Excel

Audit Time Audit Time Upper Limit Frequency


12 10-14 14 4
15 15-19 19 8
20 20-24 24 5
22 25-29 29 2
14 30-34 34 1
14 20
15
27
21
Histogram for audit time data
18 9
19 8
7
18
6

Frequency
22
5
33 4
16 3
18 2
17 1
23 0
10-14 15-19 20-24 25-29 30-34
28
Audit time (days)
13
Frequency

Notes =FREQUENCY(A2:A21,D2:D6)
Use CTRL+SHIFT+ENTER
Group Data – Table 2.7 Frequency data

Audit time Relative Percent


(days) Frequency Frequency
10-14 0.20
2019 Oct new Data\2020 Audit.xls 20
15-19 0.40 40
20-24 0.25 25
25-29 0.10 10
30-34 0.05 5
Total 1.00 100

frequency of the class


Relative frequency of the class =
𝑛
Group Data – Histogram Charting –
MS Excel

Audit Time Audit Time Upper Limit Frequency


12 10-14 14 4
15 15-19 19 8
20 20-24 24 5
22 25-29 29 2
14 30-34 34 1
14
15
Histogram for Audit
27
21 9
18 8
19 7
18 6
22 Frequency
5
33
4
16
3
18
2
17
23 1
28 0
13 10-14 15-19 20-24 25-29 30-34
Audit time (days)

Notes =FREQUENCY(A2:A21,D2:D6)
Use CTRL+SHIFT+ENTER
Group Data – Charting – MS Excel
Quiz
A portion of a frequency distribution table (FDT) is given below.
Classes Frequency
10.25 – 5
24.92 – 8
39.59 – 15
54.26 – 10
68.93 – 19
83.60 – 8
Required:
1.Construct the FDT
2.Determine the mode of the data
3.Determine the coefficient of skewness. Consider the data as a sample.
Frequency Tables:
Describing data

• Whenever statisticians use a sample to estimate the population


characteristics, they usually provide a statement of the quality, or
precision with the estimate. For example, the average life-time for
the population of lightbulbs is 76 hours with a margin of error or
+/-4 hours. Thus, the interval estimate for the new lightbulbs with
the new filament is 72 hours to 80 hours. The statistician can also
state how confident he or she is from the interval from 72-80 hours
contains the population average
EDA:
Describing data
EDA:
Describing data

• Statistical analysis involves working with large


amounts of data, computer software is frequently
used to conduct the analysis. Often data to be
analyzed reside in the spreadsheet. Given the
data management, analysis and presentation of
capabilities of modern spreadsheets, it is now
possible to conduct statistical analyses using
them.
EDA:
Audit Time Audit Time Upper Limit Frequency
12 10-14 14 4
15 15-19 19 8
20 20-24 24 5
22 25-29 29 2
14 30-34 34 1
14
15
Histogram for Audit
27
21 9
18 8
19 7
18

Describing data
6

Frequency
22 5
33
4
16
3
18
2
17
23 1
28 0
13 10-14 15-19 20-24 25-29 30-34
Audit time (days)

• We want to emphasize that the purpose of this course is not


about spreadsheets. Our focus is on the appropriate procedures
for collecting, analyzing, presenting and interpreting data.
Because MS Excel is widely available in business and engineering
organizations, you can expect the knowledge gained here to use
in the setting where you currently or soon work. If in the process
of this course, you become proficient with MS Excel, the better.
EDA:
Describing data

• We stress what the statistical procedure is and how it is used


and how to implement also using MS Excel
EDA
EDA Field Trip.
Where do we go from
here?.
Videos ABOUT
Describing Data
• 21
Videos ABOUT
Describing Data
Videos ABOUT
Describing Data
• The Accountant
END
Thoughts?

You might also like