Professional Documents
Culture Documents
Introduction To Statistics and Data Analytics: Math 105 Engineering Data Analysis For Ce
Introduction To Statistics and Data Analytics: Math 105 Engineering Data Analysis For Ce
Chapter 1
INTRODUCTION TO STATISTICS AND DATA ANALYTICS
Statistics is a science that helps us make decisions and draw conclusions in the
presence of variability. For example, civil engineers working in the transportation field are
concerned about the capacity of regional highway systems. A typical problem would
involve data on the number of nonwork, home-based trips, the number of persons per
household, and the number of vehicles per household, and the objective would be to
produce a trip-generation model relating trips to the number of persons per household
and the number of vehicles per household. The concepts of probability and statistics are
powerful ones and contribute extensively to the solutions of many types of engineering
problems.
Objectives:
After careful study of this chapter, you should be able to do the following:
1. Identify the role that statistics can play in the engineering problem-solving process.
2. Discuss how variability affects the data collected and used for making
engineering decisions.
3. Explain the difference between enumerative and analytical studies.
4. Discuss the different methods that engineers use to collect data.
5. Identify the advantages that designed experiments have in comparison to other
methods of collecting engineering data.
6. Display data graphically and interpret graphs: stemplots, histograms, and box
plots.
7. Recognize, describe, and calculate the measures of location of data.
8. Recognize, describe, and calculate the measures of the center of data.
9. Recognize, describe, and calculate the measures of the spread of data.
science that we call inferential statistics, which uses techniques that allow us to go
beyond merely reporting data to drawing conclusions (or inferences) about the scientific
system.
Information is gathered in the form of samples, or collections of observations.
Samples are collected from populations, which are collections of all individuals or
individual items of a particular type. At times, a population signifies a scientific system.
Physical laws are applied to help design products and processes. We are familiar
with this reasoning from general laws to specific cases. Moreover, it is also important to
reason from a set of measurements to more general cases to answer questions. This
reasoning is from a sample to a population. The reasoning is referred to as statistical
interference.
There are times when a scientific practitioner wishes only to gain some sort of
summary of a set of data represented in the sample. In other words, inferential statistics
is not required. Rather, a set of single-number statistics or descriptive statistics is helpful.
These numbers give a sense of center of the location of the data, variability in the data,
and the general nature of the distribution of observations in the sample.
studies are usually conducted for a relatively short time period, sometimes variables
that are not routinely measured can be included.
Quantitative data are always numbers. Quantitative data are the result of
counting or measuring attributes of a population. Amount of money, pulse rate,
weight, number of people living in your town, and number of students who take
statistics are examples of quantitative data. Quantitative data may be either discrete
or continuous.
All data that are the result of counting are called quantitative discrete data. These
data take on only certain numerical values. If you count the number of phone calls
you receive for each day of the week, you might get values such as zero, one, two,
or three.
All data that are the result of measuring are quantitative continuous data
assuming that we can measure accurately. Measuring angles in radians might result
𝜋 𝜋 𝜋 3𝜋
in such numbers as 6 , 3 , 2 , 𝜋, ,and so on. If you and your friends carry backpacks with
4
books in them to school, the numbers of books in the backpacks are discrete data
and the weights of the backpacks are continuous data.
Sampling
Gathering information about an entire population often costs too much or is
virtually impossible. Instead, we use a sample of the population. A sample should
have the same characteristics as the population it is representing. Most statisticians
use various methods of random sampling in an attempt to achieve this goal. This
section will describe a few of the most common methods. There are several different
methods of random sampling. In each form of random sampling, each member of a
population initially has an equal chance of being selected for the sample. Each
method has pros and cons. The easiest method to describe is called a simple random
sample. Any group of n individuals is equally likely to be chosen by any other group
of n individuals if the simple random sampling technique is used. In other words, each
sample of the same size has an equal chance of being selected.
Variation in Data
Variation is present in any set of data. It was mentioned previously that two or
more samples from the same population, taken randomly, and having close to the
same characteristics of the population will likely be different from each other.
Data can be classified into four levels of measurement. They are (from lowest to
highest level):
• Nominal scale level
Data that is measured using a nominal scale is qualitative. Categories, colors,
names, labels and favorite foods along with yes or no responses are examples of
nominal level data. Nominal scale data are not ordered.
• Ordinal scale level
Data that is measured using an ordinal scale is similar to nominal scale data
but there is a big difference. The ordinal scale data can be ordered. An example
of ordinal scale data is a list of the top five national parks in the United States. The
top five national parks in the Philippines can be ranked from one to five but we
cannot measure differences between the data.
• Interval scale level
Data that is measured using the interval scale is similar to ordinal level data
because it has a definite ordering but there is a difference between data. The
differences between interval scale data can be measured though the data does
not have a starting point. Temperature scales like Celsius (C) and Fahrenheit (F)
are measured by using the interval scale.
• Ratio scale level
Data that is measured using the ratio scale takes care of the ratio problem and
gives you the most information. Ratio scale data is like interval scale data, but it
has a 0 point and ratios can be calculated. For example, four multiple choice
statistics final exam scores are 80, 68, 20 and 92 (out of a possible 100 points). The
exams are machine-graded.
Frequency
A frequency is the number of times a value of the data occurs.
A relative frequency is the ratio (fraction or proportion) of the number of times a
value of the data occurs in the set of all outcomes to the total number of outcomes. To
find the relative frequencies, divide each frequency by the total number of elements in
the sample.
Cumulative relative frequency is the accumulation of the previous relative
frequencies. To find the cumulative relative frequencies, add all the previous relative
frequencies to the relative frequency for the current row.
Example 1.1
A certain machine is to dispense 1.6 kilos of alum. To determine whether it is
properly adjusted to dispense 1.6 kilos of quality control engineer weighed thirty bags of
1.6 kilo alum after the machine was adjusted. The data given below refer to the net
weight (in kilos) of each bag. Arrange the data in frequency table.
Solution:
Statistical data generated in large masses can be assessed by grouping the data
into different classes. The following are suggested steps in forming a frequency distribution
from raw data into grouped data.
1. Find the range (R). The range is the difference the largest and smallest value.
2. Decide on a suitable number of classes depending upon what information the
table is supposed to present. Sturge suggested the number of classes (m) as
m=1+3.3 log n where n=number of cases.
3. Determine the class size (c). C=R/m
4. Find the number of observations in each class. This is the class frequency (f).
Example 1.2
The following are the observed gasoline consumption in miles per gallon of 40 cars.
Arrange the data in a frequency distribution. Also provide the class boundaries and class
marks for each corresponding classes. Lastly, provide a table for the cumulative
frequency distribution of the gasoline consumption.
Solution:
𝑅 = 25.7 − 22.8
𝑚 = 1 + 3.3 log 40 = 6.2 𝑠𝑎𝑦 7 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
𝑅 2.9
𝐶= = = 0.41~0.5
𝑚 7
The lowest value is 22.8.
So, we can use 22.5 as the lower limit of the 1st class.
Then, 22.5+0.5=23.0 is the lower limit of the 2nd class.
Solution (Cont.)
The class boundaries and class marks are,
has stem nine and leaf three. Write the stems in a vertical line from smallest to largest.
Draw a vertical line to the right of the stems. Then write the leaves in increasing order next
to their corresponding stem. The stemplot is a quick way to graph data and gives an
exact picture of the data. You want to look for an overall pattern and any outliers. An
outlier is an observation of data that does not fit the rest of the data. It is sometimes called
an extreme value. When you graph an outlier, it will appear not to fit the pattern of the
graph. Some outliers are due to mistakes (for example, writing down 50 instead of 500)
while others may indicate that something unusual is happening.
Example 1.3
For Prof. Dumbledore’s calculus class, scores for the first exam were as follows (lowest to
highest):
32, 42, 49, 49, 53, 55, 55, 61, 63, 67, 68, 68, 69, 72, 73, 74, 78,
80, 83, 88, 88, 88, 90, 92, 94, 94, 94, 94, 96, 100
Construct a stemplot and describe the data.
Another type of graph that is useful for specific data values is a line graph. the x-
axis (horizontal axis) consists of data values and the y-axis (vertical axis) consists of
frequency points. The frequency points are connected using line segments.
Example 1.4
In a survey, 40 mothers were asked how many times per week a teenager must be
reminded to do his or her chores. The results are; 2 mothers said 0 times, 5 said once, 8
said twice, 14 said 3 times, 7 said 4 times, 4 said 5 times. Construct a line graph and
interpret the survey.
Bar graphs consist of bars that are separated from each other. The bars can be
rectangles or they can be rectangular boxes (used in three-dimensional plots), and they
can be vertical or horizontal.
Example 1.5
By the end of 2020, TikTok had over 146 million users in the South East Asia. The table below
shows three age groups, the number of users in each age group. Construct a bar graph
using this data and describe the graph.
Example 1.6
The following data are the heights (in inches to the nearest half inch) of 100 male
professional soccer players. The heights are continuous data, since height is measured.
Calculate the width of each bar or class interval, relative frequency of each bar and
construct a histogram.
Frequency polygons are analogous to line graphs, and just as line graphs make
continuous data visually easy to interpret, so too do frequency polygons. To construct a
frequency polygon, first examine the data and decide on the number of intervals, or
class intervals, to use on the x-axis and y-axis.
After choosing the appropriate ranges, begin plotting the data points. After all the
points are plotted, draw line segments to connect them.
Example 1.7
Construct an overlay frequency polygon comparing the students’ calculus final test
scores with the students’ final numeric grade.
Suppose that we want to study the temperature range of a region for an entire
month. Every day at noon we note the temperature and write this down in a log. A variety
of statistical studies could be done with this data. We could find the mean or the median
temperature for the month. We could construct a histogram displaying the number of
days that temperatures reach a certain range of values. However, all of these methods
ignore a portion of the data that we have collected. One feature of the data that we
may want to consider is that of time. Since each date is paired with the temperature
reading for the day, we don‘t have to think of the data as being random. We can instead
use the times given to impose a chronological order on the data. A graph that recognizes
this ordering and displays the changing temperature as the month progresses is called a
time series graph.
To construct a time series graph, we must look at both pieces of our paired data
set. We start with a standard Cartesian coordinate system. The horizontal axis is used to
plot the date or time increments, and the vertical axis is used to plot the
values of the variable that we are measuring. By doing this, we make each point on the
graph correspond to a date and a measured quantity. The points on the graph are
typically connected by straight lines in the order in which they occur.
Example 1.8
The following data shows the Annual Consumer Price Index, each month, for ten years.
Construct a time series graph for the Annual Consumer Price Index data only.
The words “mean” and “average” are often used interchangeably. The
substitution of one word for the other is common practice. The technical term is
“arithmetic mean” and “average” is technically a center location. However, in practice
among non-statisticians, “average" is commonly accepted for “arithmetic mean.”
When each value in the data set is not unique, the mean can be calculated by
multiplying each distinct value by its frequency and then dividing the sum by the total
number of data values. The letter used to represent the sample mean is an x with a bar
over it (pronounced “x bar”): 𝑥̅ .The Greek letter μ (pronounced "mew") represents the
population mean. One of the requirements for the sample mean to be a good estimate
of the population mean is for the sample taken to be truly random. The following formulas
can be used in solving mean.
∑ 𝒙𝒊 ∑ 𝒙𝒊
𝝁= ̅=
𝒙
𝑵 𝒏
Where
𝑁 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑖𝑧𝑒
𝑛 = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
For the mean of the grouped data, there are three (3) ways to use,
The median of a set of numbers in an array is either the middle value or the
arithmetic mean of the two middle values.
If n is odd,
̃ = 𝒙 𝒏+𝟏
𝒙
( 𝟐 )
If n is even,
𝒙(𝒏) + 𝒙(𝒏+𝟏)
̃= 𝟐
𝒙 𝟐
𝟐
Another measure of the center is the mode. The mode is the most frequent value.
There can be more than one mode in a data set as long as those values have the same
frequency and that frequency is the highest. A data set with two modes is called
bimodal.
Example 1.9
AIDS data indicating the number of months a patient with AIDS lives after taking a new
antibody drug are as follows (smallest to largest):
3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29;
29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47;
Calculate the mean and the median. Find the mode.
Example 1.10
Statistics exam scores for 20 students are as follows:
50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93
Calculate the mean and the median. Find the mode.
Example 1.11
Find the mean, median, and mode of the gasoline consumption in Example 1.2.
The median is a number that measures the "center" of the data. You can think of
the median as the "middle value," but it does not actually have to be one of the observed
values. It is a number that separates ordered data into halves. Half the values are the
same number or smaller than the median, and half the values are the same number or
larger.
Quartiles are numbers that separate the data into quarters. Quartiles may or may
not be part of the data. To find the quartiles, first find the median or second quartile. The
first quartile, Q1, is the middle value of the lower half of the data, and the third quartile,
Q3, is the middle value, or median, of the upper half of the data.
The interquartile range is a number that indicates the spread of the middle half or
the middle 50% of the data. It is the difference between the third quartile (Q3) and the
first quartile (Q1).
IQR = Q3 – Q1
The IQR can help to determine potential outliers. A value is suspected to be a
potential outlier if it is less than (1.5)(IQR) below the first quartile or more than (1.5)(IQR)
above the third quartile. Potential outliers always require further investigation. A potential
outlier is a data point that is significantly different from the other data points. These special
data points may be errors or some kind of abnormality or they may be a key to
understanding the data.
Deciles are values that divide the data into ten equal parts. D1, D2,…, D9 are
values such that one tenth or 10% of the data falls below D1, 20% falls below D2, …, 90%
falls below D9.
Formula for finding the kth Percentile/ Quartile/ Decile (Ungrouped Data)
In finding the kth percentile/ quartile/ decile,
k = the kth percentile/ quartile/ decile. It may or may not be part of the data.
i = the index (ranking or position of a data value)
n = the total number of data
p= 100 if percentile, 4 if quartile, and 10 if decile
• Order the data from smallest to largest.
• Calculate i = (k /p)(n + 1)th
• If i is an integer, then the kth percentile/ quartile/ decile is the data value in the ith
position in the ordered set of data.
• If i is not an integer, then round i up and round i down to the nearest integers.
Average the two data values in these two positions in the ordered data set.
Where,
𝐿 = 𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑃𝑘
𝑘 = 𝑡ℎ𝑒 𝑘𝑡ℎ 𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒
𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
(∑ 𝒇) = 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑃𝑘
𝑳
𝑓 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑃𝑘
𝐶 = 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑃𝑘
Where,
𝐿 = 𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑄𝑘
𝑘 = 𝑡ℎ𝑒 𝑘𝑡ℎ 𝑞𝑢𝑎𝑟𝑡𝑖𝑙𝑒
𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
(∑ 𝒇) = 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑄𝑘
𝑳
𝑓 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑄𝑘
𝐶 = 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑄𝑘
Where,
𝐿 = 𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐷𝑘
𝑘 = 𝑡ℎ𝑒 𝑘𝑡ℎ 𝑑𝑒𝑐𝑖𝑙𝑒
𝑛 = 𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
(∑ 𝒇) = 𝑐𝑢𝑚𝑢𝑙𝑎𝑡𝑖𝑣𝑒 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑝𝑟𝑒𝑐𝑒𝑑𝑖𝑛𝑔 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐷𝑘
𝑳
𝑓 = 𝑓𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐷𝑘
𝐶 = 𝑐𝑙𝑎𝑠𝑠 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑎𝑠𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝐷𝑘
Example 1.12
For the following 13 13 real estate prices, calculate the IQR and determine if any prices
are potential outliers. Prices are in pesos.
389,950; 230,500; 158,000; 479,000; 639,000; 114,950; 5,500,000; 387,000; 659,000; 529,000;
575,000; 488,800; 1,095,000
Example 1.13
Find the interquartile range for the following two data sets and compare them.
Example 1.14
Fifty statistics students were asked how much sleep they get per school night (rounded
to the nearest hour). The results were:
Example 1.15
Listed are 29 ages for Academy Award winning best actors in order from smallest to
largest.
18; 21; 22; 25; 26; 27; 29; 30; 31; 33; 36; 37; 41; 42; 47; 52; 55; 57; 58; 62; 64; 67; 69; 71; 72; 73;
74; 76; 77
a. Find the 70th percentile.
b. Find the 83rd percentile.
Example 1.16
On a 20 question math test, the 70th percentile for number of correct answers was
16. Interpret the 70th percentile in the context of this situation.
Example 1.17
On a timed math test, the first quartile for time it took to finish the exam was 35
minutes. Interpret the first quartile in the context of this situation.
Example 1.18
Find the D1, D3, Q1, P1, P30, and P90 for the distribution of gasoline consumption
of 40 cars in Example 1.2.
Shortcut formula
1 (∑ 𝑓𝑖 𝑥𝑖 )2
𝑠=√ [∑ 𝑓𝑖 𝑥𝑖 2 − ]
𝑛−1 𝑛
Example 1.19
In a fifth grade class, the teacher was interested in the average age and the sample
standard deviation of the ages of her students. The following data are the ages for a
SAMPLE of n = 20 fifth grade students. The ages are rounded to the nearest half year:
9; 9.5; 9.5; 10; 10; 10; 10; 10.5; 10.5; 10.5; 10.5; 11; 11; 11; 11; 11; 11; 11.5; 11.5; 11.5;
Example 1.20
Determine the standard deviation of the gasoline consumption in Example 1.2.