You are on page 1of 12

Bachelor of Science in Civil Engineering 1

GEC 3-Mathematics in the Modern World


MODULE 4: Data Management

INTRODUCTION

Statistics is the study of data, from its form to its relevance to daily lives. Data is
everywhere. It is observable or measurable. With the advancement of technology every
day, data can be accessed anywhere and by anyone. When data is correct, valid analysis
and interpretation can be generated to produce valuable information. There are many
classifications of data. Different kinds of data are collected, analyzed, and interpreted.
Being able to differentiate them is the first thing that must be considered when organizing
data.

OBJECTIVES

 Identify and distinguish the classification of data.


 Calculate the measures of central tendency and measures of dispersion.
 Use a variety of statistical tools to process and manage numerical data.
 Use the methods of linear regression and correlations to predict the value of a variable
given certain conditions.
 Advocate the use of statistical data in making important decisions.

DISCUSSION PROPER

Gathering and Organizing Data

Statistics

-is a form of mathematical analysis that uses quantified models, representations and
synopses for a given set of experimental data or real-life studies. Statistics studies
methodologies to gather, review, analyze and draw conclusions from data.

For instance, according to The World Factbook, published by the Central Intelligence
Agency (CIA), in 2015 there were approximately 105 males for every 100 females between
the ages of 15 and 24. However, in the category of people 65 years old and older, there
were approximately 79 men for every 100 women.

Data

What is data?

 Data on its own carries no meaning.


 Data is a collection of facts, such as values or measurements.
 It can be numbers, words, measurements, observations or even just descriptions
of things.
 Data can be qualitative or quantitative.
Qualitative data is descriptive information (it describes something)
Quantitative data, is numerical information (number)

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 2

Data Collection

Data collection is a term used to describe a process of preparing and collecting data.
The purpose of data collection is to obtain information to keep on record, to make
decisions about important issues, to pass information on to others.

Data constitute the foundation of statistical analysis and interpretation. Hence, the first
step in statistical work is to obtain data. Data can be obtained from three important
sources.

Methods of Data Collection

SURVEYS- Surveys are a method of gathering information from a group of individuals


by asking them questions. Surveys can be conducted through various mediums such as
paper and pencil, online forms, telephone, or face-to-face interviews.

A. Direct (Interview Method) - An interview is a qualitative research method that


relies on asking questions in order to collect data. Interviews involve two or more people,
one of whom is the interviewer asking the questions. There are several types of interviews,
often differentiated by their level of structure.
Structured- have predetermined questions asked in a predetermined order.
Un-structured- are more free-flowing.

B. Indirect (Questionnaire Method) - Questionnaires can be thought of as a kind


of written interview. They can be carried out face to face, by telephone, computer or post.
Questionnaires provide a relatively cheap, quick and efficient way of obtaining large
amounts of information from a large sample of people.

Your questionnaire can include open-ended or closed-ended (fixed) questions, or


a combination of both.

Using closed-ended questions limits your responses, while open-ended questions


enable a broad range of answers. You’ll need to balance these considerations with your
available time and resources.

Closed-ended (Fixed) questions

Closed-ended, or restricted-choice, questions offer respondents a fixed set of


choices to select from. Closed-ended questions are best for collecting data on categorical
or quantitative variables.

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 3

Categorical variables can be nominal or ordinal. Quantitative variables can be


interval or ratio. Understanding the type of variable and level of measurement means you
can perform appropriate statistical analyses for generalisable results.

Examples of closed-ended questions for different variables


Nominal variables include categories that can’t be ranked, such as race or
ethnicity. This includes binary or dichotomous categories.

It’s best to include categories that cover all possible answers and are
mutually exclusive. There should be no overlap between response items.

In binary or dichotomous questions, you’ll give respondents only two


options to choose from.

Example: Nominal variables


What is your race?
White
Black or African American
American Indian or Alaska Native
Asian
Native Hawaiian or Other Pacific Islander

Are you satisfied with the current work-from-home policies?


Yes
No

Ordinal variables include categories that can be ranked. Consider how wide
or narrow a range you’ll include in your response items, and their relevance to your
respondents.

Example: Ordinal variables


What is your age?
15 or younger
16–35
36–60
61–75
76 or older

Likert-type questions collect ordinal data using rating scales with five or
seven points.

Example: Likert-type questions


How satisfied or dissatisfied are you with your online shopping experience
today?
Very dissatisfied
Somewhat dissatisfied
Neither satisfied nor dissatisfied
Somewhat satisfied
Very satisfied

When you have four or more Likert-type questions, you can treat the
composite data as quantitative data on an interval scale. Intelligence tests,
psychological scales, and personality inventories use multiple Likert-type
questions to collect interval data.

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 4

With interval or ratio data, you can apply strong statistical hypothesis tests
to address your research aims.

Pros and cons of closed-ended questions


Well-designed closed-ended questions are easy to understand and can be
answered quickly. However, you might still miss important answers that are
relevant to respondents. An incomplete set of response items may force some
respondents to pick the closest alternative to their true answer. These types of
questions may also miss out on valuable detail.

To solve these problems, you can make questions partially closed-ended,


and include an open-ended option where respondents can fill in their own answer.

Open-ended questions

Open-ended, or long-form, questions allow respondents to give answers in their


own words. Because there are no restrictions on their choices, respondents can answer in
ways that researchers may not have otherwise considered. For example, respondents may
want to answer ‘multiracial’ for the question on race rather than selecting from a
restricted list.

Example: Open-ended questions


How do you feel about open science?
How would you describe your personality?
In your opinion, what is the biggest obstacle to productivity in remote work?
Open-ended questions have a few downsides.

They require more time and effort from respondents, which may deter them
from completing the questionnaire.

For researchers, understanding and summarising responses to these


questions can take a lot of time and resources. You’ll need to develop a systematic
coding scheme to categorise answers, and you may also need to involve other
researchers in data analysis for high reliability.

OBSEVATIONS

Quantitative data is expressed in numbers and graphs and is analysed through statistical
methods.

Qualitative data is expressed in words and analysed through interpretations and


categorisations.

SECONDARY DATA - is a research data that has previously been gathered and can be
accessed by researchers. The term contrasts with primary data, which is data collected
directly from its source.

Example: Secondary Data


1. Censuses and government departments like housing, social security, electoral statistics,
and tax records.
2. Internet searches and libraries.
3. GPS and remote sensing.
4. Progress reports.

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 5

5. Journals, newspapers and magazines.

Classification of Data

The process of arranging data into homogenous group or classes according to some
common characteristics present in the data is called classification.

For Example: The process of sorting letters in a post office, the letters are classified
according to the cities and further arranged according to streets.

Bases of Classification

There are four important bases of classification:


(1) Qualitative Base
(2) Quantitative Base
(3) Geographical Base
(4) Chronological or Temporal Base

Qualitative Base:
When the data are classified according to some quality or attributes such as sex, religion,
literacy, intelligence etc…

Quantitative Base:
When the data are classified by quantitative characteristics like heights, weights, ages,
income etc…

Geographical Base:
When the data are classified by geographical regions or location, like states, provinces,
cities, countries etc…

Chronological or Temporal Base:


When the data are classified or arranged by their time of occurrence, such as years,
months, weeks, days etc…
For Example: Time series data.

Types of Classification
1. One -way Classification:
If we classify observed data keeping in view single characteristic, this type of classification
is known as one-way classification.
For Example: The population of world may be classified by religion as Muslim, Christians
etc.

2. Two -way Classification:


If we consider two characteristics at a time in order to classify the observed data then we
are doing two way classifications.

For Example: The population of world may be classified by Religion and Sex.

Measures of Central Tendency

In mathematics and statistics, the arithmetic mean (/ærɪθˈmɛtɪk ˈmiːn/, stress on first
and third syllables of "arithmetic"), or simply the mean or the average (when the context
is clear), is the sum of a collection of numbers divided by the count of numbers in the

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 6

collection. The collection is often a set of results of an experiment or an observational


study, or frequently a set of results from a survey. The term "arithmetic mean" is preferred
in some contexts in mathematics and statistics, because it helps distinguish it from other
means, such as the geometric mean and the harmonic mean.

In addition to mathematics and statistics, the arithmetic mean is used frequently in many
diverse fields such as economics, anthropology and history, and it is used in almost every
academic field to some extent. For example, per capita income is the arithmetic average
income of a nation's population.

While the arithmetic mean is often used to report central tendencies, it is not a robust
statistic, meaning that it is greatly influenced by outliers (values that are very much larger
or smaller than most of the values). Notably, for skewed distributions, such as the
distribution of income for which a few people's incomes are substantially greater than
most people's, the arithmetic mean may not coincide with one's notion of "middle", and
robust statistics, such as the median, may provide better description of central tendency.

The arithmetic mean is the most commonly used and readily understood measure of
central tendency in a data set. In statistics, the term average refers to any of the measures
of central tendency. The arithmetic mean of a set of observed data is defined as being
equal to the sum of the numerical values of each and every observation, divided by the
total number of observations. Symbolically, if we have a data set consisting of the values
a1, a2…, an, then the arithmetic mean A is defined by the formula:
a1 + a2 + a3 + ⋯ an
x̄ =
𝑛
If the data set is a statistical population (i.e., consists of every possible observation and
not just a subset of them), then the mean of that population is called the population mean,
and denoted by the Greek letter µ.

Examples:

1. The marks obtained by 6 students in a class test are 20, 22, 24, 26, 28, 30. Find the
mean.

Solution:

x ̅=(20+22+24+26+28+30)/6

Therefore, mean = 25

2. If the arithmetic mean of 14 observations 26, 12, 14, 15, x, 17, 9, 11, 18, 16, 28, 20, 22, 8
is 17. Find the missing observation.

Solution:

Given 14 observations are: 26, 12, 14, 15, x, 17, 9, 11, 18, 16, 28, 20, 22, 8
Arithmetic mean = 17
We know that,
Arithmetic mean = Sum of observations/Total number of observations
17 = (216 + x)/14
17 x 14 = 216 + x
216 + x = 238
x = 238 – 216
x = 22

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 7

Therefore, the missing observation is 22.

Median

The arithmetic mean may be contrasted with the median. The median is defined such that
no more than half the values are larger than, and no more than half are smaller than, the
median. If elements in the data increase arithmetically, when placed in some order, then
the median and arithmetic average are equal. For example, consider the data sample1, 2,
3, 4 . The average is 2.5, as is the median. However, when we consider a sample that
cannot be arranged so as to increase arithmetically, such as 1, 2, 4, 8, 16 , the median and
arithmetic average can differ significantly. In this case, the arithmetic average is 6.2, while
the median is 4. In general, the average value can vary significantly from most values in
the sample, and can be larger or smaller than most of them.

There are applications of this phenomenon in many fields. For example, since the 1980s,
the median income in the United States has increased more slowly than the arithmetic
average of income.

Examples:

Find the median of the data in the following list.


4, 8, 1, 14, 9, 21, 12 b. 46, 23, 92, 89, 77, 108

Solution:

The list 4, 8, 1, 14, 9, 21, 12 contains 7 numbers. The median of the list with an odd number
of entries is found by ranking the numbers and finding the middle number. Ranking the
numbers from the smallest number to largest gives

1, 4, 8, 9, 12, 14, 21

The middle number is 9. Thus 9 is the median.

The list 46, 23, 92, 89, 77, 108 contains 6 numbers. The median of a list of data with an
even number of entries is found by ranking the numbers and computing the mean of the
two middle numbers. Ranking the numbers from the smallest to largest gives

23, 46, 77, 89, 92, 108

The two middle numbers are 77 and 89. The mean of 77 and 89 is 83. Thus 83 is the mean
of the data.

Mode

The mode of a list of numbers is the number that occurs most frequently. In statistics,
data can be distributed in various ways. The most often cited distribution is the classic
normal (bell-curve) distribution. In this, and some other distributions, the mean
(average) value falls at the mid-point, which is also the peak frequency of observed
values. For such a distribution, the mean, median, and mode are all the same value. This
means that this value is the average value, the middle value, also the mode—the most
frequently occurring value in the data.

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 8

Mode is most useful as a measure of central tendency when examining categorical data,
such as models of cars or flavors of soda, for which a mathematical average median value
based on ordering cannot be calculated.

Advantages and Disadvantages of the Mode


Advantages:
• The mode is easy to understand and calculate.
• The mode is not affected by extreme values.
• The mode is easy to identify in a data set and in a discrete frequency distribution.
• The mode is useful for qualitative data.
• The mode can be computed in an open-ended frequency table.
• The mode can be located graphically.
Disadvantages:
• The mode is not defined when there are no repeats in a data set.
• The mode is not based on all values.
• The mode is unstable when the data consist of a small number of values.
• Sometimes data have one mode, more than one mode, or no mode at all.

Examples:

1. For example, in the following list of numbers, 16 is the mode since it appears more times
in the set than any other number:
• 3, 3, 6, 9, 16, 16, 16, 27, 27, 37, 48

2. A set of numbers can have more than one mode (this is known as bimodal if there are
two modes) if there are multiple numbers that occur with equal frequency, and more
times than the others in the set.
• 3, 3, 3, 9, 16, 16, 16, 27, 37, 48
In the above example, both the number 3 and the number 16 are modes as they each occur
three times and no other number occurs more often.

3. If no number in a set of numbers occurs more than once, that set has no mode:
• 3, 6, 9, 16, 27, 37, 48
A set of numbers with two modes is bimodal, a set of numbers with three modes is
trimodal, and any set of numbers with more than one mode is multimodal.

4. Find the mode of the data in the following lists.


a. 18, 15, 21, 16, 15, 14, 15, 21 b. 2, 5, 8, 9, 11, 4, 7, 23

Solution:

a. In the list 18, 15, 21, 16, 15, 14, 15, 21, the number 15 occurs more often than the other
numbers. Thus 15 is the mode

b. Each number in the list 2, 5, 8, 9, 11, 4, 7, 23 occurs only once. Because no number
occurs more than the others, there is no mode.

Weighted Average

A weighted average, or weighted mean, is an average in which some data points count
more heavily than others, in that they are given more weight in the calculation. For
example, the arithmetic mean of 3 and 5 is (3+5)/2 =4, or equivalently (1/2.3)+ (1/2.5)=4.
In contrast, a weighted mean in which the first number receives, for example, twice as
much weight as the second (perhaps because it is assumed to appear twice as often in the

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 9

general population from which these numbers were sampled) would be calculated as(
2/3.3) + (1/3.5)= 11/3 Here the weights, which necessarily sum to the value one, are 2/3
and 1/3, the former being twice the latter. The arithmetic mean (sometimes called the
"un-weighted average" or "equally weighted average") can be interpreted as a special case
of a weighted average in which all the weights are equal to each other (equal to 1/2 in the
above example, and equal to 1/n in a situation with n numbers being averaged).

In some cases, you might want a number to have more weight. In that case, you’ll want to
find the weighted mean. To find the weighted mean:

Multiply the numbers in your data set by the weights.


Add the results up.
For that set of number above with equal weights (1/5 for each number), the math to find
the weighted mean would be:
1(*1/5) + 3(*1/5) + 5(*1/5) + 7(*1/5) + 10(*1/5) = 5.2

Example:

You take three 100-point exams in your statistics class and score 80, 80 and 95. The last
exam is much easier than the first two, so your professor has given it less weight. The
weights for the three exams are:

Exam 1: 40 % of your grade. (Note: 40% as a decimal is .4.)


Exam 2: 40 % of your grade.
Exam 3: 20 % of your grade.
What is your final weighted average for the class?

Multiply the numbers in your data set by the weights:


.4(80) = 32
.4(80) = 32
.2(95) = 19
Add the numbers up. 32 + 32 + 19 = 83.
The percent weight given to each exam is called a weighting factor.

Weighted Mean Formula

The weighted mean is relatively easy to find. But in some cases, the weights might not add
up to 1. In those cases, you’ll need to use the weighted mean formula. The only difference
between the formula and the steps above is that you divide by the sum of all the weights.

∑𝑛𝑖=1(𝑥𝑖 ∗ 𝑤𝑖 )
x̄ =
∑𝑛𝑖=1 𝑤𝑖

Measures of Variability: Range, Variance, and Standard Deviation

While mean and median tell you about the center of your observations, it says nothing
about the 'spread' of the numbers.

Example:
Suppose two machines produce nails which are on average 10 inches long. A sample of 11
nails is selected from each machine.
Machine A: 6, 8, 8, 10, 10, 10, 10, 10, 12, 12, 14
Machine B: 6, 6, 6, 8, 8, 10, 12, 12, 14, 14, 14

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 10

To verify, let's compute the mean:

mean for machine A: 110 / 11 = 10


mean for machine B: 110 / 11 = 10

In both cases, the mean is 10, indeed. However, the first machine seems to be the better
one, since most nails are close to 10 inches. Therefore:
We must find additional numbers indicating the 'spread' of the data.

The Range

The easiest measure of the data spread is the range. It is simply the highest data value
minus the lowest data value (we have seen the range before). In the above example, the
range is the same for both data, namely 14 - 6 = 8. The range is, while useful, too crude a
measure of variability.

The Variance

We want to find out how much the data points are spread around the mean. To do that,
we could find the difference between each data point and the mean, and average these
differences. However, we want to measure the differences to the mean regardless of the
sign (positive or negative difference). Therefore, we could find the absolute value of the
difference between each data point and average that. But for theoretical reasons an
absolute value function is not easy to deal with, so that one chooses a square function
instead (which also neutralizes signs). Finally, for yet other theoretical reasons we shall
use not the sample size n to compute an average, but instead n - 1.

Hence, we will use this formula to compute the data spread, or variance:

Variance = add up the squares of (Data points - mean), then divide that sum by (n - 1)

There are two symbols for the variance, just as for the mean:
• 𝜎2 is the variance for a population
2
•𝑠 is the variance for a sample

In other words, the variance is computed according to the formulas:


∑(𝑥−𝜇)2
• 𝜎2 = (for the population variance)
𝑛−1
∑(𝑥−x̄ )2
• 𝑠2 = (for the sample variance)
𝑛−1

We had to use two formulas because one involves the population mean, the other the
sample mean. Practically, however, the formula is the same. It is useful to compute the
variance at least once "by hand" before we show how to use Excel to accomplish the same
feat quickly and easily.

How to find the variance "by hand":


1. Make a table of all x values
2. Find the mean of the data
3. Include a column with the difference to the mean
4. Include a column with the square of difference to the mean
5. Add the last column and divide the sum by (n - 1).

Here is the table that this procedure produces for the above sample of nails from machine
A and B:

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 11

Machine A:
𝑥 (𝑥 − 𝑥̄) (𝑥 − 𝑥̄)2
6 4 16
8 2 4
8 2 4
10 0 0
10 0 0
10 0 0
10 0 0
10 0 0
12 -2 4
12 -2 4
14 4 16
Therefore, the variance for machine A is: (16 + 4 + 4 + 0 + 0 + 0 + 0 + 0 + 4 + 4 + 16) /
10 = 48 / 10 = 4.8

Machine B:
𝑥 (𝑥 − 𝑥̄) (𝑥 − 𝑥̄)2
6 4 16
6 4 16
6 4 16
8 2 4
8 2 4
10 0 0
12 -2 4
12 -2 4
14 -4 16
14 -4 16
14 -4 16
Therefore, the variance for machine B is: (16 + 16 + 16 + 4 + 4 + 0 + 4 + 4 + 16 + 16 + 16)
/ 10 = 112 / 10 = 11.2

In other words, the variance - or spread around the mean, for machine A is 4.8 while
machine B has a variance (spread) of 11.2. That means that machine A seems to produce
nails that, as a rule, produces nails that stick pretty close to the average nail length.
Machine B, on the other hand, produces nails with more variability that machine A.
Therefore, Machine A would be much preferred over machine B.

Note: The unit of the variance is the square of the original unit; hence, it is not the best
number (considering units). Therefore, one introduces an additional number, called the
standard deviation:

The Standard Deviation

The standard deviation is the square root of the variance.


As with the mean, there are two letters for variance and standard deviation:

𝜎 is the population standard deviation.


𝑠 is the sample standard deviation.

Example:
Consider the sample data 6, 7, 5, 3, 4. Compute the standard deviation for that data.

Prepared by: Engr. Jerico P. Fiel


Bachelor of Science in Civil Engineering 12

To compute the standard deviation, we must first compute the mean, then the variance,
and finally we can take the square root to obtain the standard deviation. In this case we
do not need to create a table since there are so few numbers:
• Computing the mean:
6+7+5+3+4
𝑥̄ = =5
5
• Computing the variance:
((6 − 5)2 + (7 − 5)2 + (5 − 5)2 + (3 − 5)2 + (4 − 5)2 )/ (5 − 1) = 2.5
• Standard deviation:
𝒔 = √𝟐. 𝟓 = 𝟏. 𝟓𝟖

SUMMARY
Statistics is used in every aspect of life, such as in data science, robotics, business, sports,
weather forecasting, and much more.

REFERENCES

Books:
Mathematics in the Modern World, 14th Edition Aufman Richard, et. al.,,
Mathematics in the Modern World, Philippine Edition by REX Book Store
Mathematics in the Modern World, by Esmeralda A. Manlulu, et. al.

ISUI-CvE-Mod
Revision: 02
Effectivity: August 1, 2020

Prepared by: Engr. Jerico P. Fiel

You might also like