You are on page 1of 23

MATHEMATICS IN THE MODERN WORLD

DATA MANAGEMENT
Lesson 4.1 Statistics and Data

Statistics is the science that deals with the collection, presentation, analysis, and interpretation of data. The two
divisions of statistics are descriptive and inferential statistics. Descriptive statistics deals with the gathering,
classification, presentation, and analysis of data without generalizing the results for the entire population.
Inferential statistics concerns with the generalization of sample results for the whole population. It demands
deductive reasoning and it needs a higher degree of critical judgment and mathematical methods.

Functions of Statistics

a. Organizes data for presentation and better understanding


b. Estimates quantities and measurements
c. Facilitates information dissemination
d. Helps in establishing differences
e. Explains the relationship between variable of interest
f. Test assertions and claims
g. Predicts and forecasts future outcomes

Data are the raw materials of research or any statistical investigations. They arise when measurements are made
and/or observations are recorded. In general, data can be categorized as quantitative and qualitative. Quantitative
data take numerical values for which descriptions such as means, standard deviations, and other parameters or
statistics are meaningful.
Qualitative data, such as eye and hair colors of an individual, are not computable by arithmetic relations. They
are labels that advise in which category or class an individual, object, or process fall.

Data can also be categorized according to source as primary and secondary. Primary data refer to the information
which are gathered directly from an original source or which are based on direct or first-hand experience.
Secondary data refer to the information taken from published/unpublished materials that have been previously
gathered by other individuals, researchers, or agencies.

Qualities of Statistical Data

1. Accuracy – reflects how close the measurement or the data collected to the true value
2. Precision – refers to the repeatability or consistency of the information collected
3. Timeliness – time interval between the date of occurrence of the event and the time the
information is collected or disseminated
4. Adequate – reflects that the data collected contains the needed information to meet the
requirements or the interest of the collection process
5. Relevance – reflects the consistency of the data collected to the needs of the users
6. Completeness – refers to comprehensiveness on coverage as well as on all items being asked
for.
The first activity in statistics is to measure or count. Measurement/counting theory is concerned with the
connection between data and reality. A set of data is a representation (i.e., a model) of the reality based on
numerical and measurable scales.

Scales of Measurement

The classifications of measurements depending on the precision made by the measurement procedure are
nominal, ordinal, interval, and ratio. In the nominal scale, a name, label, or category is assigned to classify each
element observed with respect to the property of interest. Gender is one variable measured through the nominal
scale. Male and female are the two categories which do not follow an order or rank. Other examples are - Civil
status: Single, Married, Widowed/er, Separated; Quality of products: defective or good

In an ordinal or ranking scale, the elements or categories are arranged in some meaningful kind of order or rank,
which corresponds to their relative position or “size”. Examples are preference and quality. Preference may have
the categories such as most preferred, next preferred, and least preferred. These categories follow an order or rank
– least preferred being the lowest and most preferred being the highest. Another variable measured through an
ordinal scale is quality (poor, fair, good, very good, and outstanding). Birth Order: Eldest, … ,Youngest; Size:
Large, Medium, Small;

Note that the difference between ranks is not meaningful.

In an interval scale, the elements can be differentiated and ordered, and the arithmetic difference between
elements is meaningful. This scale of measurement is more informative than either the nominal or ordinal scale,
since the fact that the distance between elements can be determined implies that there is a fixed unit of
measurement and a zero point (origin), even though the latter is arbitrary. Examples are temperature, I.Q., and
grade (in numerical form). Time, Blood pressure, Calendar dates

83
MATHEMATICS IN THE MODERN WORLD

The highest level of measurement is the ratio scale. Here, there’s not only an order property, a unit of
measurement and a meaningful difference between elements, but there’s also a fixed origin (which is zero) as
opposed to an arbitrary origin. Examples are height, weight, length, salary, number of bacteria, tensile strength.

Population and Sample

Population or study population is the totality of all objects, individuals or entities wherein its unique properties or
characteristics are the subject of a research or statistical inquiry. A study population can be finite or infinite.

A population is said to be finite if it is possible to count its individual members. Sometimes it is not possible to
count the units or members in a population. Such populations are described as infinite.

School of students, set of books, group of patients, organization of employees, herd of cattles, and set of bags of
cement are examples of finite populations. Infinite populations include tourists (registered and unregistered) in a
certain location, rats in an open area, stones in a riverbank, turtles in a pond, and micro-organisms of the same
species inhabiting a given area.

It’s usually due to time and budget constraints that the whole population cannot be studied. This suggests the
consideration of a small portion of the population in the investigation. Sample is a representative part of a
population. A characteristic of a population which is the consideration of a statistical inquiry or research is called
a parameter. On the other hand, statistic is a characteristic of a sample. A statistic is used to estimate, describe,
or represent a parameter.

Sampling is the process of selecting units, like people, organizations, or objects from a population of interest in
order to study and fairly generalize the results back to the population from which the sample was taken. Sampling
is the process of getting information from only part of a larger group.

Sample Size Determination


The number of respondents or subjects to form a sample is termed as the sample size. Cochran (1977) presented
a set of formulas that can be used to determine the sample size.

In estimating a population mean, the following formulas can be used.

1) For a finite and known population size, N:

2
(𝑍𝛼 ) 𝑠 2 𝑁
2
𝑛≥ 2
(𝑍𝛼 ) 𝑠 2 + 𝑁𝑒 2
2

where n is the sample size

𝑍𝛼 is the two-tailed z-score corresponding to the level of significance,


2

s is the known standard deviation

e is the margin of error

2) For an infinite or unknown population size, N:


2
(𝑍𝛼 ) 𝑠 2
2
𝑛≥
𝑒2

In estimating a population proportion, the following formulas can be used.

3) For a finite and known population size, N:

2
(𝑍𝛼 ) 𝑝𝑞𝑁
2
𝑛≥ 2
(𝑍𝛼 ) 𝑝𝑞 + (𝑁 − 1)𝑒 2
2

where p is the past estimate of the population proportion

q=1–p

4) For an infinite or unknown population size, N:

84
MATHEMATICS IN THE MODERN WORLD
2
(𝑍𝛼 ) 𝑝𝑞
2
𝑛≥
𝑒2

Notes:

i. The level of significance, 𝛼, can take any of the standard values namely, 0.01, 0.05, and 0.10. Theoretically, the
level of significance is the probability of the type I error in hypothesis testing.

ii. The following table presents the values of 𝑍𝛼 corresponding to the standard values of 𝛼:
2

𝛼 𝑍𝛼
2

0.01 2.575

0.05 1.96

0.10 1.645

iii. The standard deviation, s, can be estimated from a pilot data set or the value can be adopted from a previous
study that considered the same or similar population.

iv. In the same manner as s, p can be the past estimate of the population proportion or can be computed from a
pilot data set.

Yamane’s Formula (Simplified Formula for Proportions)

If the behavior of the population is not certain or the researcher is not familiar with the population’s
behavior, Yaro Yamen’s formula (1980) or Taro Yamane’s formula (1967) may be used. The formula is:

N
n
1  Ne 2

Where: n - is the sample size

N - is the population size


e - is the level of precision.

Example. From a population of 10,000 individuals of a certain town, what sample size is needed in order to get an
accurate result for a certain study using a margin of error of a.) 1% ; b.) 2.5% ; c.) 5%

SAMPLING TECHNIQUES

Sampling is the process of getting information from only part of a larger group. The two types of sampling are
random sampling and nonrandom sampling. Nonrandom sampling uses some criteria for choosing the sample

85
MATHEMATICS IN THE MODERN WORLD

whereas random sampling does not. The four types of random sampling techniques are simple random sampling,
systematic sampling, stratified random sampling and

A. Probability Sampling: Random Sampling Techniques

Simple Random Sampling

Simple random sampling is the most basic and well-known type of random sampling technique. In simple
random sampling, every case in the population being sampled has an equal chance of being chosen. It is an
equal probability sampling method (EPSEM). EPSEMs are important because they produce representative
samples.

Basic Steps:

1. Construct the sampling frame


2. Determine the sample size
3. Employ any of the following selection procedure:
a. Draw lots
b. Lottery
c. Usage of gadgets like the calculator or computer to generate Random Numbers
d. Table of Random Numbers

Systematic Sampling

This method consists of randomly selecting one unit and choosing additional elements at equal intervals until the
desired sample size is achieved.

Basic Steps:

1. Construct the sampling frame


2. Determine the sample size
3. Determine the sampling interval, k:
𝑁
𝑘=
𝑛

4. Identify the random start, r, using any of the selection procedure under SRS:
1≤𝑟≤𝑘

The random start identifies the first sampling unit.

5. Commencing with the random start, select every kth item until the desired sample size is reached.

Stratified Random Sampling

Stratified random sampling involves dividing the potential samples into two or more mutually exclusive
groups based on categories of interest in the research. The purpose is to organize the potential samples into
homogenous subsets before sampling. For example, you could divide the potential samples based on gender,
race or occupation. You then draw a random sample from each subset. Stratified random sampling is
common because it ensures that each subgroup of the larger group is adequately represented in the sample.

Proportional allocation:

𝑁ℎ
𝑛ℎ = (𝑛)
𝑁

where nh = sample size for each stratum

Nh = stratum size

N = population size

n = sample size

86
MATHEMATICS IN THE MODERN WORLD

Example: Suppose a school has five departments composed of the following number of students.
Determine the number of students to be part of the sample when the researcher
needs 363 respondents.

Department Nh nh

a. Business Administration 1,500

b. Management 1,200

c. Finance 850

d. Entrepreneurship 200

e. Culinary Arts 150

Total 3,900

Cluster Random Sampling

In cluster random sampling, you randomly select clusters instead of individual samples in the first stage of
sampling. For example, a cluster might be a school, a team or a village. This technique is used when no list of
individual samples is available. Usually, the way this type of sampling is done is by starting at the higher level
clusters and then sampling at subsequent levels until individual samples are reached.

Multi-stage sampling

This method uses several stages or phases in getting random samples from the general population.

B. Non-Probability Sampling

This is a sampling method that does not involve random selection of samples. With non-probability samples,
the population may or may not be represented well, and it will often be difficult to know how well the population
has been represented. Some forms of non-probability sampling are:

1. Accidental or Haphazard or Convenience sampling


- one of the most common methods of sampling where methods done are normally biased since
- the researcher considers his/her convenience in the collection of the data.

2. Purposive sampling
- sampling is based on certain criteria laid down by the researcher. People who satisfy the criteria
are interviewed.

Subcategories of Purposive sampling:

a. Modal instance sampling


- When we do modal instance sampling, we are sampling the most frequent case. The problem with
modal instance sampling is identifying the “modal” case. Modal instance sampling is only sensible
for informal sampling contexts.
b. Expert sampling
- Involves the assembling of a sample of persons with known or demonstrable experience and
expertise in some area.
Two reasons we might do expert sampling:
1. It would be the best way to elicit the views of persons who have specific expertise.
2. To provide evidence for the validity of another sampling approach you’ve chosen.

87
MATHEMATICS IN THE MODERN WORLD

c. Quota sampling
- Select items nonrandomly according to some fixed quota.
d. Snowball sampling
- Begin by identifying someone who meets the criteria for inclusion in your study. You then ask
them to recommend others who they may know who also meet the criteria.

Advantages of Sampling

1, Faster – a smaller group understudy requires shorter time spent for data collection and processing

2. Cheaper – cost entailed in studying only a part of the population is much lower with investigations involving
whole population

3. Better quality of information may be collected – a smaller study group allows a more accurate execution of
technical procedures

4. More comprehensive data may be gathered.

Good Sampling Design

1. Representative – samples to be collected should reflect the characteristics as well as the variability of the
population
2. Feasible – sampling procedure should be simple enough to be implemented and can be carried out and
sustained according to plan
3. Adequate – the sample size should be sufficiently large to provide reliable generalization
4. Economic – sampling design should be efficient enough to produce the most information at a least cost

Methods of Data Collection


A. Interview (Direct) Method – a method of person-to-person exchange between the interviewer and the
interviewee.

Positive:

1) It provides consistent and more precise information since clarification maybe given by the interviewee.
2) Questions maybe repeated or maybe modified to suit the interviewee’s level of understanding.

Negative:

1) Time-consuming
2) Expensive
3) Limited field coverage

B. Questionnaire (Indirect) Method – in this method written responses are given to prepared questions. A
questionnaire is used to elicit answers to the problems of the study. Questionnaires maybe mailed or
hand-carried.
Positive:

1) Inexpensive
2) Can cover a wide area in a shorter span of time.
3) Respondents may feel a greater sense of freedom to express views and opinions because their anonymity
is maintained.
Negative:

1) There’s a strong possibility of non-response, especially when questionnaires are mailed.


2) Questions not easily understood may not be answered.

C. Registration Method – this method of gathering information is enforced by law.

e.g. registration of births, deaths, vehicles, licenses

Positive:

1) Information is kept systematized.


2) Information is always made available to the public.

D. Observation Method – the investigator observes the behavior of the subject/respondent. It is used when the

88
MATHEMATICS IN THE MODERN WORLD

subjects cannot talk or write.

Positive: The recording of behavior at the appropriate time and situation is made possible.

E. Experiment Method - this method is used when the objective is to determine the cause-and-effect relationship
of certain phenomena under controlled conditions. It is usually used by scientific researches.

Practical Guidelines in designing questions for questionnaire:

1. State the questions clearly


2. Avoid asking leading questions
3. Construct questions that will give objective rather than subjective replies
4. Follow a logical flow of questioning
5. Ask for essential information only
6. Be sure that questions are well-understood
7. Avoid incorporating too many ideas in a question
8. Construct questions which the respondents can answer confidently
9. If negative items cannot be avoided be sure to mark (capitalize, underscore) the negation so as to guide
the respondents.

Methods of Data Presentation


a. Textual Presentation – This type of presentation incorporates data in set of narrative sentences or paragraph. It
emphasizes and compares important figures. However, it can be tedious to read especially if it consists of lengthy
paragraphs and some figures or words are repeated many times.

2000 Census of Population

The population of the Philippines as of May 1, 2000 is 75.33 million. This figure is
higher by 6.71 million from the 1995 population.

The annual growth rate from 1995 to 2000 is 2.02 percent, which is lower by 0.30
percentage point from the 1995 figure of 2.32 percent and by 0.33 percentage points
from the 1990 figure of 2.35 percent

b. Tabular Presentation
Source: NSO Monthly– This is a of
Bulletin systematic
Statistics, way of categorizing
August 2000 related data in rows and columns. This

methodical arrangement called statistical table presents data in a more concise and greater detail than in
textual or graphical form.

Table number
Title table  heading

Column
Column
Stub Head Cap
Caption
tion

Row Caption
BODY
Row Caption

c. Graphical Method – This is a method of presenting quantitative data in pictorial form produces a device which is
often referred to as graph or chart. They have visual appeal that can attract better and hold further, the reader’s
interests.

Qualities of a Good Graph

1. Accurate – It must be accurately constructed using correct and reliable data in order to produce correct
interpretation. It should not be deceiving, imprecise or confusing so as not to create illusory vision.
2. Clear – An effective chart is easy to read and understand. It should emphasize the information it wants
to present supported with definite details. It should be useful in interpretation of facts.
3. Simple – Its design should be uncomplicated and straight forward. It should contain only necessary and
relevant data or symbols to gain efficient visual communication.
4. Attractive – Its appearance should be neat and with a scholarly or professional look. The overall design
elements should be harmonious, consistent in style and balanced.

89
MATHEMATICS IN THE MODERN WORLD

Types of Graphs

1. Dotplot
- A graphic display that is used to compare frequency counts within a small number of categories or
groups, usually with small sets of data.
- The pattern of data in a dotplot can be described in terms of symmetry and skewness, only if the
categories are quantitative. If the categories are qualitative, the dotplot cannot be described in those
terms.

Example:

Bar charts and Histograms

- These are graphs or charts that are used to compare the sizes of different groups.

Bar Chart. It represents the frequency or magnitudes of quantities of each of the categories as a bar rising vertically
from the horizontal axis with the height of each bar proportional to the frequency or magnitude of the corresponding
category.

It may be simple, compound and can be vertically or horizontally arranged. It is used for both qualitative and
quantitative data.

Example:

Figure 1. Monthly mean particulate matter (PM10) level in Baguio City for 2010.

Histogram. It is made up of columns plotted on a graph where there is no space between adjacent columns. The
columns are positioned over a label that represents a continuous, quantitative variable.

A histogram is distinct from a bar chart based on the type of variable that is being presented. With this
distinction, it can be appropriate to talk about skewness of a histogram.

Example:

90
MATHEMATICS IN THE MODERN WORLD

2. Stem and leaf plot


- A chart that shows how individual values are distributed within a set of data. It is used to display
quantitative data from small data sets.

Example: Stem and leaf plot of IQ scores of 30 individuals.

3. Boxplot (box and whisker plot)


- A type of graph which is used to display patterns of quantitative data.
- It splits the data set into quartiles. The body of the boxplot consists of a “box” which goes from the
first quartile (Q1) to the third quartile (Q3).
- Within the box, a vertical line is drawn at the median (Q2) of the data set.
- The whiskers are two horizontal lines from the front and the back of the box. The front whisker
goes from Q1 to the smallest non-outlier in the data set while the back whisker goes from Q3 to the
largest non-outlier.
- Outliers in the data set are plotted separately as points on the chart. A data point is an outlier if it
goes beyond 1.5 times the interquartile range from Q1 or Q3.

Example:

Interpreting a boxplot or box and whisker plot:

 Range. It is the horizontal distance between the smallest value and the largest value which
includes any outliers.

 Interquartile Range (IQR). The interquartile range is represented by the width of the box.

 Shape of the data set.

4. Scatterplot
- A graphic tool used to display the relationship between two quantitative variables.

91
MATHEMATICS IN THE MODERN WORLD

- Each dot on the scatterplot represents an ordered pair of observation from a data set.
- Used to analyze patterns in bivariate data. The patterns are described in terms of linearity, slope,
and strength.

Examples:

5. Line chart.
- Graphical presentation of data especially useful for showing trends over a period of time.

Example:

Figure 2. Age at First Marriage in the United States

Following a sharp decline during and after World War II, the age
at which men and women in the United States first marry has
steadily increased. In the mid-1990s, the age of first marriage for
women was higher and closer to the age at which men first
marry than at any time in the previous 100 years.

Four ways to describe Data Sets:

When comparing two or more data sets, we focus on four features:

 Center
 Spread
 Shape
 Unusual features

Measures of Central Tendency are numerical values that tend to locate in some sense the middle of a set of data.
The term average is often associated with these measures. The most important measure of central tendency are (1)
the mean, (2) the median, and (3) the mode.

92
MATHEMATICS IN THE MODERN WORLD

A. MEAN, 𝜇 or 𝑥̅

1. Arithmetic Mean – it is obtained by adding all the observations and dividing the sum by the number of
observations, thus it is called a computational average.
Population mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑁 represents a finite population of size 𝑁, then the population
mean 𝜇 is

x
i 1
i


N

Sample Mean: If a set of data 𝑥1 , 𝑥2 … 𝑥𝑛 represents a finite sample of size 𝑛, then the sample mean 𝑥̅ is

x
i 1
1

x
n

Example 1

Suppose you are to choose ten people who enter the campus and whose ages are as follows:

15 25 18 20 25 18 18 20 25 15

What is the mean age of this sample?

2. Weighted Mean – if the data set 𝑥1 , 𝑥2 … 𝑥𝑘 have assigned weights 𝑤1 , 𝑤2 … 𝑤𝑘 , respectively, then the
weighted mean is computed as follows:
k

w x i i
x i 1
k

w
i 1
i

Example 2

The table provides the grades obtained by a student in the different criteria for grading and the
corresponding weight for each criterion. Find his weighted average.

Criteria Grade Weight

Long Tests 80 0.30

Quizzes 85 0.20

Departmental Exam 82 0.25

Class Participation 88 0.10

Homework and Projects 85 0.15

Example 3

Mall goers were asked to rate the level of effectiveness of the inspection being done by security forces in
preventing crimes in malls.

Level of Very Effective Moderately Effective Least Effective Not Effective


Effectiveness (4) (3) (2) (1)

Number of Mall
97 132 176 170
goers

*Likert Scale: Interval Scale = (highest rate – lowest rate)/ no. of ratings = ( 4 - 1 )/ 4 = 0.75

Rating Range of Values Qualitative Description

4 3.25 – 3.99 Very Effective

93
MATHEMATICS IN THE MODERN WORLD

3 2.50 – 3.24 Moderately Effective

2 1.75 – 2.49 Least Effective

1 1.00 – 1.74 Not Effective

Geometric mean: The geometric mean of n positive numbers is the nth root of their product. That is,

𝐺𝑀 = 𝑛√𝑥1 𝑥2 … 𝑥𝑛

Example: Find the geometric mean of

1. 3 and 12
2. 2,3 and 30
3. 1, 4, 6, 11, and 2

Harmonic mean: The harmonic mean of n numbers is defined as n divided by the sum of the reciprocals of the n
numbers. That is,

𝑛
𝐻𝑀 =
1 1 1
+ + ⋯+
𝑥1 𝑥2 𝑥𝑛

Example:

If on a three-day vacation trip to Canada, a family traveled 80 kph the first day, 93 kph the second day,
and 87 kph the third day, find the average speed for the entire trip.

Note:

GM – used with such data as rates of change, ratios, economic index numbers, population sizes over
consecutive time periods

HM – used in averaging speeds for various distances covered where the distances remain constant, costs of
some commodity, and mutual funds.

B. MEDIAN, 𝜇̃ or 𝑥̃

- a value that divides the distribution into two equal parts (after arranging the values/scores in ascending or
descending order). As such, it is a positional average. The median is defined by

𝑥𝑛+1 𝑖𝑓 𝑛 𝑖𝑠 𝑜𝑑𝑑
2
𝜇̃ 𝑜𝑟 𝑥̃ = {𝑥𝑛 + 𝑥𝑛+1
2 2
𝑖𝑓 𝑛 𝑖𝑠 𝑒𝑣𝑒𝑛
2

Example 4

Find the median: (a) 12, 15, 18, 8, 9,10, 6; (b) 23, 18, 15, 12, 10, 9, 8, 6

C. MODE, 𝜇̂ or 𝑥̂

- the value in the distribution with the highest frequency. It locates the point where the observation values occur
with the greatest density. It can be used for quantitative as well as qualitative data.

94
MATHEMATICS IN THE MODERN WORLD

Example 5

Find the mode of the following data: 15 12 4 9 6 10 5 15

12 4 12 6 12 5 15 12 4 15 4 6 5

Evidently, a distribution can have no mode, one mode, or more than one mode. Thus, the mode is not a very reliable
measure of central tendency. However, there are instances when no other measure can be used except the mode.
In determining the prevalent gender, civil status, or highest educational attainment, only the mode can be used
because no numerical values can be assigned to these variables.

D. MIDRANGE

- the mean of the largest and smallest values in the data set.

Remarks

Mean:

1. All the scores or measurements are considered in the computation of the mean.
2. Very high or very low scores or measurements affect the mean.

Median:

1. Only the middle scores or measurements are considered in the computation of the median.
2. Very high or very low scores do not affect the median.
Mode:

1. It is very easy to compute but is seldom used because it is very unstable.


2. It is most appropriate for nominal scale as a measure of popularity.

MEASURES OF LOCATION

There are several other measures of location that describe or locate the position of certain non-central pieces of
data relative to the entire set of data. These measures, often referred to as quantiles or fractiles are values below
which a specific fraction or percentage of the observations in a given set must fall.

PERCENTILES

Percentiles are values that divide a set of observations into 100 equal parts. These values, denoted by 𝑃1 , 𝑃2 , … , 𝑃99 ,
are such that 1% of the data falls below 𝑃1 , 2% falls below 𝑃2 , …, and 99% falls below 𝑃99 .

The 𝑘th percentile, 𝑃𝑘 (𝑘 = 1, 2, 3, … ,99), can be determined using the following procedure:

𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( ) 𝑛, where 𝑛 is the number
100
of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝑃𝑘 = 𝑖 𝑖+1. If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝑃𝑘 = 𝑥𝑖 .
2

DECILES

Deciles are values that divide a set of observations into 10 equal parts. These values, denoted by 𝐷1 , 𝐷2 , … , 𝐷9 , are
such that 10% of the data falls below 𝐷1 , 20% falls below 𝐷2 , …, and 90% falls below 𝐷9 .

The 𝑘th decile, 𝐷𝑘 (𝑘 = 1, 2, … ,9), can be determined using the following procedure:

𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( ) 𝑛, where 𝑛 is the number
10
of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝐷𝑘 = 𝑖 𝑖+1. If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝐷𝑘 = 𝑥𝑖 .
2

QUARTILES
Quartiles are values that divide a set of observations into 4 equal parts. These values, denoted by 𝑄1 , 𝑄2 , and 𝑄3 ,
are such that 25% of the data falls below 𝑄1 , 50% falls below 𝑄2 and 75% falls below 𝑄3 .

95
MATHEMATICS IN THE MODERN WORLD

The 𝑘th quartile, 𝑄𝑘 (𝑘 = 1, 2, 3), can be determined using the following procedure:

𝑘
1. Arrange the data in increasing order and compute the value of the index 𝑖 = ( ) 𝑛, where 𝑛 is the number
4
of observations.
𝑥 +𝑥
2. If 𝑖 is an integer, 𝑄𝑘 = 𝑖 𝑖+1. If 𝑖 is not an integer, use the rounded up value for 𝑖 and take 𝑄𝑘 = 𝑥𝑖 .
2

Examples

1. Find the quartiles, interquartile range, 3rd and 7th deciles, and 12th, 37th, 95th percentiles for the
following examination scores given in the stem-and-leaf plot.
Exam Scores

4 |568

5 |34569

6 |2356699

7 |01133455578

8 |122369

2. As part of a quality-control study aimed at improving a production line, the weights (in ounces) of 50 bars
of soap are measured. The results are as follows, sorted from smallest to largest. Find the interquartile
range, the 3rd and 9th deciles, and the 12th, 43rd, and 61st percentiles.

11.6 12.6 12.7 12.8 13.1 13.3 13.6 13.7 13.8 14.1

14.3 14.3 14.6 14.8 15.1 15.2 15.6 15.6 15.7 15.8

15.8 15.9 15.9 16.1 16.2 16.2 16.3 16.4 16.5 16.5

16.5 16.6 17.0 17.1 17.3 17.3 17.4 17.4 17.4 17.6

17.7 18.1 18.3 18.3 18.3 18.5 18.5 18.8 19.2 20.3

MEASURES OF VARIABILITY OR DISPERSION

The measures of central tendency do not by themselves give an adequate description of the data. It is also very
important for us to know how the observations spread out from the average. The measures of variation indicate the
extent to which individual items in a series are scattered about the average. It is used to determine the extent of
the scatter so that steps may be taken to control the existing variation.

Let us consider the following measurements for two samples of data:

Sample
P24,500 20,700 22,900 26,000 24,100 23,800 22,500
A

Sample
P24,900 17,500 21,600 29,700 25,300 23,800 21,700
B

Both samples have the same mean but, it is quite obvious that the measurements for sample A are more uniform
or the values are close to each other as compared to sample B.

General Classifications of Measures of Variation

 Measures of Absolute Dispersion

96
MATHEMATICS IN THE MODERN WORLD

 Measures of Relative Dispersion

Measures of Absolute Dispersion

The measures of absolute dispersion are expressed in the units of the original observations. They cannot
be used to compare variations of two data sets when the averages of these data sets differ a lot in value or when the
observations differ in units of measurement. The most common statistics for measuring the variability of a set of
data are the range, variance, and the standard deviation.

RANGE

The range measures the distance between the largest and the smallest values and, as such, gives an idea of the
spread of the data set. However, the range does not use the concept of deviation. It is affected by outliers but does
not consider all values in the data set. Thus it is a not a very useful measure of variability.

𝑅𝑎𝑛𝑔𝑒 (𝑅) = 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒 – 𝑚𝑖𝑛𝑖𝑚𝑢𝑚 𝑣𝑎𝑙𝑢𝑒

MEAN ABSOLUTE DEVIATION

The mean absolute deviation (MAD) utilizes deviations of the data values from the mean in its computation. The
MAD is the average of the absolute values of the deviations from the mean, computed using the formula

∑ |𝑥𝑖−𝜇| ̅
∑ |𝑥𝑖−𝑥|
population: 𝑀𝐴𝐷 = sample: 𝑀𝐴𝐷 =
𝑁 𝑛

If a data set A has a greater MAD than data set B, then it is reasonable to believe that the values in data set A are
more spread out (variable) than the values in set B.

VARIANCE AND STANDARD DEVIATION

The variance and the standard deviation are the most common and useful measures of variability. These two
measures provide information about how the data vary about the mean. The variance 𝜎 2 or 𝑠 2 is a measure of
variation which considers the position of each observation relative to the mean of the set. It is an approximate
average of the squared deviations from the sample mean. The standard deviation 𝜎 or 𝑠 is the square root of the
variance.

Population Variance: Given the finite population 𝑥1 , 𝑥2 … 𝑥𝑁 , the population variance, which is exact, is

∑(𝑥𝑖−𝜇)2 𝑁∑𝑥𝑖2 −(∑𝑥𝑖)2


𝜎2 = or 𝜎2 =
𝑁 𝑁2

Sample Variance: Given a random sample 𝑥1 , 𝑥2 … 𝑥𝑛 , the sample variance is

∑(𝑥𝑖−𝑥̅ )2 𝑛∑𝑥𝑖2 −(∑𝑥𝑖)2


𝑠2 = or 𝑠2 =
𝑛−1 𝑛(𝑛−1)

where:  = population standard deviation 𝑥𝑖 = 𝑖 th observation

𝑠 = sample standard deviation 𝜇 = population mean

𝑥̅ = sample mean 𝑁 = population size

𝑛 = sample size

If the data are clustered around the mean, then the variance and the standard deviation will be somewhat small.
If, however, the data are widely scattered about the mean, the variance and the standard deviation will be somewhat
large.

Notes:

1. We divide by the quantity 𝑛 − 1 in order to make the sample variance an unbiased estimator of the
population variance. (An estimator is unbiased if its average value is equal to the parameter it is estimating.)

97
MATHEMATICS IN THE MODERN WORLD

2. The unit of the standard deviation is the same as that of the raw data, so it is preferable to use the standard
deviation as a measure of variability instead of the variance.
3. The range is a quick but a rough measure of variation since considers only the highest value and the lowest
value of the observations.

In attempting to develop a sense for the standard deviation, we consider the following results from Chebyshev’s
theorem:

At least 75% of all scores will fall within two standard deviations of the mean.

At least 89% of all scores will fall within three standard deviations of the mean.

Let us also consider the empirical rule, which applies to data that is approximately bell shaped. For these bell-
shaped distributions, the empirical rule states that:

About 68% of all scores fall within one standard deviation of the mean.

About 95% of all scores fall within two standard deviations of the mean.

About 99.7% of all scores fall within three standard deviations of the mean.

Measures of Relative Dispersion

The measures of relative dispersion are unit less and are used when one wishes to compare the dispersion
of one distribution with another distribution.

COEFFICIENT OF VARIATION (CV)

The coefficient of variation standardizes the variation by dividing it by the sample mean. Because of this property,
it can be used to compare variations for different variables with different units.

𝜎 𝑠
population: 𝐶𝑉 = ( ) 100% sample: 𝐶𝑉 = ( ) 100%
𝜇 𝑥̅

A larger coefficient of variation implies a more spread out or more dispersed data set.

This is only defined for non-zero mean, and is most useful for variables that are always positive. It is also known
as unitized risk or the variation coefficient. CV is unitless. It is used to compare dispersion of two or more data
sets with the same or different units. The higher the CV the more variable is the data set relative to its mean.

Example:

Several measurements of the diameter of a spherical instrument bearing made with one micrometer had a
mean of 2.49 mm and a standard deviation of 0.12 mm, and several measurements of the unstretched
length of a spring made with another micrometer had a mean of 0.75 in. with a standard deviation of 0.02
in. Which of the two micrometers is relatively more precise?

Example:

98
MATHEMATICS IN THE MODERN WORLD

Blood samples from 10 persons were sent to each of two laboratories for cholesterol determination.
Measurements were as follows (Kuzma and Bohnenblust, 2005):

Subject Lab 1 Lab 2

1 296 318

2 268 287

3 244 260

4 272 279

5 240 245

6 244 249

7 282 294

8 254 271

9 244 262

10 262 285

Compare the data sets recorded by the two laboratories by considering the following descriptive
measures: mean, median, mode, first quartile, third quartile, min, max, range, standard deviation,
variance, quartile deviation, mean absolute deviation, median deviation, and coefficient of variation.

Correlation Analysis

Correlation analysis is a technique used to describe the relationship or association between variables. If we
want to know the degree of relationship between two variables which are measured in at least an interval scale, the
Pearson Product Moment Correlation Coefficient (r) may be obtained.

Interpreting the Correlation Coefficient:

The value of the correlation coefficient indicates the degree as to how the variables are related with each
other. The correlation coefficient is a value between -1 and +1 inclusive where if the value of r is negative, there is
a negative relationship between the variables while if r is positive, the relationship is said to be positive. The value
of r is interpreted as follows:

Correlation
Linear Relationship
Coefficient

0 None

± 0.01 - ± 0.20 Very Weak

± 0.21 - ± 0.40 Weak

± 0.41 - ± 0.60 Moderate

± 0.61 - ± 0.80 Strong

± 0.81 - ± 0.99 Very Strong

±1 Perfect Linear

A. Pearson Product Moment Correlation Coefficient ρ

The estimator of the true population Pearson Product Moment Correlation Coefficient (ρ) is given by

99
MATHEMATICS IN THE MODERN WORLD

 x  y 
 xy  n
r

  x   
2
 y  2


    y 
2 2
x 


n 

n 

Properties of the Correlation Coefficient (r):

1. It is a unitless quantity.
2. It is always some number between -1 and +1, inclusive.
3. The magnitude of r is simply a measure of how closely the points cluster about a certain trend line
which is known as the regression line.

It is known that r is a measure of the degree of relationship between two variables, X and Y. However, if we want to
know if the relationship between the two variables is meaningful, a test of significance is necessary. The null
hypothesis may be stated as there is no significant relationship between the two variables and the alternative
hypothesis is stated as there is a significant relationship between the two variables.

The test statistic used in testing the null hypothesis is the t-statistic with the formula

r n2
t
1 r 2

The computed value of the test statistic will be compared with a tabular t-value at a certain level of
significance with n – 2 degrees of freedom.

Example: Consider the scores obtained in Math (X) and Statistics (Y) by 10 students.

10
Student 1 2 3 4 5 6 7 8 9
20
Math Score (X) 5 8 10 12 12 14 15 16 18
12
Stat Score (Y) 2 7 8 9 10 12 14 10 16

Compute for the correlation coefficient, r. Test if it is significant.

X y

100
MATHEMATICS IN THE MODERN WORLD

Exercise Problems:

1. The number of hours of study for an examination and the grade received by a random sample of 10 students
are:

10
Student 1 2 3 4 5 6 7 8 9
No. of
Hours 8
8 5 11 13 10 5 18 15 2
Studied
(X)
Exam Grade 65
56 44 79 72 70 54 94 85 33
(Y)

Calculate r.

X y

B. Spearman Rank Correlation Coefficient(r s)

The Pearson r is based on fairly stringent assumptions that it is sometimes preferable to use an alternative
statistical tool (which is considered to be nonparametric) which can be applied under much more general conditions.
This is the Spearman’s rank correlation coefficient which is essentially the coefficient of correlation for the ranks of
the x’s and y’s within two samples.

Spearman correlation measures the correlation of variables in at least ordinal level. It yields closer
approximation to the Pearson’s correlation when the data are more or less continuous, that is, it is not characterized
by a large number of tied ranks.

Procedure in Obtaining Spearman rs:

1. Rank the scores in variable X giving the lowest value a rank of 1 and the highest value a rank of n.
Repeat the process for the scores in variable Y.

101
MATHEMATICS IN THE MODERN WORLD

2. Obtain the difference (di) between the two sets of ranks.


3. Square each difference and then take the sum of the squared d i values.
4. Compute rs using the formula:
6  d i2
rs 1 

n n 2 1 
5. If the proportion of ties in either the X or the Y observations is large, use the formula

rs 
 x  y d
2 2
i
2

2  x   y 
2 2

Where

n n 2 1  t x3  t x
 x2  12
  Tx ; Tx 
12

n n 2 1  t 3y  t y
 y2  12
  Ty ; Ty 
12
tx = number of observations in X tied at a given rank
ty = number of observations in Y tied at a given rank

To test whether the observed rs value indicates a significant association between variables, the following maybe
applied:

a. For n from 4 to 30, critical values of r s at 0.05 and 0.01 level of significance are shown in the
accompanying table.
b. For n > 30, significance of the observed rs under the null hypothesis can be determined using the t-test
with the formula:

rs n  2
t
1  rs2

And this computed value will be compared with the tabular t value at a certain level of significance with n
– 2 degrees of freedom.

Table of Critical Values of rs, the Spearman Rank Correlation Coefficient

Significance Level (one-tailed test)


N
0.05 0.01

4 1.000

5 0.900 1.000

6 0.829 0.943

7 0.714 0.893

8 0.643 0.833

9 0.600 0.783

10 0.564 0.746

12 0.506 0.712

14 0.456 0.645

16 0.425 0.601

102
MATHEMATICS IN THE MODERN WORLD

18 0.399 0.564

20 0.377 0.534

22 0.359 0.508

24 0.343 0.485

26 0.329 0.465

28 0.317 0.448

30 0.306 0.432

Regression Analysis

Correlation and regression analysis are closely related since both involve relationship between two variables
and they both use paired observations obtained from the same (or matched) subjects. While correlation is used to
determine the degree as well as the direction of relationship between variables, regression analysis deals with the
use of the relationship for forecasting or predicting the value of a dependent variable. The primary goal of regression
analysis is to develop a statistical (regression) model that will characterize the association of the variables and also
to determine the statistical relationship, if any, between variables. If the regression model is found to be adequate,
it can then be used to estimate or forecast values of the dependent variable.

Before proceeding with regression analysis, a scatter diagram of Y versus X can be done. It may give an
idea of the form of relationship between them.

Simple Linear Regression

- A statistical tool that is used to


o Describe the dependence of variable Y on the independent variable X.
o Lend support to the hypothesis regarding the possible causation of changes in Y brought about by
changes in X.
o Predict Y in terms of X.
o Explain some of the variations of Y by X.

The Simple Linear Regression Model

In most real situation, the relation between the two variables is not perfect. For example, if a student
obtained a grade of 85%, it cannot be solely attributed to the students’ IQ. The student’s performance is also
affected by other factors aside from the student’s IQ level.

The simple linear regression model, expresses the response (or dependent) variable (Y) as a function of one
predictor (or independent) variable (X), as

Yi = β0 + β1Xi + εi

Where

Y = observed value of the dependent variable

X = observed value of the independent variable

Β0 = true regression intercept or the value of the response variable when X is zero

Β1 = true regression slope or the changes (increase if positive or decrease if negative) in the
response variable brought about by an increase of one unit in the independent variable

εi = random error component which captures all other factors affecting the response variable but
were not included in the model

Estimation of the Parameters βo and β1:

The values of the parameters in the regression equation or model are often times unknown. The common
practice is to take sample observations and from this sample data, the parameters are estimated

The estimate of the parameter β1 is the statistic b1 and is given by

103
MATHEMATICS IN THE MODERN WORLD

 x  y 
x y
i i
i i 
b1  n
 x  2


i
x i2 
n

The estimate of the parameter β0, on the other hand, is given by the statistic b0 where

b0  y  b1 x

Example: A corporation administers an aptitude test to all new sales representatives. Management is
interested in the extent to which this test is able to predict their eventual success. The accompanying table
records average weekly sales (in thousands of pesos) and aptitude test scores for a random sample of eight
representatives.

Test Scores 55 60 85 75 80 85 65 60

Weekly
10 12 28 24 18 16 15 12
Sales

a.) Estimate the linear regression of weekly sales on aptitude test scores.
b.) Interpret the estimated slope of the regression line.

Coefficient of Determination (R2)

The value R2 is a fraction between 0.0 and 1.0 and is unit less. It is the proportionate reduction of total
variation associated with the use of the independent variable. The larger R2 is, the more is the total variation of the
dependent variable reduced by introducing the independent variable.

An R2 value of 0 means that knowing X does not help you predict Y. There is no linear association or
relationship between X and Y in the sample data. When R2 is equal to 1.0, all points lie exactly on a straight line
with no points scattered about the line. Knowing X lets you predict Y perfectly. You can think of R 2 as the fraction
of the total variance of Y that is “explained” by the variation in X.

The coefficient of determination, R2 (%) is computed as

SSR
R2  x100
SSY
Note: If R2 (%) is close to
zero, this may suggest that a
Where: SSR = b1 SPXY linear model is not
appropriate for the given
 x  y  data set. A non-linear model
x y
i i
SPXY = i i  may be fitted.
n

104
MATHEMATICS IN THE MODERN WORLD

 y  2


i
SSY = y i2 
n

 x  2


i
SSX = x i2 
n

105

You might also like