0% found this document useful (0 votes)
206 views21 pages

Statistics and Data Management Overview

The document discusses statistics and data management. It defines statistics and describes its branches, functions, scope and limitations. It also discusses concepts like data collection, measurement, presentation and distribution.

Uploaded by

Kristine Perez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
206 views21 pages

Statistics and Data Management Overview

The document discusses statistics and data management. It defines statistics and describes its branches, functions, scope and limitations. It also discusses concepts like data collection, measurement, presentation and distribution.

Uploaded by

Kristine Perez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Lesson 4 41

Lesson 4: Statistics and Data Management

Learning Outcomes
At the end of the lesson, the students are able to

1. demonstrate the ability to apply fundamental concepts in exploratory data analysis;

2. define the field of Statistics in terms of its definition and application;

3. enumerate the procedures involved in collecting data;

S
4. distinguish between the nominal, ordinal, interval and ratio methods of data measurement;

5. recognize the various ways to present data;

DM
6. identify the features that describe a data distribution.

Statistics is the study of the collection, organization, analysis, interpretation, and presentation of data.
It deals with all aspects of data, including the planning of its collection in terms of the design of
surveys and experiments. Some consider statistics a mathematical body of science that pertains to the
collection, analysis, interpretation or explanation, and presentation of data, while others consider it a
branch of mathematics concerned with collecting and interpreting data. Because of its empirical roots
and its focus on applications, statistics is usually considered a distinct mathematical science rather than
a branch of mathematics.
P
4.1 Basic Concepts
PU

Statistics is defined as a branch of mathematics which is concerned with facilitating wise decision-
making in the face of uncertainty and that, therefore develops and utilizes techniques for collection,
effective presentation, and proper analysis of data.

Branches of Statistics

1. Descriptive Statistics is concerned with the description and summarization of data, It deals with
the techniques used in the collection, presentation, organization, and analysis of the data on hand.

2. Inferential Statistics is concerned with the drawing of conclusions from data. It deals with the
techniques used in generalizing from samples to populations, performing estimations and hypothesis
tests determining relationships among variables, and making predictions.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 42

Functions of Statistics

1. Condensation. Generally speaking by the verb ‘to condense’, we mean to reduce or to lessen.
Condensation is mainly applied at embracing the understanding of a huge mass of data by providing
only few observations.

2. Comparison. Classification and tabulation are the two methods that are used to condense the
data. They help us to compare data collected from different sources. Grand totals, measures
of central tendency measures of dispersion, graphs and diagrams, coefficient of correlation, etc.
provide ample scope for comparison. As statistics is an aggregate of facts and figures, comparison

S
is always possible and in fact comparison helps us to understand the data in a better way.

3. Forecasting. By the word forecasting, we mean to predict or to estimate beforehand. Given the

DM
data of the last ten years connected to the number of students enrolled in PUP, it is possible to
predict or forecast the number of students that will enroll for the near future. In business also
forecasting plays a dominant role in connection with production, sales, profits etc. The analysis of
time series and regression analysis plays an important role in forecasting.

4. Estimation. One of the main objectives of statistics is drawn inference about a population from
the analysis for the sample drawn from that population.

5. Tests of Hypothesis. A statistical hypothesis is some statement about the probability distri-
bution, characterizing a population on the basis of the information available from the sample
P
observations. In the formulation and testing of hypothesis, statistical methods are extremely use-
ful. Whether the grades of students increased because they are motivated or whether the new
teaching method is effective in discussing a particular topic are some examples of statements of
hypothesis and these are tested by proper statistical tools.
PU

Scope of Statistics

1. Statistics and Industry. Statistics is widely used in many industries. In industries, control charts
are widely used to maintain a certain quality level. In production engineering, to find whether the
product is conforming to specifications or not, statistical tools, namely inspection plans, control
charts, etc., are of extreme importance. In inspection plans we have to resort to some kind of
sampling - a very important aspect of Statistics.

2. Statistics and Commerce. Statistics are lifeblood of successful commerce. Any businessman
cannot afford to either by under stocking or having overstock of his goods. In the beginning he
estimates the demand for his goods and then takes steps to adjust with his output or purchases.
Thus statistics is indispensable in business and commerce.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 43

3. Statistics and Economics. Statistical methods are useful in measuring numerical changes in
complex groups and interpreting collective phenomenon. Nowadays the uses of statistics are abun-
dantly made in any economic study. Both in economic theory and practice, statistical methods
play an important role.

4. Statistics and Education. Statistics is widely used in education. Research has become a
common feature in all branches of activities. Statistics is necessary for the formulation of policies
to start new course, consideration of facilities available for new courses etc. There are many people
engaged in research work to test the past knowledge and evolve new knowledge. These are possible
only through statistics.

S
5. Statistics and Planning. Statistics is indispensable in planning. In the modern world, which can
be termed as the “world of planning”, almost all the organizations in the government are seeking

DM
the help of planning for efficient working, for the formulation of policy decisions and execution of
the same. In order to achieve the above goals, the statistical data relating to production, consump-
tion, demand, supply, prices, investments, income expenditure etc and various advanced statistical
techniques for processing, analyzing and interpreting such complex data are of importance. In
India statistics play an important role in planning, commissioning both at the central and state
government levels.

6. Statistics and Medicine. In Medical sciences, statistical tools are widely used. In order to test
the efficiency of a new drug or medicine, t - test is used or to compare the efficiency of two drugs
or two medicines, t-test for the two samples is used. More and more applications of statistics are
P
at present used in clinical investigation.

7. Statistics and Modern Applications. Recent developments in the fields of computer technol-
ogy and information technology have enabled statistics to integrate their models and thus make
PU

statistics a part of decision making procedures of many organizations. There are so many software
packages available for solving design of experiments, forecasting simulation problems etc.

Limitations of Statistics

1. Statistics is not suitable to the study of qualitative phenomenon. Since statistics is


basically a science and deals with a set of numerical data, it is applicable to the study of only
these subjects of enquiry, which can be expressed in terms of quantitative measurements. As a
matter of fact, qualitative phenomenon like honesty, poverty, beauty, intelligence etc, cannot be
expressed numerically and any statistical analysis cannot be directly applied on these qualitative
phenomenon.

2. Statistics does not study individuals. Statistics does not give any specific importance to the
individual items; in fact it deals with an aggregate of objects. Individual items, when they are taken

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 44

individually do not constitute any statistical data and do not serve any purpose for any statistical
enquiry.

3. Statistical laws are not exact. It is well known that mathematical and physical sciences are
exact. But statistical laws are not exact and statistical laws are only approximations. Statistical
conclusions are not universally true. They are true only on an average.

4. Statistics table may be misused. Statistics must be used only by experts; otherwise, statistical
methods are the most dangerous tools on the hands of the inexpert. The use of statistical tools
by the inexperienced and untraced persons might lead to wrong conclusions.

S
5. Statistics is only one of the methods of studying a problem. Statistical method do
not provide complete solution of the problems because problems are to be studied taking the

Population and Sample DM


background of the countries culture, philosophy or religion into consideration. Thus the statistical
study should be supplemented by other evidences.

In statistics, we are often interested in gathering information from a group of objects. If the group
in consideration consists of large number of objects, we try to obtain information about the group by
examining its subgroup.
P
Definition 14
The total collection of all the elements that we are interested in is called a population. A
subgroup of the population that will be studied in detail is called a sample.
PU

In order for the data from the sample is informative about the population, it must be representative
of the population. Being representative of the population does not mean that the characteristic of the
sample is exactly that of the total population, but instead the sample was obtain in such way that every
member of the population had an equal chance to be included in the sample.

Definition 15
A sample of k members of a population is called a random sample, also called a simple random
sample, if the members are chosen in such a way that all possible choices of the k members are
equally likely.

After a random sample is obtain from the population, we can use statistical inference to draw general-
izations about the population by examining the members of the sample.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 45

4.2 Steps in Statistical Investigation


1. Defining the problem

(a) Identify a specific problem.


(b) Define the scope and limitations, assumptions to be made, and expected outcomes.

2. Collection of data

(a) Make sure to collect the data properly.


(b) Incomplete, fabricated, outdated, and inaccurate data are useless.

S
3. Summarization and tabulation of data

(a) This refers to organization of data in text, tables, graphs and charts, so that logical conclusion
can be derived from them.

4. Analysis of data
DM
(b) Explore the data to obtain additional insight that could contribute to the study.

(a) This pertains to the process of deriving from the given data relevant information from which
numerical descriptions can be formulated.
(b) Summarized data must be examined so that insights and meaningful information ca be pro-
duced to support decision-making or solutions to the question or problem at hand.
P
5. Interpretation of data and results

(a) Refers to the task of drawing conclusions from the analyzed data.
(b) Results must be able to answer the research problem and give recommendations.
PU

6. Presentation of the result

(a) Present all pertinent results in a clear and concise manner.


(b) Use appropriate form of media to present results.

4.3 Sampling and Sampling Techniques


Sampling refers to the process of obtaining samples from the population. Sampling maybe categorized as
either probability sampling or non-probability sampling. Probability sampling, also referred to as random
sampling, is the method of sampling in which every member of the population have equal chance of
being selected as sample; otherwise, it is considered as non-probability sampling. We should note that in
able to properly use the techniques of statistical inference, probability sampling must be used to obtain
samples.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 46

Probability Sampling Techniques

1. Simple Random Sampling. A probability sampling technique wherein all possible subsets con-
sisting of n elements selected from the N elements of the population have the same chances of
selection.

2. Systematic Sampling. This is a probability sampling technique wherein the selection of the
first element is at random and the selection of other elements in the sample is systematic by
subsequently taking every kth element from the random start where k is the sampling interval.

3. Stratified Random Sampling. A probability sampling method where we partition the population

S
into non-overlapping strata or group and then a proportional sample is chosen from each strata.
The actual sample is the sum of the samples derived from each strata.

DM
4. Cluster Sampling. A probability sampling technique wherein we partition the population into
non-overlapping groups or clusters consisting of one or more elements, and then select a sample
of clusters. Every member of the selected cluster will be considered as sample.

Non-Probability Sampling Techniques

1. Accidental Sampling. Sample is chosen by the researcher by the obtaining members of the
population in a convenient, often haphazard way.

2. Quota Sampling. There is specified number of persons of certain types is included in the sample.
The researcher is aware of categories within the population and draws samples from each category.
P
The size of each categorical sample is proportional to the proportion of the population that belongs
in that category.
PU

3. Purposive Sampling. The researcher employs his or her judgments on choosing which he or she
believes are representative of the population.

4. Snowball Sampling. This technique is also called referral sampling. A primary set of samples
are chosen based on the criteria set by the researcher. Information on where to find succeeding
set of sample having the same criteria will be gathered from this primary set in order to expand
the number of samples.

4.4 Sample Size Considerations


The sample size is typically denoted by n and it is always a positive integer. No exact sample size can be
mentioned here and it can vary in different research settings. However, all else being equal, large sized
sample leads to increased precision in estimates of various properties of the population.
To determine the sample size we can apply one of the following methods:

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 47

1. Slovin’s Formula. Slovin’s formula is used to calculate the sample size n given the population
size and a margin of error E. It is a formula use to estimate sampling size of a random sample
from a given population. We can compute

N
n= ;
1 + NE 2

where N is the population size.

Example 27. A researcher plans to conduct a survey about food preference of BS Stat students. If the
population of students is 1000, use the Slovin’s formula to find the sample size if the margin of error is 5%.

S
Solution. Using the Slovin’s formula, we get

DM
n=
1000
1 + 1000(0:05)2
≈ 285:71:

Therefore, the researcher needs to survey 286 BS Stat Students.

2. Minimum Sample Size for Estimating a Population Mean. The estimated minimum sample
size n needed to estimate a population mean — to within E units at 100(1 − ¸)% confidence is

(z¸=2 )2 ff 2
n= ;
E2
where ff is the known population standard deviation, E is the margin of error and z¸=2 is a value
P
which can be obtained in the z-table.

Example 28. Suppose we want to know the average age of STEM students. We would like to be 99%
PU

confident about our results. From previous study, we know that the standard deviation for the population
is 1.3. How many students should be chosen for a survey if the margin of error is 0.2.

Solution. Find z¸=2 by looking at the z-table.

¸ = (1 − 0:99) = 0:01 =⇒ z¸=2 = z0:005 :

The closest z-score for 0:005 in the z-table is 2:58. Thus,

(2:58)2 (1:3)2
n= ≈ 281:23:
(0:2)2

which we round up to 282, since it is impossible to take a fractional observation. We need a 282 STEM
students as a sample for our study.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 48

3. Minimum Sample Size for Estimating a Population Proportion The estimated minimum
sample size n needed to estimate a population proportion p to within E at 100(1 − ¸)% confidence
is
(z¸=2 )2 p̂(1 − p̂)
n= :
E2
This is also called the Cochran Formula.

The dilemma here is that the formula for estimating how large a sample to take contains the
number p̂, which we know only after we have taken the sample. There are two ways out of this
dilemma.

S
• First, typically the researcher will have some idea as to the value of the population proportion

in the formula.
DM
p, hence of what the sample proportion p̂ is likely to be. For example, if last month 37% of
all voters thought that state taxes are too high, then it is likely that the proportion with that
opinion this month will not be dramatically different, and we would use the value 0.37 for p̂

• The second approach to resolving the dilemma is simply to replace p̂ in the formula by 0.5.
This is because if p̂ is large then 1 − p̂ is small, and vice versa, which limits their product to
a maximum value of 0.25, which occurs when p̂ = 0:5. This is called the most conservative
estimate, since it gives the largest possible estimate of n.
P
Example 29. Suppose we are doing a study on the inhabitants of a large town, and want to find out
how many households serve breakfast in the mornings. We don’t have much information on the subject
to begin with, so we’re going to assume that half of the families serve breakfast: this gives us maximum
PU

variability. Here, p̂ = 0:5. We want 95% confidence and at least 5% precision.

Solution. Find z¸=2 in the z-table. We have

¸ = (1 − 0:95) =⇒ z¸=2 = z0:025 :

The closest z-score for 0:025 in the z-table is 1:96. A 95% confidence level gives us Z values of 1.96,
we get
(1:96)2 (0:5)(1 − 0:5)
n= ≈ 384:16:
(0:05)2
Hence, a random sample of 385 households in our target population should enough to give us the
confidence levels we need.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 49

Finite Population Correction for Proportions

If the population is small then the sample size can be reduced slightly. This is because a given sample size
provides proportionately more information for a small population than a large population. The formula
is
n0
n= ;
n0 − 1
1+
N
where n0 is the Cochran’s sample size recommendation, N is the population size and n is the new adjusted
sample size.

S
Example 30. In the preceding example, if there were just 1000 households in the target population, we
would calculate
385
n= ≈ 278:18:
385 − 1

DM
1+
1000
All we need are 279 households in our sample, a substantially smaller sample size.

4.5 Methods of Data Collection


1. Survey Method. The survey is a method of collecting data on the variable of interest by asking
people questions. This may be done, by interview or by using questionnaires.

2. Observation. Observation is a method of obtaining data or information by using our primary


senses.
P
3. Experiment. Experiment is a method of collecting data where there is direct human intervention
on the conditions that may affect the values of the variable of interest.
PU

4.6 Levels of Measurement


1. The nominal level of measurement classifies data into mutually exclusive (non-overlapping)
categories in which no order or ranking can be imposed on the data.

Example: Gender (male, female), Zip Code, Color, Nationality, Political affiliation, Religious
affiliation.

2. The ordinal level of measurement classifies data into categories that can be ranked; however,
precise differences between the ranks do not exist.

Example: Grade(A,B,C,D,F), Rating Scale/Likert scale, Ranking of tennis players, Judging (First
place, second place, etc.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 50

3. The interval level of measurement ranks data, and precise differences between units of measure
do exist; however, there is no meaningful zero.

Example: Temperature, IQ, SAT score

4. The ratio level of measurement possesses all the characteristics of interval measurement, and
there exists a true zero. In addition, true ratios exist when the same variable is measured on two
different members of the population

Example: Height, Weight, volume, Time, Salary, Age

S
4.7 Presentation of Data

DM
After data have been collected, the researcher can now present them in the following logical methods.

1. Textual Form. Data are presented in paragraph of text. The text highlights the important figures
or results that the researcher wishes to focus on.

2. Tabular Form. Data appears in a systematic manner in rows and columns.


The following is an example of a Simple or One-Way Table.

Table 1
Frequency Distribution of the
P
Students Enrolled for the Last 6 Years
Year Frequency
2012 13,450
PU

2013 13,200
2014 15,389
2015 16,790
2016 18,900
2017 19,500
Total 97,229

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 51

The following is an example of a Two-Way Table.

Table 2
Number of Students Enrolled for the Last 6 Years
When Grouped According to Sex

Year
Sex
2012 2013 2014 2015 2016 2017 Total
Male 5560 6095 7386 8056 7945 6451 41493
Female 7890 7105 8003 8734 10955 13049 55736

S
Total 13450 13200 15389 16790 18900 19500 97229

3. Graphical Form. Data or relationship among variables could be presented in visual form, thru

Types of Statistical Charts DM


graph or diagrams. In that manner, the reader can easily perceive what is being meant by the
figure or any trend being portrayed by the data.

(a) Bar Graph (Vertical Bar/Column Charts) is applicable for showing comparison of
amount of a variable of interest collected over time.

Simple Chart
P
PU

Grouped Column Charts

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 52

Subdivided Column Charts

S
(b) Histogram is similar to the bar graph but the base of the rectangle has a length exactly
equal to the class width of the corresponding interval. Also, there are no spaces between
rectangles.

DM Histogram
P
(c) Pictograph is similar to the bar chart but instead of bars, we use pictures or symbols to
represent a value or an amount.
PU

Pictograph

(d) Pie Chart is a circular graph partitioned into several section, depicting relative percentage
with respect to the total distribution.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 53

Pie Chart

S
(e) Line Graph is a graph used to visualize data that changes continuously over time.

Simple Line Graph

DM Multiple Line Graph


P
PU

(f) Statistical Map is used to show data in geographical areas.

Statistical Map

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 54

4.8 Measures of Central Tendency


A measure of central tendency or average is a location measure that pinpoints the center or typical
middle value of a data set. A convenient way of describing a set of data with a value that describes
the average characteristic a data set. The three common measures of central tendency are the mean,
median and mode.

Mean

Definition 16

S
Suppose that a variable x assumes values x1 ; x2 ; : : : ; xn . The arithmetic mean x of these values
is defined as n
1X x1 + x2 + · · · + xn
P
x
x=

DM
= xi = :
n n i=1 n

The (arithmetic) mean of x is obtained by adding all its observed values and dividing the sum by the
total number of observations.

Example 31. The scores of 15 students in Mathematics in the Modern World on an exam consisting
of 25 items are 25,20,18,18,17,15,15,15,14,14,13,12,12,10,10. Determine the mean score for this exam.

Solution. Let x denote the score of a random student from the sample of 15 students in Mathematics in
the Modern World. The sum of these scores is x = 228. Hence, the mean score of the 15 students is
P
P
228
P
x
x= = = 15:2:
n 15
PU

There are cases when the observations in a data set assume respective weights. In this case where the
weights are positive integers, we can call these weights as frequencies. The following gives a formula
for the weighted mean of a weighted data set.

Definition 17
Given the x values x1 ; x2 ; : : : ; xn assuming respective weights w1 ; w2 ; : : : ; wn , the weighted mean
is defined as
w1 x1 + w2 x2 + · · · + wn xn
P
wx
x= P = :
x w1 + w2 + · · · + wn

Example 32. Suppose that we are asked to get the mean of the data set 1; 1; 3; 3; 3; 3; 4; 4; 4; 6; 6; 8.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 55

Using the original formula for the arithmetic mean we find that

(1 + 1) + (3 + 3 + 3 + 3) + (4 + 4 + 4) + (6 + 6) + 8
x=
12
2·1+4·3+3·4+2·6+1·8
=
1+4+3+2+1
2 + 12 + 12 + 12 + 8
=
12
46
=
12
= 3:833

S
We can interpret the mean of the data values as the fulcrum or center of gravity in a balance scale as
shown below.

1
DM
P
1 2 3 4 5 6 7 8

mean = 3:8333
PU

Example 33.
Calculate the General Weighted Average (GWA) of
Course Grade Units
Julius Garde for the first semester of school year
BM 112 1.25 3
2019-2020 as shown in the following table.
BM 101 1.00 3
AC 103 1.25 6
Solution. To solve for the GWA, we first consider
MG 101 1.00 3
the entries on the second column of the table as the
EC 111 1.50 3
points xi and the entries in the third column as the
MK 101 1.50 3
corresponding weights wi . By constructing a fourth
FM 111 1.20 3
column consisting of the products wi xi and finding
PE 1 1.00 2
the column totals, we get the table below.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 56

Course xi wi wi xi
BM 112 1.25 3 3.75
BM 101 1.00 3 3.00
AC 103 1.25 6 7.50
MG 101 1.00 3 3.00
EC 111 1.50 3 4.50
MK 101 1.50 3 4.50
FM 111 1.20 3 3.60
PE 1 1.00 2 2.00

S
Total w = 26 w x = 32:00
P P

We see from the column totals that w = 26 and w x = 32. Therefore, the weighted mean or the
P P

DM
general weighted average (GWA) of Julius Garde for the first semester of AY 2019-2020 is

32
P
wx
x= P = = 1:23:
w 26

Median

Definition 18
The median, usually denoted by x̃, is the middle value of a data set if the observations are
P
arranged either in increasing or decreasing order.

Outliers in the data set do not affect the median. Thus, the median is preferred over the mean as a
PU

measure of central tendency when the data contains outliers. To find the median, begin by listing the
data in order from smallest to largest, or largest to smallest.

If the number of data values, N, is odd, then the median is the middle data value. This value can be
found by rounding N=2 up to the next whole number. If the number of data values is even, there is no
one middle value, so we find the mean of the two middle values (values N=2 and N=2 + 1)

Example 34. Given the scores of 15 students in Mathematics in the Modern World on an exam consisting
of 25 items:
25; 20; 18; 18; 17; 15; 15; 15; 14; 14; 13; 12; 12; 10; 10

Since the data is already arranged in decreasing order and there are 15 observations, hence, we round
15
up = 7:5 to the nearest whole number, which is 8, and take the 8th observation from the left (or
2
right). Therefore, the median is x̃ = 15: In comparison to example 31, the computed mean is 15:2.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 57

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

S
mean

median

Month
January
February
March
April
Hours Lost
55
23
24
37
DM
Remark. In general, the median need not equal the mean.

Example 35. The data given below is the total number of hours lost due to tardiness and absences of
employees in a company in a given year. Find the median.

Solution. If the data are arranged in increasing order, we have

20; 23; 24; 27; 30; 32; 37; 37; 40; 48; 42; 55:

May 37
June 48
Since there are 12 observations (even), we take note of the two
P
July 42 middle observations then compute
August 27
September 20
32 + 37
October 40 x̃ = = 34:5:
November 30 2
PU

December 32

Therefore, the median number of hours lost due to tardiness and absences of employees in a company
in the given year is 34:5 hours.

Mode

Definition 19
The mode is the most frequent observation in a given data set.

Outliers in the data set do not affect the mode. It is possible that the mode of a data set does not
exist, and it is not always unique. It is an appropriate measure of average for data measured only in the
nominal level. We will denote mode using the symbol x̂.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 58

Example 36. Suppose that we wanted to know the “average color” of cars used by the residents in a
given village. In our vehicle color survey, we collected the following data.
Color Frequency
Blue 3
Green 5
Red 4
White 3
Black 2
Grey 3

S
Since color of vehicles are measured up to the nominal level, the most appropriate measure for the
“average color” is then the mode. The most frequent color is Green, a total of 5 vehicles. Therefore, the
“average color” in our survey data must be Green.

4.9
DM
It is possible for a given data set to have more than one modes. Such a data set is said to be multimodal.
If a given set has only one mode, the data set is unimodal. If it has two modes, the data set is bimodal,
and so on.

Measures of Dispersion or Variability


Measures of dispersion are descriptive summary measures that helps us characterize the data set in terms
of how varied the observations are from the center. If its value is small, then this indicates that the
observations are not too different from the center. On the other hand, if its value is large, then this
indicates that the observations are very different from the center or that they are widely spread out from
P
the center.

Range
PU

Definition 20
The range is the difference between the largest and the smallest observations or items in a set of
data.

The range of a data set is easy to compute, but it is a limited measure because it depends on only two
of the numbers (the highest and the lowest) in the data set. Hence, the range can easily be affected
by outliers. Also, it does not provide any information regarding the concentration of the data from the
center.

Example 37. The following are scores of 20 coming from two different sections, 10 from each section,
in a 50-item exam in MMW.
section 1 40 38 42 40 39 39 43 40 39 40
section 2 46 37 40 33 42 36 40 47 34 45

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 59

For section 1, the highest score is 43, while the lowest score is 38. Thus,

range = 43 − 38 = 5:

On the other hand, for section 2, the highest score is 47, while the lowest score is 33. Thus,

range = 47 − 33 = 14:

Therefore, the scores of students surveyed from section 2 gets a wider range than those of students
surveyed from section 1.

S
Variance and Standard Deviation

Suppose that the center of a population data set {x1 ; x2 ; : : : ; xN } is best described by the arithmetic

would like to compute for

i=1
DM
mean — and that our goal is to get the average “distance” of each data point xi form —. Naturally, we

1 X

(xi − —) =
N

N i=1
(xi − —):

However, using the properties of summations, and the fact that n— = x1 + x2 + · · · + xN we can check
that
N N
X

i=1
xi −
N
X

i=1
— = N— − N— = 0:

In other words, the sum of the deviations from the mean is 0, and therefore, we cannot have a meaningful
measure of variability this way. The reason behind this fact is that some of the deviations from the mean
P
are negative (those which are to the left of the mean) and some are positive (those which are to the right
of the mean) and they cancel each other out. However, we can work our way out of this unfortunate
situation if we can ignore the signs of these deviations. One way to do this is to take the square these
PU

deviations from the mean. We then have the following definition.

Definition 21
The variance of a population data set {x1 ; x2 ; : : : ; xN } with population mean — is defined as

N
1 X
ff 2 = (xi − —)2 :
N i=1

On the other hand, the variance of a sample data set {x1 ; x2 ; : : : ; xn } with sample mean x is
defined as n
2 1 X
s = (xi − —)2 :
n − 1 i=1

As we may have noticed, the formula for the sample variance differs significantly from the formula for

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 60

the population variance mainly because of the divisor n − 1. The reason behind this is rather technical
and mathematical in nature. Simply taken, the divisor n − 1 removes the “bias” in s 2 when we want it
to estimate ff 2 for the purposes of making inferences.

Notice that the variance is a nonnegative quantity because it came from averaging squared quantities.
We also realize that there is one major drawback to using the variance. If we follow the steps in calcu-
lating the variance, we find that the variance is measured in terms of square units because we took the
squares of the deviation. For example, if our sample data is measured in terms of meters, then the units
for a variance would be given in square units.

S
In order to standardize the units, we can take the square root of the variance to eliminate the problem of

Definition 22
DM
squared units, and gives us a measure of the spread that will have the same units as our original sample
or population data.

The population (sample) standard deviation is the nonnegative square root of the the pop-
ulation (sample) variance. In symbols,
√ √
ff = ff 2 and s = s 2:
P
PU

Example 38. Using the sample data sets in example 37, determine which section exhibits a greater
variability in terms of standard deviations.

Solution. Let x denote the scores of students sampled from section 1 and let y denote the scores of
students sampled from section 2. To calculate the standard deviations of each sample, we first take note
that the sample means from each section are

400 400
P P
x y
x= = = 40 and y = = = 40:
n 10 n 10

To calculate the sample standard deviation, we construct the following table.

All Rights Reserved. 2020 Abdul, Atienza, et. al.


Lesson 4 61

x y x −x y −y (x − x)2 (y − y )2
40 46 0 6 0 36
38 37 −2 −3 4 9
42 40 2 0 4 0
40 33 0 −7 0 49
39 42 −1 2 1 4
39 36 −1 −4 1 16
43 40 3 0 9 0
40 47 0 7 0 49
39 34 1 36

S
−1 −6
40 45 0 5 0 25
x = 400 y = 400 (x − x)2 = 20 (y − y )2 = 224
P P P P

s =

DM
Therefore, the sample variance for the sample from section 1 is

2
P
(x − x)2
n−1

while the sample variance for the sample from section 2 is

2
s =
P
(y − y )2
n−1
=
=
20

9
9

224
= 2:2222;

= 24:8888:

Taking square roots, we find that the sample standard deviations of section 1 and section 2 respectively
√ √
are 2:2222 ≈ 1:49 and 24:8888 ≈ 4:99. We can conclude that for these samples, the one from
P
section 1 exhibits the lesser variability than that from section 2. We comment that even though the two
samples have equal means, the standard deviations showed the actual difference between the two data
sets.
PU

All Rights Reserved. 2020 Abdul, Atienza, et. al.

You might also like