Data Management Module

MODULE 4: Data Management
MODULE 4
Data Management
Catalina B. Gayas & Emmeline R. Garcia
Table of Contents
Lesson 1 Data Collection, Organization, and Interpretation
Basic terminology in statistics 3
Data Collection and Sampling Techniques 4
Frequency Distribution and Graphs for Numerical Data 6
Lesson 2 Measures of Central Tendency
Mean (Raw and Grouped Data) 13
The Weighted Mean 15
Median (Raw and Grouped Data) 15
Mode (Raw and Grouped Data) 18
Types of Distribution 19
Lesson 3 Measures of Variation
Range (Raw and Grouped Data) 22
Mean Absolute Deviation (Raw and Grouped Data) 23
Variance and Standard Deviation (Raw and Grouped Data) 25
Lesson 4 Measures of Relative Position
Standard Score 30
Percentiles, Deciles, and Quartiles 31
Lesson 5 Normal Distribution
The Standard Normal Distribution 37
Applications of Normal Distribution 41
Lesson 6 Correlation Coefficients and Linear Regression
Correlation Analysis 46
Linear Regression 50
Leyte Normal University | Mathematics Unit 1
Overview
Statistics is used in all aspects of human endeavors. Statistics is used to describe data; to determine
significant relationship between and among variables; to determine significant difference in a
variable of interest between or among groups; and to make forecast and prediction. The concepts in
Statistics were already discussed in your K to 12 Curriculum. Hence, this module focuses on the
application of these concepts in the real setting, in which you can relate to. It is the aim of this
module to make you appreciate the importance of Statistics, and at the same time have fun doing
the exercises and activities.
This module includes the topics: Data Collection, Organization, and Interpretation; Measures
of Central Tendency; Measures of Dispersion; Measures of Relative Position; Normal Distribution;
and Correlation Coefficients and Linear Regression. Computer applications will be utilized in
this module, especially the use of Microsoft Excel and statistical analysis software, like SPSS, for
data analysis.
Objectives
At the end of this module, you should be able to:
1. demonstrate knowledge of basic statistical terms;
2. use statistical methods to summarize and organize data;
3. solve problems applying normal distribution;
4. apply linear regression and correlation in analyzing data; and
5. interpret computer outputs in data analysis.
LESSON 1: Data Collection, Organization, and Interpretation
Introduction
Statistics is defined as science of collecting, organizing, summarizing, presenting and

interpreting data. There are three main reasons why student study statistics. They are as
follows: (1) To read and understand the various statistical studies published in print or broadcast
media; (2) To conduct research in his own field since statistical procedures are basic to
research; and (3) To become better consumers and citizen by using the knowledge gained from
studying statistics.
Basic Terminology in Statistics
In studying statistics, it is important to understand the basic terms used in the subject. The
following terms are defined for this purpose.
Variable refers to a characteristic or attribute that can assume different or varied values. Example
of a variable is sex, nationality, score, height, etc. Data are the measurements or observations that
the variables can assume. A data set is collection of data values, and every particular value in the
set is called datum.
There are two branches of statistics. The branch that involves collection, organization,
summarization and presentation of data is called descriptive statistics. While the branch that
makes generalization from sample (representative of a population) to a population (totality of all
observations or entities of any sort), performs estimation and hypothesis testing, and determines
relationship among variables and makes predictions is called inferential statistics.
Variables can be classified as quantitative and qualitative. Quantitative variable is a

numerical value that can be ordered or ranked. IQ, scores, weight, temperature are examples of
quantitative variables. Quantitative variable is further classified as discrete and continuous.
Discrete variable assumes values that can be counted. On the other hand, a continuous variable
assumes unlimited number of values between any two specific values. Continuous variable is
measured. The number of deaths in a certain locality relative to CoViD-19 pandemic is an example
of a discrete variable, while the height of a person is an example of a continuous variable. Why a
height is considered a continuous variable? What are other examples of continuous variables?
How about discrete variables?
Variables are also classified according into four levels of measurement scales. They are: nominal,
ordinal, interval and ratio. Nominal scale is the simplest scale of measurement that classifies data
into mutually exclusive categories and uses numbers for labels only. Sex, occupation,
religious affiliation and marital status are examples of nominal data. Ordinal scale uses numbers for
labelling and the numbers can be ranked. However, there is no equal difference between
ranks. Socio economic status, Latin honor, and academic rank are examples of ordinal data.
Interval scale possesses the characteristics of ordinal scale (label and rank) and equal differences
between ranks exist. Also, in an interval data, there is no true zero value. Score in an examination,
temperature, Intelligent Quotient (IQ) are examples of interval scale. Ratio scale is the
highest level of measurement. It possesses all the characteristics of an ordinal scale (label, rank,
equal differences
between ranks) and a true zero value of a number exist. Distance travelled, height, weight and age
are examples of ratio scale.
Variables are also classified according to their functions, especially in experimental studies.
They are independent or explanatory variable, dependent or outcome variable, and
confounding variable. Independent Variable is the variable manipulated by the researcher, while
the dependent variable is the variable affected or influenced by the manipulated variable.
The confounding variable on the other hand is a variable that influences the dependent
variable. For example a researcher is interested on finding out the effect of learning delivery
modes (pure online, pure printed module, mixture of online and printed module) on the
performance (test score) of the students in GE104. The delivery mode is the independent
variable; the performance is the dependent variable. The performance can be affected by
learning ability of the students. Thus, the learning ability is a confounding variable.
Data Collection and Sampling Techniques
Data can be collected in different ways. The method to use in the collection of data depends on the
source of data as well as the type of data to be collected. Data can be collected through
survey ( telephone, questionnaire or interview), test, observation, and experimentation. D
etails
on how each method are done and what is the advantage of one over the other will not be
part of this lesson as this is exhaustively discussed in your research course.
Data are collected from a representative of a population called sample. The process of collecting
samples is called sampling. There are two types of sampling: non-probability and probability
sampling. In non-probability sampling, not every member of the population is given equal chance to
be chosen, hence the samples are not are true representative of the population. If the objective of
the study is to make a generalization, using non-probability sampling is discouraged. Convenience
or Accidental sampling, Purposive or Judgemental Sampling and Quota Sampling a re the most
common techniques in non-probability sampling.
Probability sampling on the other hand gives equal chance to each member of the population to be
selected as a representative. There are four techniques under this type of sampling. They are as
follows: simple random sampling, systematic random sampling, stratified random sampling
and cluster random sampling.
Simple Random Sampling is a technique used in when the population is homogeneous with respect
to the characteristic of interest to the researcher and the population size is known (Petilos, 2012).
Selection of sample can be done either by lottery method or using random numbers.
Systematic Random Sampling is a technique that selects the desired sample size by selecting every
subject. To select the sample the researcher assigns number to each member of the population
kth
(by numbering consecutively) then he determines the value of k by dividing the total number
of cases (population) by the desired number of samples. For example the total population (N) is
1,000 and the sample size (n) is 100. Therefore, the value of k is 10. Thus, the researcher will select
every 10th subject in the population, which is determined by selecting the starting number between
1 to 10 by using simple random sampling. Suppose the starting number is 6, so the
researcher will
consider the subjects whose numbers are: 6, 16, 26, etc. until the desired number of
samples is completed.
Stratified Random Sampling is a technique used by grouping the population into subgroups called
strata according to the common characteristic/s as determined by the researcher. The subjects are
selected from each stratum which is proportional to the number of each subgroup. For example if
the population consists of all freshmen student across the three colleges (A, B, and C) in University
X. If the total freshmen population among the three colleges is 1400 divided as follows: NA = 350;
NB = 500 and NC = 550 and the researcher wishes to take a total of 350 respondents. Then he has to
select from each stratum the desired samples using either simple random sampling or systematic
random sampling using the following computation:
College N n
A 350
B 500
C 550
Total 1,400 351
[Note: Due to the rule of rounding off numbers as applied in A & C which are 87.5 = 88 and
137.5 = 138, respectively, the researcher has to decide in which subgroup he has to reduce the
samples by 1.]
Cluster Random Sampling is a technique used when the population is large enough or the
respondents are residing in a large geographic area and it is impossible for the researcher to obtain
the list of all members of the population. The members of each cluster are heterogeneous. Unlike
the stratified random sampling where the subjects are selected individually, in this technique
cluster/s is selected randomly and all members of the selected cluster would represent the
population. For example a researcher wishes to determine the type of fertilizer (pure
synthetic, pure organic or combination of synthetic and organic) use by rice farmers from the
municipality of Town Q. Assuming that there is no available list of rice farmers (categorized a small
scale, medium and large scale rice producing), the researcher can get a copy of the map of Town Q
and determine the number of barangays which are located outside downtown and along the
seashore areas. Each of these barangays is a considered a cluster. Suppose there are 43
barangays that belong to this group. Therefore, there are 43 clusters to choose from. The
researcher then decides how many of these barangays will be included and then he randomly
selects the cluster/s. The rice farmers in the selected cluster/s represent the group from Town Q.
Frequency Distribution and Graphs for Numerical Data
Once the researcher has already collected the data, the next thing to do is to organize. There are
three ways of presenting data: tabular, graphical and textual. The following discussion focuses on
how to organize raw data and subsequently represent those using graphs.
Example 1.1
Below are scores of 50 students in Statistics examination.
63 88 79 92 86 87 83 78 47 67
68 76 46 81 92 77 76 84 70 66
77 75 98 81 82 81 87 78 70 60
94 79 52 82 77 81 77 70 74 61
56 69 83 83 71 48 90 52 75 84
Looking at the array of scores it would be difficult for the reader to tell the characteristic of
the group. Thus, a frequency distribution needs to be prepared. A frequency distribution is
an organization of raw data classes/groups and frequencies. The frequency distribution is a
tabular way of organizing raw data. The following are the steps in preparing frequency distribution.
Step 1. Determine the number of classes.

• Find the highest value (HV) and lowest value (LV) .
• Find the range (R) by subtracting the lowest from the highest value.
• Determine the estimate number of classes by getting the square root of n, call this k.
Your actual number of classes could be greater than the estimated one.
Step 2. Determine the class size of the interval.
R
c= k (r ounded to the nearest whole number)
Step 3. Determine lower and upper limit of the lowest class interval. The lower limit should
be divisible by the class interval.
€
Step 4. Determine the upper class
Step 5. Tally the scores in their respective classes
Step 6. Summarize the tallies.
Illustration: Using the array of raw scores given above, we have:

1. Determine the number of classes
R = HV – LV
= 98 – 46
R = 52
(it tells us the gap between the highest and lowest scores in the given data set)
k = 50 = 7.07
k = 7
2. Determine the class size.

= 7.43 c = 8
€
R 52
c= k = 7
3. Determine lower and upper limits of the lowest class interval.
Since the lowest value in the given data set is 46 and it is not divisible by the class interval €
which is 8, we have to find a smaller number closest to 46 which is divisible by 8.
The number is 40. So, our lower limit of our lowest class interval is 40 and the upper limit is
47, because the lower limit of the next class interval is 48 = lower limit of the preceding
class
added by the class size (c). It follows that the upper limit of this class interval is 55. Thus,
the class boundary is 48 – 55. Following the same procedure, you can find the remaining
class intervals.
4. Determine the upper class. The highest class interval should contain the highest value of the
given data set. Since our highest value is 98 which is not divisible by the class size of 8, so the
lower limit of the highest class interval should be a number smaller and closest to 98. The
number is 96. Thus, the highest class interval is between 96 – 103.
5. List down the class intervals and tally the scores in their respective classes.
Class Limits Class Boundaries Tallies Frequency
96 - 103 95.5 - 103.5 / 1
88 - 95 87.5 – 95.5 ///// 5
80 - 87 79.5 – 87.5 /////-/////-//// 14
72 - 79 71.5 – 79.5 /////-/////-/// 13
64 - 71 63.5 – 71.5 /////-/// 8
56 - 63 55.5 – 62.5 //// 4
48 - 55 47.5 – 55.5 /// 3
40 - 47 39.5 – 47.5 // 2
REMARKS:
• In this illustration the actual number of classes which is 8 is greater than the estimated value of k
which is 7.
• The second column shows the boundary of each class interval in which the actual lower and upper limits
are indicated. These are called true limits or class boundaries.
• The true upper limit of the preceding class is also the true lower limit of the succeeding class. This shows
the continuity of the data.
Using the same data set as presented in the frequency distribution above, we can prepare graphs.
In this module, we will discuss only the histogram, frequency polygon and ogive. These are
the most commonly used graphs in research.
A histogram displays the data using continuous bars (vertical or horizontal). The histogram is a bar
graph in which bars are constructed without space in between. This implies that the data presented
is continuous. The heights/lengths of the bars show the frequency of the respective classes. The
frequency polygon on the other hand displays the data by using lines connecting the points
plotted for the frequencies of each class. This graph is used when the data is continuous.
Both graphs use the midpoints of the classes in the frequency axis.
The ogive is a graph that shows the cumulative frequencies for the classes in the given distribution.
The ogive can be constructed either for cumulative frequency less of cumulative frequency greater.
The following are steps in constructing the above-specified graphs manually. The same graphs can
be constructed by using either by Excel or Minitab and the specific steps are illustrated in the book
of Bluman.
Example 1.2
Before constructing the different graphs, we need to add more information in our
frequency distribution as shown below.
Class Interval f X <cf >cf rf
95.5 - 103.5 1 99.5 50 1 2.0
87.5 – 95.5 5 91.5 49 6 10.0
79.5 – 87.5 14 83.5 44 20 28.0
71.5 – 79.5 13 75.5 30 33 26.0
63.5 – 71.5 8 67.5 17 41 34.0
55.5 – 62.5 4 59.5 9 45 18.0
47.5 – 55.5 3 51.5 5 48 10.0
39.5 – 47.5 2 43.5 2 50 4.0
N = 50 100.0
REMARKS:
+U
X = LL L
• The midpoint of each class is obtained .
2
using the formula:
Steps in Constructing a Histogram
€
Step What to do?
1 Construct two perpendicular axes (vertical and horizontal)
2 Label the vertical axis as the frequency a xis and the horizontal as variable
axis.(In our illustration below, our variable is a s core)
3 Lay off segments along the vertical axis (y-axis) to correspond to the
frequencies. (The segments must be equal in length)
4 Lay off segments along the horizontal axis (x-axis) to correspond to the different
class intervals of the variable. The first line segment should be moved a little to the
right if the lowest value of the variable is not zero.
5 Mark all midpoints of the intervals and label these using class midpoints.
6 Draw rectangle or bars whose heights correspond to the frequency counts and
whose widths to the class size. (Shade or color your bars).
Adapted from: Resource Materials in Basic Statistics (Petilos,p.9)
Score
istogram
Figure 1.1. H
Steps in Constructing a Frequency Polygon

Step What to do?
2 Label the vertical axis as the frequency a xis and the horizontal as variable
axis.(In our illustration above, our variable is a s core)
3 Lay off segments along the vertical axis (y-axis) to correspond to the
frequencies. (The segments must be equal in length)
4 Lay off segments along the horizontal axis (x-axis) to correspond to the different
class intervals of the variable. The first line segment should be moved a little to the
right if the lowest value of the variable is not zero.
5 For each class interval, the class midpoint and corresponding frequency are
considered ordered pair and is plotted in the plane determined by the
coordinate axes.
6 The plotted points are then joined using line segments from left to right. To close the
polygon, extend one class interval to both sides by connecting the endpoints of the
graph to the midpoints of the extended segments along the x-axis.
Adapted from: Resource Materials in Basic Statistics (Petilos, p.10)
16
14
12
yc
0
n
10
8
6
4
2 Score
Polygon
requency
Figure 1.2. F
Steps in Constructing an Ogive

Step What to do?
2 Label the vertical axis as the cumulative frequency a xis and the horizontal as
variable axis. (In our example the variable is a s core) .
3 Lay off equal segments along the vertical axis (y a xis) to correspond to the
cumulative frequencies. Use an appropriate scale to represent the cumulative
frequencies. (Depending on the numbers in the cumulative frequencies, the scales
can be by 2’s, 4’s, 5’s, etc. )
4 Lay off equal segments along the horizontal axis (x axis) to correspond to the
true upper limit of the ogive for less than cumulative frequencies and true lower
of the ogive for greater than cumulative frequencies
5 Plot the cumulative frequencies with the corresponding class boundaries.
6 The plotted points are then joined using line segments from left to right.
Reference: Bluman, pp54-55

REMARK:
• To determine the percentage or the number of cases found below or above a particular boundary. • If
the ogives (for >cf and <cf) are graphed on the same coordinate plane, a line can be drawn from the point
of intersection of the two graphs onto the variable axis which represents the median of the data set.
e
20 r
10 60
yc
0 50
n
40
e
u 30
q
Class Boundaries
e 20
r
F
10
60
0
50 yc
40
n
e
Class Boundaries
u
30 q
Figure 1.3. L ess than cumulative frequency reater than cumulative frequency
Figure 1.4. G
Stem and Leaf Plot

Another method of organizing data which is a combination of sorting and graphing is the
called stem and leaf plot. It is a data plot that uses the leading digit as stem and the trailing digit as
leaf.
Steps in Constructing Stem and Leaf Plot.

Step What to do?
1 List down the leading digits of the data set called the stem. Arrange them in
a column either from lowest to highest or vice versa.
2 Starting from the first to the last entry of the data set, carefully record the
trailing digits (leaf) in their corresponding stem.
3 Arrange in order the trailing digits in each row. If there are no data values in a class,
the stem number is written and the leaf row is left blank.
Reference: Bluman, pp81-82
Example 1.2
Let us illustrate the above procedure using the data on the scores of 50 students in
Statistics examination. The data are reproduced as follows:
63 88 79 92 86 87 83 78 47 67
68 76 46 81 92 77 76 84 70 66
77 75 98 81 82 81 87 78 70 60
94 79 52 82 77 81 77 70 74 61
56 69 83 83 71 48 90 52 75 84
Steps:
1. Stem (Leading Digit) Leaf (Trailing Digit)
9 48220
8 831123627113744
7 76599717678800045
6 3897601
5 622
4 687
2. Rearranging the trailing digits (leaf) we have:

Stem Leaf
9 02248
8 111122333446778
7 00014556677778899
6 0136789
5 226
4 678
REMARKS:
• The figure shows that the distribution peaks in the center and there are no gaps in the data. •
The highest score is 98 and the lowest is 46.
• Most scores are 70 and above.
What other information can you draw from the figure above?
1
Exercises 1.1
A. Determine the area of statistics (descriptive or inferential) illustrated by thefollowing

statements.
1. A recent study showed that eating garlic could lower blood pressure.
2. The teacher - pupil ratio in public schools has increased from 1:40 in 2015 to 1:50 in
2019.
3. It is predicted that the average number of automobiles each household owns will
increase next year.
4. A study revealed that Lagundi is more effective in curing cough than a similar
product.
5. Consumers generally prefer Colgate than any other toothpaste.
B. In each statement below identify the variable/s and classify it/them according to the level of
measurement (nominal, ordinal, interval, ratio)
1. Marital status of faculty members in a university.
2. Time it takes a student to travel from home to school.
3. Scores in the College Admission Test of freshman students in University Q.
4. Socio-economic status of the residents in a barangay (poor, average, above-average). 5.
Ages of freshman college students of Leyte Normal University.
C. Classify each variable as discrete or continuous

1. Number of CoViD-19 cases in Eastern Visayas.
2. Weights of backpacks of college students inside a Science laboratory room.
3. Number of new mono bloc chairs inside the university social hall.
4. Blood pressures of patients seeking admission in a hospital.
5. Number of boxes of disposable surgical masks sold in one pharmacy in three days.
D. A research is to be conducted to determine the level of language proficiency and numeracy

skills among the 700 Education and 300 Management graduating students at University Q.
The researcher wants a sample of 300 be selecting representatives from the two
programs.
1. What is the population of the study?
2. What is the sample in the study?
3. What are the variables of the study? What is the level of measurement of each
variable?
4. What sampling technique used in this study?
E. An insurance company researcher conducted a survey on the number of car thefts in a
large city for a period of 30 days last summer. The raw data are shown below. Construct a
grouped frequency distribution, frequency polygon, histogram and ogives (Show all
necessary solutions).
52 58 75 79 57 65 62 77 56 51
59 53 51 66 55 68 63 78 50 53
67 65 69 66 69 57 73 72 75 55
LESSON 2: Measures of Central Tendency
Introduction
Statistics is a science of collecting, organizing, summarizing, presenting and interpreting data. There
are two branches of statistics. The branch that involves collection, organization, summarization and
presentation of data is called descriptive statistics. While the branch that involves the
interpretation and drawing conclusion is called inferential statistics. Descriptive statistics
include the measures of central tendency, measures of position and measures of variability.
There are three measures of central tendency or measures of central location, namely: the mean,
median and the mode. The measure of central tendency is a single value that describes a whole set
of data by identifying the central position within the given data set. It is sometimes called
the measure of central location or summary statistics.
Mean (Raw and Grouped Data)
The data gathered in their original form is called raw or ungrouped data, while the data that have
been organized into a frequency distribution is called grouped data.
For raw data, the mean is defined as the arithmetic average of a data set It is equal to the sum of
the measurements divided by the number of cases (n). It is the measure used when there is
no extreme value of the data set and the data is either an interval or ratio. Among the three
measures of central tendency, the mean is the most reliable and is amenable for further
mathematical manipulation which makes it useful for inferential statistics.
Formula: mean =
The Greek capital letter sigma is used to denote a sum. Thus, the formula above means, the
ivided by the total number of cases. For the data collected from a
summation of the values of x d
population the symbol use for the mean is a Greek letter (read as mu) which is called parameter. x
ar) which is
(read as: x b
While the data collected from sample, the symbol use for the mean is
called statistic. The total number of cases is denoted by N a nd n for a parameter and statistic,
respectively. Thus the working formula for the mean of a population is: =
€
Example 2.1. Compute for the average of the scores in a Math quiz of 15
students. 23 25 34 32 22 24 26 24 34 30 26 26 37
25 24
Solution:
Using a calculator, we have:
+ 25+ 34 +. ..+ 24

23
x=
15
= 412
15
x= 27.5
This implies that the average score in a Math quiz of the 15 students is 27.5
€
Note: Rounding Rule for the Mean. The mean should be rounded to one more decimal place
than occurs in the raw data.
For grouped data, the mean is obtained by using the formula below:
fX
Σ
x=
N
where: � = average or mean

f = class frequency
€
X = midpoint of each class
N = total number of cases
Example 2.2. U sing the data in Example 4.1.1 we find the mean of grouped data. (Scores of
50 students in Statistics examination)
Class Interval f X fX
96 - 103 1 99.5 99.5

88 – 95 5 91.5 457.5
80– 87 14 83.5 1169.0
72– 79 13 75.5 981.5
64 – 71 8 67.5 540.0
56 – 63 4 59.5 238.0
48– 55 3 51.5 154.5
40 – 47 2 43.5 87.0
N = 50 fX = 3,727.00
By substitution, we have:
fX
Σ
x=
3727
N=
50 =
74.54
Therefore, the average score of 50 students in a Statistics examination is 74.54.

€
Note: We rounded off the computed mean to the nearest hundredths because the class intervals is
actually 0.5 below and above the given limits. Thus, the true lower limit of each class interval is 0.5
below the apparent lower limit and the true upper limit is 0.5 higher than the apparent upper limit.
Example, f or the class interval of 40 - 47 with the lower limit of 40 and the upper limit of 47, has a
true lower limit of 39.5 (0.5 lower than the apparent lower limit), and has a true upper limit of 47.5
(0.5 higher than the apparent upper limit).
There is another method of finding the mean of grouped data by using the assumed
deviation. However, the discussion of this method will not be included in this module.
The Weighted Mean
When the weight of each value or observation is not equal the weighted mean is obtained. The
weighted mean is computed using the formula below:
ΣwX
X=
Σw
1 X1 +w2 X2 +...+w

w
X= nX
n
w1 +w2 +...+w
n
Where: w1, w2, …, wnare

the weights and X1, X2 ,…, Xn are the values or observations
ΣΣΣΣΣwX = sum of the products of each value and its respective weight
Σw = sum of the weights

€
Example 2.3
Find the grade weighted average of a student in his five subjects as shown in the table below:
Subject Grade (X) No. of Units (w) wX
Mathematics 1.5 3 4.5
English 1.7 3 5.1
PE 1.3 2 2.6
Physics 1.6 5 8.0
Social Science 1.5 3 4.5

16
Σw= ΣwX =24.7
By substituting to the formula, we find the Grade Weighted Average (GWA) of the student:
wX
X = Σ
Σw = 24.7

16 =
1.54
Thus, the grade weighted mean of the student is 1.54.
€
Median (Raw and Grouped Data)
The median is the middlemost value of the measurements when they are arranged from smallest to
highest. It is used when the data is at least ordinal. The median is not affected by extreme values or
outliers. The median is reliable and less stable than the mean.
5
For raw data or ungrouped data, the median is obtained by getting the middlemost value after the
data set is arranged from lowest to highest. It is the value that divides the data set into two equal
parts.
Example 2.4
Using the data set in Example 2.1 we have:
23 25 34 32 22 24 26 24 34 30 26 26 37 25 24
Solution: a) Arrange the scores from lowest to highest.

Using stem and leaf plot we have:
Stem Leaf
3 42407
2 3524646654
Rearranging the leaf in our plot above we have
Stem Leaf
3 02447
2 2344455666
22 23 24 24 24 25 25 26 26 26 30 32 34 34 37
Thus, the median of the given data set is 26. This implies that with the score of 26, there
seven cases below and above it. Example 2.4 i s an example of data set for odd cases (n = 15). How
to find the median when there are even cases? Based on the definition of the median it is the
middlemost value.
Example 2.5
6 30 32 34 34 35 37
22 23 24 24 24 25 25 26 26 2
Thus we
⎛ ⎞
⎛ ⎞
⎝⎜ ⎠⎟th c ase.
2+1
n 2
case and
To get the median of even ⎠⎟th
have: n
cases, we take the
average of the ⎝⎜

⎛
⎞ n
⎝⎜ ⎠⎟th case + 2+1
⎛ ⎞
⎝⎜ ⎠⎟th c ase n

Md = € 2 €
2
+ 26
= 26
2
Md = 26
This implies that the value of 26 divides the cases into two equal parts. This 26 is not the 8th nor the
9th case but there is a value of 26 between 8th and 9th cases.
€
For grouped data, the median is obtained using the formula below:
⎛⎜ ⎟ c
2 −
cf ⎟( )
⎜
Md = ⎜ f ⎟
⎜ ⎞
LL + ⎝ ⎟ ⎠
N
where: LL = true lower limit of the median class

€
cf = cumulative frequency below the median class
f = frequency of the median class
c = class interval
Example 2.6
e find the median of grouped data. (Scores of 50 students
Using the data in Example 1.1 w
in Statistics examination)
Class Interval f <cf
96 - 103 1 50
88 – 95 5 49
80– 87 14 44
72– 79 (median class) 13 (f) 30
64 – 71 8 17 (cf)
56 – 63 4 9
48– 55 3 5
40 – 47 2 2
N= 50
Note that 50% of 50 cases is 25. This means that we find a number or value such that 50% of the
total number of cases is below and above it. Using the formula above we have:
⎜ = ⎛⎜ ⎟
⎜ 2 −
17
⎛ 71.5 + ⎜
Md = ⎝ ⎟ ⎜ 13 ⎠
⎞ ⎜
LL + N ⎞
⎟
⎛⎜ ⎠ ⎝ ⎟
⎜ 2− cf ⎟ (c) ⎞ 50 ⎟ 8
⎟
f ⎟( )
(0.6154)(8) = 7 1.5+ 4.92
−17
= 71.5+25
8 = 71.5+
⎠⎟( )
76.42
⎝⎜
Md = 13
This implies that 76.42 is the middlemost value of the given data set. This means that there are 25
cases found below and above this value.
€
Mode (Raw and Grouped Data)
The mode is the most frequent value in a given data set. The mode is used when you want
to determine a quick estimate of the typical value in a given data set. The mode is the most
unstable measure of central tendency especially if there are only few cases. A given data set can
have more than one mode. For cases where there are two modes it is called bimodal.
Example 2.7
Using the data set in Example 1.1, we notice that there are two values (24 and 26) that have the
same frequency of 3.
5 25 26 26 26 30 32 34 34 37
22 23 24 24 24 2
Therefore, the modes of the given distribution are 24 and 26. This is an example of a
bimodal distribution.
Example 2.8
Find the mode of the following data: 12, 34, 12, 71, 48, 93, 71 .
By inspection, the number 12 occurs more often than the other numbers. Therefore, the mode of
the distribution is 12. This is an example of a unimodal distribution.
Example 2.9
Find the mode of the following data set:
12, 5, 8, 9, 11, 11, 4, 7, 23, 7, 8, 12, 23, 9, 4, 5
By inspection, each number in the list occurs twice. There is no number that occurs more
often than the others. Therefore, there is no mode.
For grouped data, the mode is obtained by using the formula below:
⎞
⎛ c
⎠⎟( )
Mo = LL
+d1
d1 +
d2
⎝⎜
where: LL = true lower limit or lower boundary of the modal class;
d 1 = absolute difference between the frequencies of the modal class
€
and the lower class interval (interval just below it);
d2 = absolute difference between the frequencies of the modal class
a nd the higher class interval (interval just above it);
c = the class size
Example 2.10
Using the data in Example 4.1.1 we find the mode of grouped data. (Scores of 50 students
in Statistics examination)
Class Interval f
96 - 103 1
88 – 95 5
(interval just above the modal class)
80– 87 14
(modal class)
72– 79 13
(interval just below the modal class)
64 – 71 8
56 – 63 4
48– 55 3
40 – 47 2
Using the formula below, we obtain the mode of the given data set:
⎞ ⎞
⎛ ⎛
c
1
LL +d
Mo = ⎠⎟( ) = 79.5+14 −1
3
⎜ + 14 − 5
⎝⎜ ⎜ ( ) ⎝
d1 + 2
d (14 −1 3) ⎟ 8
⎟( ) ⎠
⎛
1 ⎞
79.5+ 1+ 9
Mo = 8 = 79.5+
⎠⎟( )
⎝⎜
Mo = 80.30 (0.10)(8) = 79.5+. 80
Therefore, the mode of the given data set is 80.3.

€
In summary, the given data set has the following values of the measures of central
tendency: Mean = 74.54 Median = 76.42 Mode = 80.30
What is the characteristic of our illustrative distribution? Why?
Types of Distribution
The characteristic of the distribution can be determined by the shape of its graph (histogram
of frequency polygon). According to Bluman, the symmetric, positively skewed and negatively
skewed are the most important shapes of graphs that describe a distribution. Skewness refers to
the degree of departure of the distribution from the line of symmetry. When the data values
are evenly distributed on both sides of the mean and it is unimodal, the distribution is
called symmetric distribution. Further, the mean, median and mode have equal values and are at
the center of the x = Md =
Mo .
distribution. In symbol,
€
A positively skewed or right-skewed distribution is unimodal and majority of the data values
cluster at the lower end of the distribution and to the left of the mean. Moreover, with the
positively skewed distribution, the mode is lesser than the median and the median is lesser than
Mo < Md < x .
the mean. In symbol,
A negatively skewed or left-skewed distribution is observed when majority of the data
values cluster at the upper end of the distribution and to the right of the mean. Furthermore,
with the
€
negatively skewed distribution the mode is greater than the median and the median is greater than
x < Md <
Mo .
the mean. In symbol,
The following graphs are illustrations of the three types of distribution according to its
skewness (MathBits.com).
€
Symmetric Distribution
Positively Skewed Distribution
Negatively Skewed Distribution

0
Summary of Measures of Central Tendency

Measure Common When to Use Advantage Disadvantage
Name
Mean Arithmetic • There are no • Most stable, i.e., • Affected by

Average extreme values stable and less extreme scores or
• When the data at variable from sample values
least an interval to sample
• Amendable for
further
mathematical
manipulation which
makes it useful in
inferential statistics
Median Middle • The distribution • Easy to compute • Less stable from

Score/Valu is skewed • Not affected by sample to sample
e • When the data is extreme scores or
at least ordinal or values
rank
Mode Typical • When a quick • Easy to compute • T he most

Score/Valu estimate to the unstable
e typical score or measure
value to be especially when
determined the number
of cases is small.
Adapted from: Resource Materials in Basic Statistics (Petilos, p.14)
Exercises 2.1
A. Using Exercise 13.1 on page 811 of the book, Mathematical Excursion by Aufman,
answer numbers 4 to 9 and 11.
B. Using the same exercise, find the mean, median and mode of the data set of number
14 on page 812.
C. Problem Solving.
1. If the mean age of eight college freshman students is 19.25. and six of the ages
are: 19, 18, 20, 19, 20 and 18. What are the ages of the two students who are
twin siblings? What is the mode (age) of the eight students?
2. Find the mean of 20, 30, 40, 50 and 60.
a. Add 5 to each value and find the mean.
b. Subtract 5 from each value and find the mean.
c. Multiply each value by 5 and find the mean.
d. Divide each value by 5 and find the mean.
e. Make a general statement about each situation.
1
LESSON 3: Measures of Variation
Introduction
In the preceding lesson you learned the three measures of central tendency namely, mean, median
and mode. Accordingly, to describe the data set, it is important that one knows more than
the measures we studied in the previous lesson as one tends to claim that two or more data sets
are not varied when it is observed that the averages are equal. In this lesson, we will
discuss the measures of variation/spread or measures of dispersion. In this module the four
measures of variability both for ungrouped and grouped data will be talked over. They are
the range, mean absolute deviation, variance and standard deviation.
Range (Raw and Grouped Data)
The range is simply the gap or difference between the highest and lowest value/observation of the
data set. In formula: R = HV – LV.
If R = 0, it implies that all values in a data set are equal. Thus, there is no variability of the data.
Example 3.1
Ages of female faculty members from three departments.
Statistical Implication/Impression Data Set
Measure
A B C
37 40 39
38 41 40
42 42 42
45 43 43
48 44 46
Mean Equal distribution 42 42 42
Range Distribution A is 11 4 7
more spread. Why?
According to Petilos in his Resource Material in Basic Statistics, range of grouped data is equal to
the difference between true upper limit of the highest class interval a nd the true lower limit of the
lowest class interval. If the apparent limits are used, the range is equal to the difference between
upper limit of the highest class interval less than the lower limit of the lowest class interval plus 1. In
formula:
R= UL − LL
( )H ( )L
€
2
Example 3.2
Scores of 50 students in Statistics examination
Class Interval f
96 - 103 1
88 – 95 5
80– 87 14
72– 79 13
64 – 71 8
56 – 63 4
48– 55 3
40 – 47 2
N = 50
Using the data set as presented in the distribution above, the range is:
R = 103.5 – 39.5 = 64 (using true limits)
R = 103 – 40 + 1 = 63 + 1 = 64 (using apparent limit)
Mean Absolute Deviation (Raw and Grouped Data)
The mean absolute deviation (MAD) of a data set is defined as the average distance between each
data value and the mean. It helps to describe how “spread out” the values in a data set
are (https://www.khanacademy.org/math). The MAD for raw data is computed using the
following formula:
X −
Σ
MAD = x
or value
N
where: X = score
x = mean score or mean value

€
Using the data set of Example 3.1 and computing for the MAD of each distribution, we
have: €
Example 3.3
Ages of female faculty members from three departments
Statistical Implication/Impression Data Set
Measure
A B C
37 40 39
38 41 40
42 42 42
45 43 43
48 44 46

3
Range Distribution A is more 11 4 7

spread. Why?
MAD Distribution B is least 3.6 1.2 2.0

variable compared to
the other two data
sets. Why?
By substituting the formula, we find the MAD of Data Set A as follows:

X − x
Σ
MAD =
− 42 + 38 − 42 + 42 − 42 + 45 − 42 + 48 − 42

37
N=
5
+ 4 + 0 + 3+ 6
= 5
18
5 = 5
MAD = 3.6
Following the same procedure we find the MAD of the remaining two distributions as reflected on
the table above.
€
It can be deduced from the table of Example 3.3 that the scores of Data Set A deviate from
the mean by an average of 3.6, compared to Data Set B where the scores deviate from the mean by
an average of 1.2. This implies that Data Set B is less spread compared to Data Set A. The lesser
the value of MAD the less spread the distribution is.
For grouped data the MAD is obtained using the formula below:
f X −
Σ
MAD = x
N
€
x = mean score or mean value
Example 3.4 N = total number of cases

where: X = class mark or
midpoint of each class f = €
frequency of each class
Using the data in Example 1.1 we find the mean absolute deviation of grouped data.
Class Interval f X ⏐ x -X⏐ �⏐ x -X⏐
96 - 103 1 99.5 24.96 24.96
88 – 95 5 91.5 16.96 84.80
80– 87 14 83.5 8.96 125.44

€ €
72– 79 13 75.5 0.96 12.48
64 – 71 8 67.5 7.04 56.32
56 – 63 4 59.5 15.04 60.16
48– 55 3 51.5 23.04 69.12
40 – 47 2 43.5 31.04 62.08
N = 50 ⏐ x - X
Σ � ⏐ = 495.36
x=74.54 (from Example 2.2)
Recall:
Leyte Normal University | Mathematics Unit € 2
4
€
Thus,
Σ f X −
MAD = x
N = 495.36

50 =
9.9072
9.91
MAD =
This implies that the 50 scores deviate from the mean of 74.54 by an average of 9.91 units.
€
Variance and Standard Deviation (Raw and Grouped Data)
The last two measures of dispersion or measures of variation to be included in this module are the
variance and standard deviation. Bluman, in his book Elementary Statistics, defines variance as the
average of the squares of the distance each score or value from the mean. While the
standard deviation, is the square root of the variance. It looks at how spread out a group of
numbers is from the mean (https://www.investopedia.com).
The population variance and standard deviation are calculated using the following respective
formulas:
Σ X − ∝ 2
σ2 = ( )
read as “sigma squared”):
Variance (σ2
N
Σ X − ∝
σ= ( )2
Standard Deviation (σ = square root of the variance) :
€
N
Where: σ2 = population variance

σ = population standard deviation
€
X = the item or observation
∝ = population mean
Example 3.5
The following data are ages of 10 teachers in one Elementary School:
27, 34, 30, 29, 28, 30, 34, 35, 28, 29.
Find the variance and standard deviation of this population data.
Solution: To compute for the variance, we present the data as shown in the table below:
Age (X) X−
∝ (X − ∝)2
27 -4.4 19.36
34 2.6 6.76
30 -1.4 1.96

5
29 -2.4 5.76
28 -3.4 11.56
30 -1.4 1.96
34 2.6 6.76
35 3.6 12.96
38 6.6 43.56
29 -2.4 5.76
∝ = 31.4 Σ(X − ∝)2 = 116.4
By substituting the formula, the variance is

Σ X − ∝ 2
σ2 = ( )
N = 116.4

10 =
11.64
When the variance is zero (0) it indicates that all of the data values are the same. Thus, there is no
variation. Since a variance is an average of the square it follows that all non-zero variances
are €
positive. A small variance indicates that the data points tend to be very close to the mean, and to
each other. A high variance indicates that the data points are very spread out from the mean, and
from one another (MathBits.com).
What does a population variance of 11.64 mean? Since the value of 11.64 is far from zero,
this implies that the observations are more spread from one another and from the mean.
From above value of population variance, it follows that the population standard deviation which is
the square root of the variance is:
σ = 11.64 = 3.41 .
We recall that the standard deviation measures how concentrated the data are around the mean;
the more concentrated, the smaller the standard deviation. €
(https://www.dummies.com/education/math/statistics). What is the implication of the above value
in relation to the mean of the given data set?
Example 3.6
Using the data set of Example 4.3.3, determine the variance and standard deviation of each subset
of data. Compare your results. The table is reproduced below.
Ages of female faculty members from three departments

Statistic Implication/Impression Data Set
al
Measur A B C
e
37 40 39
38 41 40
42 42 42
45 43 43
48 44 46

6

Range Distribution A is 11 4 7
more spread. Why?
MAD Distribution B is 3.6 1.2 2.0

least variable
compared to the
other two data
sets. Why?
Variance
Standard
Deviatio
n
Computing Sample Variance and Standard Deviation
The table below shows the different notations use for the variance and standard deviation.
Notation Statistical Measure
σ2 Variance of a population
σ Standard deviation of a sample
s2 Variance of a sample
s Standard deviation of a sample
If the data set is taken from a sample, the variance and standard deviation are obtained using the
following computational formula (Bluman, p.137)
Sample Variance:
X 2 (
s2 =n Σ ) − (ΣX ) 2
−1
n n
)
(
Sample Standard Deviation (s quare root of the variance):

€ ) − (ΣX ) 2
X (
n Σ
s=
2
variance
n n −
1
Where: s2 = sample ( )
X = individual observation
€
n = sample size
Example 3.7
Find the sample variance and standard deviation for the daily production rate of fiberglass boats of
a certain manufacturer. If the company production manager feels that a standard deviation of more
than three boats a day is unacceptable, should the manager be concerned about the plant
production rate? Why?
17 21 18 27 17 21 20 22 18 23
7
Solution:
X 17 21 18 27 17 21 20 22 18 23 ΣX = 204
X2 289 441 324 729 289 441 400 484 324 529 ΣX2 = 4,250
X 2 (
s2 =n Σ ) − (ΣX )2
n n −1 =
( ) (10)(4250) − (204)2
10 10 −1
( )
s2 = 42500
− 41616
90 = 884
90 =
9.82
From above value of sample variance, it follows that the sample standard deviation which is
the square root of the variance is:
€
s= 9.82 =
3.13
.
REMARK:
Since the obtained sample standard deviation of 3.1implies that the fiberglass boats plant
daily €
production is within the acceptable rate. Thus, there is no reason for the plant manager to weary
about its production.
Computing Sample Variance and Standard Deviation from Grouped Data
For grouped data we find the variance and standard deviation using the following computational
formula (Bluman, p.139)
€
n n −
1
Variance: ( )
s2 =n ΣfX 2 ( )
− ΣfX 2 fX 2 ( )
( ) n
s= Σ
− ΣfX 2 n n
( ) (
−1
)
Standard
Deviation:
where: f = class frequency

X = class mark
€
n = total number of observations
Example 3.8
Using the data in Example 4.1.1 we find the variance and standard deviation of grouped data. The
table is reproduced below:
8

Class Interval f X fX fX2
96 - 103 1 99.5 99.5 9900.25
88 – 95 5 91.5 457.5 41861.25
80– 87 14 83.5 1169.0 97611.50
72– 79 13 75.5 981.5 74103.25
64 – 71 8 67.5 540.0 36450.00
56 – 63 4 59.5 238.0 14161.00
48– 55 3 51.5 154.5 7956.75
40 – 47 2 43.5 87.0 3784.50
50
N= 3727
ΣfX = ΣfX2= 285,828.50
Substituting the above computational or shortcut formula, we obtain the sample variance as
follows:
fX 2 (
s2 =n Σ ) − (ΣfX ) 2
n n −1 =
( ) (50)(285828.50) − (3727)2
(50)(50 −1)
s2 = 14291425
−13890529
=
(50)(49) 400896
2450 =
163.63
With the above sample variance value of 163.63 it follows that the sample standard deviation (s)
which is the square root of the variance is 12.79. This implies that the scores of 50 students deviate
€
from the mean on the average by a distance of 12.79 units.
There is another method of computing the sample variance and sample standard deviation by using
the Coded Deviation. However, its discussion is not included in this module.
Exercises 3.1
y Aufman,
A. Using Exercise 13.2 on page 823 of the book, Mathematical Excursion b
answer numbers 4 to 8 and 12.
B. Using the same exercise, answer number 20 on page 824 on the ages of the female
and male actors Academy awardees. Answer questions a, b, and c found at the end
of the exercise.
C. Critical Thinking
Using the exercise no. 26 on page 825 perform the suggested activity and answer
the question found at the end.
9

Data Management Module

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Management Module

Uploaded by

Copyright:

Available Formats

MODULE 4: ​Data Management

MODULE 4: ​Data Management

LESSON 1: Data Collection, Organization, and Interpretation

Statistics ​is defined as science of collecting, organizing, summarizing, presenting and

Basic Terminology in Statistics

Variables can be classified as quantitative and qualitative. ​Quantitative variable ​is a

MODULE 4: ​Data Management

Data Collection and Sampling Techniques

MODULE 4: ​Data Management

Total 1,400 351

Frequency Distribution and Graphs for Numerical Data

Step 1. Determine the number of classes.

Illustration: Using the array of raw scores given above, we have:

2. Determine the class size.

MODULE 4: ​Data Management

96 - 103 95.5 - 103.5 / 1

88 - 95 87.5 – 95.5 ///// 5

80 - 87 79.5 – 87.5 /////-/////-//// 14

72 - 79 71.5 – 79.5 /////-/////-/// 13

64 - 71 63.5 – 71.5 /////-/// 8

56 - 63 55.5 – 62.5 //// 4

48 - 55 47.5 – 55.5 /// 3

MODULE 4: ​Data Management

Class Interval f X <cf >cf rf

95.5 - 103.5 1 99.5 50 1 2.0

87.5 – 95.5 5 91.5 49 6 10.0

79.5 – 87.5 14 83.5 44 20 28.0

71.5 – 79.5 13 75.5 30 33 26.0

63.5 – 71.5 8 67.5 17 41 34.0

55.5 – 62.5 4 59.5 9 45 18.0

47.5 – 55.5 3 51.5 5 48 10.0

39.5 – 47.5 2 43.5 2 50 4.0

Steps in Constructing a Histogram

1 Construct two perpendicular axes (vertical and horizontal)

Adapted from: Resource Materials in Basic Statistics (Petilos,p.9)

Steps in Constructing a Frequency Polygon

1 Construct two perpendicular axes (vertical and horizontal)

Adapted from: Resource Materials in Basic Statistics (Petilos, p.10)

Steps in Constructing an Ogive

1 Construct two perpendicular axes (vertical and horizontal)

5 Plot the cumulative frequencies with the corresponding class boundaries.

Reference: Bluman, pp54-55

MODULE 4: ​Data Management

Stem and Leaf Plot

Steps in Constructing Stem and Leaf Plot.

Reference: Bluman, pp81-82

MODULE 4: ​Data Management

2. Rearranging the trailing digits (leaf) we have:

MODULE 4: ​Data Management

A. Determine the area of statistics (descriptive or inferential) illustrated by thefollowing

C. Classify each variable as discrete or continuous

D. A research is to be conducted to determine the level of language proficiency and numeracy

MODULE 4: ​Data Management

LESSON 2: Measures of Central Tendency

Mean (Raw and Grouped Data)

Leyte Normal University | Mathematics Unit 13

MODULE 4: ​Data Management

​ + ​25​+ ​34 ​+.​ ..​+ ​24

where: �​ ​= average or mean

96 - 103 1 99.5 99.5

80– 87 14 83.5 1169.0

72– 79 13 75.5 981.5

MODULE 4: Data Management

MODULE 4: Data Management

Statistics is defined as science of collecting, organizing, summarizing, presenting and

Variables can be classified as quantitative and qualitative. Quantitative variable is a

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

+ 25+ 34 +. ..+ 24

where: � = average or mean

MODULE 4: Data Management

1 X1 +w2 X2 +...+w

Where: w1, w2, …, wnare

Σw = sum of the weights

MODULE 4: Data Management

average of the ⎝⎜

MODULE 4: Data Management

where: LL = true lower limit of the median class

72– 79 (median class) 13 (f) 30

MODULE 4: Data Management

MODULE 4: Data Management

MODULE 4: Data Management

Mean Arithmetic • There are no • Most stable, i.e., • Affected by

Median Middle • The distribution • Easy to compute • Less stable from

Mode Typical • When a quick • Easy to compute • T he most

MODULE 4: Data Management

MODULE 4: Data Management

R = 103.5 – 39.5 = 64 (using true limits)

R = 103 – 40 + 1 = 63 + 1 = 64 (using apparent limit)

x = mean score or mean value

MODULE 4: Data Management

− 42 + 38 − 42 + 42 − 42 + 45 − 42 + 48 − 42

Example 3.4 N = total number of cases

Where: σ2 = population variance

MODULE 4: Data Management

∝ = 31.4 Σ(X − ∝)2 = 116.4

MODULE 4: Data Management