Professional Documents
Culture Documents
MODULE 4
Data Management
Catalina B. Gayas & Emmeline R. Garcia
Table of Contents
Lesson 1 Data Collection, Organization, and Interpretation
Basic terminology in statistics 3
Data Collection and Sampling Techniques 4
Frequency Distribution and Graphs for Numerical Data 6
Lesson 2 Measures of Central Tendency
Mean (Raw and Grouped Data) 13
The Weighted Mean 15
Median (Raw and Grouped Data) 15
Mode (Raw and Grouped Data) 18
Types of Distribution 19
Lesson 3 Measures of Variation
Range (Raw and Grouped Data) 22
Mean Absolute Deviation (Raw and Grouped Data) 23
Variance and Standard Deviation (Raw and Grouped Data) 25
Lesson 4 Measures of Relative Position
Standard Score 30
Percentiles, Deciles, and Quartiles 31
Lesson 5 Normal Distribution
The Standard Normal Distribution 37
Applications of Normal Distribution 41
Lesson 6 Correlation Coefficients and Linear Regression
Correlation Analysis 46
Linear Regression 50
Leyte Normal University | Mathematics Unit 1
MODULE 4: Data Management
Overview
Statistics is used in all aspects of human endeavors. Statistics is used to describe data; to determine
significant relationship between and among variables; to determine significant difference in a
variable of interest between or among groups; and to make forecast and prediction. The concepts in
Statistics were already discussed in your K to 12 Curriculum. Hence, this module focuses on the
application of these concepts in the real setting, in which you can relate to. It is the aim of this
module to make you appreciate the importance of Statistics, and at the same time have fun doing
the exercises and activities.
This module includes the topics: Data Collection, Organization, and Interpretation; Measures
of Central Tendency; Measures of Dispersion; Measures of Relative Position; Normal Distribution;
and Correlation Coefficients and Linear Regression. Computer applications will be utilized in
this module, especially the use of Microsoft Excel and statistical analysis software, like SPSS, for
data analysis.
Objectives
At the end of this module, you should be able to:
1. demonstrate knowledge of basic statistical terms;
2. use statistical methods to summarize and organize data;
3. solve problems applying normal distribution;
4. apply linear regression and correlation in analyzing data; and
5. interpret computer outputs in data analysis.
Leyte Normal University | Mathematics Unit 2
Introduction
In studying statistics, it is important to understand the basic terms used in the subject. The
following terms are defined for this purpose.
Variable refers to a characteristic or attribute that can assume different or varied values. Example
of a variable is sex, nationality, score, height, etc. Data are the measurements or observations that
the variables can assume. A data set is collection of data values, and every particular value in the
set is called datum.
There are two branches of statistics. The branch that involves collection, organization,
summarization and presentation of data is called descriptive statistics. While the branch that
makes generalization from sample (representative of a population) to a population (totality of all
observations or entities of any sort), performs estimation and hypothesis testing, and determines
relationship among variables and makes predictions is called inferential statistics.
Variables are also classified according into four levels of measurement scales. They are: nominal,
ordinal, interval and ratio. Nominal scale is the simplest scale of measurement that classifies data
into mutually exclusive categories and uses numbers for labels only. Sex, occupation,
religious affiliation and marital status are examples of nominal data. Ordinal scale uses numbers for
labelling and the numbers can be ranked. However, there is no equal difference between
ranks. Socio economic status, Latin honor, and academic rank are examples of ordinal data.
Interval scale possesses the characteristics of ordinal scale (label and rank) and equal differences
between ranks exist. Also, in an interval data, there is no true zero value. Score in an examination,
temperature, Intelligent Quotient (IQ) are examples of interval scale. Ratio scale is the
highest level of measurement. It possesses all the characteristics of an ordinal scale (label, rank,
equal differences
Leyte Normal University | Mathematics Unit 3
between ranks) and a true zero value of a number exist. Distance travelled, height, weight and age
are examples of ratio scale.
Variables are also classified according to their functions, especially in experimental studies.
They are independent or explanatory variable, dependent or outcome variable, and
confounding variable. Independent Variable is the variable manipulated by the researcher, while
the dependent variable is the variable affected or influenced by the manipulated variable.
The confounding variable on the other hand is a variable that influences the dependent
variable. For example a researcher is interested on finding out the effect of learning delivery
modes (pure online, pure printed module, mixture of online and printed module) on the
performance (test score) of the students in GE104. The delivery mode is the independent
variable; the performance is the dependent variable. The performance can be affected by
learning ability of the students. Thus, the learning ability is a confounding variable.
Data can be collected in different ways. The method to use in the collection of data depends on the
source of data as well as the type of data to be collected. Data can be collected through
survey ( telephone, questionnaire or interview), test, observation, and experimentation. D
etails
on how each method are done and what is the advantage of one over the other will not be
part of this lesson as this is exhaustively discussed in your research course.
Data are collected from a representative of a population called sample. The process of collecting
samples is called sampling. There are two types of sampling: non-probability and probability
sampling. In non-probability sampling, not every member of the population is given equal chance to
be chosen, hence the samples are not are true representative of the population. If the objective of
the study is to make a generalization, using non-probability sampling is discouraged. Convenience
or Accidental sampling, Purposive or Judgemental Sampling and Quota Sampling a re the most
common techniques in non-probability sampling.
Probability sampling on the other hand gives equal chance to each member of the population to be
selected as a representative. There are four techniques under this type of sampling. They are as
follows: simple random sampling, systematic random sampling, stratified random sampling
and cluster random sampling.
Simple Random Sampling is a technique used in when the population is homogeneous with respect
to the characteristic of interest to the researcher and the population size is known (Petilos, 2012).
Selection of sample can be done either by lottery method or using random numbers.
Systematic Random Sampling is a technique that selects the desired sample size by selecting every
subject. To select the sample the researcher assigns number to each member of the population
kth
(by numbering consecutively) then he determines the value of k by dividing the total number
of cases (population) by the desired number of samples. For example the total population (N) is
1,000 and the sample size (n) is 100. Therefore, the value of k is 10. Thus, the researcher will select
every 10th subject in the population, which is determined by selecting the starting number between
1 to 10 by using simple random sampling. Suppose the starting number is 6, so the
researcher will
Leyte Normal University | Mathematics Unit 4
consider the subjects whose numbers are: 6, 16, 26, etc. until the desired number of
samples is completed.
Stratified Random Sampling is a technique used by grouping the population into subgroups called
strata according to the common characteristic/s as determined by the researcher. The subjects are
selected from each stratum which is proportional to the number of each subgroup. For example if
the population consists of all freshmen student across the three colleges (A, B, and C) in University
X. If the total freshmen population among the three colleges is 1400 divided as follows: NA = 350;
NB = 500 and NC = 550 and the researcher wishes to take a total of 350 respondents. Then he has to
select from each stratum the desired samples using either simple random sampling or systematic
random sampling using the following computation:
College N n
A 350
B 500
C 550
[Note: Due to the rule of rounding off numbers as applied in A & C which are 87.5 = 88 and
137.5 = 138, respectively, the researcher has to decide in which subgroup he has to reduce the
samples by 1.]
Cluster Random Sampling is a technique used when the population is large enough or the
respondents are residing in a large geographic area and it is impossible for the researcher to obtain
the list of all members of the population. The members of each cluster are heterogeneous. Unlike
the stratified random sampling where the subjects are selected individually, in this technique
cluster/s is selected randomly and all members of the selected cluster would represent the
population. For example a researcher wishes to determine the type of fertilizer (pure
synthetic, pure organic or combination of synthetic and organic) use by rice farmers from the
municipality of Town Q. Assuming that there is no available list of rice farmers (categorized a small
scale, medium and large scale rice producing), the researcher can get a copy of the map of Town Q
and determine the number of barangays which are located outside downtown and along the
seashore areas. Each of these barangays is a considered a cluster. Suppose there are 43
barangays that belong to this group. Therefore, there are 43 clusters to choose from. The
researcher then decides how many of these barangays will be included and then he randomly
selects the cluster/s. The rice farmers in the selected cluster/s represent the group from Town Q.
Once the researcher has already collected the data, the next thing to do is to organize. There are
three ways of presenting data: tabular, graphical and textual. The following discussion focuses on
how to organize raw data and subsequently represent those using graphs.
Example 1.1
Below are scores of 50 students in Statistics examination.
Leyte Normal University | Mathematics Unit 5
MODULE 4: Data Management
63 88 79 92 86 87 83 78 47 67
68 76 46 81 92 77 76 84 70 66
77 75 98 81 82 81 87 78 70 60
94 79 52 82 77 81 77 70 74 61
56 69 83 83 71 48 90 52 75 84
Looking at the array of scores it would be difficult for the reader to tell the characteristic of
the group. Thus, a frequency distribution needs to be prepared. A frequency distribution is
an organization of raw data classes/groups and frequencies. The frequency distribution is a
tabular way of organizing raw data. The following are the steps in preparing frequency distribution.
Step 3. Determine lower and upper limit of the lowest class interval. The lower limit should
be divisible by the class interval.
€
Step 4. Determine the upper class
Step 5. Tally the scores in their respective classes
Step 6. Summarize the tallies.
k = 50 = 7.07
k = 7
added by the class size (c). It follows that the upper limit of this class interval is 55. Thus,
the class boundary is 48 – 55. Following the same procedure, you can find the remaining
class intervals.
4. Determine the upper class. The highest class interval should contain the highest value of the
given data set. Since our highest value is 98 which is not divisible by the class size of 8, so the
lower limit of the highest class interval should be a number smaller and closest to 98. The
number is 96. Thus, the highest class interval is between 96 – 103.
5. List down the class intervals and tally the scores in their respective classes.
Class Limits Class Boundaries Tallies Frequency
40 - 47 39.5 – 47.5 // 2
REMARKS:
• In this illustration the actual number of classes which is 8 is greater than the estimated value of k
which is 7.
• The second column shows the boundary of each class interval in which the actual lower and upper limits
are indicated. These are called true limits or class boundaries.
• The true upper limit of the preceding class is also the true lower limit of the succeeding class. This shows
the continuity of the data.
Using the same data set as presented in the frequency distribution above, we can prepare graphs.
In this module, we will discuss only the histogram, frequency polygon and ogive. These are
the most commonly used graphs in research.
A histogram displays the data using continuous bars (vertical or horizontal). The histogram is a bar
graph in which bars are constructed without space in between. This implies that the data presented
is continuous. The heights/lengths of the bars show the frequency of the respective classes. The
frequency polygon on the other hand displays the data by using lines connecting the points
plotted for the frequencies of each class. This graph is used when the data is continuous.
Both graphs use the midpoints of the classes in the frequency axis.
The ogive is a graph that shows the cumulative frequencies for the classes in the given distribution.
The ogive can be constructed either for cumulative frequency less of cumulative frequency greater.
The following are steps in constructing the above-specified graphs manually. The same graphs can
be constructed by using either by Excel or Minitab and the specific steps are illustrated in the book
of Bluman.
Example 1.2
Before constructing the different graphs, we need to add more information in our
frequency distribution as shown below.
Leyte Normal University | Mathematics Unit 7
N = 50 100.0
REMARKS:
+U
X = LL L
• The midpoint of each class is obtained .
2
using the formula:
€
Step What to do?
2 Label the vertical axis as the frequency a xis and the horizontal as variable
axis.(In our illustration below, our variable is a s core)
3 Lay off segments along the vertical axis (y-axis) to correspond to the
frequencies. (The segments must be equal in length)
4 Lay off segments along the horizontal axis (x-axis) to correspond to the different
class intervals of the variable. The first line segment should be moved a little to the
right if the lowest value of the variable is not zero.
5 Mark all midpoints of the intervals and label these using class midpoints.
6 Draw rectangle or bars whose heights correspond to the frequency counts and
whose widths to the class size. (Shade or color your bars).
Score
istogram
Figure 1.1. H
Leyte Normal University | Mathematics Unit 8
MODULE 4: Data Management
2 Label the vertical axis as the frequency a xis and the horizontal as variable
axis.(In our illustration above, our variable is a s core)
3 Lay off segments along the vertical axis (y-axis) to correspond to the
frequencies. (The segments must be equal in length)
4 Lay off segments along the horizontal axis (x-axis) to correspond to the different
class intervals of the variable. The first line segment should be moved a little to the
right if the lowest value of the variable is not zero.
5 For each class interval, the class midpoint and corresponding frequency are
considered ordered pair and is plotted in the plane determined by the
coordinate axes.
6 The plotted points are then joined using line segments from left to right. To close the
polygon, extend one class interval to both sides by connecting the endpoints of the
graph to the midpoints of the extended segments along the x-axis.
16
14
12
yc
0
n
10
8
6
4
2 Score
Polygon
requency
Figure 1.2. F
2 Label the vertical axis as the cumulative frequency a xis and the horizontal as
variable axis. (In our example the variable is a s core) .
3 Lay off equal segments along the vertical axis (y a xis) to correspond to the
cumulative frequencies. Use an appropriate scale to represent the cumulative
frequencies. (Depending on the numbers in the cumulative frequencies, the scales
can be by 2’s, 4’s, 5’s, etc. )
4 Lay off equal segments along the horizontal axis (x axis) to correspond to the
true upper limit of the ogive for less than cumulative frequencies and true lower
of the ogive for greater than cumulative frequencies
6 The plotted points are then joined using line segments from left to right.
REMARK:
• To determine the percentage or the number of cases found below or above a particular boundary. • If
the ogives (for >cf and <cf) are graphed on the same coordinate plane, a line can be drawn from the point
of intersection of the two graphs onto the variable axis which represents the median of the data set.
e
20 r
10 60
yc
0 50
n
40
e
u 30
q
Class Boundaries
e 20
r
F
10
60
0
50 yc
40
n
e
Class Boundaries
u
30 q
Figure 1.3. L ess than cumulative frequency reater than cumulative frequency
Figure 1.4. G
1 List down the leading digits of the data set called the stem. Arrange them in
a column either from lowest to highest or vice versa.
2 Starting from the first to the last entry of the data set, carefully record the
trailing digits (leaf) in their corresponding stem.
3 Arrange in order the trailing digits in each row. If there are no data values in a class,
the stem number is written and the leaf row is left blank.
Example 1.2
Let us illustrate the above procedure using the data on the scores of 50 students in
Statistics examination. The data are reproduced as follows:
63 88 79 92 86 87 83 78 47 67
68 76 46 81 92 77 76 84 70 66
77 75 98 81 82 81 87 78 70 60
94 79 52 82 77 81 77 70 74 61
56 69 83 83 71 48 90 52 75 84
Leyte Normal University | Mathematics Unit 10
Steps:
1. Stem (Leading Digit) Leaf (Trailing Digit)
9 48220
8 831123627113744
7 76599717678800045
6 3897601
5 622
4 687
REMARKS:
• The figure shows that the distribution peaks in the center and there are no gaps in the data. •
The highest score is 98 and the lowest is 46.
• Most scores are 70 and above.
What other information can you draw from the figure above?
Leyte Normal University | Mathematics Unit 1
1
Exercises 1.1
B. In each statement below identify the variable/s and classify it/them according to the level of
measurement (nominal, ordinal, interval, ratio)
1. Marital status of faculty members in a university.
2. Time it takes a student to travel from home to school.
3. Scores in the College Admission Test of freshman students in University Q.
4. Socio-economic status of the residents in a barangay (poor, average, above-average). 5.
Ages of freshman college students of Leyte Normal University.
E. An insurance company researcher conducted a survey on the number of car thefts in a
large city for a period of 30 days last summer. The raw data are shown below. Construct a
grouped frequency distribution, frequency polygon, histogram and ogives (Show all
necessary solutions).
52 58 75 79 57 65 62 77 56 51
59 53 51 66 55 68 63 78 50 53
67 65 69 66 69 57 73 72 75 55
Leyte Normal University | Mathematics Unit 12
Introduction
Statistics is a science of collecting, organizing, summarizing, presenting and interpreting data. There
are two branches of statistics. The branch that involves collection, organization, summarization and
presentation of data is called descriptive statistics. While the branch that involves the
interpretation and drawing conclusion is called inferential statistics. Descriptive statistics
include the measures of central tendency, measures of position and measures of variability.
There are three measures of central tendency or measures of central location, namely: the mean,
median and the mode. The measure of central tendency is a single value that describes a whole set
of data by identifying the central position within the given data set. It is sometimes called
the measure of central location or summary statistics.
The data gathered in their original form is called raw or ungrouped data, while the data that have
been organized into a frequency distribution is called grouped data.
For raw data, the mean is defined as the arithmetic average of a data set It is equal to the sum of
the measurements divided by the number of cases (n). It is the measure used when there is
no extreme value of the data set and the data is either an interval or ratio. Among the three
measures of central tendency, the mean is the most reliable and is amenable for further
mathematical manipulation which makes it useful for inferential statistics.
Formula: mean =
The Greek capital letter sigma is used to denote a sum. Thus, the formula above means, the
ivided by the total number of cases. For the data collected from a
summation of the values of x d
population the symbol use for the mean is a Greek letter (read as mu) which is called parameter. x
ar) which is
(read as: x b
While the data collected from sample, the symbol use for the mean is
called statistic. The total number of cases is denoted by N a nd n for a parameter and statistic,
respectively. Thus the working formula for the mean of a population is: =
€
Example 2.1. Compute for the average of the scores in a Math quiz of 15
students. 23 25 34 32 22 24 26 24 34 30 26 26 37
25 24
Solution:
Using a calculator, we have:
= 412
15
x= 27.5
This implies that the average score in a Math quiz of the 15 students is 27.5
€
Note: Rounding Rule for the Mean. The mean should be rounded to one more decimal place
than occurs in the raw data.
For grouped data, the mean is obtained by using the formula below:
fX
Σ
x=
N
Example 2.2. U sing the data in Example 4.1.1 we find the mean of grouped data. (Scores of
50 students in Statistics examination)
Class Interval f X fX
64 – 71 8 67.5 540.0
56 – 63 4 59.5 238.0
40 – 47 2 43.5 87.0
N = 50 fX = 3,727.00
By substitution, we have:
fX
Σ
x=
3727
N=
50 =
74.54
Example, f or the class interval of 40 - 47 with the lower limit of 40 and the upper limit of 47, has a
true lower limit of 39.5 (0.5 lower than the apparent lower limit), and has a true upper limit of 47.5
(0.5 higher than the apparent upper limit).
There is another method of finding the mean of grouped data by using the assumed
deviation. However, the discussion of this method will not be included in this module.
When the weight of each value or observation is not equal the weighted mean is obtained. The
weighted mean is computed using the formula below:
ΣwX
X=
Σw
Example 2.3
Find the grade weighted average of a student in his five subjects as shown in the table below:
Subject Grade (X) No. of Units (w) wX
PE 1.3 2 2.6
16
Σw= ΣwX =24.7
By substituting to the formula, we find the Grade Weighted Average (GWA) of the student:
wX
X = Σ
Σw = 24.7
16 =
1.54
Thus, the grade weighted mean of the student is 1.54.
€
Median (Raw and Grouped Data)
The median is the middlemost value of the measurements when they are arranged from smallest to
highest. It is used when the data is at least ordinal. The median is not affected by extreme values or
outliers. The median is reliable and less stable than the mean.
Leyte Normal University | Mathematics Unit 1
5
For raw data or ungrouped data, the median is obtained by getting the middlemost value after the
data set is arranged from lowest to highest. It is the value that divides the data set into two equal
parts.
Example 2.4
Using the data set in Example 2.1 we have:
23 25 34 32 22 24 26 24 34 30 26 26 37 25 24
Stem Leaf
3 42407
2 3524646654
Stem Leaf
3 02447
2 2344455666
22 23 24 24 24 25 25 26 26 26 30 32 34 34 37
Thus, the median of the given data set is 26. This implies that with the score of 26, there
seven cases below and above it. Example 2.4 i s an example of data set for odd cases (n = 15). How
to find the median when there are even cases? Based on the definition of the median it is the
middlemost value.
Example 2.5
6 30 32 34 34 35 37
22 23 24 24 24 25 25 26 26 2
Thus we
⎛ ⎞
⎛ ⎞
⎝⎜ ⎠⎟th c ase.
2+1
n 2
case and
To get the median of even ⎠⎟th
have: n
cases, we take the
⎞ n
⎝⎜ ⎠⎟th case + 2+1
⎛ ⎞
⎝⎜ ⎠⎟th c ase n
Md = € 2 €
2
+ 26
= 26
2
Md = 26
This implies that the value of 26 divides the cases into two equal parts. This 26 is not the 8th nor the
9th case but there is a value of 26 between 8th and 9th cases.
€
Leyte Normal University | Mathematics Unit 16
For grouped data, the median is obtained using the formula below:
⎛⎜ ⎟ c
2 −
cf ⎟( )
⎜
Md = ⎜ f ⎟
⎜ ⎞
LL + ⎝ ⎟ ⎠
N
Example 2.6
e find the median of grouped data. (Scores of 50 students
Using the data in Example 1.1 w
in Statistics examination)
Class Interval f <cf
96 - 103 1 50
88 – 95 5 49
80– 87 14 44
64 – 71 8 17 (cf)
56 – 63 4 9
48– 55 3 5
40 – 47 2 2
N= 50
Note that 50% of 50 cases is 25. This means that we find a number or value such that 50% of the
total number of cases is below and above it. Using the formula above we have:
⎜ = ⎛⎜ ⎟
⎜ 2 −
17
⎛ 71.5 + ⎜
Md = ⎝ ⎟ ⎜ 13 ⎠
⎞ ⎜
LL + N ⎞
⎟
⎛⎜ ⎠ ⎝ ⎟
⎜ 2− cf ⎟ (c) ⎞ 50 ⎟ 8
⎟
f ⎟( )
(0.6154)(8) = 7 1.5+ 4.92
−17
= 71.5+25
8 = 71.5+
⎠⎟( )
76.42
⎝⎜
Md = 13
This implies that 76.42 is the middlemost value of the given data set. This means that there are 25
cases found below and above this value.
€
Leyte Normal University | Mathematics Unit 17
The mode is the most frequent value in a given data set. The mode is used when you want
to determine a quick estimate of the typical value in a given data set. The mode is the most
unstable measure of central tendency especially if there are only few cases. A given data set can
have more than one mode. For cases where there are two modes it is called bimodal.
Example 2.7
Using the data set in Example 1.1, we notice that there are two values (24 and 26) that have the
same frequency of 3.
5 25 26 26 26 30 32 34 34 37
22 23 24 24 24 2
Therefore, the modes of the given distribution are 24 and 26. This is an example of a
bimodal distribution.
Example 2.8
Find the mode of the following data: 12, 34, 12, 71, 48, 93, 71 .
By inspection, the number 12 occurs more often than the other numbers. Therefore, the mode of
the distribution is 12. This is an example of a unimodal distribution.
Example 2.9
Find the mode of the following data set:
12, 5, 8, 9, 11, 11, 4, 7, 23, 7, 8, 12, 23, 9, 4, 5
By inspection, each number in the list occurs twice. There is no number that occurs more
often than the others. Therefore, there is no mode.
For grouped data, the mode is obtained by using the formula below:
⎞
⎛ c
⎠⎟( )
Mo = LL
+d1
d1 +
d2
⎝⎜
where: LL = true lower limit or lower boundary of the modal class;
d 1 = absolute difference between the frequencies of the modal class
€
and the lower class interval (interval just below it);
d2 = absolute difference between the frequencies of the modal class
a nd the higher class interval (interval just above it);
c = the class size
Leyte Normal University | Mathematics Unit 18
Example 2.10
Using the data in Example 4.1.1 we find the mode of grouped data. (Scores of 50 students
in Statistics examination)
Class Interval f
96 - 103 1
88 – 95 5
(interval just above the modal class)
80– 87 14
(modal class)
72– 79 13
(interval just below the modal class)
64 – 71 8
56 – 63 4
48– 55 3
40 – 47 2
Using the formula below, we obtain the mode of the given data set:
⎞ ⎞
⎛ ⎛
c
1
LL +d
Mo = ⎠⎟( ) = 79.5+14 −1
3
⎜ + 14 − 5
⎝⎜ ⎜ ( ) ⎝
d1 + 2
d (14 −1 3) ⎟ 8
⎟( ) ⎠
⎛
1 ⎞
79.5+ 1+ 9
Mo = 8 = 79.5+
⎠⎟( )
⎝⎜
Mo = 80.30 (0.10)(8) = 79.5+. 80
Types of Distribution
The characteristic of the distribution can be determined by the shape of its graph (histogram
of frequency polygon). According to Bluman, the symmetric, positively skewed and negatively
skewed are the most important shapes of graphs that describe a distribution. Skewness refers to
the degree of departure of the distribution from the line of symmetry. When the data values
are evenly distributed on both sides of the mean and it is unimodal, the distribution is
called symmetric distribution. Further, the mean, median and mode have equal values and are at
the center of the x = Md =
Mo .
distribution. In symbol,
€
Leyte Normal University | Mathematics Unit 19
A positively skewed or right-skewed distribution is unimodal and majority of the data values
cluster at the lower end of the distribution and to the left of the mean. Moreover, with the
positively skewed distribution, the mode is lesser than the median and the median is lesser than
Mo < Md < x .
the mean. In symbol,
A negatively skewed or left-skewed distribution is observed when majority of the data
values cluster at the upper end of the distribution and to the right of the mean. Furthermore,
with the
€
negatively skewed distribution the mode is greater than the median and the median is greater than
x < Md <
Mo .
the mean. In symbol,
The following graphs are illustrations of the three types of distribution according to its
skewness (MathBits.com).
€
Symmetric Distribution
Exercises 2.1
A. Using Exercise 13.1 on page 811 of the book, Mathematical Excursion by Aufman,
answer numbers 4 to 9 and 11.
B. Using the same exercise, find the mean, median and mode of the data set of number
14 on page 812.
C. Problem Solving.
1. If the mean age of eight college freshman students is 19.25. and six of the ages
are: 19, 18, 20, 19, 20 and 18. What are the ages of the two students who are
twin siblings? What is the mode (age) of the eight students?
2. Find the mean of 20, 30, 40, 50 and 60.
a. Add 5 to each value and find the mean.
b. Subtract 5 from each value and find the mean.
c. Multiply each value by 5 and find the mean.
d. Divide each value by 5 and find the mean.
e. Make a general statement about each situation.
Leyte Normal University | Mathematics Unit 2
1
Introduction
In the preceding lesson you learned the three measures of central tendency namely, mean, median
and mode. Accordingly, to describe the data set, it is important that one knows more than
the measures we studied in the previous lesson as one tends to claim that two or more data sets
are not varied when it is observed that the averages are equal. In this lesson, we will
discuss the measures of variation/spread or measures of dispersion. In this module the four
measures of variability both for ungrouped and grouped data will be talked over. They are
the range, mean absolute deviation, variance and standard deviation.
The range is simply the gap or difference between the highest and lowest value/observation of the
data set. In formula: R = HV – LV.
If R = 0, it implies that all values in a data set are equal. Thus, there is no variability of the data.
Example 3.1
Ages of female faculty members from three departments.
Statistical Implication/Impression Data Set
Measure
A B C
37 40 39
38 41 40
42 42 42
45 43 43
48 44 46
Range Distribution A is 11 4 7
more spread. Why?
According to Petilos in his Resource Material in Basic Statistics, range of grouped data is equal to
the difference between true upper limit of the highest class interval a nd the true lower limit of the
lowest class interval. If the apparent limits are used, the range is equal to the difference between
upper limit of the highest class interval less than the lower limit of the lowest class interval plus 1. In
formula:
R= UL − LL
( )H ( )L
€
Leyte Normal University | Mathematics Unit 2
2
Example 3.2
Scores of 50 students in Statistics examination
Class Interval f
96 - 103 1
88 – 95 5
80– 87 14
72– 79 13
64 – 71 8
56 – 63 4
48– 55 3
40 – 47 2
N = 50
Using the data set as presented in the distribution above, the range is:
The mean absolute deviation (MAD) of a data set is defined as the average distance between each
data value and the mean. It helps to describe how “spread out” the values in a data set
are (https://www.khanacademy.org/math). The MAD for raw data is computed using the
following formula:
X −
Σ
MAD = x
or value
N
where: X = score
Using the data set of Example 3.1 and computing for the MAD of each distribution, we
have: €
Example 3.3
Ages of female faculty members from three departments
Statistical Implication/Impression Data Set
Measure
A B C
37 40 39
38 41 40
42 42 42
45 43 43
48 44 46
+ 4 + 0 + 3+ 6
= 5
18
5 = 5
MAD = 3.6
Following the same procedure we find the MAD of the remaining two distributions as reflected on
the table above.
€
It can be deduced from the table of Example 3.3 that the scores of Data Set A deviate from
the mean by an average of 3.6, compared to Data Set B where the scores deviate from the mean by
an average of 1.2. This implies that Data Set B is less spread compared to Data Set A. The lesser
the value of MAD the less spread the distribution is.
For grouped data the MAD is obtained using the formula below:
f X −
Σ
MAD = x
N
€
x = mean score or mean value
N = 50 ⏐ x - X
Σ � ⏐ = 495.36
x=74.54 (from Example 2.2)
Recall:
Leyte Normal University | Mathematics Unit € 2
4
€
MODULE 4: Data Management
Thus,
Σ f X −
MAD = x
N = 495.36
50 =
9.9072
9.91
MAD =
This implies that the 50 scores deviate from the mean of 74.54 by an average of 9.91 units.
€
The last two measures of dispersion or measures of variation to be included in this module are the
variance and standard deviation. Bluman, in his book Elementary Statistics, defines variance as the
average of the squares of the distance each score or value from the mean. While the
standard deviation, is the square root of the variance. It looks at how spread out a group of
numbers is from the mean (https://www.investopedia.com).
The population variance and standard deviation are calculated using the following respective
formulas:
Σ X − ∝ 2
σ2 = ( )
read as “sigma squared”):
Variance (σ2
N
Σ X − ∝
σ= ( )2
Standard Deviation (σ = square root of the variance) :
€
N
Example 3.5
The following data are ages of 10 teachers in one Elementary School:
27, 34, 30, 29, 28, 30, 34, 35, 28, 29.
Solution: To compute for the variance, we present the data as shown in the table below:
Age (X) X−
∝ (X − ∝)2
27 -4.4 19.36
34 2.6 6.76
30 -1.4 1.96
29 -2.4 5.76
28 -3.4 11.56
30 -1.4 1.96
34 2.6 6.76
35 3.6 12.96
38 6.6 43.56
29 -2.4 5.76
What does a population variance of 11.64 mean? Since the value of 11.64 is far from zero,
this implies that the observations are more spread from one another and from the mean.
From above value of population variance, it follows that the population standard deviation which is
the square root of the variance is:
σ = 11.64 = 3.41 .
We recall that the standard deviation measures how concentrated the data are around the mean;
the more concentrated, the smaller the standard deviation. €
(https://www.dummies.com/education/math/statistics). What is the implication of the above value
in relation to the mean of the given data set?
Example 3.6
Using the data set of Example 4.3.3, determine the variance and standard deviation of each subset
of data. Compare your results. The table is reproduced below.
38 41 40
42 42 42
45 43 43
48 44 46
Variance
Standard
Deviatio
n
The table below shows the different notations use for the variance and standard deviation.
Notation Statistical Measure
If the data set is taken from a sample, the variance and standard deviation are obtained using the
following computational formula (Bluman, p.137)
Sample Variance:
X 2 (
s2 =n Σ ) − (ΣX ) 2
−1
n n
)
(
variance
n n −
1
Where: s2 = sample ( )
X = individual observation
€
n = sample size
Example 3.7
Find the sample variance and standard deviation for the daily production rate of fiberglass boats of
a certain manufacturer. If the company production manager feels that a standard deviation of more
than three boats a day is unacceptable, should the manager be concerned about the plant
production rate? Why?
17 21 18 27 17 21 20 22 18 23
Leyte Normal University | Mathematics Unit 2
7
Solution:
X 17 21 18 27 17 21 20 22 18 23 ΣX = 204
X2 289 441 324 729 289 441 400 484 324 529 ΣX2 = 4,250
X 2 (
s2 =n Σ ) − (ΣX )2
n n −1 =
( ) (10)(4250) − (204)2
10 10 −1
( )
s2 = 42500
− 41616
90 = 884
90 =
9.82
From above value of sample variance, it follows that the sample standard deviation which is
the square root of the variance is:
€
s= 9.82 =
3.13
.
REMARK:
Since the obtained sample standard deviation of 3.1implies that the fiberglass boats plant
daily €
production is within the acceptable rate. Thus, there is no reason for the plant manager to weary
about its production.
For grouped data we find the variance and standard deviation using the following computational
formula (Bluman, p.139)
€
n n −
1
Variance: ( )
s2 =n ΣfX 2 ( )
− ΣfX 2 fX 2 ( )
( ) n
s= Σ
− ΣfX 2 n n
( ) (
−1
)
Standard
Deviation:
Example 3.8
Using the data in Example 4.1.1 we find the variance and standard deviation of grouped data. The
table is reproduced below:
Leyte Normal University | Mathematics Unit 2
8
50
N= 3727
ΣfX = ΣfX2= 285,828.50
Substituting the above computational or shortcut formula, we obtain the sample variance as
follows:
fX 2 (
s2 =n Σ ) − (ΣfX ) 2
n n −1 =
( ) (50)(285828.50) − (3727)2
(50)(50 −1)
s2 = 14291425
−13890529
=
(50)(49) 400896
2450 =
163.63
With the above sample variance value of 163.63 it follows that the sample standard deviation (s)
which is the square root of the variance is 12.79. This implies that the scores of 50 students deviate
€
from the mean on the average by a distance of 12.79 units.
There is another method of computing the sample variance and sample standard deviation by using
the Coded Deviation. However, its discussion is not included in this module.
Exercises 3.1
y Aufman,
A. Using Exercise 13.2 on page 823 of the book, Mathematical Excursion b
answer numbers 4 to 8 and 12.
B. Using the same exercise, answer number 20 on page 824 on the ages of the female
and male actors Academy awardees. Answer questions a, b, and c found at the end
of the exercise.
C. Critical Thinking
Using the exercise no. 26 on page 825 perform the suggested activity and answer
the question found at the end.
Leyte Normal University | Mathematics Unit 2
9