Professional Documents
Culture Documents
GARABA
0764273821/garabaathanas@gmail.com/athanas.garaba@tpsc.go.tz
Mr. Garaba is a Tanzanian scholar working at Tanzania Public Service College, Tabora. He has a Bachelor of
Science with Education (2011) from Sokoine University of Agriculture (SUA), Morogoro Tanzania, Master of
Educational Management and Planning (MEMP)(2014) from St. Augustine University of Tanzania (SAUT) and
International Diploma in Educational Planning and Administration (IDEPA, 2020) from National Institute of
Educational Planning and Administration (NIEPA), New Delhi -110016, India.
Mr. Garaba has published several publications in different international Journals and National Journals including
the Use of ICT in selected secondary schools in Tabora Municipal which was published at IJIRAS, Issue of July,
2019.
We often represent a data set by numerical summary measures. A measure of central tendency is
a measure that tells us where the middle of a bunch of data lies. The most common measures of
central tendencies are mean, mode, and the median. This section discusses three different
measures of central tendency: the mean, the median, and the mode;
i. Mean- is the most common measure of central tendency. It is simply the sum of the
numbers divided by the number of numbers of in a set of data. This is also known as
arithmetic mean.
ii. Mode- the value of item which occurs most frequently in a series
iii. Median- is a value of the middle term of a series when these items are arranged in
ascending or descending order
We will learn how to calculate each of these measures for grouped and ungrouped data. Data that
give information on each member of the population or sample individually are called ungrouped
data, whereas grouped data are presented in the form of a frequency distribution table.
I. MEASURES OF CENTRAL TENDECY FOR UNGROUPED DATA
MEAN
For ungrouped data, the mean is obtained by dividing the sum of all values by the number of
values in the data set.
Mean=
∑ of all values
Numberof values
Therefore, mean for population and sample ungrouped data is found as follows;
Example1;
Example2;
The following are the weights (in kgs) of machines of a small company:
53 32 61 27 39 44 49 57
Find the mean weight of these machines.
We should remember that the mean is not always the best measure of central tendency because it
is heavily influenced by outliers. Sometimes other measures of central tendency give a more
accurate impression of a data set. For example, when a data set has outliers, instead of using the
mean, we can use the median as a measure of central tendency
MEDIAN
The median is the value of the middle term in a data set that has been ranked in increasing or
decreasing order. As is obvious from the definition of the median, it divides a ranked data set
into two equal parts. The calculation of the median consists of the following two steps:
1. Rank the data set in either increasing or decreasing order.
2. Find the middle term. The value of this term is the median
Note that if the number of observations in a data set is odd, then the median is given by the value
of the middle term in the ranked data. However, if the number of observations is even, then the
median is given by the average of the values of the two middle terms.
Example1;
Median price is 374.
Example2;
MODE
The mode is the value that occurs with the highest frequency in a data set.
Example1;
A major shortcoming of the mode is that a data set may have none or may have more than one
mode, whereas it will have only one mean and only one median. For instance, a data set with
each value occurring only once has no mode. A data set with only one value occurring with the
highest frequency has only one mode. The data set in this case is called unimodal. A data set
with two values that occur with the same (highest) frequency has two modes. The distribution, in
this case, is said to be bimodal. If more than two values in a data set occur with the same
(highest) frequency, then the data set contains more than two modes and it is said to be
multimodal.
Example2; Last year’s incomes of five randomly selected families were 76,150, 95,750,124, 985,
87,490, and 53,740. Find the mode.
Solution Because each value in this data set occurs only once, this data set contains no mode.
Example3; the status of five students who are members of the student senate at a college are
senior, sophomore, senior, junior, and senior, respectively. Find the mode.
Solution Because senior occurs more frequently than the other categories, it is the mode for this
data set. We cannot calculate the mean and median for this data set.
One advantage of the mode is that it can be calculated for both kinds of data—quantitative and
qualitative—whereas the mean and median can be calculated for only quantitative data.
II. MEASURE OF CENTRAL TENDECY FOR GROUPED DATA
MEAN
When only grouped data is available, you do not know the individual data values (we only know
intervals and interval frequencies); therefore, you cannot compute an exact mean for the data set.
What we must do is estimate the actual mean by calculating the mean of a frequency table. A
frequency table is a data representation in which grouped data is displayed along with the
corresponding frequencies. To calculate the mean from a grouped frequency table we can apply
the formulae of mean:
Example1;
Solution
Example2: Below is the daily rental for a bank loan application. Find
a) Mean
b) Median
c) Mode
Example 3:
10 30 5 21 28 20 12 11 7 9 10 14 15 34 42 45 34 33 36 35 8 27 34 37 44 25 50 66 70 71 72 79 54 29 32 22
41 52 40 33 35 36 34 37 38 40 39 35.
a. Prepare a table for calculating mean, mode, and median by using five (5) as a class size and
begin class interval from 5.
Solution:
5-9 7 4 4 28
10-14 12 5 9 60
15-19 17 1 10 17
20-24 22 3 13 66
25-29 27 4 17 108
30-34 32 8 25 256
35-39 37 9 34 333
40-44 42 5 39 210
45-49 47 1 40 47
50-54 52 3 43 156
55-59 57 0 43 0
60-64 62 0 43 0
65-69 67 1 44 67
70-74 72 3 47 216
75-79 77 1 48 77
Total N 1641
48
N 48
24
2 2
b. The median class is obtained by taking
24 25
So, is found within the cumulative frequency when the data arranged in
ascending order.
30 34
Hence the median class is
c. The modal class is the one with highest frequency. So, from the frequency
35 39.
distribution table, above the modal class is
d. The mean, mode and median.
fx 1641 34.19
N 48
i. Mean =
t
L 1 i
t1 t 2
ii. Mode=
L
Where, Lower boundary of the modal class
t1
The difference between the frequency of the modal class and the frequency
before the modal class
t2
The difference between the frequency of the modal class and the frequency after
the modal class
i
Class size
1
34.5 5
1 4
Mode =
5
34.5
5
Mode =
35.5
Mode=
N
nb
L 2 i
nw
iii. Median =
Where, L= lower boundary of the median class
i
=class size
nb cumulative
Frequency of the class preceding/before the median class
nw
Frequency within the median class
N
Total number of frequency
24 17
29.5 5
8
Median =
75
29.5
8
Median =
35
29.5
8
Median=
29.5 4.38
Median =
33.88
Median =
The number of classes should not be too small or too large. If the classes are few, the
classification becomes very broad and rough which might obscure some important
features and characteristics of the data. The accuracy of the results decreases as the
number of classes becomes smaller. On the other hand, too many classes will result in a
few frequencies in each class. This will give an irregular pattern of frequencies in
different classes thus makes the frequency distribution irregular. Moreover a large
number of classes will render the distribution too unwieldy to handle. The computational
work for further processing of the data will become quite tedious and time consuming
without any proportionate gain in the accuracy of the results.
Hence a balance should be maintained between the loss of information in the first case
and irregularity of frequency distribution in the second case, to arrive at a suitable
number of classes. Normally, the number of classes should not be less than 5 and more
than 20. Prof. Sturges has given a formula:
k = 1 + 3.322 log n
Where k refers to the number of classes and n refers to total frequencies or number of
observations.
The value of k is rounded to the next higher integer:
If n = 100 k = 1 + 3.322 log 100 = 1 + 6.644 = 8
If n = 10,000 k = 1 + 3.22 log 10,000 = 1 + 13.288 = 14
There are two methods of classifying the data according to class intervals :
(i) exclusive method, and
(ii) inclusive method
In an exclusive method, the class intervals are fixed in such a manner that upper limit of one
class becomes the lower limit of the following class. Moreover, an item equal to the upper limit
of a class would be excluded from that class and included in the next class. The following data
are classified on this basis.
---------------------------------------------------------
According to the inclusive method, an item equal to upper limit of a class is included in
that class itself. The following table demonstrates this method.
-----------------------------------------------------------
---------------------------------------------------------------
Class Interval Tally Bars Frequency (f)
---------------------------------------------------------------
10 – 13 5
14 – 17 ||| 8
18 – 21 ||| 8
22 – 25 || 7
26 – 29 5
30 – 33 |||| 4
34 – 37 || 2
38 – 41 | 1
---------------------------------------------------------------
Example: Prepare a statistical table from the following :
Weekly wages (Rs.) of 100 workers of Factory A
---------------------------------------------------------------
88 23 27 28 86 96 94 93 86 99
82 24 24 55 88 99 55 86 82 36
96 39 26 54 87 100 56 84 83 46
102 48 27 26 29 100 59 83 84 48
104 46 30 29 40 101 60 89 46 49
106 33 36 30 40 103 70 90 49 50
104 36 37 40 40 106 72 94 50 60
24 39 49 46 66 107 76 96 46 67
26 78 50 44 43 46 79 99 36 68
29 67 56 99 93 48 80 102 32 51
---------------------------------------------------------------
Solution: The lowest value is 23 and the highest 106. The difference between the lowest and
highest value is 83. If we take a class interval of 10. Nine classes would be made. The first class
should be taken as 20 – 30 instead of 23 – 33 as per the guidelines of classification.
-----------------------------------------------------------------
The measures of central tendency, such as the mean, median, and mode, do not reveal the whole
picture of the distribution of a data set. Two data sets with the same mean may have completely
different spreads. The variation among the values of observations for one data set may be much
larger or smaller than for the other data set. (Note that the words dispersion, spread, and
variation have the same meaning. Consider the following two data sets on the ages (in years) of
all workers working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same, 40 years. If we do not know the
ages of individual workers at these two companies and are told only that the mean age of the
workers at both companies is the same, we may deduce that the workers at these two companies
have a similar age distribution. As we can observe, however, the variation in the workers’ ages
for each of these two companies is very different. We need a measure that can provide some
information about the variation among data values. The measures that help us learn about the
spread of a data set are called the measures of dispersion. The measures of central tendency and
dispersion taken together give a better picture of a data set than the measures of central tendency
alone.
Measures of dispersion for ungrouped data.
Range
The range is the simplest measure of dispersion to calculate. It is obtained by taking the
difference between the largest and the smallest values in a data set. That is;
Range = Largest value -Smallest value
Example1:
Solution:
The maximum total area for a state in this data set is 267,277 square miles, and the smallest area
is 49,651 square miles. Therefore, the total areas of these four states are spread over a range of
217,626 square miles. The range, like the mean, has the disadvantage of being influenced by
outliers. In Example above if the state of Texas with a total area of 267,277 square miles is
dropped, the range decreases from 217,626 square miles to 20,252 square miles. Consequently,
the range is not a good measure of dispersion to use for a data set that contains outliers. Another
disadvantage of using the range as a measure of dispersion is that its calculation is based on two
values only: the largest and the smallest. All other values in a data set are ignored when
calculating the range. Thus, the range is not a very satisfactory measure of dispersion.
Variance and Standard Deviation
The standard deviation is the most-used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. In general, a
lower value of the standard deviation for a data set indicates that the values of that data set are
spread over a relatively smaller range around the mean. In contrast, a larger value of the standard
deviation for a data set indicates that the values of that data set are spread over a relatively larger
range around the mean.
Example1:
Measure of dispersion for grouped data
Range
For grouped data, the range is the difference between the highest class boundary and the lowest
boundary.
Example: Find the range of the grouped data:
Class 10 - 14 15 - 19 20 - 24 25 - 29
Frequency 2 8 7 3
Solution:
Range = Highest class boundary – lowest class boundary
=29-10=19
Range is 19.
Variance and standard deviation
Following are what we will call the basic formulas used to calculate the population and sample
variances for grouped data:
Example1:
Solution All the information required for the calculation of the variance and standard deviation
appears in Table bellow;
Activity:
4. The frequency distribution of the lengths of 100 leaves from a certain species of plant is
given below:
Find the mean, median, mode, range, variance and standard deviation of heights.
The following are the important methods of ascertaining whether two variables are correlated or
not:
The first step in the investigation of the relationship between two continuous variables is a
scatter plot. Create a scatter plot for the two variables and evaluate the quality of the
relationship.
Example: Does the number of years invested in schooling pay off in the job market? Apparently
so – the better educated you are, the more money you will earn. The data in the following table
give the median annual income of full-time workers age 25 or older by the number of years of
schooling completed.
The scatter plot shows a strong, positive, linear association between years and salary.
1. Does a relationship exists that can be described by a straight line (which means is there a
linear relationship)?
3. If the scatter plot of the variables looks like a cloud there is no relationship between both
variables and one would stop at this point.
Pearson’s correlation coefficient (named after Karl Pearson, 1857-1936) is a number between -1
and 1 that measures the strength of a linear relationship between two continuous variables. The
absolute value of the coefficient measures how closely the variables are related. The closer it is
to 1 the closer the relationship. A correlation coefficient over 0.8 indicates a strong correlation
between the variables
2
OR
b =b×b
The value of a correlation coefficient can vary from minus one to plus one. A minus one
indicates a perfect negative correlation, while a plus one indicates a perfect positive
correlation.
A correlation of zero means there is no relationship between the two variables.
When there is a negative correlation between two variables, as the value of one variable
increases, the value of the other variable decreases, and vice versa.
Example 1:
The Pearson correlation coefficient of Years of schooling and salary r = 0.994. A correlation of
0.9942 is very high and shows a strong, positive, linear association between years of schooling
and the salary.
2. Regression
Simple regression is used to examine the relationship between one dependent and one
independent variable. After performing an analysis, the regression statistics can be used
to predict the dependent variable when the independent variable is known.
The regression line (known as the least squares line) is a plot of the expected value of the
dependent variable for all values of the independent variable. Technically, it is the line that
"minimizes the squared residuals". The regression line is the one that best fits the data on a
scatter plot.
Using the regression equation, the dependent variable may be predicted from the
independent variable. In the example above we might want to predict the expected salary
for different times of schooling, or calculate the increase in salary for every year of
schooling. For this purpose we can do a regression analysis.
The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is
the point on the y axis where the regression line would intercept the y axis. The slope and y
intercept are incorporated into the regression equation. The intercept is usually called the
constant, and the slope is referred to as the coefficient. Since the regression model is usually not
a perfect predictor, there is also an error term in the equation.
In the regression equation, y is always the dependent variable and x is always the
independent variable.
Here is a way to mathematically describe a linear regression model: y = a + bx + e
the coefficient of determination (r-squared) is the square of the correlation coefficient. Its
value may vary from zero to one. It has the advantage over the correlation coefficient in that it
may be interpreted directly as the proportion of variance in the dependent variable that can
be accounted for by the regression equation. For example, an r-squared value of 0.49 means
that 49% of the variance in the dependent variable can be explained by the regression equation.
The other 51% is unexplained.
the standard error of the estimate for regression measures the amount of variability in
the points around the regression line. It is the standard deviation of the data points as they are
distributed around the regression line. The standard error of the estimate can be used to develop
confidence intervals around a prediction.