You are on page 1of 34

ATHANAS M.

GARABA

0764273821/garabaathanas@gmail.com/athanas.garaba@tpsc.go.tz

Mr. Garaba is a Tanzanian scholar working at Tanzania Public Service College, Tabora. He has a Bachelor of
Science with Education (2011) from Sokoine University of Agriculture (SUA), Morogoro Tanzania, Master of
Educational Management and Planning (MEMP)(2014) from St. Augustine University of Tanzania (SAUT) and
International Diploma in Educational Planning and Administration (IDEPA, 2020) from National Institute of
Educational Planning and Administration (NIEPA), New Delhi -110016, India.

Mr. Garaba has published several publications in different international Journals and National Journals including
the Use of ICT in selected secondary schools in Tabora Municipal which was published at IJIRAS, Issue of July,
2019.

Measures of central tendencies

We often represent a data set by numerical summary measures. A measure of central tendency is
a measure that tells us where the middle of a bunch of data lies. The most common measures of
central tendencies are mean, mode, and the median. This section discusses three different
measures of central tendency: the mean, the median, and the mode;

i. Mean- is the most common measure of central tendency. It is simply the sum of the
numbers divided by the number of numbers of in a set of data. This is also known as
arithmetic mean.
ii. Mode- the value of item which occurs most frequently in a series
iii. Median- is a value of the middle term of a series when these items are arranged in
ascending or descending order

We will learn how to calculate each of these measures for grouped and ungrouped data. Data that
give information on each member of the population or sample individually are called ungrouped
data, whereas grouped data are presented in the form of a frequency distribution table.
I. MEASURES OF CENTRAL TENDECY FOR UNGROUPED DATA
 MEAN
For ungrouped data, the mean is obtained by dividing the sum of all values by the number of
values in the data set.

Mean=
∑ of all values
Numberof values
Therefore, mean for population and sample ungrouped data is found as follows;

Example1;
Example2;
The following are the weights (in kgs) of machines of a small company:
53 32 61 27 39 44 49 57
Find the mean weight of these machines.
We should remember that the mean is not always the best measure of central tendency because it
is heavily influenced by outliers. Sometimes other measures of central tendency give a more
accurate impression of a data set. For example, when a data set has outliers, instead of using the
mean, we can use the median as a measure of central tendency
 MEDIAN
The median is the value of the middle term in a data set that has been ranked in increasing or
decreasing order. As is obvious from the definition of the median, it divides a ranked data set
into two equal parts. The calculation of the median consists of the following two steps:
1. Rank the data set in either increasing or decreasing order.
2. Find the middle term. The value of this term is the median
Note that if the number of observations in a data set is odd, then the median is given by the value
of the middle term in the ranked data. However, if the number of observations is even, then the
median is given by the average of the values of the two middle terms.
Example1;
Median price is 374.
Example2;

 MODE
The mode is the value that occurs with the highest frequency in a data set.
Example1;

A major shortcoming of the mode is that a data set may have none or may have more than one
mode, whereas it will have only one mean and only one median. For instance, a data set with
each value occurring only once has no mode. A data set with only one value occurring with the
highest frequency has only one mode. The data set in this case is called unimodal. A data set
with two values that occur with the same (highest) frequency has two modes. The distribution, in
this case, is said to be bimodal. If more than two values in a data set occur with the same
(highest) frequency, then the data set contains more than two modes and it is said to be
multimodal.
Example2; Last year’s incomes of five randomly selected families were 76,150, 95,750,124, 985,
87,490, and 53,740. Find the mode.
Solution Because each value in this data set occurs only once, this data set contains no mode.
Example3; the status of five students who are members of the student senate at a college are
senior, sophomore, senior, junior, and senior, respectively. Find the mode.
Solution Because senior occurs more frequently than the other categories, it is the mode for this
data set. We cannot calculate the mean and median for this data set.
One advantage of the mode is that it can be calculated for both kinds of data—quantitative and
qualitative—whereas the mean and median can be calculated for only quantitative data.
II. MEASURE OF CENTRAL TENDECY FOR GROUPED DATA
 MEAN
When only grouped data is available, you do not know the individual data values (we only know
intervals and interval frequencies); therefore, you cannot compute an exact mean for the data set.
What we must do is estimate the actual mean by calculating the mean of a frequency table. A
frequency table is a data representation in which grouped data is displayed along with the
corresponding frequencies. To calculate the mean from a grouped frequency table we can apply
the formulae of mean:

And m is found by taking;

Example1;

Solution
Example2: Below is the daily rental for a bank loan application. Find

a) Mean
b) Median
c) Mode

Ans: mean=74.5, median=74.5

Example 3:

Use the data set given to answer the questions below;

10 30 5 21 28 20 12 11 7 9 10 14 15 34 42 45 34 33 36 35 8 27 34 37 44 25 50 66 70 71 72 79 54 29 32 22
41 52 40 33 35 36 34 37 38 40 39 35.

a. Prepare a table for calculating mean, mode, and median by using five (5) as a class size and
begin class interval from 5.

b. Indicate the median class

c. Indicate the modal class

d. Compute the mean, mode and median.

Solution:

a. Frequency distribution table


Class Interval x f Cumulative fx
Class Mark ( ) Frequency ( ) Frequency

5-9 7 4 4 28

10-14 12 5 9 60

15-19 17 1 10 17

20-24 22 3 13 66

25-29 27 4 17 108

30-34 32 8 25 256

35-39 37 9 34 333

40-44 42 5 39 210

45-49 47 1 40 47

50-54 52 3 43 156

55-59 57 0 43 0

60-64 62 0 43 0

65-69 67 1 44 67
70-74 72 3 47 216

75-79 77 1 48 77

Total N 1641
48

N 48
  24
2 2
b. The median class is obtained by taking
24 25
So, is found within the cumulative frequency when the data arranged in
ascending order.
30  34
Hence the median class is

c. The modal class is the one with highest frequency. So, from the frequency

35  39.
distribution table, above the modal class is
d. The mean, mode and median.

 fx  1641  34.19
N 48
i. Mean =
 t 
L   1 i
 t1  t 2 
ii. Mode=
L
Where, Lower boundary of the modal class
t1 
The difference between the frequency of the modal class and the frequency
before the modal class
t2 
The difference between the frequency of the modal class and the frequency after
the modal class
i
Class size

 1 
34.5   5
1 4 
Mode =
5
34.5 
5
Mode =
35.5
Mode=

N 
  nb 
L 2 i
 nw 
 
 
iii. Median =
Where, L= lower boundary of the median class
i
=class size
nb  cumulative
Frequency of the class preceding/before the median class
nw 
Frequency within the median class
N
Total number of frequency

 24  17 
29.5   5
 8 
Median =
75
29.5 
8
Median =
35
29.5 
8
Median=
29.5  4.38
Median =
33.88
Median =

What is frequency distribution?


Collected and classified data are presented in a form of frequency distribution. Frequency
distribution is simply a table in which the data are grouped into classes on the basis of common
characteristics and the numbers of cases which fall in each class are recorded. It shows the
frequency of occurrence of different values of a single variable. A frequency distribution is
constructed to satisfy three objectives:
(i) to facilitate the analysis of data,
(ii) to estimate frequencies of the unknown population distribution from the distribution of
sample data, and
(iii) to facilitate the computation of various statistical measures.

Principles for Constructing Frequency Distributions


Inspite of the great importance of classification in statistical analysis, no hard and fast rules are
laid down for it. A statistician uses his discretion for classifying a frequency distribution and
sound experience, wisdom, skill and aptness for an appropriate classification of the data.
However, the following guidelines must be considered to construct a frequency distribution:
1. Type of classes: The classes should be clearly defined and should not lead to any ambiguity.
They should be exhaustive and mutually exclusive so that any value of variable corresponds to
only class.
2. Number of classes: The choice about the number of classes in which a given frequency
distribution should he divided depends upon the following things;
(i) The total frequency which means the total number of observations in the distribution.
(ii) The nature of the data which means the size or magnitude of the values of the variable.
(iii) The desired accuracy.
(iv) The convenience regarding computation of the various descriptive measures of the frequency
distribution such as means variance etc.

 The number of classes should not be too small or too large. If the classes are few, the
classification becomes very broad and rough which might obscure some important
features and characteristics of the data. The accuracy of the results decreases as the
number of classes becomes smaller. On the other hand, too many classes will result in a
few frequencies in each class. This will give an irregular pattern of frequencies in
different classes thus makes the frequency distribution irregular. Moreover a large
number of classes will render the distribution too unwieldy to handle. The computational
work for further processing of the data will become quite tedious and time consuming
without any proportionate gain in the accuracy of the results.
 Hence a balance should be maintained between the loss of information in the first case
and irregularity of frequency distribution in the second case, to arrive at a suitable
number of classes. Normally, the number of classes should not be less than 5 and more
than 20. Prof. Sturges has given a formula:

k = 1 + 3.322 log n

Where k refers to the number of classes and n refers to total frequencies or number of
observations.
The value of k is rounded to the next higher integer:
If n = 100                             k = 1 + 3.322 log 100 = 1 + 6.644 = 8
If n = 10,000                        k = 1 + 3.22 log 10,000 = 1 + 13.288 = 14
There are two methods of classifying the data according to class intervals :
(i) exclusive method, and
(ii) inclusive method
In an exclusive method, the class intervals are fixed in such a manner that upper limit of one
class becomes the lower limit of the following class. Moreover, an item equal to the upper limit
of a class would be excluded from that class and included in the next class. The following data
are classified on this basis.

---------------------------------------------------------

Income (Rs.)     No. of Persons


---------------------------------------------------------
200 – 250              50
250 – 300              100
300 – 350              70
350 – 400              130
400 – 50                50
450 – 500              100
It is clear from the example that the exclusive method ensures continuity of the data in as much
as the upper limit of one class is the lower limit of the next class. Therefore, 50 persons have
their incomes between 200 to 249.99 and a person whose income is 250 shall be included in the
next class of 250 – 300.

 According to the inclusive method, an item equal to upper limit of a class is included in
that class itself. The following table demonstrates this method.

-----------------------------------------------------------

Income (Rs.)                          No.of Persons


-----------------------------------------------------------
200 – 249                              50
250 – 299                              100
300 – 349                              70
350 – 399                              130
400 – 149                              50
450 – 499                              100
----------------------------------------------------------
                                        Total 500
----------------------------------------------------------
Hence in the class 200 – 249, we include persons whose income is between Rs. 200
and Rs. 249.
Example :Construct a frequency distribution from the following data by inclusive method taking
4 as the class interval:
10 17 15 22 11 16 19 24 29 18
25 26 32 14 17 20 23 27 30 12
15 18 24 36 18 15 21 28 33 38
34 13 10 16 20 22 29 19 23 31
Solution : Because the minimum value of the variable is 10 which is a very convenient figure for
taking the lower limit of the first class and the magnitude of the class interval is given to be 4,
the classes for preparing frequency distribution by the Inclusive method will be 10 – 13, 14 – 17,
18 – 21, 22 – 25,.............. 38 – 41.
Frequency Distribution

---------------------------------------------------------------
Class Interval     Tally Bars   Frequency (f)
---------------------------------------------------------------
10 – 13                                          5
14 – 17                 |||                    8
18 – 21                 |||                    8
22 – 25                 ||                      7
26 – 29                                          5
30 – 33                 ||||                  4
34 – 37                 ||                     2
38 – 41                 |                       1
---------------------------------------------------------------
Example: Prepare a statistical table from the following :
Weekly wages (Rs.) of 100 workers of Factory A
---------------------------------------------------------------
88 23 27 28 86 96 94 93 86 99
82 24 24 55 88 99 55 86 82 36
96 39 26 54 87 100 56 84 83 46
102 48 27 26 29 100 59 83 84 48
104 46 30 29 40 101 60 89 46 49
106 33 36 30 40 103 70 90 49 50
104 36 37 40 40 106 72 94 50 60
24 39 49 46 66 107 76 96 46 67
26 78 50 44 43 46 79 99 36 68
29 67 56 99 93 48 80 102 32 51
---------------------------------------------------------------
Solution: The lowest value is 23 and the highest 106. The difference between the lowest and
highest value is 83. If we take a class interval of 10. Nine classes would be made. The first class
should be taken as 20 – 30 instead of 23 – 33 as per the guidelines of classification.

Frequency Distribution of the Wages of 100 Workers

-----------------------------------------------------------------

Wages (Rs.)   Tally Bars      Frequency (f)


-------------------------------------
29–30                    ||||                     13
30–40                    |                        11
40–50                    |||                      18
50–60                                             10
60–70                    |                          6
70–80                                               5
80–90                    ||||                      14
90 – 100                ||                      12
100 – 110               |                      11
Measures of dispersion

The measures of central tendency, such as the mean, median, and mode, do not reveal the whole
picture of the distribution of a data set. Two data sets with the same mean may have completely
different spreads. The variation among the values of observations for one data set may be much
larger or smaller than for the other data set. (Note that the words dispersion, spread, and
variation have the same meaning. Consider the following two data sets on the ages (in years) of
all workers working for each of two small companies.
Company 1: 47 38 35 40 36 45 39
Company 2: 70 33 18 52 27
The mean age of workers in both these companies is the same, 40 years. If we do not know the
ages of individual workers at these two companies and are told only that the mean age of the
workers at both companies is the same, we may deduce that the workers at these two companies
have a similar age distribution. As we can observe, however, the variation in the workers’ ages
for each of these two companies is very different. We need a measure that can provide some
information about the variation among data values. The measures that help us learn about the
spread of a data set are called the measures of dispersion. The measures of central tendency and
dispersion taken together give a better picture of a data set than the measures of central tendency
alone.
 Measures of dispersion for ungrouped data.
 Range
The range is the simplest measure of dispersion to calculate. It is obtained by taking the
difference between the largest and the smallest values in a data set. That is;
Range = Largest value -Smallest value
Example1:

Solution:
The maximum total area for a state in this data set is 267,277 square miles, and the smallest area
is 49,651 square miles. Therefore, the total areas of these four states are spread over a range of
217,626 square miles. The range, like the mean, has the disadvantage of being influenced by
outliers. In Example above if the state of Texas with a total area of 267,277 square miles is
dropped, the range decreases from 217,626 square miles to 20,252 square miles. Consequently,
the range is not a good measure of dispersion to use for a data set that contains outliers. Another
disadvantage of using the range as a measure of dispersion is that its calculation is based on two
values only: the largest and the smallest. All other values in a data set are ignored when
calculating the range. Thus, the range is not a very satisfactory measure of dispersion.
 Variance and Standard Deviation
The standard deviation is the most-used measure of dispersion. The value of the standard
deviation tells how closely the values of a data set are clustered around the mean. In general, a
lower value of the standard deviation for a data set indicates that the values of that data set are
spread over a relatively smaller range around the mean. In contrast, a larger value of the standard
deviation for a data set indicates that the values of that data set are spread over a relatively larger
range around the mean.
Example1:
 Measure of dispersion for grouped data
 Range
For grouped data, the range is the difference between the highest class boundary and the lowest
boundary.
Example: Find the range of the grouped data:
Class 10 - 14 15 - 19 20 - 24 25 - 29
Frequency 2 8 7 3

Solution:
Range = Highest class boundary – lowest class boundary
=29-10=19
Range is 19.
 Variance and standard deviation
Following are what we will call the basic formulas used to calculate the population and sample
variances for grouped data:

Example1:

Solution All the information required for the calculation of the variance and standard deviation
appears in Table bellow;
Activity:

4. The frequency distribution of the lengths of 100 leaves from a certain species of plant is
given below:

length (mm) Frequency


20 – 24 6
25 – 29 10
30 – 34 18
35 – 39 25
40 – 44 22
45 – 49 15
50 – 54 4

5. The following table shows the distribution of heights of 50 students:

Height (cm) Frequency


160 – 164 8
165 – 169 12
170 – 174 14
175 – 179 7
180 – 184 6
185 – 189 3

Find the mean, median, mode, range, variance and standard deviation of heights.

CORRELATION AND REGRESSION ANALYSIS


In this section we will be investigating the relationship between two continuous variables, such
as height and weight, the concentration of an injected drug and heart rate, or the consumption
level of some nutrient and weight gain. The tools used to explore this relationship, is the
regression and correlation analysis. These tools can be used to find out if the outcome from one
variable depends on the value of the other variable, which would mean a dependency from one
variable on the other. Regression and correlation analysis can be used to describe the nature and
strength of the relationship between two continuous variables.
1. Correlation
Correlation is a measure of association between two variables. The variables are not designated
as dependent or independent. The two most popular correlation coefficients are: Spearman's
correlation coefficient rho and Pearson's product-moment correlation coefficient. When
calculating a correlation coefficient for ordinal data, select Spearman's technique. For interval or
ratio-type data, use Pearson's technique.
 the standard error of a correlation coefficient is used to determine the confidence intervals
around a true correlation of zero. If your correlation coefficient falls outside of this range, then it
is significantly different from zero. The standard error can be calculated for interval or ratio-type
data (i.e., only for Pearson's product-moment correlation).

Methods of studying correlation

The following are the important methods of ascertaining whether two variables are correlated or
not:

 Scatter Diagram methods


 Karl Pearson’s coefficient of correlation
 Spearman’s Rank correlation coefficient
 Methods of least square (Regression line)

 Scatter diagram method

The first step in the investigation of the relationship between two continuous variables is a
scatter plot. Create a scatter plot for the two variables and evaluate the quality of the
relationship.

Example: Does the number of years invested in schooling pay off in the job market? Apparently
so – the better educated you are, the more money you will earn. The data in the following table
give the median annual income of full-time workers age 25 or older by the number of years of
schooling completed.
The scatter plot shows a strong, positive, linear association between years and salary.

Questions to be answered with the help of the scatter plot:

1. Does a relationship exists that can be described by a straight line (which means is there a
linear relationship)?

2. Is there a relationship that is not linear?

3. If the scatter plot of the variables looks like a cloud there is no relationship between both
variables and one would stop at this point.

 Karl Pearson coefficient of correlation

Pearson’s correlation coefficient (named after Karl Pearson, 1857-1936) is a number between -1
and 1 that measures the strength of a linear relationship between two continuous variables. The
absolute value of the coefficient measures how closely the variables are related. The closer it is
to 1 the closer the relationship. A correlation coefficient over 0.8 indicates a strong correlation
between the variables
2
OR
b =b×b
 The value of a correlation coefficient can vary from minus one to plus one. A minus one
indicates a perfect negative correlation, while a plus one indicates a perfect positive
correlation.
 A correlation of zero means there is no relationship between the two variables.
 When there is a negative correlation between two variables, as the value of one variable
increases, the value of the other variable decreases, and vice versa.
Example 1:

The Pearson correlation coefficient of Years of schooling and salary r = 0.994. A correlation of
0.9942 is very high and shows a strong, positive, linear association between years of schooling
and the salary.

2. Regression
 Simple regression is used to examine the relationship between one dependent and one
independent variable. After performing an analysis, the regression statistics can be used
to predict the dependent variable when the independent variable is known.
The regression line (known as the least squares line) is a plot of the expected value of the
dependent variable for all values of the independent variable. Technically, it is the line that
"minimizes the squared residuals". The regression line is the one that best fits the data on a
scatter plot.
 Using the regression equation, the dependent variable may be predicted from the
independent variable. In the example above we might want to predict the expected salary
for different times of schooling, or calculate the increase in salary for every year of
schooling. For this purpose we can do a regression analysis.
The slope of the regression line (b) is defined as the rise divided by the run. The y intercept (a) is
the point on the y axis where the regression line would intercept the y axis. The slope and y
intercept are incorporated into the regression equation. The intercept is usually called the
constant, and the slope is referred to as the coefficient. Since the regression model is usually not
a perfect predictor, there is also an error term in the equation.
 In the regression equation, y is always the dependent variable and x is always the
independent variable.
Here is a way to mathematically describe a linear regression model: y = a + bx + e

 the coefficient of determination (r-squared) is the square of the correlation coefficient. Its
value may vary from zero to one. It has the advantage over the correlation coefficient in that it
may be interpreted directly as the proportion of variance in the dependent variable that can
be accounted for by the regression equation. For example, an r-squared value of 0.49 means
that 49% of the variance in the dependent variable can be explained by the regression equation.
The other 51% is unexplained.
 the standard error of the estimate for regression measures the amount of variability in
the points around the regression line. It is the standard deviation of the data points as they are
distributed around the regression line. The standard error of the estimate can be used to develop
confidence intervals around a prediction.

You might also like