You are on page 1of 11

DATA ANALYSIS AND PRESENTATION

TOOLS
The tools that will eventually be used for analyzing data are selected during the research planning stage
and not after data has been collected. The preliminary plan for data analysis helps the researcher in
deciding on the following:
i. What to do when
ii. How questionnaires will be checked for completeness
iii. Action to take after the checking – e.g. reject some questionnaires or questions (e.g. due to
omission of some issues, etc.)
iv. Editing the questionnaires – identifying illegible questions, incomplete responses,
unsatisfactory responses, ambiguous responses etc.
v. Options to pursue after rejection: - e.g. returning to the field, assigning missing values,
discarding the unsatisfactory questionnaire, excluding the questions with unsatisfactory
responses from analysis.
vi. Statistical methods to apply in the analysis.
vii. Procedure to follow in analysis – coding scheme, code sheet, tabulation.

Statistics and Data Analysis


Statistics is a science of numbers; a discipline that provides the tools of analysis in research. They are the
techniques used in collecting, describing, analyzing, summarizing and drawing conclusions on the data.
For our purpose, we shall define statistics as the collection, analysis, and interpretation of numerical data.
Data in this case are the figures collected, or a series of observations

Types of Statistical Methods


There are two types of statistics - descriptive and mathematical/inductive statistics.

1. Descriptive Statistics
Deal with compilation and presentation of data in various forms e.g. tables, charts, and diagrams. This is
done to display and pass on information from which conclusions and recommendations can be made.

Functions of Descriptive Statistics


a. Measuring
Since concepts (characteristics) have theoretical definitions, it is important that we define them
operationally. Measurement is defining a concept operationally in tangible terms. Any data set has
concepts that need to be defined operationally to make them measurable. e.g. a microfinance client may
be defined operationally as “ a duly registered person, saving consistently, attending group meetings…”
A variable may be measured at different levels – Nominal, ordinal, interval and ratio.

Nominal measures
Name/label e.g. gender, type of client, etc. from these nominal variables we could have the category
labels “male & female” and “start-up & expansion” respectively. It then becomes possible to group
the population into homogeneous categories and providing a count of each category. Furthermore it
becomes possible to compare two or more homogeneous categories of nominal values

Accessed a Type of client


Loan?
Existing Start up
Yes 55 60
No 65 80
TOTAL* 120 120
* Assuming a sample of 120 clients

Ordinal measures
Here, each individual or unit has a position in a numbered order. e.g.
1. Never
2. Sometimess e a r c h M e t h o d s Page 38
3. Always
Each category has a number as a label

Interval measures
Here it is possible to rank the individuals or units and also measure the distance between them.
Therefore, we need a physical unit of measure.
e.g. the no. of loans accessed

Ratio measures
A variable is taken to be a ratio level if the scale of values assumed by the variable includes an
absolute zero value, i.e. 0 – 100%

b. Quantifying
In describing a certain characteristic from a set of data, it is necessary to represent its value in terms of
quantities or numerical values. e.g. we can assign male (1) and female (2). This is called coding.

c. Organizing data
Data should be organized in a way that makes it easy for the mind to absorb and understand. To organize
data is to arrange them into a pre-set format. This ultimately helps in describing the situation through
numbers and representing it through diagrams.

The number of times of the occurrence of a variable is the frequency

Simple frequency distribution


Example: No of visits by a branch manager to 10 credit groups is given as: 8, 6,
5, 5, 7, 4, 5, 7, 9, 4
i. First arrange them in ascending (or descending) order: 4,
4, 5, 5, 5, 6, 7, 7, 8, 9
ii. Then put them in tabular form

No of visits (x) Tally No of groups


[variable] (frequency = f)
4 II 2
5 III 3
6 I 1
7 II 2
8 I 1
9 I 1
iii. Plot the frequency distribution
Table 1: Simple frequency distribution

x F
4 2
5 3
6 1
7 2
esearchMetho
8 1
9 1
The researcher is thus able to see how the numbers of visits are distributed

Group Frequency Distribution


Sometimes, the spread of the data is very large

Example: (These could be no. of products [variable] stocked by 100 small scale retail stores)
81 85 62 71 70 81 86 67 96 51
63 71 75 69 48 34 87 86 73 75
42 91 58 93 52 82 90 95 82 72
53 38 77 93 85 47 70 68 57 71
96 40 70 92 68 88 58 51 90 74
52 63 96 77 83 76 48 92 81 83
92 73 84 78 78 72 60 84 78 60
43 70 83 64 96 93 55 73 58 40
88 96 72 53 87 92 73 77 63 58
71 80 38 63 56 76 82 61 76 63
i. You may first arrange the variables in ascending [or descending] order in order to identify the
smallest and highest values.
ii. We condense the products by allocating them into CLASSES. (We can use groups of 10
e.g. 1 – 10, 11 – 20, etc. each group is called a class).
iii. Use a tally mark to place each variable in its appropriate or corresponding class.
iv. The frequency distribution is constructed as shown below:

Table 2: Group frequency distribution

No of products Tally No of stores


(variable =x) (frequency = f)
31 – 40 IIII 4
41 – 50 IIII 5
51 – 60 IIII IIII IIII 15
61 – 70 IIII IIII IIII I 16
71 – 80 IIII IIII IIII IIII IIII 24
81 – 90 IIII IIII IIII IIII I 21
91 - 100 IIII IIII IIII 15
Total 100

NOTE:
In constructing group frequency distribution, it is important to calculate the class interval. Class interval =
upper class boundary, less lower class boundary.
To get lower boundary: lower class limit less 0.5
= 31 – 0.5 = 30.5
To get upper boundary: upper limit plus 0.5
= 40 + 0.5 = 40.5

Therefore the class interval = 40.5 – 30.5 = 10

d) Displaying Data
Descriptive statistics also help us in displaying data. We can use tools like histogram, bar charts, cartoons,
pie charts, frequency curves, etc to display data.

i) Cartoons
One can chose cartoons to display data. For example

Men Women

1 rep: 1000

ii) Bar Charts (Discrete Data)


This useful for discrete data

Y-axis
(Frequency)

X-axis
0 1 2 3 4 (score or observation)
Note
i. All bars are of equal width
ii. The height of each bar represents the frequency (sometimes percentage)
iii. The bars are separated to show the fact that the data is discrete
iv. Axes are scaled or labeled and the bar chart titled
v. The first bar can touch the y-axis.

iii) Histogram (continuous data)


This is suitable for continuous data e.g. age, weight, etc
i. The bars touch each other to indicate the continuous nature of the data.
ii. The score is not indicated in the middle but at the beginning of the bar.c h M e t h o d s Page 41

Frequency

Observation/score

iv) Pie Chart


A pie chart is circular. The angle of each sector is proportional to the corresponding frequency

Example:
The following diagram shows how different regions share the total loan amount disbursed by an MFI.

A B 800

D 1200

C 700

If the total loan disbursed to region A was Ksh 600,000/=, calculate: -


i. The total loans disbursed, and
ii. The loan amount disbursed to each of the other regions

2. MATHEMATICAL/INDUCTIVE/INFERENTIAL STATISTICS
This is concerned with extending beyond particular information available and attempting to make general
predictions. They are measures that enhance the understanding of data. The important statistical measures
used are:
• Measures of central tendency (averages)
• Measures of dispersion
• Measures of correlation
• Measures of association h


M e t h o d s Page 42
i. MEASURES OF CENTRAL TENDENCY
Collected data is useless and meaningless until it is organized in a certain manner. A measure of central
tendency is an aggregate or summary measure that represents a value that is at the center of the
distribution values.

a) Mean (Arithmetic mean)


This measure is arithmetical average of a distribution of values.
The arithmetic mean (AM) is denoted by x, where x represents a variable with a distribution of n values x1,
x2, and _ _ _ xn. It is worked out as = sum of observations = ∑x
No of observations n

Example:
Mean of raw data: Suppose you collected the following data on number of members of 5 credit groups: -
30, 26, 20, 29, 20
The average number of members per group (mean) = 30+26+20+29+20
5
AM or x = 125/5 = 25

Mean of a frequency distribution


In case of a large distribution of values, the distribution of variable x takes the form a frequency
distribution. Then you use a formula to calculate the mean. X = ∑fx
∑f
Example:
The following data shows the visits by a credit officer to 15 credit groups. 4, 1,

3, 6, 7, 2, 2, 8, 2, 7, 2, 5, 5, 9, 5

No of visits Frequency (groups Fx


(value of x) visited
f)
1 1 1
2 4 8
3 1 3
4 1 4
5 3 15
6 1 6
7 2 14
8 1 8
9 1 9
TOTAL ∑f = 15 ∑fx = 68

Substituting:
X = ∑fx = 68 = 4.5
∑f 15s e a r c h M e t h o d s Page 43

Mean of a grouped distribution


If the data is grouped, we modify the frequency distribution table to work out the mean.

Example: If the data provided is as follows: -


No of visit (x) Frequency (groups
visited f)
1–3 6
4–6 5
7–9 4
TOTAL 1515

NB: We cannot work out the fx because x is not a single figure. We therefore need to work out the class
midpoints.
E.g. for class 1 – 3, the class mid point will be (1 + 3)/2 = 2

No of Class Frequency fx
visits midpoint (groups
(x) visited f)
1–3 2 6 12
4–6 5 5 25
7–9 8 4 32
∑f = 15 ∑fx =
69

x = ∑fx = 69 = 4.6
∑f 15

b) Mode
 The value that occurs most often. It is therefore the value with the most frequency.

Example: What is the mode of the following data?


4, 1, 3, 6, 7, 2, 2, 8, 2, 7, 2, 5, 5, 9, 5

 It is quite commonly used measure of central tendency by research students

NOTE:
 For grouped frequency distribution, the mode is the mid-point of the class with the highest
frequency.
 If a distribution has the same frequency for all values, such a distribution has no mode.

c) Median:
The middle value when all values are arranged in the order of size. It divides the distribution into two equal
halves of number of values on its either side.

Example: What is the median of this data? 4,

1, 3, 6, 7, 2, 2, 8, 2, 7, 2, 5, 5, 9, 5

 Re-arrange: - 1, 2, 2, 2, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7, 8, 9s e a r c h M e t h o d s Page 4
 If total number of values is odd, take the middle value as median
 If it is even, calculate the average of the two middle numbers.

Median of ungrouped frequency distribution


 Write in columns: -
i. The values of the variable (x)
ii. The frequencies
 Have a column for the cumulative frequency e.g. frequency distribution of schools visited

Score / no of Frequency (f) Cu. f (F)


visits (x)
1 1 1
2 4 5
3 1 6
4 1 7
5 3 10
6 1 11
7 2 13
8 1 14
9 1 15
Total ∑f = 15

 Locate the middle data item by n/2; where n is the total number of observations (∑f) n/2 =

15/2 = 7.5

 Calculate the cumulative frequencies until they reach the value 7.5 or exceed 7.5 for the first time

 Choose the median as the value of x corresponding to the last value of F found to exceed 7.5. This is
10, and the x value that corresponds to it is 5. Therefore the median = 5
Median of grouped distribution
Example
No of visits (x) Frequency (f) Cu.f
(F)
1–3 6 6
4–6 5 11
7–9 4 15
∑f = 15

 n/2 = 15/2 = 7.5.


7.5 should be lying in the class with the cu.f of 11 [this is 4 – 6]
 Then use the formula:

Median = U + (1/2F – C)w


N
Where: U = the upper limit of the median class F = Total
frequencye s e a
C = Cumulative frequency up to but not including the median class W =
Class interval
N = the frequency of the median class

Therefore, median = 6 + (1/2 x 15 – 6)3 = 6+(7.5 – 6)3


11 11

= 6 + (4.5) = 6 + 0.409
11

= 6.4

Measures of Dispersion
A measure of dispersion is an aggregate measure of deviation from a central value. The most commonly
used measures are the:
i. Range
ii. Mean deviation
iii. Standard deviation

The Range
The difference between the lowest value and the highest value in the distribution

Example: Given the following sets of data, calculate the range Set

A: 1, 2, 3, 4, 5, 6, 7, 8, 9
Set B: 3, 4, 4, 4, 5, 5, 6, 6, 7

Answer:
Set A: 9 – 1 = 8
Set B: 7 – 3 = 4

This shows that the values in set A are spread out more than those in set B.

Variance and Standard Deviation

Deviation is the difference between the value of an observation and the mean, i.e. (x – x).
The deviation can be from the mean, mode or median. In most cases it is from the mean. (Where you are
not told, assume it is in reference to the mean)

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values.
A low standard deviation indicates that the values tend to be close to the mean of the set, while a high
standard deviation indicates that the values are spread out over a wider range. It is a useful measure of
spread for normal distributions.

Illustration:
Given the data – 1, 2, 2, 6, 9, determine the mean and
deviation. a) Mean = 20/5 = 4
b) Deviation (how far each value is from the mean):

_ _
X x x–x
1 4 -3
2 4 -2
2 4 -2
e a r c h M e t h o d s Page 4
6 4 2
9 4 5
_
∑(x -x) = 0

To get the deviation for the whole distribution: -


_
= ∑ (x - x) = 0 n
5
To avoid getting zero (0), which does not tell us anything, we can make use of: -
i. Absolute values |x|, or
ii. Square the differences and sum them

Using the squaring method: -

The magnitudes of deviation will change e.g. -3 becomes bigger (9). [If we had any of values (x -
x) being positive but less than 1, the magnitude of the deviation would become smaller. e.g. 0.4, after
squaring, would become 0.16]
Therefore, we divide by the number of the observations:
_
∑ (x - x) 2
n
The result is the variance
Formula: _
Variance = δ2 = ∑ (x – x) 2
n

_ _ _
X X x–x (x – x)2
1 4 -3 9
2 4 -2 4
2 4 -2 4
6 4 2 4
9 4 5 25
_
∑ (x - x) 2 =
46
= 46
5

= 9.2

If we get the square root of the variance, (i.e. √δ2), the result is the standard
deviation. Therefore: _
Standard deviation = √δ2 = √∑ (x - x) 2
n

_
= δ2 = √ ∑ (x - x) 2
n

= √9.2

= 3.03 t h o d s Page 47

NOTE: A shorter (and simpler) method


_
Variance: δ2 = ∑ (x - x) 2
n

__
Expanding the equation = ∑ (x - 2xx + x2)
2

= ∑x2 - 2x2∑x + nx2 n


nn
_ _
= ∑x2 - 2x2 + x2 n

= ∑x2 - x2 n

Thus using the formula and the data given: x


x2
1 1
2 4
2 4
6 36
9 81 _
∑x2 = 126 and x2 = 42 = 16

Substituting: Variance δ2 = ∑x2 - x2 = 126 - 16


n 5
= 25.2 – 16
= 9.2

Standard deviation √δ2 = √ 9.2


= 3.03

Measures of Association
a) Rank Correlation Coefficient (Rs)
The rank correlation investigates the presence or absence of association between variables. Moreover, it
measures the strength or degree of relationship between variables.
Assumptions
i. The data consists of a random sample of n pairs. Each pair of observation represents two
measurements taken on the same object or individual called the unit of association.
ii. Each x (observation) is ranked relative to all other observed values of X (variable e.g. age) from
smallest to largest or the largest to smallest in order of magnitude.
E.g. X Rank
34
61
43
5 2 s e a r c h M e t h o d s Page 48

iii. Each y (observation) is ranked to all other observed values of Y (variable e.g. height) from
the smallest to the largest or the largest to the smallest in order of magnitude
iv. If ties occur among the x’s or among the y’s, each tied value is assigned the mean rank
position for which it is tied.

Example 1
X Rank Positions
3 1 1
6 5 5
4 2 2.5
4 3 2.5
5 4 4
For tie ups = 2+3 = 2.5
2

Example 2
X Rank Positions
3 1 1
6 2 3
4 5 5
4 3 3
4 4 3
For tie ups = 2+3+4 = 3
3
v. If the data consists of non-numeric observations, they must be capable of being ranked as
described above.
E.g. above average, average, poor, etc
Formula:

rs = 1 - 6 ∑d2i
n (n2-1)

Where di = is the difference between the ranks assigned to xi and yi. n = is


the number of observations

Example:
The following are the number of hours, which 10 clients with loans spent in the business per day and the
number of different customers that they served. Calculate rs.

No of hrs No of diff
X customers Y
8 56
5 44
11 79
13 72
10 70
5 54
18 94
15 85
2 33
8 65
e s e a r c h M e t h o d s Page 49
Solution:
Rank of X Rank of Rank xi – rank di2
yi (di)
6.5 -0.5 0.25
8.5 -0.5 0.25

-1

8.5 0.5 0.25

10 10
6.5 0.5 0.25
n = 10 n = 10 ∑ di2 = 3

Substituting: rs = 1 - 6∑ di2
n(n2-1)
= 1- 6(3)
10(100-1)
= 1- 18
990
= 0.98 (means that they are highly correlated)
Note:
rs ranges from -1 to +1. i.e. –1 ≤ r, ≤ 1

If the value of rs calculated tends towards 0, it shows that there’s no correlation; if towards 1, there is a
correlation. We can go ahead and test whether the above finding is actually true statistically. For a layman,
0.98 means that the two variables are highly correlated. A statistician/researcher would like to go further
and test whether there is or there isn’t a
correlation between them – hypothesis testing.

You might also like