You are on page 1of 63

DATA ANALYSIS IN RESEARCH

Dr. E. Mushi
22ND – 26TH May, 2023
Data Analysis
In this session we
shall cover:
1. Quantitative
Data Analysis
2. Qualitative Data
Analysis
QUANTITATIVE DATA ANALYSIS
 In quantitative research, data analysis begins where
data collection ends; when the instruments of data
collection (e.g. questionnaires) are completed.
 Quant. analysis includes:
 (a) data preparation for computer entry,
 (b) entering the data in the computer,
 (c) processing and analysing the data,
 (d) data presentation,
 (e) interpreting the findings and
 (f) drawing conclusions
Data preparation
 Data preparation first involves checking, editing and coding.
 Information collected should be checked and edited so that it
is clear, legible, relevant and appropriate.

 Coding is the process of converting verbal responses to


numerical codes.
 For instance, ‘Male’ may be given the code 1, and ‘Female’
code 2. Coding can be performed before data collection (pre-
coding) or after it (post-coding).
 The numbers expressing the codes (e.g. 1 and 2) are called
values, and the concepts they represent (e.g. ‘Male’ and
‘Female’) the value labels.
Coding cont….
Coding further clarified:
 A systematic way in which to condense extensive data
sets into smaller analyzable units through the
creation of categories and concepts derived from the
data.
 The process by which verbal data are converted into
variables and categories of variables using numbers,
so that the data can be entered into computers for
analysis.”
Variables and categories (sub-variables)

Variables: Gender Age Do you like


ice cream?

Categories: Male Female 18-25 26-33 34-41 yes no


Data entry
 Data entry requires access to a computer and to relevant
software, eg Excel, SPSS
 Data entry entails two steps: (a) defining the variables and (b)
entering the data.
 Coded data can either be scanned or entered by hand.
Scanning can be expensive, specialist software is
required and the questionnaire would also need to be
set up in ‘scannable’ format.
 On the other hand, scanning is usually quicker, with less
room for human error.
Data processing and analysis
 Data processing is a series of actions or
operations that converts data into useful
information. The processing involves data
cleaning and actual analysis:
Data cleaning:
 Looking at the ‘likeliness’ of responses (e.g. if an ‘age’
value is listed as 555 this is likely to be a keying error).
 Looking for responses to questions that should have
been skipped via routing (e.g. ‘if Q6 = no, go to Q10’ but
Q7-Q9 have been answered)
Data cleaning Cont. …
Data cleaning
 For example, a variable called GENDER would be expected to have
only two values; or a variable representing height in inches would
be expected to be within reasonable limits.
 It is recommended that you ‘clean’ your data (i.e. correct any
errors in the data set) before carrying out any work on it.
 This involves running a few basic checks:
- Looking for duplicates in any variables which should be unique
(e.g. an ID code).
- Looking at the minimum and maximum values for each variable
(e.g. a response of ‘6’ when 1 to 4 are the only possible options
suggests an error).
Data analysis

 Data analysis refers to the computation of certain


measures along with searching for patterns of
relationship that exist among data groups.

 The process of DATA analysis aims at determining


whether our observations support the hypotheses we
formulated.
Purpose of Data Analysis
 Describe or summarize data clearly,

 Search for consistent patterns or themes among data,

 Answer your research questions, by interpreting the


analysis product.
Why use statistics?

 Statistics are powerful tools


that help people
understand interesting
phenomena.

 “Statistics can offer one


method for helping you
make sense of your
environment” (Urdan,
2001).
Options for Calculating Statistical Analysis
 Technological advances have made the process of
statistically analyzing data easy enough for non-
statisticians to do.
 There are numerous statistical analysis programs
available for analyzing quantitative data, such as
Microsoft Excel, SAS (Statistical Analysis System), SPSS
(Statistical Package for the Social Sciences.
What is an SPSS?
 Statistical Package for the Social Sciences (SPSS)

 SPSS is a Windows based program that can be used


to perform data entry and analysis and to create
tables and graphs.
 SPSS is capable of handling large amounts of data
and can perform most of the analyses.

 SPSS is commonly used in the Social Sciences and in


the business world.
DESCRIPTIVE STATISTICS
 Descriptive statistics include;

- Numerical counts or frequencies


- Percentages
- Measures of central tendency (mean, mode,
median)
- Measure variability (range, standard deviation,
variance)
Measure of central tendency
 1.The sample mean or simply the mean
 The mean is easily calculated by summing all the scores in the
sample and then dividing by the number of scores in the sample.
The mean of this sample (5, 6, 9, 2.) of scores will be: = 5.5.

 Eg: sample – (2 20 20 12 12 19 19 25 20) the mean is 16.56


 The mean is the typical or middle score.

Measure of central tendency… cont.
 2. The median
 A second measure of central tendency is the median,
defined as the value that lies in the middle of the sample;
that is, it has the same number of scores above as below it.

 Eg. Score: 2 12 12 19 19 20 20 20 25
Ranks: 1 2 3 4 5 6 7 8 9
 In the above example it was easy to work out the median
as we had an odd number of scores. When you have odd
number of scores there is always one score that is the
middle one. What if you have even numbers?
Measure of central tendency… cont.

The mode
A third measure of central tendency is the
mode, which is simply the most frequently
occurring score.

Eg. 2 12 12 19 19 20 20 20 25 26

In the above data set, mode is 20.


Measure of central tendency… cont.
 Which measure of central tendency to use?
The answer is: It depends on your data.
 Mean is the most frequently used because it is calculated
from the actual scores themselves, not from the ranks as
is the case with the median,
 and not from frequency of occurrence, as is the case
with the mode.
 However, Because the mean uses all the actual scores in
its calculation, it is sensitive to extreme scores.
Measure of central tendency… cont.

 Sensitive to extreme value ….


 From this data set: 1 2 3 4 5 6 7 8 9 10
The mean is 5.5
The median is also 5.5
 But
if you change one score to make it extreme, the
mean changes significantly:
 Eg. 1 2 3 4 5 6 7 8 9 100
Mean = 14.5
Median = 5.5
Measure of central tendency… cont
 The population mean
 One way of estimating the population mean is to calculate
the means for a number of samples and then calculate the
mean of these sample means.
 Statisticians
have found that this gives a close
approximation of the population mean.

 E.g. Youhave three samples from the same population. If


mean for sample A = 5, Mean for sample B = 6, mean for
sample C = 8, the calculate the population mean.
Measure of Variability

1. Range
 The range is simply the difference between the
minimum and maximum scores.
 It is an indication of the spread of scores by compare
the minimum score with the maximum score in the
sample or population.
 However, rage does not give us an indication of what
is happening in between these scores.

 It does not really tell us much about the overall shape


of the distribution of the sample of scores.
Measure of Variability…cont.

 2. Standard deviation
 A more informative measure of the variation in data is
the standard deviation (SD).

 SD does give us an indication of what is happening


between the two extremes.

 The reason why the SD is able to do this is that it tells


us how much all the scores in a data set vary around
the mean.
Measure of Variability…cont
SD ….cont.
Actual Scores 1 4 5 6 9 11
Deviations -5 -2 -1 0 3 5
from the
mean
Squared 25 4 1 0 9 25
deviations

From the table above, variance is obtained by


computing the mean of the total squared deviations.
Variance = 10.67

SD is obtained by computing the square root of


the variance. SD = 3.27
Measure of Variability…. Cont.
 However, the 3.27 is a mathematical SD, which is
normally underrepresentation of the population SD.
 In order to get SD closer to that of the population,
statisticians advice to minus one in the N of the
deviations, and use the result to divide the total
deviations.
 For our case 6-1 = 5.
 Therefore the SD = 64/5 = 12.8
 For our data set the SPSS for WINDOWS would
display 3.58 as SD.
Other Characteristics of distribution
 1. Normal Distribution
 Statisticaltests are valid only if your data are
distributed in a certain way.
 In everyday life many variables such as height,
weight, shoe size, anxiety levels and exam marks
all tend to be normally distributed, that is, they all
tend to look like the curves.
 Most powerful statistical tools we use assume
that the populations from which our samples are
drawn are normally distributed.
Other Characteristics of distribution
 Normal Distribution … cont.
 For a distribution to be classed as normal it should
have the following characteristics:
it should be symmetrical about the mean,
the tails should meet the x-axis at infinity,
it should be bell-shaped,

 When you have a PERFECT normal distribution,


that the mean, median and mode are exactly the
same.
Other Characteristics of distribution, cont.
 Normal distribution … cont.
 We say the data is "normally distributed“,
if mean median mode are at the center.

 The Normal Distribution has:


 mean = median = mode
 symmetry about the center
 50% of values less than the mean and 50% greater
than the mean
Normal Distribution curve
Normal Distribution Curve
Mean=Median=Mode

3 2 1 0 1 2 3
Other Characteristics of distribution…cont.

Normal distribution … cont.

 many naturally occurring variables are plotted


they are found to be normally distributed.
 It is also generally found that the more scores
from such variables you plot, the more like the
normal distribution they become.
Other Characteristics of distribution.. Cont.

2. Skewness
 Most often, observed deviations from normality are the
result of skewness.

 The distribution that has an extended tail to the right is


known as a positively skewed distribution (Figure below).

 The distribution that has an extended tail to the left is


known as a negatively skewed distribution (Figure below)
Skewness
Skewness… cont.
 Here a positive value suggests a positively skewed distribution,
whereas a negative value suggests a negatively skewed
distribution.

 A value of zero tells you that your distribution is not skewed in


either direction.

 Values of skewness around about 1 (or -1) suggest deviations


from normality which are too extreme for us to use many of the
statistical techniques.
Other Characteristics of distribution.. Cont

 3. Kurtosis /kəːˈtəʊsɪs/
 Thekurtosis of a distribution is a measure of
how peaked the distribution is.

 A flat distribution is called platykurtic, a very


peaked distribution is called leptokurtic, and a
distribution between these extremes is called
mesokurtic.
Kurtosis
Skewness & Kurtosis
Kurtosis …
 A zero value tells you that you have a mesokurtic distribution, normal
distribution.
Descriptive statistics output from SPSS for windows
GRAPHIC PRESENTATION OF
DATA
Why Worry About Graphics?
 “Data graphics are mainly devices for showing the
obvious to the ignorant”
 “They have to be alive, communicatively dynamic,
decorated, and exaggerated; otherwise all the
dullards will fall asleep in the face of those boring
statistics”
 Graphics are instruments for reasoning about
quantitative information… They reveal data.” (Tufte
1983)
ORGANIZING AND GRAPHING QUANTITATIVE DATA

 Frequency Distributions
 Distribution Tables
 Relative and Percentage Distributions
 Graphing Grouped Data
Histograms
Polygons
Frequency Distributions Table

Table 2.7 Weekly Earnings of 100 Employees of a Company


Variable
Weekly Earnings Number of Employees Frequency
(dollars) f column
401 to 600 9
601 to 800 22
Frequency of the
Third class 801 to 1000 39
third class
1001 to 1200 15
1201 to 1400 9
1401 to 1600 6

Lower limit of the Upper limit of the


sixth class sixth class
Graphic Presentation of Frequency Distribution
The three commonly used graphic forms are:
Histograms
Frequency polygons
Cumulative frequency distributions curve.
Histogram
 Why use Histogram?
When there is a lot of data
When data is continuous.

a mass, height, volume, time etc


Presented basing on in a Grouped Frequency
Distribution.
usually in groups or classes that are
UNEQUAL
Histogram
HISTOGRAM A graph in which the classes are
marked on the horizontal axis and the class
frequencies on the vertical axis. The class
frequencies are represented by the heights of the
bars and the bars are drawn adjacent to each other.
Frequency Polygon
 A frequency polygon
also shows the shape of
a distribution and is
similar to a histogram.

 It consists of line
segments connecting
the points formed by
the intersections of the
class midpoints and the
class frequencies.
Histogram Versus Frequency Polygon
 Both provide a quick picture of the main characteristics of the data
(highs, lows, points of concentration, etc.)
 The histogram has the advantage of depicting each class as a
rectangle, with the height of the rectangular bar representing the
number in each class.
 The frequency polygon has an advantage over the histogram. It
allows us to compare directly two or more frequency distributions.
Cumulative Frequency Distr.
Pie Charts

 Pie charts are used to illustrate percentages or


proportions of a whole. They are particularly useful in
investigating discrete elements of:
-Populations
-Budgets
 But“at best, they allow readers to see crude proportions
among a few elements.” (Booth et al. 1995)
City of Tallahassee FY02
Revenues from All Sources

Utilities (66%)
Interdept (7%)
Interest (1%)
Misc (12%) Cap Bdgt OH (1%)
Fund Balance (1%) Taxes (8%)
InterGvtl (4%)
InterGvtl (4%)
Fund Balance (1%)
Taxes (8%)
Misc (12%)
Cap Bdgt OH (1%) Utilities (66%)
Interest (1%)
Interdept (7%)
Scatter Plots/Line Graphs
 Scatter plots and line graphs are used to show
the relation between two quantitative
variables where there is a unique value of the
dependent variable for any value of the
independent variable.
 The independent variable is typically plotted
on the x axis while the dependent variable is
plotted on the y axis
 Line graphs are especially effective at
presenting data that vary continuously
Bay County's Total Population, 1920-2000
160,000

140,000

120,000
Population

100,000

80,000

60,000

40,000

20,000

0 Source: US Census
1920 1930 1940 1950 1960 1970 1980 1990 2000

Year
Bar Graphs
 Bar Graphs are appropriate for data that are non‐numerical
and discrete for at least one variable, i.e. they are grouped
into separate categories. There are no dependent or
independent variables. Important features of this type of
graph include:
 Data are collected for discontinuous, non‐numerical
categories (e.g. place, colour, and species) so the bars do not
touch.
 Data values may be entered on or above the bars if you wish.
 Multiple sets of data can be displayed side‐by‐side for direct
comparison (e.g. Males and females of the same age group).
Bar Graphs…..

 Axes may be reversed so that the categories are on


the x‐axis, i.e. the bars can be vertical or horizontal.

 When they are vertical, these graphs are


sometimes called column graphs (MS Excel uses this
name for vertical bar graphs).
Sales of Books (in thousand numbers) from Six Branches - B1, B2, B3, B4,
B5 and B6 of a publishing Company in 2000 and 2001.
On what day did they sell the most chocolate milk?

Chocolate Milk Sold

120
112

100
Amount Sold

80 76
72

60 53

40 33

20

0
Monday Tuesday Wednesday Thursday Friday

Chocolate Day
How to interpret a Bar Graph?
The bar graph shows John’s students by gender and band
membership.

7
 How many of
John’s students are

6
band members?

5
4
 How many of
John’s students are
3
not band 2

members?
1
0

Female Female not Male band Male not


band band band
QUALITATIVE DATA ANALYSIS
 Qualitative analysis is a research
procedure that
 (a) deals with data presented in textual,
verbal and multi focus format;
 (b) contains a minimum of quantitative
measurement, standardization and
statistical techniques,
 (c) aims to transform and interpret
qualitative data in a rigorous and
scholarly manner
Timing for qualitative data analysis

 1. Data analysis during data collection:


 In this case, data are collected, coded, conceptually
organized, interrelated, analysed, evaluated and
then used as a spring-board for further sampling,
data collection, processing and analysis, until
saturation is achieved.
 Data collection is thus merged with data analysis.
Timing for qualitative data analysis,
cont …
 2. Data analysis after data collection:
 Always there is some work left for analysis after
completion of data collection.
 Also there are cases where qualitative analysis is wholly
conducted after data collection, e.g. when data is
electronically collected.
 3. Qualitative data analysis during and after data collection:
 For example while collecting data, researchers conduct some
basic analysis, record the data and then intensify their analysis
when the study is completed by focusing on more specific aspects
of the research question as shown in the transcripts.
Methods for Qualitative data analysis


1. Content Analysis
 Content analysis is a research
option in research used to reduce
large amounts of unstructured
textual content into manageable
data relevant to the (evaluation)
research questions.
 Content analysis uses thematic
coding in order to perform a
quantitative analysis of particular
occurrences of themes in an
unstructured text
Example of thematic coding

You might also like