Data Analysis Stata and SPSS A Handbook PDF

A HANDBOOK ON
DATA ANALYSIS
USING SPSS AND STAT

Table of contents
A )Introduction to statistics
Descriptive statistics
Probability
Inference
B) Sampling design
C) Data collection techniques
D )Probability distributions
E ) Statistical inference
F ) Non parametric statistics
G) Regression and correlation

H) Data Management and analysis using SPSS and STAT
I ) Report writing skills
INTRODUCTION TO STATISTICS
Is the study of how to organize, analyze and interpret numerical information from data so as to
draw valid and meaningful conclusions?
Data is the raw facts, numbers, and ideas pertaining to an activity of interest.
Information is processed data.
Statistics is divided into two branches
i) Descriptive statistics
This is used when the purpose of an investigation is to describe the data that has been (or will be
collected.
Suppose a researcher is interested in determining the proportion of voters in his village ( LC 1) who
prefer a certain candidate . The focus of the researcher is his/her village and he/she will collect data
on all the voters in the village and note whether each voter supports the given candidate and the
calculate the proportion. Because the researcher is using statistical methods merely to describe the
data he/she collects, this is an example of descriptive statistics.
ii) Inferential statistics

This is used when the purpose of the research is not to describe the data that has been collected but
to generalize or make inferences based on it. In the above example , if the researcher was interested
in determining the proportion of voters who favour a certain candidate in his/her whole district, it is
unlikely that he/she will be able or even want to collect the relevant data on all voters in the district.
He/she will probably randomly select smaller groups of voters in the district and use inferential
statistics to generalize or make inferences based on it.
The smaller group on which the data is collected is called a sample whereas the larger group to
whom conclusions are generalized ( or inferred ) is called the population
In general two major factors affect or influence the researchers confidence that what holds true for
the sample also holds true for the population at large.
They are:
i) Size of the sample
ii) The method of sample selection
When the sample is the population, we are in the area of descriptive statistics and the conclusion
will be 100% certain. Thus one of the major goals of inferential statistics is to assess the degree of
certainty of inferences when such inferences are drawn from sample data.
VARIABLES AND CONSTANTS
Variables are characteristics of persons or objects that vary from one person to person or object to
object. In our example of the researcher above, the preference of a certain candidate by a voter will
vary from voter to voter hence preference here is a variable.
Characteristics that remain constant from person to person or object to object are called constants.
Whether a characteristic is designated as a variable or a constant depends on the study in question.
In the above study of voter preference, the number of voters in the village ( LC1) is constants ( i.e it
does not change for that village in this particular study).
MEASUREMENT OF VARIABLES
Measurement involves the observation of characteristics on persons or objects and the assignment
of numbers to such persons or objects so that the numbers assigned represent the amounts of the
characteristics possessed. The rules for making number assignments determine the type of
arithmetic operations and comparisons that can be meaningfully made.
There are 4 types of measurement

1 Norminal level
This percieves 2 objects to be similar or not. Examples are short versus non-short , heavy versus
non-heavy, male versus female e.t.c . Norminal data are not intended for calculations
2 Ordinal level
Characteristics are considered ordinal if there is an order to them. For example Excellent , very
good, good, poor.
Or great-success , average success , and below average-success
3 Interval level
This allows for meaningful statements about the amount of difference between points along a
scale i.e if numbers can be assigned in such a way that equal numerical difference correspond to
equal increments in the property.
4 Ratio level
This allows characteristics to be ordered , take differences and also find the ration between 2 data
values
DESCRETE AND CONTINOUS VARIABLES
A numerically valued variable is said to be discrete if the values it takes on are integers or can be
thought of in some unit of measurement in which they are integers
It is continuous if whenever it can take on values a and b, it can also take on values between a and b
where a and b are integers
PROBABILITY
Recall that inferential statistics deals with drawing conclusions about the population from the results
obtained from a sample. Recall also that we are also faced with the issue of “degree of certainty on
the inferred conclusions” .
We use basic probability to quantify this degree of certainty.
Concepts of probability apply to variables that are either discrete or continuous
THE DISCRETE CASE

Basic probability hinges on experiments.
An experiment is defined as any action that leads to an observable outcome e.g tossing a coin,
answering a question on a test.
The observable outcomes or any combinations of observable outcomes of any experiment are called
events.
e.g.
experiment : Toss a coin once
outcomes : {H} {T}
experiment : Toss a coin twice

outcomes : {HH} {HT} {TH} {TT}
The probability or degree of certainty that a particular event will occur is defined by P(E) where
P(E) = Number of outcomes / Total number of outcomes
Example 1. What is the probability of obtaining 2 heads when a coin is tossed twice ?
solution: there are 4 possible equally likely outcomes listed as {HH} {HT} {TH} {TT}
only one these results in 2 heads i.e. {HH}
Let E be the event “ obtaining two heads “
therefore P(E) = ¼
Example 2 . In a data set there are 227 males and 273 females, if the experiment consists of selecting
one individual from this data set at random.
What is the probability that the individual is
i) A Female
ii) A Male
iii) A toddler
iv) Either female or male
P(F) = 273/500
P(M) = 227/500
P(F) = 273/500
P(T) =0/500
P(M or F) =500/500
PROBABILITY RULES
COMPLEMENTARY RULE
In the example 2 above , a female individual cannot also be a male and vice-versa. We call these 2
events mutually exclusive or disjoint events .
The event “individual selected is male” is the same as the event ” individual selected is not female” .
These 2 events are said to be complementary events.
event “individual selected is male” is the complement of the event ” individual selected is female”
If 2 events E1 and E2 are complements of each other then p(E1) = 1- P(E2)
ADDITIVE RULE
For any 2 events E1 and E2 the probability of the combined event E1 or E2 is P(E1 or E2) and is given
by
p(E1 or E2) =p(E 1)+ P(E2). This is called the first additive rule of probability
The second additive rule states that p(E1 or E2) =p(E 1)+ P(E2) – P(E1 and E2) where
P(E1 and E2) represents the probability of the outcomes that E1 and E2 have in common
Example 3 A cross tabulation of the number of students in a data set who sat post secondary exam
by region is given below. Determine the probability of selecting a student who is from the south or
who took the post secondary exam.
Region
NEast central south west Total
Post secondary yes 96 54 101 60 311

Sat exam
No 10 97 49 33 189
Total 106 151 150 93 500
Let E1 be the event “student from south”

E2 be the event “student took exam”
Now E1 and E2 are not mutually exclusive as 101 students are from south and took the eam
So P(E1 and E2) = 101/500
Therefore p(E1 or E2) =p(E 1)+ P(E2) – P(E1 and E2) = 150/500+311/500-101/500 = 360/500
MULTIPLICATIVE RULE
P(E1 and E2) = p(E 1). P(E2) e.g tossing a coin twice
P(HH) = P(H).P(H) = ½*1/2 = ¼
This rule applies when the two events are independent of one another ( have no effect on each
other). This rule can be extended to more than 2 events.
CONDITIONAL PROBABILITY
When two events are not independent, the probability of one event depends whether or not the
other has occurred. We write this as P(E1/E2) i.e probability of E1 given E2 has occurred.
Example 4: A cross tabulation of gender and marijuana usage is given below
Marijuana usage
Never Yes Total
Sex male 185 42 227
female 223 50 273
Total 408 92 500
a) The probability that a student randomly selected never smoked marijuana given that the
student is male is 185/227
b) The probability that a student randomly selected never smoked marijuana given that the
student is female is 223/273
c) The probability that a student randomly selected is male given that the student never
smoked marijuana is 185/408
In general P(E1/E2) = P (E1 and E2)/P(E2) as an illustration refer to example 4(a) above
P(E1) = student randomly selected never smoked marijuana = 185/500
P(E2) = student is male = 227/500
Now P (E1 and E2) = 185/500
By substitution P(E1/E2) = P (E1 and E2)/P(E2)= 185/500*500/227 = 185/227 as before
THE CONTINOUS CASE

Recall in the definition of a continuous random variable that a continuous random variable can as
well take on values between a and b where a and b are integers.
Many variables in the real world follow this trait. Examples are height, weight, IQ scores e.t.c.
probabilities in this case are solved by using the normal distribution model.
a) A normal distribution is symmetrical about its mean
b) A normal distribution extends indefinitely to the right and to the left of the mean always
getting closer but never touching the horizontal axis ( i.e observations that are further from
the mean have smaller relative frequencies than observations that are close to the mean)
c) The total area under the normal curve is 1. By symmetry this means that one-half of the area
is to the right of the mean and one-half is to the left.
In general all normal distributions have the same bell-shaped appearance and differ from
each other only in their particular mean and standard deviation values.
The particular normal distribution with mean 0 and standard deviation 1 is called the
standard normal distribution.
In general probabilities can be obtained as proportions of area under the appropriate
normal curve.
To convert a normal distribution to the standard normal distribution we convert values to
Z-score.
Z-score= (value-mean)/standard deviation)
Enter a data value in the top left cell in spss. Click transform and then compute.
Type variable name prob in the TargetVariable box.
Find CDF.Normal(q,maen,stard dev) using the arrow down. Highlight this function and use up arrow
to move it to the numerical expression box.
You will get CDF.Normal(?,?,?) then replace the ? with 26,18.40,6.92)
The value 0.86 will appear in the second column
Interpretation
If you randomly select a score from a normal distribution with mean 18.4 and standard deviation of
6.92, the probability of obtaining a value below 26 is 0.86
To use the stardard normal curve first convert 26 to Z-score i.e 26 = (26-18.4)/6.92 = 1.11
Then repeat the above with the function CDFNorm(z)
DESCRIPTIVE STATISTICS:
COUNTING RESPONSES
Whenever you ask a number of people to answer the same questions or when you measure the
same characteristics for several people or objects, you want to know how frequently the possible
responses or values occur.
This can be as simple as just counting up the number of yes or no responses to a question. Or it can
be considerably more complicated- for example if you have asked people to report their annual
income or their ages.
In this case of annual income and age, simply counting the number of times each unique income or
age occurs may not be a useful summary of the data. In this case you need to resort to other means
to summarize and display values for one variable at a time.
FREQUENCY TABLE
From this table, you can tell how frequently people gave each response.
It consists of rows which represent the responses given. It shows how many people gave each
response.
It also consists of a part labelled missing which will tell you how many respondents did not select
one of the responses.
The last row gives the total of the respondents who participated in the survey/study.
The frequency table also includes a column which shows the proportion of respondents who gave
each response in terms of percentages. These percentages help to compare various survey results.
The last column gives the cumulative percentage which gives the percentage of people who gave a
response and any response that precedes it in the frequency table. It is the sum of the valid
percentages for that row and all rows before it.
When reporting results based on cases with no missing values, you should also report the
percentages of cases that refused to give an answer. This is because missing values can be a big
problem especially where many respondents refuse to answer. This makes interpretation of results
difficult.
[ consider this scenario of a study of 100 employees of which 55 reported they were a satisfied lot ,
4 rate themselves unsatisfied and 41 refused to answer. That means 55% are satisfied. Now
remove non response and the percentage becomes 93 %. So which is the right conclusion ? ]
It may mean the company is full of satisfied employees many of whom don’t like to answer question.
It may mean half of the employees are unhappy but fear to voice their dissatisfaction.
Ensure to always run frequency tables on your variables because this will help detect
mistakes in the data files as each code captured in data entry will be reported on by the
frequency tables.
PIE CHARTS
This is a visual display of a frequency table. It consists of slices for each row in the frequency table.
The size of the slice depends on the number of cases in the category. An
BAR CHARTS
This is a visual display of a frequency table. It consists of bars for each row in the frequency table.
The length of the bar depends on the number of cases in the category. An example of bar chart is
shown below
HISTOGRAM
Some responses to variables will produce so many slices or bars which will be crowded that they are
not useful in any way (variables like income and age mentioned earlier above).
In a pie chart and bar chart, no provision is made for a value of a variable that may not occur for
example if you have values 2 and 4 where 3 is missing, the bar or slice for the value 2 will be next to
the bar of the value 4 hence not telling you that values of 3 are missing.
A better display in this case is the histogram. This groups adjacent values together. It is similar to a
bar chart except that each bar represents a range of values. An example of a histogram with a
normal curve imposed is shown below. The normal curve will be discussed later .
MEASURES OF CENTRAL TENDENCY
MEAN: The formula ∑ X/N
MODE: This is refers to that observation with the highest frequency in the data set. Is usually
used for variables measured on a norminal scale. It is a useful statistic to report with a frequency
table or a bar chart.
MEDIAN: This is a statistical measure that divides the data into two equal parts. To get the
median of the ungrouped data we arrange the data into either ascending or descending
order and then selecting the middle value.
MEASURES OF DISPERSION
RANGE: (for ungrouped data) This is the distance between the largest and smallest
observations.
INTER-QUARTILE RANGE;
VARIANCE AND STANDARD DEVIATION
SKEWNESS AND KURTOSIS:
KURTOSIS
Kurtosis: This is the degree of Peakedness or flatness of a probability distribution relative to the
normal distribution with the same variance.
A diagrammatic example showing the Mesokurtic Leptokurtic and Platykurtic kurtosis:
Example 1:
Example 2:
MODULE 5 STATISTICAL INFERENCE

INTERVAL ESTIMATION
It is difficult if not impossible to get a single value that is a true representative of the
population parameter since different samples yield different point estimates of the
corresponding population parameter, thus instead of trying to get a single point estimate
we concentrate on determining an interval within which we expect the true value to lie with
some level of probability; such an interval is called the CONFIDENCE INTERVAL.
INTERVAL ESTIMATE : This defines an interval within which the true value is expected to lie
with some level of probability.
For instance the (1 – α) 100% confidence interval for the mean is given by:
Therefore: (1 – α) 100% =
Where α: level of significance
1-α: level of confidence.
then manipulate the inequality.
Z tells us how many standard deviations the x value lies above or below the mean.
(Go to the examples and exercise)
Page 271. 1 & 2
CONFIDENCE INTERVAL FOR THE MEAN WHEN THE STANDARD DEVIATION IS UNKNOWN.
The 1-α 100% confidence interval for the mean when the standard deviation is unknown is
given by the t-distribution.
i.e.: (1 – α) 100% =
then manipulate the inequality.
Example 19. Page 273.
THE STATISTICAL TEST FOR μ.
Hypothesis testing.
HYPOTHESIS: This refers to a preposition that is made about a property of a population of

interest that will later be proven right or wrong.
In hypothesis testing there are two kinds of hypotheses.
1. The null hypothesis: the hypothesis set so as either to be rejected or accepted. It is a

specific baseline statement to be tested and it usually takes such forms as “no effect” or “no
difference
2. The alternative / research Hypothesis: This is the hypothesis being proposed by the
researcher. Researchers often, but not always, expect that evidence supports the
alternative hypothesis.
In hypothesis testing there are procedures to follow and these include:
Statement of a clear Null hypothesis and the corresponding alternative
Specification of the level of significance.
Selection of the appropriate test statistic and specification of the rejection criteria.
Pay attention to whether a test is one-tailed or two-tailed to get the right critical
value and rejection region. I.e. alpha/ two or alpha just.
Computation of the test statistic and p-value based on the observed data plus the
value of the test statistic.
Decision-making: Make the respective decisions by either accepting or rejecting

the null hypothesis by comparing the subjective criterion in (3) and the objective
test statistic or p-value calculated in (4).
A STATISTICAL TEST FOR MEAN ( μ) WHEN σ IS KNOWN OR WHEN THE SAMPLE SIZE N >=
30.
Here We Use The Z SCORE.
A STATISTICAL TEST FOR MEAN ( μ) WHEN σ IS UNKNOWN OR WHEN THE SAMPLE SIZE N
< 30.
If the population standard deviation is not known, the error bound for a population
mean is:
EBM=tα/2⋅(sn√)
tα/2 is the t-score with area to the right equal to α2
Here We Use the Student’s T distribution. t α/2, n-1

MODULE 2
SAMPLING TECHNIQUES
Simple Random Sampling

- Develop a frame or list of elements in the sampled population
- Select a procedure based on a random number to ensure that each element in the
sampled population has the same probability of being selected.
- You can use A table of random numbers or mathematical functions like RAND() ,
RANDBETWEEN() to generate random numbers
Population Mean
If n ≥ 30, the sampling distribution can be approximated by a normal probability distribution.
An internal estimate of μ is given by = standard error of mean when a
simple random sample of size n is selected from a finite population of size N, an estimate of
the standard error of the mean is =
In this case the interval estimate of the population mean becomes
It is common practice to use a value of 2 so an approximate 95% CI estimate of the

population mean is
An approximate 95% CI estimate for the population total is n
Example in a study of a simple random
Sample of n = 5 public schools from a population of N = 500 schools, the sample mean is =
22000 sq ft and the sample standard deviation is S = 4000 sq ft
= = 536
So an approximate 95% CI estimate of the mean = 22000 2(536)
= 22000 1072
Approximate 95% CI estimate of the population total is 500 × 22000
= 11000000 1072
Determining Sample Size
Recall the bound on the sampling error is 2 times the estimate of the standard error of the
point estimate thus
B=
Solving for n gives n =
Thus by choosing a value for B, the value of n can be obtained.

Notice a value of is required in order to get B
Can be obtained from
i. Use results of a pilot survey
ii. Use information from a previous sample
Example: it is required to estimate the population of a particular university. Suppose there are
N = 5000 graduates. We want to develop an approximate 95% CI with a width of at most $
1000. To provide this CT, B = 500.
We now need suppose a previous study gave S = $3000.
We now have n = = 139.97
Stratified Simple Random Sampling

Here the population is divided into H groups called Strata. Then for each stratum a simple
random sample of size n h is selected. The data from the H simple random samples are
combined to develop an estimate of a population parameter.
The population may be stratified by department, location, age, product type, industry type,
and sales levels and so on.
Point estimate of population mean
= H = number of strata
= sample mean of stratum h
= no. of elements in the population in stratum h
= + +--------+
An estimate of the standard error of the mean is =

The 95% CI for the mean is
Suppose a survey of 180 graduates at a certain college yielded the results below;
Stratum h
Accounting 30,000 2000 500 45
Finance 28500 1700 350 40
Information technology 31500 2300 200 30
Marketing 27,000 1600 300 35
Operating management 31,000 2250 150 30
We have =
as point estimator of the mean
Calculation of the standard error of the mean
h
Accounting 1 = 20,222,222,222
Finance 2 = 7,839,125,000
Information systems 3 = 5995333333
Marketing 4 = 5814837143
Operations management 5 = 3037500000.

42909037698
Determining Sample Size

There are 2 approaches;
1. A total sample size n must be chosen. we must then decide how to assign the samples
units to the various strata
2. First decide how large a sample to take in each stratum and then sum the stratum sample
sizes to obtain the total sample size.
Since it is often of interest to develop estimates of the mean, total, and proportion for the
individual strata, a combination of these two approaches is often employed. The factors
considered most important in making the allocation are;-
- The number of elements in each stratum

- The valiance of the elements within each stratum
- The cost of selecting elements within in each stratum.
Generally larger samples should be assigned to the larger strata and to strata with larger
variances. Conversely to get the most information for a given cost, smaller samples should b
allocated to the strata where the cost per unit of sampling is greatest.
The cost of selection can be an important consideration when significant interviewer travel
between sampled units is necessary in some strata but not in others e.g. rural area and urban
areas.
In many surveys the cost per unit of sampling is approximately the same for each stratum e.g.
mail and telephone surveys. In such cases the cost of sampling can be ignored. The
appropriate formula to use is given by;-
= the number of units allocated to a stratum increases with the stratum size
and standard deviation. Note that we need to first determine the total sample size n. given a
level of precision B the following formulae can be used when estimating the population mean
and population total
Population mean
Population total
Systematic Sampling
Is often used as an alternative to simple random sampling, this is because it can be time
consuming to select a simple random sample by first finding a random number and searching
through the frame to locate the elements.
It requires that the defined target population be ordered in some way e.g. a list, roll or roster.
You need a skip interval which is determined as
After identifying the skip pattern then a starting point is randomly selected.
Suppose a sample size of 50 is required from a population of 5000 elements. We might
sample one element for every .
A systematic sample for this case would involve randomly selecting one of the first 100
elements from the frame. The remaining sample elements are then identified by starting with
the first sample element and then selecting every 100th element that follows in the frame.
Cluster Sampling
This requires that the population be divided into N group of elements called Clusters such
that each element in the population belongs to one and only one cluster. E.g. suppose we
want to survey registered voters in this country, one approach would be to develop a frame
consisting of all registered voters and then select a simple random from this frame.
Alternatively in cluster sampling we might choose to define the frame as the list all districts
in the country. In this approach each district is a cluster which consists of a group of
registered voters.
Suppose we select a simple random sample say 10 of these districts, at this point we can
collect data on all registered voters in each of these 10 districts, this approach is called Single
Stage Cluster sampling. We could as well select a simple random sample of registered voters
from each of these 10 sampled clusters; this approach is called Two-Stage Cluster sampling.
Cluster sampling is similar to stratified sampling as the two methods divide the population
into groups. Cluster sampling tends to provide better results when other elements within the
cluster are not alike (heterogeneous).
In the ideal case, each cluster will be a small –scale version of the entire population and
hence sampling a small number of clusters would provide good information about the
characteristics of the entire population.
Cluster sampling involves area sampling where the clusters are countries, townships, cities or
other well-defined geographical sections
Quota Sampling
Involves selection of perspective participates according to pre-specified quotas for either

demographic characteristics e.g. age, race, gender income, specific attitudes e.g. satisfied/ not
satisfied , like/dislike or specific behaviors of regular/rare customers user/non user. The
purpose of quota sampling is to ensure that pre-specified sub groups of the target population
are represented on relevant sampling factors moreover surveys frequently use quotas that
have been determined by the nature of the research objectives.
Determining quota sizes for each of the sub groups in somewhat subjective but usually
percentages can be used e.g. to get a sample of 1000 customers if age group 20-30
contributes about 50% then the sample size could be 50%.
Snow Ball Sampling
Involves identifying and qualifying initial prospective respondents who can in turn help
identify additional people to include in the study, It is also called referral sampling. It is
applicable where;-
- The defined target population is small and unique
- Compiling the complete list of sampling units is very difficult
MODULE 3 Methods of data collection
There are various methods of data collection, most important ones being observation,
personal interview and questionnaires.
DIRECT OBSERVATION
This involves enumerators taking observations directly from the sampling units of interest
e.g. in
Agricultural surveys enumerators observe and measure accurately the area under cultivation.
NOTE THAT.
Observation techniques require prior knowledge of the field of research.
That is one cannot observe cattle if he does not know what it looks like, or leaf types if they
do not have leaf knowledge and appearance.
ADVANTAGES
It is free from errors due to memory lapse as enumerators record every thing as they
see it.
Non response errors are never encountered.
DISADVANTAGES
It is expensive and time consuming as it involves moving from one place to another.
Transport and communication is a problem especially in low developed countries
where road net works and other means of communication are poor.
The technique is not always feasible, especially when observation of human behaviors
is involved. This is because people have a tendency of changing the behavior during
the process of observation.
You cannot get people’s attitude towards something by mere observation.
Ethical issues concerning privacy of an individual are not considered.
PERSONAL INTERVIEW
Interviewing is a technique that is used to gain an understanding of the underlying reasons
and motivations for people’s attitudes, preferences or behavior.
In this method of data collection, the enumerator is brought into contact with the respondent
(one to one or one to many) and asks him or her (them) questions about the subject under
study.
TYPE OF INTERVIEWS
STRUCTURED:
Base on a carefully worded interview schedule
Frequently require short answers being ticked off
Useful when there are a lot of questions which are not particularly contentious or thought
provoking.
Respondent may become irritated by having to give over-simplified answer.
SEMI-STRUCTURED
The interview focused by asking certain questions but with scope for the respondent to
express him or her self at strength.
UNSTRUCTURED
This is also called an in-depth interview. The interviewer begins by asking a general question.
The interviewer then encourages the respondent to talk freely. The interviewer uses an
unstructured format, the subsequent direction of the interview being determined by the
respondent’s initial reply. The interviewer then probes for elaboration.”Why do you say
that?” or that is interesting, tell me more or would you like to add any thing else, being a
typical probes
ADVANTAGES
Interaction creates an opportunity for on spot clarification of concepts and form of
information sought in the survey. It is very useful where the respondent is not sure of
the kind of responses to give.
Suitable for both literate and illiterate population
It has a high response rate than questionnaire
Serious approach by respondent resulting from accurate information
Complete and immediate
Possible in -depth questions (probing)
Interviewer in control and can give help if there is a problem
Can use recording equipments
Used to pilot other methods
Can investigate the motives and feelings of the respondent
DISADVANTAGES
It involve high expenses on transport and other field related exercise
It is prone interviewer’s bias. Often at times interviewers may ask leading and
suggestive questions. Such questions may bias the respondent
There is the problem of memory lapse
It is subject to problems like language barrier, non response and hostilities
It is also time consuming
Limited geographical coverage
Respondent bias- Tendency to please or impress, create false personal image, or end
interview quickly
QUESTIONAIRRE METHOD
This particular method of data collection can be explored in four dimensions
 Self administered questionnaire
 Mail questionnaire
 Telephone interviews
 By computer (emails, directly)
 Mail questionnaire
This involves mailing questionnaires to prospective respondents with a list of instructions and
a letter explaining the objective and importance of the survey.
Respondents are expected to fill the questionnaires and return them by mail
Advantages
 Speed and cost reduction as it does not involve movement of people (enumerators)
 It is very effective where sampling unites are scattered
 It is possible to get correct information about sensitive issues since the people fill the
questionnaires privately
 It reduces errors due to interviews bias
 Correct information can be got since consultations can be made
Disadvantages
 It presupposes a high level of literacy among the respondents. This is not usually the
case in most African countries
 It assumes the existence of a good and efficient postal system, which may not be the
case
 There is a high rate of non-response
 Follow up are very difficult to conduct. (Some kind of reminder need to be sent in
order to remind the respondent to send back the questionnaire)
 Responses are usually slow. This is because people fill in the questionnaire at their on
pace
 Actively A wrong person may fill the questionnaires there by biasing the results.
 Remember that in interviewer’s bias include
 The interviewer asking leading question
 Writing wrong information for personal reasons or due to stress
 Interviewers may also bias or lead to memory lapse of respondent, that is when a
woman dresses propound the respondent is a mea, he might be upset and give wrong
information leading to errors
CORRELATION AND REGRESSION

Correlation Analysis
Correlation definition: This refers to the statistical measure of the linear relationship
between any continuous random variables. It measures the magnitude and direction of the
linear relationship between the variables.
Correlation is divided into two categories;
These include:
1) Simple correlation.
2) Multiple correlation.
Simple correlation: This refers to the statistical measure of the linear relationship between
any two (2) continuous random variables. i.e. Physical statures of Parents and their
offspring, and the correlation between the demand for a product and its price.
Forms of Simple correlation
Simple correlation can either be linear or nonlinear.
Linear correlation:
N.B: If the relationship between the two variables is a Perfect linear relationship, then the
scatter plot of the points will fall on a straight line as shown in the examples below.
 Perfect Positive linear correlation.
 Perfect Negative linear correlation.
 Positive linear correlations.


 Negative linear correlations.

Nonlinear correlation:
Positive nonlinear correlation.
Example 1:
Example 2:
Negative nonlinear correlation.

N.B:
For Positive correlation: Small values of x go with small values of Y while large values of x
go with large values of Y. i.e The relationship between the Age of an individual and his/her
Weight.
For Negative correlation: Small values of x go with large values of Y while large values of x
go with small values of Y. i.e. the relationship between the Quantity Demanded of a
product and its Price.
Examples of no correlation
Example 1:
Example 2:
When there is Zero or no correlation (relationship) then this implies that the change in one
(independent) variable has no effect on the other (dependent).
The correlation coefficient :
The correlation coefficient is a value that lies between -1 and +1 i.e. -1 ≤ rxy ≤ +1
The higher the correlation coefficient the greater relationship between the variables and
vice-versa.
The correlation coefficient can be calculated by using either Parametric techniques or non
Parametric techniques,
 One of the Parametric techniques is the Pearson’s correlation coefficient.
The Pearson’s correlation coefficient is given by:
The correlation coefficient can also be calculated using the covariance of the variables X and
Y and their variances.
Where:
 One of the nonparametric techniques is the Spearman’s Rank correlation
coefficient.
The Spearman’s Rank correlation coefficient is given by:
The correlation coefficient specification table.
CORRELATION COEFFICIENT COEFFICIENT SPECIFICATION
0.00 - 0.19 Very low or no correlation

0.20 - 0.39 Slight correlation
0.40 - 0.59 Moderate correlation
0.60 - 0.79 Substantial correlation
0.80 - X Very high or strong correlation
1 Perfect correlation
N.B: The correlation coefficient measure runs short of the significance level thus the
Probability value is used to tell the level of significance of the relationship between the
random variables; if the P value is less than the level of significance α then that implies that
there is a significant relationship between the two random variables and vice versa.
REGRESSION ANALYSIS
Regression: This refers to the statistical measure of the relationship between any two or
more random variables. It’s a statistical technique for estimating the relationship among
variables and it’s used to find out whether there is any evidence of relationships among
variables of interest for the purpose of predicting future values. Regression is concerned
with the study of the dependence of one dependent variable on one or more independent
variables.
N.B: The independent variable can also be called the Explanatory/ Exogenous / Regressor
while the dependent variable can also be called the Explained / Endogenous / Regressand
Simple linear regression: This refers to the statistical measure of the relationship between
two random variables where by one is the dependent variable and the other is the
independent variable.
In simple regression there is only one independent variable that is assumed to be affecting
the dependent variable. I.e. Consider the function y = f(x1). The x is the independent
variable while the y is the dependent variable.
A linear regression is a statistical method that helps one understand the relationship
between two (or more) variables. It does this in three ways:
1. It uses one variable to predict the value of another variable

2. It tests hypotheses concerning the relationship between two variables
3. It quantifies the strength of the relationship between two variables
Consider the graph below:

As we did in our discussion of linear correlation, we will denote the two variables as
X and Y; X is the independent variable, Y the dependent one. A linear regression assumes
that there is a linear relationship between X and Y, and is given by the following formula:
Yi = b0 + b1Xi + εi for i = 1, …, n
where:
Yi is the ith value of the dependent variable
b0 is the y-intercept
b1 is the slope coefficient
Xi is the ith value of the independent variable
εi is the ith value of an error term

Consider the simple linear regression models below; In linear regression (simple) there are
two kinds of equations and these are:
i) The Deterministic equation: Y = α + β Xi. (A deterministic equation is a one to one
relationship).
ii) The Stochastic equation: Y = α + β Xi + έ
In this simple linear regression model below used for modelling data points there is only
one independent variable: , and two parameters, α / and :
Consider the straight line:
From the above equation β1 is the slope of the line, which shows the average change (increase or
decrease) in the dependent variable given a unit change (increase or decrease) in the independent
variable.
So given a random sample from a Population, we estimate the population parameters and
obtain the sample linear regression model:
Where the formulas for the least squares estimates are:
The residual,(error term), , is the difference between the value of the

dependent variable predicted by the model, , and the true value of the dependent
variable, .
COEFFICIENT OF DETERMINATION
The coefficient of determination: This is the measure of how well the regression line fits the
sample observation of the data; it is also known as the measure of goodness of fit and it’s
usually denoted by R2. The R2 lies between 0 and +1 i.e. 0 ≤ R2 ≤ 1
N.B: A zero means a Poor fit while a One (1) means Perfect linearity.
The Adjusted R2: The Adjusted R2 is similar to the R2 but it takes into account the variables
being used in the model (the number of predictors). i.e. the sample size, so it’s slightly
smaller or equal to the R2.
MULTIPLE REGRESSION
Simple linear regression looks at the dependence of a quantitative variable on another;
however, suppose that the dependent variable depends not only on one independent
variable but more then this is what is called multiple regression. e.g. the number of children
a woman has may depend on the Age of the mother, number of years spent at school and
even the number of co wives.
Multiple Regression analysis: This refers to the statistical measure of the relationship
between three or more random variables. In other words it’s a statistical technique that
predicts the value of one dependent variable basing on two or more independent variables.
In multiple regression there are two or more independent random variables that are
assumed to be affecting the dependent variable. i.e. Consider the function y = f(x1, x2, x3,
…, xn). The x’s are the independent variables while the y is the dependent variable.
In Multiple linear regression analysis a multiple linear regression model is fitted and it takes
the form:
Y = α + β1 X1 + β2 X2 + β3 X3 + .... + βn Xn + έ
Where:
Y = Dependent variable
X= Independent variable
βn = The coefficients of the independent variables.
έ = The error / disturbance term.
From the above equation each βi tells us the slope of the line, in other words it shows the average
change (increase or decrease) in the dependent variable given a unit change (increase or decrease)
in the independent variable.
N.B: Before fitting the Multiple regression model one should make sure that the xijs are
uncorrelated by carrying out correlation analysis (compute a correlation matrix amongst the
independent variables); if any of the two independent variables are highly correlated then you
should always use one of them in regression analysis but not both.
MODULE 11
WRITING A RESEARCH REPORT
1 OVERVIEW
After completing data analysis, as a researcher you are obliged to conclude your project by writing a
research report. Basically this report contains the following items
research question
relevant literature on the subject
Key findings
Limitations of the study
Writing the research report can be very challenging because it requires demonstrating a clear
understanding of the:
research question
analytical methods
and the findings
The intended audience should be able to understand the data analysis aspects without getting lost in
the subject of data analysis process and tools.
They need to understand the limitations as well as appreciate the insights of the study. Furthermore
the majority of the audience may not be familiar with all the statistical methods used so it is
imperative that every analysis is accompanied by a written explanation of the findings. At this report
writing stage, the data analysis becomes a ‘writer/narrator’ instead of the ‘explorer’.
2 WRITING STYLE
The first question to consider is the audience. Not all readers have the same skills or abilities to
understand statistics and therefore the style should be appropriate to the audience.
The following are guidelines:
Scientists writing to others can take for granted that their readers will understand the
fundamental statistical methods used as you have met in the course. However this audience
will also need the findings and methods summarized so that they can critically evaluate your
research project.
A general audience needs a different approach. Statistical results like ‘coefficients’, ‘standard
errors’, ‘significance tests’, e.t.c will be meaningless to those who do not understand
statistics or have little understanding of statistics.
Audiences with skills more than yours e.g professors will required in-depth discussion of the
process used in addition to the most import findings.
In general, scientists and other general audiences are more interested in findings than in the process
of the data analysis.
In all cases, the report will consist of statistical output which should be explained. Try to keep
analyses that are central to the research question. Also try to use statistical formats that are reader
friendly in the form of tables, graphs, and written text.
3 STRUCTURE OF THE REPORT

Reports usually differ in structure but the structure below serves as a guide.
Title
Abstract
Introduction
Literature review
Methodology
Findings
Conclusion
References
1. Title
The title is a concise description of the research project. Its one of the means of attracting
your audience in the first place. If it lacks necessary information or it introduces the subject
in unsatisfactory manner it will discourage the reader.
It must be captivating and carrying enough information about the subject matter.
It’s recommended to be between 5- 10 words.
Avoid magazine, newspaper titles e.g
Down but not defeated
A struggle against all odds
Someone should save the community
These titles are captivating but do not offer enough information and they portray the
researcher as an advocate of a cause other than an objective analysis of data.
For example who is down but not defeated? , what is the problem with the community?,
what is the struggle?.
Consider these titles
The impact of taxation on disposable income
The democratic class struggle in Uganda 1990-2000
These titles are captivating but they offer substantial information on the research project.
Abstract
This section although appears as a major section, should be written last. This is because it a concise
summary of the report
It should contain information about the research question, data, findings, and conclusions of the
study.
You should use words as effectively as possible. It is recommend to be no more 1 page.
Introduction
This section examines the rest of the report. It informs the reader of the research question and why
this question is important.
After reading through the introduction the reader should have a clear idea of the question the writer
is addressing.
Good introduction begins with phases like “ In this paper I intend to demonstrate that…………..”. This
brief phrase forces the writer to come to terms with the specific purpose of the paper and force the
rest of the report accordingly.
In this section avoid any un supported statements. If they are to be used, begin by quoting the
source e.g “According to Uganda Government National statistics abstract of
2000/2001……………………….”.
The introduction should not serve as a conclusion. Avoid statements like “In this paper I will prove
that poverty accounts highly for rising domestic violence……………………………..” This is because
scientific methodology cannot prove anything but provides support for some hypotheses and refute
others hypotheses. You can redefine statement to “ In this paper I will examine whether poverty
accounts highly for rising domestic violence……………………………..”
Introduction only opens up the research question.
It is recommended to cover 1-3 pages.
Literature review
This section places the study in the context of the body of research. It is basically aimed at
acknowledging other researchers works and contributions to the collective knowledge about the
subject under study while informing the reader of how these contributions relate to the current
research project. This section highlights what is known and what is not known about the research
question.
Literature review at the bare minimum is an overview of the essential findings from articles related
to the research question.
Literature review is significantly improved by creating a structure that links these various articles.
This structure provides the reader with a solid understanding of specific relationships.
As an example when studying factors that associated with family violence, the literature review can
first be organized by looking at the factors gender, age, race/ethnicity and the social economic
factors in that order. This will provide a structure for understanding the specific relationships.
Another approach could be to use comparison by dates for exampling studies between a base period
and a current period.
Avoid literature reviews that discuss studies in a random order or in the order in which the articles
are read. This is because the sequence does not matter to the audience.
The interest of the audience is in the research question, knowledge that has already been developed
and how the current research project fits into this body of literature.
Methodology
This section contains the methods used. It explains the following:
data collection techniques
data sources
sampling strategies i.e sample size and sample characteristics
indicators
modifications made to the data
statistical procedures applied
It should be concise as the rest of the sections. Its lengthy will depend on the complexity of the data
collection and analysis but generally it recommended to be between 2- 5 pages.
Findings
This details the results of the study. You should stick to those findings that relate to the research
question. This is because in the course of analysis other interesting results might appear.
Include tables, graphs and narrating text. This is because some audiences read tables, graphs and
ignore text and vice-versa( read text and ignore tables).
The findings should be related to the hypothesis. Include any conclusions on the hypotheses.
Depending on the study, it usually ranges between 6-12 pages.
Conclusion
This section relates the findings to the research question. It also discusses the limitations of the
study.
A good conclusion generalizes findings and gives consideration to external validity i.e the degree to
which findings accurately describe behavior outside the confines of the study.
A biased sample for example threatens external validity. An example is if the sample is students
avoid making generalizing the findings to the whole population.
A good conclusion ends by opening up the research question to further investigation by say “In
this study I have showed that ……………………………………… Further research needs to be done to find
out if…………………………………………..”
References
This section provides a list of all the references cited in the report.
You should provide enough information to enable the reader locate the source.
Further reading: Data analysis with SPSS (A first course in applied statistics)
Stephen A Sweet and Karen Grace-Martin 2003

Data Analysis Stata and SPSS A Handbook PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis Stata and SPSS A Handbook PDF

Uploaded by

Copyright:

Available Formats

A HANDBOOK ON

USING SPSS AND STAT

F ) Non parametric statistics

G) Regression and correlation

ii) Inferential statistics

There are 4 types of measurement

THE DISCRETE CASE

experiment : Toss a coin twice

Post secondary yes 96 54 101 60 311

Total 106 151 150 93 500

Let E1 be the event “student from south”

THE CONTINOUS CASE

MEAN: The formula ∑ X/N

VARIANCE AND STANDARD DEVIATION

SKEWNESS AND KURTOSIS:

A diagrammatic example showing the Mesokurtic Leptokurtic and Platykurtic kurtosis:

MODULE 5 STATISTICAL INFERENCE

Where α: level of significance

1-α: level of confidence.

then manipulate the inequality.

(Go to the examples and exercise)

Page 271. 1 & 2

then manipulate the inequality.

Example 19. Page 273.

THE STATISTICAL TEST FOR μ.

HYPOTHESIS: This refers to a preposition that is made about a property of a population of

In hypothesis testing there are two kinds of hypotheses.

1. The null hypothesis: the hypothesis set so as either to be rejected or accepted. It is a

In hypothesis testing there are procedures to follow and these include:

Statement of a clear Null hypothesis and the corresponding alternative

Specification of the level of significance.

Decision-making: Make the respective decisions by either accepting or rejecting

Here We Use The Z SCORE.

Here We Use the Student’s T distribution. t α/2, n-1

Simple Random Sampling

It is common practice to use a value of 2 so an approximate 95% CI estimate of the

Solving for n gives n =

Thus by choosing a value for B, the value of n can be obtained.

Stratified Simple Random Sampling

An estimate of the standard error of the mean is =

Information systems 3 = 5995333333

Operations management 5 = 3037500000.

Determining Sample Size

- The number of elements in each stratum

Involves selection of perspective participates according to pre-specified quotas for either

CORRELATION AND REGRESSION

Correlation is divided into two categories;

Simple correlation can either be linear or nonlinear.

 Perfect Positive linear correlation.

 Perfect Negative linear correlation.

 Positive linear correlations.

 Negative linear correlations.

Positive nonlinear correlation.

Negative nonlinear correlation.

The correlation coefficient :

 One of the Parametric techniques is the Pearson’s correlation coefficient.

The Pearson’s correlation coefficient is given by:

The Spearman’s Rank correlation coefficient is given by:

The correlation coefficient specification table.

CORRELATION COEFFICIENT COEFFICIENT SPECIFICATION

0.00 - 0.19 Very low or no correlation

1. It uses one variable to predict the value of another variable

Consider the graph below: