Professional Documents
Culture Documents
5
Rationale…
6
Limitations of statistics
As a science, statistics has its own limitation
– Deals with only quantitative information
7
1.2 Data
14
Scales of measurement…
15
Scales of measurement…
16
Types of measurement scales
• Nominal
– Data that represent categories or names
– There is no implied order to the categories of
nominal data.
– No arithmetic and relational operation can be
applied.
– E.g.
• Blood type (A, B, O and AB)
• Eye color (brown, black, blue, etc.)
• Sex (Male, Female)
17
Types of measurement scales…
• Ordinal
– Categories that can be ranked, but differences
between ranks do not exist
– Arithmetic operations are not applicable but
relational operations.
– Ordering is the sole property of ordinal scale.
– E.g.
• Degree of pain (minimal, moderate, severe)
• Rating scales (Excellent, Very good, Good, Fair, poor)
• Letter grade (A, B, C, D and F)
18
Types of measurement scales…
• Interval
– Data that can be ranked and differences are
meaningful. However, there is no meaningful
zero, so ratios are meaningless.
– All arithmetic operations except division and
relational operations are also possible.
– E.g.
– IQ
– Temperature in degree Fahrenheit (30F is not as
much as two times of 15F)
19
Types of measurement scales…
• Ratio
– Data can be ranked, differences are
meaningful, and there is a true zero.
– All arithmetic and relational operations are
applicable.
– E.g.
• Age (30 year individual is two times of 15 years)
• Weight (0kg is to mean, no weight)
• Number of drugs (0 means no drug)
20
1.7 Sources of data
– Methods of collection
Personal Interview ƒ
(Telephone, face-to-face…)
Focus Group discussion (FGD)
ƒMail Questionnaires
ƒDoor-to-Door Survey
ƒNew Product Registration ƒ
Sources of data…
Five stages
24
Stages in…
• Presentation of the data: The process of re-
organization, classification, compilation… of data
to present it in a meaningful form.
• Analysis of data: The process of extracting
relevant information from the summarized data
• Inference of data: The interpretation and further
observation of the various statistical measures
through the analysis of the data
– And by implementing those methods by which
conclusions are formed and inferences made.
25
1.10 Types of questions
.
1 Open-ended questions
• Permit free responses
• Not allowed any possible answers to
choose from.
• Mostly used for investigation of
• Facts which the researcher is not familiar
• Opinions, attitudes, and suggestions of
informants
• Sensitive issues
Types of questions…Example
• Can you describe exactly what the traditional
birth attendant did when your labor started?
2. Close-ended Questions
• Offer a list of possible options/answers
• When designing closed questions you should
try to:
• Make lists are complete and mutually exclusive
(events can’t happen at same time)
• Keep the number of options as few as possible
• It is useful if the range of possible responses
is known
Types of questions…Example
• What is your marital status?
1. Single
2. Married/living together
3. Separated/divorced/widowed
2. No
Steps in designing questionnaire
1. Content
Decide what questions will be needed to
2. Formulating Questions
3. Sequencing of Questions
5. Translation
If the interview will be conducted in one or
more local languages, translate
2. Data presentation
Data presentation
• Then, draw
Bar chart
• is the most widely used graphical method for
describing qualitative data.
• A set of bars representing some magnitude
over time space.
• The common types of bar chart
– Simple
– Multiple
– Component … etc
Simple bar chart
%
60 Male
Female
40
20
0
Single Married Divorced Widowed
Marital status
Graphical presentation of data
• The commonly used graphs for
continuous data are
– histogram,
– Frequency polygon
– Ogive (CF graph)…
Histogram
• A graph which displays the data by using
vertical bars of various heights to represent
frequencies.
• Class boundaries are placed along the horizontal
axes.
• Example: Construct a histogram to represent
the previous data
– i.e., (example on grouped freq.distrib.)
Solution
4
3
2
• Objectives
• To understand the data easily
• To facilitate comparison
• To make further statistical analysis
Types of MCT
54
The Mean ( X )
• The Arithmetic Mean:
• Is defined as the sum of the magnitude of the
items divided by the number of items
• The mean of X1+X2+X3+,…+Xn is denoted by
A.M ,m or X and is given by:
,, Or
55
Mean for Ungrouped data
56
Mean for grouped data
57
Example: calculate the mean for the
following data
58
The Mode ( X̂)
59
Examples:
60
~
The Median( X )
• In a distribution, median is the value of the
variable which divides it in to two equal
halves.
• In an ordered series of data median is an
observation lying exactly in the middle of the
series.
61
Example:
Find the median of the following numbers.
a) 6, 5, 2, 8, 9, 4
b) 2, 1, 8, 3, 5
Solution:
a) First order the data: b) Order the data :
2, 4, 5, 6, 8, 9 1, 2, 3, 5, 8
Here n=6, which is even Here n=5 , which is
n=6 odd
62
MV: Measures of variation
63
Objectives of measures of variation
64
Types of Measures of Dispersion
65
The Range
Example: 32 35 36 42 42 43 43 45
Range is 45-32=13
66
Mean Deviation
67
The variance and standard deviation
Population Variance:
• If we divide the variation by the number of
values in the population, we get the
population variance.
• This variance is the "average squared
deviation from the mean"
68
Sample Variance
69
Sample variance formula
For raw data:
, Or
, shorthand formula
Or
,
shorthand formula
70
Standard deviation
71
Examples:
• Find the variance and standard deviation of
the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency
distribution
72
Cont…
73
Cont…
74
Coefficient of Variation (C.V)
75
Example:
• An analysis of the monthly wages paid to
workers in two dep’t Pedi (A) and Ortho (B)
belonging to the same campus gives the
following results
76
Cont…
77
Standard Scores (Z-scores)
78
Cont…
79
Examples:
80
Solutions:
81
Measures of shape
• Measures of skewness
– Skewed to the right
– Skewed to the left
– Symmetric
• Measures of kurtosis
– Leptokurtic
– Mesokurtic
– Platykurtic
• Reading Assignment
82
4. Probability and its distribution
What is probability?
• It is the chance of an outcome in an exp’t.
84
Important definitions
• Experiment: Any process of observation or measurement
which generates a well defined outcome.
– E.g. The parasite counts of malaria pts entering
Hospital
• Probability Experiment:
– It is an experiment that can be repeated a number of
times under similar conditions
– it is possible to enumerate the total number of
outcomes with out predicting an individual outcome.
Example: If a fair die is rolled once
– it is possible to list all the possible outcomes
• i.e.{1, 2, 3, 4, 5, 6}
– but it is not possible to predict which outcome will
occur.
85
Important def…
87
Approaches to measure probability
– These are:
88
Classical approach
89
Example
a) Number 4?
b) An odd number?
c) An even number?
d) Number 8?
90
Solutions:
• First identify the sample space,
S={1,2,3,4,5,6}; N=n(S)=6
a) Let A be the event of number 4; A={4}
NA= n(A)=1;
P(A)=n(A)/n(S)=1/6
b) Let A be the event of odd numbers; A={1,3,5}
NA= n(A)=3;
P(A)=n(A)/n(S)=3/6=0.5
c) and d)… calculate by your self
91
Addition Rule
• The sum of the probabilities of all Mutually exclusive
event outcomes is equals to “1”
– For any events A and B;
p(AUB)= p(A)+p(B)-p(AnB)…Addition rule
92
Example
• Out of 200 seniors at a certain college, 98 are women, 34
are majoring in Health Officer, and 20 Health officer
majors are women. If one student is chosen at random
from the senior class, what is the probability that the
choice will be either a Health officer major or a women.
Solution:
p( HO major or woman)=P( HO major) +p(woman) - Pr(HO Major
and woman)
=34/200 + 98/200 - 20/200 = 112/200 =0.56
93
Conditional probability
94
Example
95
Solution
A: Survive birth to age 25=0.95
= 0.65/0.95
=0.684
96
Probability distribution
97
Random Variable
98
Cont…
– X(HT)=X(TH)=1,
– X(TT)=0
• X=0, 1, 2
99
RV are of two types:
1. Discrete random variable:
100
Cont…
2. Continuous random variable: are variables that can
assume all values between any two give values
Examples:
– Length of time required to complete a kidney transplant
surgery.
101
Discrete Probability Distribution
or
102
Properties of Discrete prob.Distr
1. 0 ≤ P(X = x) ≤ 1
2. ∑ P(X = x) = 1
103
Example:
– X(HT)=X(TH)=1,
– X(TT)=0
• X=0, 1, 2
X 0 1 2
P(X=x) 1/4 2/4 1/4
104
Example 2:
105
Solution
• What is the probability that a patient receives exactly 3
diagnostic services?
P(X=3) = 0.031
106
Probability distributions can also be displayed using a
graph
0.8
Probability, X=x
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5
No. of diagnostic services, x
107
Intro. to Expectation [ E(X) ]
108
Example
Solution
E(X)=X1P(X1) + X2P(X2)+ …. +X6P(X6)
=(0*0.067)+(1*0.229)+(2*0.053)+(3*0.031)+(4*
0.010)+(5*0.006)
=0+0.229+0.106+0.093+0.04+0.03=0.498
109
Mean and Variance of RV
• Mean of x is E(X)
Where,
110
Some general rules for expectation
1. E(k) = k
2. Var(k) = 0
3. E(kX) = kE(X)
4. Var(kX) = k 2 Var(X)
5. E(X+Y) = E(X)+E(Y)
111
Example:
• Calculate Mean and Variance of the above Example
(Example on the number of diagnostic services a
patient receives)
Solution
=(0*0.067)+(1*0.229)+(2*0.053)+(3*0.031)+(4*0.010)+
(5*0.006)
=0+0.229+0.106+0.093+0.04+0.03
=0.498
112
Solution cont…
• Variance: Var(X)=E(X2) – [E(X)] 2
113
Discrete probability distribution
• Binomial Distribution
• Poisson Distribution
114
Binomial Distribution
• A binomial experiment is a probability experiment
which satisfies the following four requirements
• Assumptions of a binomial distribution.
1. The experiment consists of “n” identical trials
2. Each trial has only one of the two possible mutually
exclusive outcomes, success or a failure
3. The probability of each outcome does not change
from trial to trial, and
4. The trials are independent, thus we must sample
with replacement
115
Examples
116
Binomial Distribution…
• The outcomes of the binomial experiment and the
corresponding probabilities are called Binomial
Distribution.
117
Example
Solution
– Let X be the number of female in four single births in a
family?
– n=4, p=0.5
118
Solution cont…
119
Poisson Distribution
120
Poisson distribution…
– Car Accidents…
121
Example
122
Normal distribution
Where;
123
Properties of Normal distribution
• It is a continuous distribution
• Mean=Median=Mode
124
Normal distribution…
125
Standard Normal distribution
where,
126
Standard Normal distribution…
• Note that;
127
Example
128
Solution
129
Solution…
d) Exercise ??
130
131
5. Sampling
Definitions
133
Definitions…
134
The main concern in sampling
• Researchers are
135
Reasons for Sampling
– Reduced cost
– Greater speed
– Greater accuracy
– Greater scope
136
Sampling Techniques
– Systematic sampling
– Multi-stage sampling
137
Probability sampling
138
Simple Random Sampling(SRS)
142
Non probability sampling
• It is a sampling technique in which the choice of
individuals for a sample depends on the basis of;
– convenience,
– personal choice or
– Interest…etc
143
Types of Non probability sampling
• Judgment Sampling
– The person taking the sample has direct or indirect
control over which items are selected for the sample.
• Convenience Sampling
– The decision maker selects a sample from the
population in a manner that is relatively easy and
convenient.
• Quota Sampling
– The decision maker requires the sample to contain a
certain number of items with a given characteristic.
Many political polls are, in part, quota sampling.
144
Errors in sample survey:
a) Sampling error:
– Measurement
145
Sample size determination
How many subjects should a researcher study?
146
Sample size determination…
147
Sample size determination…
PRECISION COST
∆
148
Sample size determination…
Example:
• A prevalence of 10% from a sample size of 20
– would have a 95% CI of 3% to 23%,
– which is not very precise or informative.
• But, a prevalence of 10% from a sample of
size 400
– would have a 95% CI of 9% to 13%,
– which may be considered sufficiently accurate.
149
Sample size determination depends on the
150
Sample Size for estimating a Single Proportion
(z ) pq2
n 2
d2
Where, p = proportion
q= 1-p
d= the degree of precision
Zα/2= The confidence level at α level of significance
151
Example:
(z )2 pq
1.96 2 (0.2)(0.8)
n 2
683
d2 (0.03) 2
152
Example
n=n/1+(n/N)=683/1+(683/5000)=601
153
Example
(z )2 pq
1.96 2 (0.5)(0.5)
n 2
1068
d2 (0.03) 2
154
Sample Size for Estimating a Single Mean
(z ) 2 2
n 2
d 2
Where,
n = sample size
σ = standard deviation
d = desired precision = half of the
confidence interval (width (w)=margin of
error (e)=2d)
155
Example
(z )2 2
(1.96)2 (144)
n 2
88.5 89
d2 (2.5) 2
156
Example
• Suppose d=1
• Then the sample size increases!
(z )2 2
(1.96)2 (144)
n 2
553.2 554
d2 12
157
But, the population 2 is most of the time unknown!
158
6. Statistical Estimation and
Hypothesis Testing
Inference
– Point Estimation
– Interval Estimation
Statistical estimation…
Point Estimation
• It should be unbiased
• It should be consistent
– It is a random variable.
Two types of hypothesis
6. Making decision.
Given:
Steps
Conclusion:
• Reject Ho and conclude that the new diet has an
improvement on the rats.
Test of Association
• Suppose
– A has r mutually exclusive and exhaustive classes.
Decision Rule:
Reject H0 for at α level of significance if the calculated
value of χ2 exceeds the tabulated value of χ2 with
degree of freedom equal to (r − 1)(c −1)
Example
Solution:
Solution…
Solution…
Conclusion
• At 5% level of significance we have evidence to say
there is association between father and son regarding
boldness, based on this sample data.
Exercise
• We will discuss on
– Two way scatter plot
3. No correlation (r=0)
5. Perfect negative(r=-1)
Pearson's Correlation…
– Chance
Pearson's Correlation…
BW 57.6 64.9 59.2 60.0 72.8 77.1 82.0 86.2 91.6 99.8
RMR 1325 1365 1342 1316 1382 1439 1536 1466 1519 1639
13,510.72
r 0.955
(1953.56)(102, 424.9)
Interpretation
• There is a positive high linear relationship between BW
and RMR
Solution…
Thus, y= 913.3729+6.91596*85
y=1501.2295
Interpretation
Y o 1 X1 2 X 2 ... p X p
• Where:
– Y= response V - b = slope
1
– Xs= regressors - ε =random error component
– b = intercept
0
Example: As a research question
202
…the SPSS output
…cont
• From the 1st table we can see the correlation
between Cig and CHD
• From the 2nd table again we can see the
ANOVA table
• We are interested with the 3rd table
– We will focus on the Unstandardized predicted and
residual values.
– The model “CHD=27.08+0.45Cig–5.92Exercise”
…cont
The interpretation looks
• The model “CHD=27.08+0.45Cig–5.92Exercise”
• Smoking and Exercise are a significant factors to the CHD
The conclusion will be
• “In the given 21 countries, the 1 cigarette increase in
smoking will rise the CHD mortality by 27.53” and “when
Number of exercise per week is decreased by 1 the CHD
mortality will increase by 21.16”
8. Logistic Regression
Logistic Regression
• The model;
• The DV;
– Decision about research (0=stop and 1=continue)
• The IV;
– gender (0=F and 1=M)
• The model and the output looks
Meaning???
9. Survival analysis
What is Survival Analysis?
• Survival Analysis is referred to statistical methods
for analyzing survival data
0.8
% Surviving
0.6
0.4
0.2
0
0 10 20 30 40 50
Month
Example: Four patients’ survival data are 10,
15+, 35 and 40 months. Estimate the survival
function
0.8
% Surviving
0.6
0.4
0.2
0
0 10 20 30 40 50
Month
In 1958, Product-Limit (P-L) method was
introduced by Kaplan and Meier (K-M)
• For example,
• The sensitivity of mammography for detecting breast
cancer is 90%. This value is interpreted as “90% of
women who have biopsy-proven breast cancer will have a
positive mammogram.”
• The specificity of mammography for detecting breast
cancer is also about 90%. This value is interpreted as
“90% of women who have biopsy-proven absence of
breast cancer will have a negative mammogram.
• Bias
• Confounding
• Chance
Reliability
• The degree to which results are
consistently measured by any type
of data collection instrument
– medical test
– medical record
– observation
– study questionnaire
Example:
• CONCLUSIONS?
• PROBLEMS?
Conclusions:
RR = 2.13 (1.05 - 12.10) p = 0.01
• P-value:
We have observed an association that is significantly
different than the null hypothesis (RR=1) and the
probability that an observed effect is actually due to
chance is 1 in 100.
• Confidence Interval:
If we did this study 100 times (took 100 different
samples from the target population) approximately
95% of the time the interval would cover the true
population measure.
Conclusions/Problems
RR = 2.13 (1.05 - 12.10) p = 0.01
• Random error
– reflects fluctuations around the true value of a
parameter.
– is essentially attributable to sampling variation,
the extent of which may depend on aspects of the
study design (e.g. sample size) and statistical
characteristics of the estimator (e.g. its variance).
• Systematic or non-random error
– leads to BIAS
– reflects a deviation of results or inferences from
the truth.
– the processes leading to such deviation can be
introduced at any point in an investigation.
Errors and Study Size
(BIAS)
Effect of Bias
• Bias will result in an estimate that is not the same as
the true value.
• Directions of bias:
– Away from the null:
• study RR=8, true RR=2
• study RR=0.5, true RR=0.9
– Towards the null:
• study RR=1.3, true RR=5.0
• Study RR=0.9, true RR=0.4
– “Switchover”:
• Study RR=0.5, true RR=2.0
Internal vs. External Validity
X X X
X
X X X XX
X X XX
X
X
X X X
x X
Aday, 1996
How can the relative risk or odds ratio be wrong?
Systematic Error
Confounding
Study participant
Study instrument
2) Study implementation:
Quality Assurance & Quality Control
Disease Disease
Total
Yes No
Test
TP FP TP + FP
Pos
Test
FN TN TN + FN
Neg
Total TP + FN TN + FP
Sensitivity
Disease Disease
Total
Yes No
Test
TP FP TP + FP
Pos
Test
FN TN TN + FN
Neg
Total TP + FN TN + FP
Sensitivity Specificity
False positive & negative results
• False positives
– burden on HC system
– unnecessary anxiety
– labeling
• False negatives
– delay treatment
– false sense of “security” regarding risk
behaviors
Improving sensitivity and/or specificity
• Sequential testing
– initial test positives examined using other
method
– improves specificity
• Simultaneous tests
– multiple variables assessed at the same time
– improves sensitivity
Measure of Yield
PVP = TP / TP + FP
Influences on PVP
– Inter-rater
• % agreement, kappa statistic
– Internal consistency
• Kuder-Richardson20 , Cronbach’s coefficient
alpha
– Test-retest
• Quantified by correlation co-efficient
*See Szklo book for more examples*
Assessing agreement between observers,
instruments, etc.
• Percent (observed) agreement
– proportion of measurements that have the same
results by two (or more) methods, expressed as a
percentage
In medical research:
> 0.75 excellent
0.40 < < 0.75 good
0 < < 0.40 marginal/poor
Correlation coefficient
•
• •
• •
• •
• • •
•
• •• •
•• •
A. B. C.
• All 3 r=1.0
– A. Both observers get same exact value
– B and C. Systematic differences between
observers, but very reliable differences
Intraclass Correlation Coefficient (ICC)