Professional Documents
Culture Documents
Objectives
• Define data
• Classify variables.
2
Definition of Statistics
• Is a field of study concerned with the collection, organization,
summarization, analysis and interpretation of data, and
3
Cont…
•Biostatistics is the segment of statistics that deals with data
4
Data
patients (counting).
rate(measurement).
6
Types of Statistics
1. Descriptive statistics: consists of procedures used to summarize and
• Tables
• Graphs and
7
2. Inferential Statistics
III. Presentation of data: after the data has been collected and
organized, they are ready for presentation.
12
B. Quantitative variable
13
Types of quantitative variables
A. Discrete variable
• can assume only a finite or countable number of values b/n any two values.
• These gaps or interruptions indicate the absence of values between particular values
• Episodes of diarrhea/day.
15
Scales of Measurement
16
1. Nominal variable
• it consists of naming or classifying observations into various
categories.
• Have unordered categories and no magnitude.
• numbers used to represent categories.
• Numbers help to decide whether the categories are the same or different
(comparisons are = or ≠ ).
• the descriptive summary measure is the proportion of subjects who posses
the attribute.
Examples: drug category: antibiotics, analgesics, diuretics
• Categories can be compared as to whether they are the same or not and put in
order.
• the members of one category are considered lower, worse, or smaller than those in
equal.
• Example: patients may be characterized as:
1. unimproved 2.improved 3. much improved.
• Level of pain(1. mild 2. moderate 3. severe)
18
Anemia status: 1. mild anemia 2. moderate anemia 3. severe anemia
3. Interval variable
• Distance between any two measurements is known.
• For example, in the Fahrenheit temperature scale, the difference between 70 and 71
degrees is the same as the difference between 32 and 33 degrees.
• But the scale is not a Ratio Scale(40 degrees Fahrenheit is not twice as much as 20
degrees Fahrenheit).
• No true Zero value.
• Eg: Temperature
19
4. Ratio data
• Equality of ratios as well as equality of intervals may be
determined.
21
Objectives
22
Methods of Data Collection
Data collection methods ;-Allow us to systematically collect data
about our subjects of study. They are:-
• Observation
• Interviews (Face-to-Face, Telephone)
• self-administered questionnaire
• Focus Group Discussion (FGD)
• Using available information(secondary sources)
− Location
− Time available for data collection
− Infrastructure available (telephones, mail service, internet access).
• Resources(money)
24
1. Observation
• Involves selecting, watching and recoding behaviors of
people or other aspects of the setting in which they occur.
• Includes all methods from simple visual observations to the use
of high level machines and measurements(X-ray machines,
microscope, clinical examinations).
• Guidelines or check list should be prepared for the observations
prior to actual data collection.
A. Face to face-interview
B. By telephone
A. face to face-interview
person.
26
face to face-interview……
Advantages:
27
B. Telephone interview
Advantages
• Quick
• Can cover reasonably large number of people
• Wide geographic coverage
• High response rate
• Help can be given to the respondent
• Can record answers
28
3. Questionnaire method
A. Self administered Questionnaire
• the respondent reads the questions and fills in the answers
by himself.
Advantages
• Can cover a large number of people or organizations
• No interviewer bias
• simpler and cheaper (can be administered to many
persons simultaneously).
29
B. Mailed Questionnaire Method
• The investigator prepares a questionnaire and sent by post or
email to the informants together with a polite covering letter.
• The main problems with postal questionnaire are
• Morbidity reports
• Mortality reports
• Epidemic reports
• Laboratory data
• Demographic data (census)
• Official publications of Central Statistical Authority
34
For example
1. What is the importance of the traditional medicine’’tikurmud ‘’
2. “Can you describe exactly what the traditional birth attendant did
when your labor started?”
3. “What do you think are the reasons for a high drop-out rate of
village health committee members?”
What is your opinion about the care provided by nursing
proffesional?
35
Steps in Designing a Questionnaire
Step1: Content
• Take your objectives and variables as your starting point
Step 2: Formulating
• Formulate one or more questions that will provide the information
needed for each variable.
• Check whether each question measures one thing at a time.
questions.
38
Cont’d
Step 5: Translation
• If interview will be conducted in one or more local
languages, the questionnaire has to be translated to
standardize the way questions will be asked.
• After having it translated you should have it retranslated
into the original language.
• You can then compare the two versions for differences and
make a decision concerning the final phrasing of difficult
concepts. 39
Methods of Data Organization and Presentation
40
1. THE ORDERED ARRAY
• Is a listing of the values in ascending or descending order.
• It is the first step in organizing data.
Eg. age of the patients: 34, 37, 44, 30, 38, 35, 37,38, 40, 44, 43
Ordered array= 30, 34, 35, 37,37, 38, 38, 40, 43, 44, 44
41
2. Frequency Distribution
• Is the arrangement of data set using actual values and their
corresponding frequency of occurrence.
• It presents data in a compact form and gives a good overall
picture.
• Frequency: is the number which tells us the number of times a
particular data appear.
43
1. Categorical frequency Distribution
• Used for organizing and presenting qualitative data(nominal,
or ordinal). e.g. marital status, sex, disease Dx.
44
Eg: categorical frequency distribution for Sex of year-II
nursing students.
45
2. Ungrouped frequency distribution
46
47
3. Grouped Frequency Distribution
• Used for organizing and presenting large set of quantitative data.
48
Example:
• Construct a grouped frequency distribution of the following
data on the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school
week:
23 24 18 14 20 24 24 26 23 21
16 15 19 20 22 14 13 20 19 27
29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
30 17 22 29 29 18 25 20 16 11
17 12 15 24 25 21 22 17 18 15
21 20 23 18 17 15 16 26 23 22
11 16 18 20 23 19 17 15 20 10
49
Cumulative and Relative Frequencies
50
3. Statistical Tables
• A statistical table is systematic presentation of numerical data
in rows and columns.
• Rows are horizontal and columns are vertical arrangements.
51
Types of Table
Source: Fikru T et al. EPI Coverage in Adami Tulu. Eth J Health Dev
1997;11(2): 109-113
52
2. Cross tabulation /Two-way table
• Is used to obtain the frequency distribution of one variable by the
subset of another variable.
Table 3: Frequency and percentage distribution of anemia among women of reproductive age in
Ethiopia by age, 2018: Data from 2016 EDHS (n=14,489).
source: Central Statistical Agency (CSA) [Ethiopia] and ICF. 2016, Ethiopia Demographic and Health Survey 2016.
Addis Ababa, Ethiopia, and Rockville, Maryland, USA:
53
CSA and ICF.ETHIOPIA, 2017.
3. Higher order table
54
4. Diagrammatic Presentation of Data
55
1. Bar Chart
• Are used to represent and compare the frequency distribution
of categorical and discrete variables.
• All the bars must have equal width and the distance between
bars must be equal.
56
Cont’d
Example of simple bar diagram
100
80
Number of children
60
40
20
0
Not immunized Partially im- Fully immunized
munized
Immunization status
27%
37%
FI
NI
PI
36%
Fig. Immunization status of children in X 58
3. Histograms
• A histogram is the graph of the frequency distribution of
continuous variables.
• It is constructed on the basis of the following principles:
a) values of the variable under consideration are
represented by the horizontal axis.
• It should be labeled with the name of the variable and the
units of measurement.
b) frequency or relative frequency is represented by vertical
axis(bars)
59
Cont…
c) For each class in the distribution a vertical bars (rectangle) is
drawn with:
25
Number of students
20
15
10
0
1
Amout of time spent
Fig 6: Histogram for amount of time college students devoted to leisure activities
in X college in Y year.
61
4. Frequency Polygon
• If we join the midpoints of the tops of the adjacent rectangles
of the histogram with line segments, a frequency polygon is
obtained.
• When the polygon is continued to the X-axis just out side the
range of the lengths the total area under the polygon will be
equal to the total area under the histogram.
25
Number of students
20
15
10
0
7 12 17 22 27 32 37 42
Fig 7: Frequency polygon curve on time spent for leisure activities by students
64
5. O-give or cumulative frequency curve
70
60
50
40
30
20
10
0
4.5 9.5 14.5 19.5 24.5 29.5 34.5 39.5
Fig 8: Cumulative frequency curve for amount of time college students devoted to leisure
activities in X college, in Y year.
66
6. The line diagram
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1967 1969 1971 1973 1975 1977 1979
Year
69
Session objectives
At the end of this session, the student will be able to:
• Identify the different methods of data summarization
70
Introduction
•In a large data set, it is easy to lose track of the overall picture by
possible.
72
1. Measures of Central Tendency
•Mean(computed average)
•Median(positional average)
•Mode(frequent observation)
75
Characteristics for a good average
1. It should be rigidly defined.
2. It should be easy to understand and compute.
3. It should be based on all items in the data.
4. Its definition shall be in the form of a mathematical formula.
5. It should be capable of further algebraic treatment.
6. It should have sampling stability.
7. It should be capable of being used in further statistical
computations or processing.
76
1. The Arithmetic Mean(computed average)
1
Σx , i 1,2,3...
N i
N=number of population
78
• The mean for data organized by ungrouped frequency
distribution is given as:
Mean= Σfixi / Σfi
Where f is frequency of each observation.
Example:-The mean of a set of ten temperature:
7,9,11,4,7,13,9,6,11,13 is found by adding them together
and dividing by 10.
• x̅ =9
79
Cont’d
• Example: Find the mean age of the following data.
Age 10 15 20 25 30 Total
frequen 3 6 5 4 2 20
cy
• Suggested answer:
X̅=
80
Example
• Suppose the sample shown below consists of birth
weights (in grams) of all live born infants born at a
private hospital in a city during a 1-week period:
1 1
Χ Σx (3265 + 3260 + ….+ 2834)
n i 20
63,338
3166.9 g
20
82
Characteristics of mean
it is zero.
• It always exists
Disadvantages
•For samples with an odd sample size, there is a unique central point.
•Eg. for sample of size 9, the fifth largest point is the central point in the
sense that 4 points are both smaller and larger than it.
85
th
n 1
• median is the observation if n is odd.
2
• For samples with an even size, there is no unique central
point and the middle 2 values must be averaged.
n th n th
• Median is the average of the 1 and
2 2
observations if n is even.
• The rational for these definitions is to ensure an equal
number of sample points on both sides of the sample
median.
86
Example-1
• Consider the following data, which consists of white blood counts
taken on admission of all patients entering a small hospital on a
given day.
88
Characteristics
• It is an average of position.
• Uniqueness 89
Advantages
consideration.
91
4. Mode
• is an observation that occurs most frequently.
• Used for all data types, but most useful for ordinal and
nominal data
• The mode of a set of data or distribution can be:
No mode: All values appear equal number of times.
Unimodal: If the distribution has only one mode
Bimodal: If the distribution has two modes
Multi-modal: If the distribution has more than two modes.
92
Cont’d
• Find the modal values for the following data
93
Characteristics
• It is an average of position
• It is not affected by extreme values
• It is the most typical and actual value of the
distribution.
94
Advantages of mode
•It is easily identifiable
•is usually an “actual value”, it indicates the precise value of an important part of the
series.
•Disadvantages
– Symmetrical distribution 96
• “A distribution with extreme values at the right
(asymmetric tail extending to right) is referred to as
“positively skewed” or “skewed to the right,”
• a distribution with extreme values at the left(asymmetric
tail extending out to the left) is referred to as “negatively
skewed” or “skewed to the left.”
• Skewness motivates a researcher to investigate outliers.
97
Example
= mode
> mode
< mode
98
2.Measures of Dispersion
99
Cont’d
• A measure of dispersion conveys information regarding the amount
of variability present in a set of data.
Note:
101
Consider the following data sets:
Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
• The two data sets given above have a mean of 50, but
obviously set 1 is more “spread out” than set 2. How
do we express this numerically?
• The object of measuring this scatter or dispersion is
to obtain a single summary figure which adequately
exhibits whether the distribution is compact or spread
out.
102
Cont’d
Some of the commonly used measures of dispersion
(variation) are:
1. Range
2. Variance(S2)
3. Standard deviation(SD) and
4. Coefficient of variation(CV).
103
1. Range
104
Cont’d
• Range = Xmax - Xmin
• Where
106
• It is the sum of the squared deviations of the
measurements
n
about their mean divided by (n - 1).
( X i X ) 2
S2= i 1
n 1
107
3. Standard deviation(SD)
• It is the square root of the variance.
• S or SD= (
i 1
Xi X ) 2
n 1
• SD=
108
• Example: systolic blood pressure(SBP) of a sample
of 15 patients is given as follows (mmhg) :
101,105,110,114,115,124,125,125,
130,133,135,136,137,140,145
•Find the variance and standard deviation of the above
distribution.
• The mean SBP of the sample is 125 mmhg.
109
Cont’d
n
• Variance (sample) = s =
( X X )
2 2
i
i 1
•
n 1
• = (101-125)2 +(105-125)2 + ….(145-125)2 / (15-1)
• = 2502/14
• = 178.71 (mmhg)2
• Hence, the standard deviation =
178.71
= 13.37mmhg.
110
4. The Coefficient of Variation(CV)
111
Cont’d
• CV expresses the SD as a percentage of the mean.
• CV= 100% S
Χ
112
Cont’d
113
3. MEASURES OF RELATIVE STANDING
• They show the position of one observation relative to others
in a set of data.
1. Percentiles
2. Quartiles
114
quizz
Q. Suppose that the following data set is showing the age of sample
of 10 patients in X hospital.
I. Find (1 point):
c. Interquartile range.
115
Wu, CMHS, Dep’t of PH
116
Session objectives
. 117
Demography
with respect to all the above features and the causes of such
variation and the effect of all these on health, social and economic
conditions.
• Population: is a collection of persons at a specified
point in time who shares some characteristics in common.
118
Sources of Demographic Data
Demographic data can be obtained from:
• Census
119
1. Census
Wcu,dep’t of ph 120
In census, Data collected on:
• Sex
• age
• marital status
• education
• place of birth,
• language, fertility, mortality, living conditions (e.g. house-
ownership, type of housing ), religion, etc..
wcu 121
Two techniques of conducting census:
A. DE JURE
• is the counting of people according to the permanent place of
location or residence.Eg, USA.
• excludes temporary residents and visitors, but includes
permanent residents who are temporarily away.
Advantages
• It yields information relatively unaffected by seasonal and other
temporary movements of people (i.e, it gives a picture of the
permanent population).
123
B. DE FACTO
Advantages
Disadvantages
126
iii) Registration of vital events
• It is a collection of data on birth, death, marriage and
divorce.
• Changes in population numbers are taking place every
day.
• Additions are made by births or through new arrivals
from outside the area.
• Reductions take place because of deaths, or through
people leaving the area.
127
4. Health Service Records
• All health institutions report their activities to the MOH
through regional health bureaus.
• The Ministry compiles, analyzes and publishes it.
128
Tools of demographic measurement
• Ratios
• Proportions
• Rates
Rate=K
Eg. Crude Death Rate: is the number of deaths per 1000 population in
a given year.
2. Specific rate –are rate computed for specific group of population.
• Specific rate includes rates like age specific, sex specific and
occupation specific rates, etc.
Eg. Child Mortality Rate: is the number of deaths of children 1-4 years
of age per 1000 children 1-4 years of age. 135
Rate, Ratios and Proportions
Learning Objectives
At the end of this session participants should be able to:
Define sample.
Identify the population to be studied
Identify probability and non-probability sampling
methods
Describe common methods of sampling.
What Sampling?
Researchers often use sample survey
methodology to obtain information about a larger
population by selecting and measuring a sample
from that population.
Population
If the wrong questions are posed to the wrong
people, reliable information will not be received
and lead to a wrong conclusion when applied to the
entire population.
Woreda PSU
Kebele SSU
Sub-Kebele TSU
HH
• In the first stage, large groups or clusters are
identified and selected. These clusters contain
more population units than are needed for
the final sample.
176
Learning objectives
At the end of this session, the student will be able to:
• Understand the concepts of sample statistics and population parameters
estimations.
179
Cont’d
Sample Statistic Population parameter
• X̅ (sample mean) • μ (population mean)
180
2. Sampling Distribution of means
• Is the distribution of the means of many samples of equal
size n taken from the same population repeatedly.
• Since it is a frequency distribution, it has its own mean and
standard deviation.
• The frequency distributions for statistics are called sampling
distributions because, in repeated sampling, they provide this
information:
• What values of the statistic can occur.
• How often each value occurs.
181
Sampling distribution of means are found as follows
182
4. The result is a series of means of samples of size n.
• If each mean in the series is now treated as an
individual observation and arrayed in a frequency
distribution, one determines the sampling
distribution of means of samples of size n.
• the scores ( Χs) in the sampling distribution of means
are themselves means.
183
Properties of sampling distribution of means
1. The mean of the sampling distribution of means is the
same as the population mean, μ.
2. The SD of the sampling distribution of means is σ /√n.
3. The shape of the sampling distribution of means is
approximately a normal curve, regardless of the shape of
the population distribution and provided n is large
enough(n>=30) (Central limit theorem).
184
Standard Error(SE)
• Standard deviation of the sampling distribution of means is
called the standard error.
σ
• SE (σ x ) n
• SE quantifies the variability among means of repeated
samples drawn from that population.
185
Central Limit Theorem
187
Cont’d
• Possible samples X̅ i ( sample mean )
• (10, 20) or (20, 10) 15
• (10, 30 ) or (30, 10) 20
• (10, 40) or (40, 10) 25
• (20, 30) or (30, 20) 25
• (20, 40) or (40, 20) 30
• (30, 40) or (40, 30) 35
• (10, 10) 10
• (20, 20) 20
• (30, 30) 30
• (40, 40) 40
188
Cont’d
a) frequency distribution of sample means
sample mean ( Χi ) frequency (fi)
10 1
15 2
20 3
25 4
30 3
35 2
40 1
189
Cont’d
b) The mean of the sampling distribution of means
= Σ Χ ifi / Σfi = Σ10x1+15x2…../16= 400 / 16 = 25
• The standard deviation of the sampling distribution of
mean:
(σ Χ )= √Σ ( Χi - μ)2 / Σfi
• = √{Σ (10 - 25)2 + (15 - 25)2 + …. + ( 40 - 25)2 } / 16
• = √ 1000 / 16 = √62.5 = 7.90
190
For the population given above (10,20,30 and 40)
191
a) μ = Σ xi / N = (10 + 20 + 30 + 40) / 4 = 25
b) σ2 = Σ(xi - μ)2 / N = (225+ 25+ 25 + 225) / 4 = 125
• Hence, σ = √ 125 = 11.18 and σ Χ = σ / = 11.180 / = 7.9
192
3. Estimation
193
Statistical estimation
• There are two ways to estimate population values from sample
values.
195
Interval Estimation………..
• It is a statement that describes a population parameter has
a value lying in between two specified limits with a
certain confidence level.
• A point estimate does not give any indication on how far
away the parameter lies.
• A more useful method of estimation is to compute an
interval which has a high probability of containing the
parameter.
196
Confidence interval
parameter.
interval.
198
Cont’d
• It is usually accepted that a 5%(α=5%) chance that the
range will not include the true population value and the
range of interval is called 95% confidence interval.
• When we say a 95 % confidence interval, it is to mean
that the interval would contain the true parameter value in
95 % of the time.
199
A confidence interval is given as:
Point estimate ± Reliability coefficient x Standard error
201
2.1 . Confidence interval for a single population mean
• Confidence interval for population mean;
CI for μ = Χ ± Z(1-/2) x σ/
Χ -Is the point estimate of the population mean
Z(1-/2) -is the value of Z to the left of which lies 1-/2 of the area
and to the right of which lies /2 of the area under standard
normal curve.
Χ - Z(1-/2) x σ/
Χ+ Z(1-/2) x σ/
Standard
X 1.96 X X +1.96
n error of the
n
sample
mean
204
Cont’d
• Given points:
• Level of confidence = 95 %, α = 0.05 and α/2 = 0.025
and the value of Z at 1-α/2 is 1.96
• n=100 σ = 4 and x̅ = 14
• inserting the given values in the formula;
• The result will be 14 ± 0.784= (13.2 and 14.8)
• Interpretation: A clinician is 95% confident that the
mean weight of all children is between 13.2 and 14.8 kg.
205
2.2. Confidence interval for a single population proportion
• Notation: P (or π) = proportion of “successes” in a
population (parameter)
• Q = 1-P = proportion of “failures” in a population
• p = proportion of successes in a sample
• q = 1-p= proportion of “failures” in a sample
• σp= Standard deviation of the sampling distribution
of proportions or Standard error of proportions
• n = size of the sample
206
Cont’d
• The confidence interval for the population proportion (P) is
given by the formula:
Z(1-/2)x σp)
207
Example
• An epidemiologist is worried about the ever increasing trend of
the peak transmission period and finds that 60 of them are positive
209
a) A 95% C.I for the population proportion
= 0. 4 ± 1.96 (.04) = (0.4 ± .078) = (0.322, 0.478).
210
Sample Size Estimation
• Is deciding on the number of people needed to be
studied in order to answer the study objectives.
211
To calculate sample size in cross-sectional studies:
• estimate how big the proportion might be (P)-from the
previous study.
• choose the margin of error you will allow in the estimate of
the proportion (say ± w)
• choose the level of confidence that the proportion in the
whole population is indeed between (p-w) and (p+w).
• We can never be 100% sure.
212
Cont’d
• The minimum sample size required, for a very large population
(N≥10,000) is:
n = Z2 p(1-p) / w2
• Where:
• Z=Reliability Coefficient corresponding to confidence level.
• p=Population proportion from previous data
• q=1-p
• w= Margin of error to be tolerated
213
Example
• A survey is being planned to determine what proportion
of families in a certain area are poor. It is believed that
the proportion cannot be greater than 0.35. A 95
percent confidence interval is desired with w: 0.05.
What size of sample of families should be selected?
• Solution: If the finite population correction can be
ignored, we have
• n= 1:96 2x0.35x0.65/0.052= 349.59
214
Hypothesis Testing
215
Learning objectives
At the end of this session, the students will be able to:
• Understand the concepts of null and alternative
hypothesis
• Differentiate between type I and type II errors
• Explain the meaning and application of P – values
216
• Hypothesis is a statement made about one or more population parameter.
hospital is 5 days.
used.
217
Hypothesis testing
• It is determining whether or not statements(hypothesis) are
true based on the sample data.
• It is using sample statistics to test hypothes is about population
parameters.
• It is deciding to accept or reject the pre-set hypothesis, using
the sample statistics.
a. Null Hypothesis
b. Alternative Hypothesis
c. Test Statistic
e. Conclusion 220
Two types of hypothesis
1. Null hypothesis
• Is the main hypothesis which we wish to test.
• It is denoted by the symbol Ho.
• It is denoted by HA or H1.
223
Rules of thumb for stating the null and alternative
hypothesis
a. The null hypothesis should contain a statement of equality, (=, ≥ or ≤
b. What we hope or expect to be able to conclude as a result of the test
usually should be placed in the alternative hypothesis.
c. The null and alternative hypotheses are complementary.
Example: if we want to answer the question, can we conclude that a certain
population mean age is not 35?
• Suppose we want to know if we can conclude that the population mean age
is greater than 50. Our hypotheses are:
226
Z-test is used when:
1. sampling is from a normally distributed population and σ is
known.
230
Two types of Errors in hypothesis testing
It is not always possible to make a correct decision since we are
231
• α (level of significance) is arbitrarily chosen, equal to a small
number (usually 0.1, .01, .05, etc.)
Correct decision
Fail to reject Ho Type II Error
Define:
a = P(Type I error) = P(rejecting H0 when H0 is true)
b =P(Type II error) = P(accepting H0 when H0 is false)
234
P – Values
• The P-value (or probability value) is the probability of getting
a sample statistic (such as the mean) or a more extreme sample
statistic in the direction of the alternative hypothesis when the
null hypothesis is true.
• Shows the exact probability of getting a test statistic value if
Ho is true.
• It tells how common(>=0.05) or how rare(<0.05) is the
computed value of the test statistic given that H0 is true.
235
Steps in hypothesis testing
1. Formulate H0 and HA
4. Determine decision rule about when to reject the Ho hypothesis and when
239
Example-1
241
Cont’d
5. Calculate test statistic
• The Z score for the random sample of 64 persons of the
village aged 20 to 40 years:
243
Example-2
Z calc=
n=10, x =27 , σ2=20,
• 27-30/1.41= -2.12
Decision : -2.12 < -1.96- reject Ho
Conclusion: We can conclude that μ is not equal to 30
2. Hypothesis Testing for a single Population Proportion
250
thank you