You are on page 1of 238

College of Medicine & Health Sciences 

Biostatistics for Health Science


Students

Tamirat (Mph,mph epid/bio) 1


Chapter One: Introduction to Biostatistics

Objectives

After completing this chapter, the student will be able to:

• Define Statistics and Biostatistics.

• Enumerate the importance and limitations of statistics.

• Define data

• Classify variables.

• Differentiate b/n descriptive and inferential statistics.

2
Definition of Statistics
• Is a field of study concerned with the collection, organization,
summarization, analysis and interpretation of data, and

the drawing of inferences about a body of data when only a


part of the data is observed.

• Statistics is a group of methods used to collect, analyze,


present, and interpret data and to make decisions.

3
Cont…
•Biostatistics is the segment of statistics that deals with data

relating to living organisms, medicine or health data.

•It is a scientific methods of collecting, organizing,

analyzing and interpreting biological or medical data.

4
Data

• Are measurements or observations obtained from the different


members of a sample or a population for a certain variable or a set
of variables.
• are the quantities (numbers) or qualities (attributes) measured or
observed that are to be collected or analyzed.
• The raw materials of Statistics are data.
• Raw facts or figures resulting from the process of counting or from
taking a measurement.
• The word data is plural, datum is singular.
5
For example:

• When a pharmacist weighs the dose of drugs

• When a hospital administrator counts the number of

patients (counting).

• When a nurse weighs a patient (measurement).

• When a nurse measures BP, heart rate and respiratory

rate(measurement).

6
Types of Statistics
1. Descriptive statistics: consists of procedures used to summarize and

describe the important characteristics of data.

• it merely describes, organizes, or summarizes the actual data available.

• Vital statistics(birth, death, marriage and divorce)

• Tables

• Graphs and

• Measures of central tendency and variation

7
2. Inferential Statistics

• Statistical inference is reaching in a conclusion about a population


on the basis of the information contained in a sample.
• Inferential statistics: consists of procedures used to make
inferences about population characteristics from information in a
sample.
• Is the branch of modern statistics that is most relevant to public
health and clinical medicine.
• Builds upon descriptive statistics.

• Drawn from particular properties of sample to particular properties


of population. Eg: Estimation and hypothesis testing
8
Five stages of statistical investigation
I. Collection of data: constitutes the first step in a statistical
investigation

II. Organization of data: data collected from different sources are


done in organized form.

III. Presentation of data: after the data has been collected and
organized, they are ready for presentation.

IV. Analysis: after collection, organization and presentation, the next


step is analysis.

V. Interpretation: drawing conclusion from the statistical results.9


Uses of statistics/biostatistics

• It presents facts in a precise form.


• Data reduction.
• Measuring the magnitude of variations in data.

• Furnishes a technique of comparison.


• Estimating unknown population characteristics.
• Testing and formulating of hypothesis.
• Studying the relationship between two or more variable.

• For planning, conducting and interpretation of medical


10
research.
Variable

• A variable is a characteristic that changes or varies over time


and/or for different individuals or objects.
• It is a characteristic that takes on different values for different
persons, places or things.

• Is a quality or quantity which varies from one member of a sample


or population to another.

For example: sex of 4th year midwife students

• Number of drugs issued/day, number of drugs expired/month

• heart rate, the weights of preschool children, sex of the patient


11
• the heights of males, the ages of patients seen in a dental clinic.
Types of variable
A. Qualitative: a variable or characteristic which can not be
measured in quantitative form but can only be identified by name
or categories.
• Takes categories/ names as their values.
• Measurements made on qualitative variables convey
information regarding attribute.
Eg: type of drug issued, place of birth, ethnic group, stages of
breast cancer, degree of pain, type of medical diagnosis.

12
B. Quantitative variable

• is one that can be measured and expressed numerically.

• Takes numbers as their values


• Measurements made on quantitative variables convey

information regarding amount.


•Eg: Drug dose
• the weights of preschool children, the heights of adult
females.

13
Types of quantitative variables
A. Discrete variable

• can assume only a finite or countable number of values b/n any two values.

• characterized by gaps or interruptions in the values that it can assume.

• These gaps or interruptions indicate the absence of values between particular values

that the variable can assume.

• values of a discrete variable are usually whole numbers.

Eg. Number of drugs issued/day

• The number of daily admissions to a general hospital,

• Episodes of diarrhea/day.

• The number of decayed teeth per child in elementary school.


14
B. Continuous variable

• Does not possess the gaps or interruptions in values it can


assumes.
• May take on any possible value between any two values.
• For any two values you pick, a third value can always be
found between them!
Eg: drug dose, height, weight, and skull circumference.

15
Scales of Measurement

• Scale is all possible values for a given variables


• Based on their scales of measurement(values assigned
to them), variables are classified as:
• Nominal
• Ordinal
• Interval
• Ratio

16
1. Nominal variable
• it consists of naming or classifying observations into various

categories.
• Have unordered categories and no magnitude.
• numbers used to represent categories.

• Numbers help to decide whether the categories are the same or different

(comparisons are = or ≠ ).
• the descriptive summary measure is the proportion of subjects who posses
the attribute.
Examples: drug category: antibiotics, analgesics, diuretics

• Sex - male, female. Religion - Christian, Islam, Hinduism, etc,


17
•Marital status-Single, married, divorced, widowed.
2. Ordinal variable
• observations are not only different from category to category but can be ranked

according to some criterion.

• Categories can be compared as to whether they are the same or not and put in

order.

• the members of one category are considered lower, worse, or smaller than those in

another category(possible comparisons are: = or ≠, < or >)

• impossible to infer that the difference between members of adjacent categories is

equal.
• Example: patients may be characterized as:
1. unimproved 2.improved 3. much improved.
• Level of pain(1. mild 2. moderate 3. severe)
18
Anemia status: 1. mild anemia 2. moderate anemia 3. severe anemia
3. Interval variable
• Distance between any two measurements is known.

• In interval data, the intervals between values are the same.

• For example, in the Fahrenheit temperature scale, the difference between 70 and 71
degrees is the same as the difference between 32 and 33 degrees.
• But the scale is not a Ratio Scale(40 degrees Fahrenheit is not twice as much as 20
degrees Fahrenheit).
• No true Zero value.

• Comparisons are: = or ≠, < or >, + or -)


• The interval scale unlike the nominal and ordinal scales is a truly quantitative scale.

• Eg: Temperature
19
4. Ratio data
• Equality of ratios as well as equality of intervals may be

determined.

• All operations are possible(= or ≠, < or >, + or -, * or ÷)

• The data values in ratio data do have meaningful ratios, for

example, age is a ratio data, some one who is 40 is twice as old as

someone who is 20.

• In addition to distance, the values allow calculation of ratios

• Eg. Age, height, weight


20
Methods of Data Collection

21
Objectives

After completing this chapter, the students


will be able to:
• Identify the different methods of data collection

• Understand advantages and disadvantages of methods of


data collection.
• Understand types of questions

• List steps to design questionnaire

22
Methods of Data Collection
Data collection methods ;-Allow us to systematically collect data
about our subjects of study. They are:-

• Observation
• Interviews (Face-to-Face, Telephone)
• self-administered questionnaire
• Focus Group Discussion (FGD)
• Using available information(secondary sources)

NB: Interviews and self-administered questionnaires are probably


the most commonly used research data collection techniques. 23
Factors affect the choice of data collection Method

• Characteristics of study population eg. Literacy


• Access to population

− Location
− Time available for data collection
− Infrastructure available (telephones, mail service, internet access).
• Resources(money)

24
1. Observation
• Involves selecting, watching and recoding behaviors of
people or other aspects of the setting in which they occur.
• Includes all methods from simple visual observations to the use
of high level machines and measurements(X-ray machines,
microscope, clinical examinations).
• Guidelines or check list should be prepared for the observations
prior to actual data collection.

Advantages: Gives relatively more accurate data on behavior and


activities. 25
2. Interview

A. Face to face-interview

B. By telephone

A. face to face-interview

• Data collection through oral conversations meeting the subject in

person.

26
„ face to face-interview……
Advantages:

• Good response rate

• Completed and immediate

• Interviewer can give explanations when necessary.

• Can use recording equipment

• Tone of voice, facial expression and hesitation can be assessed.

27
B. Telephone interview
Advantages
• Quick
• Can cover reasonably large number of people
• Wide geographic coverage
• High response rate
• Help can be given to the respondent
• Can record answers

28
3. Questionnaire method
A. Self administered Questionnaire
• the respondent reads the questions and fills in the answers
by himself.
Advantages
• Can cover a large number of people or organizations
• No interviewer bias
• simpler and cheaper (can be administered to many
persons simultaneously).

29
B. Mailed Questionnaire Method
• The investigator prepares a questionnaire and sent by post or
email to the informants together with a polite covering letter.
• The main problems with postal questionnaire are

– response rates tend to be relatively low,

– may be under representation of less literate subjects


• apart from their expense, interviews are preferable to self-
administered questionnaires if they are conducted by skilled
interviewers.
30
4. Focus group discussion
• The most widely used technique
• It is a group discussion of 6-12 persons guided by a facilitator,
during which group members talk freely & spontaneously about a
certain topic.

• Moderator- leads the discussion

MPH Group VII Year-I 31


5. Use of documentary sources

• Morbidity reports

• Mortality reports
• Epidemic reports
• Laboratory data
• Demographic data (census)
• Official publications of Central Statistical Authority

• International Publications like Publications by WHO.


32
Types of questions
A. Closed ended questions
• A question is asked and then a number of possible answers are
provided for the respondent.
• The respondent selects the answer which is appropriate.

• are useful if the range of possible responses is known.


• Closed ended questions are particularly useful in obtaining factual
information.
• Sex:    Male [   ] Female [   ]
• Did you watch television last night?     Yes [   ] No [   ]
33
B. Open ended questions

• permit free responses that should be recorded in the


respondent’s own words.
• respondent is not given any possible answer to choose from.

• It involves intensive summarization and possibly coding.

• useful to obtain information on:

– Facts with which the researcher is not very familiar,


opinions, attitudes, and suggestions of informants.

34
For example
1. What is the importance of the traditional medicine’’tikurmud ‘’

2. “Can you describe exactly what the traditional birth attendant did
when your labor started?”
3. “What do you think are the reasons for a high drop-out rate of
village health committee members?”
What is your opinion about the care provided by nursing
proffesional?

35
Steps in Designing a Questionnaire

Step1: Content
• Take your objectives and variables as your starting point

Step 2: Formulating
• Formulate one or more questions that will provide the information
needed for each variable.
• Check whether each question measures one thing at a time.

• Avoid leading questions.


– E.g. '' “The U.S. president believes that universal access to care
is an essential element of health care reform. Do you agree that
universal access should be mandated?”
36
Cont’d
Step 3: Sequencing of Questions

• Questions must be logical for the respondent and

• At the beginning of the interview, keep questions


concerning “background variables” (e.g., age, religion,
education, marital status, or occupation) minimal.

• Pose more sensitive questions as late as possible in the


interview.

• Use simple language.


37
Step 4: Formatting the Questionnaire

• When you finalize your questionnaire, be sure that:

 Each questionnaire has a heading and space to insert the

number and data.

 Sufficient space is provided for answers to open-ended

questions.

38
Cont’d
Step 5: Translation
• If interview will be conducted in one or more local
languages, the questionnaire has to be translated to
standardize the way questions will be asked.
• After having it translated you should have it retranslated
into the original language.
• You can then compare the two versions for differences and
make a decision concerning the final phrasing of difficult
concepts. 39
Methods of Data Organization and Presentation

• The data collected in a survey is called raw data.


• In most cases, useful information is not immediately evident
from the mass of unsorted data

• raw data by itself is meaningless unless it is manipulated


by the investigator/researcher.
Organizing and Presenting the raw data:
• enhances the understanding of complex data.

• helps to easily determine what information they contain.

40
1. THE ORDERED ARRAY
• Is a listing of the values in ascending or descending order.
• It is the first step in organizing data.

• Enables to determine quickly the smallest and largest


measurement, and other facts about the data.
• Used for quantitative data of small size (20).

Eg. age of the patients: 34, 37, 44, 30, 38, 35, 37,38, 40, 44, 43

Ordered array= 30, 34, 35, 37,37, 38, 38, 40, 43, 44, 44

41
2. Frequency Distribution
• Is the arrangement of data set using actual values and their
corresponding frequency of occurrence.
• It presents data in a compact form and gives a good overall
picture.
• Frequency: is the number which tells us the number of times a
particular data appear.

• Data set: is a collection of values for a certain variable.

• Relative Frequency: the frequency of each value or category


divided by the total frequency. 42
There are three basic types of frequency distributions

1. Categorical frequency distribution


2. Ungrouped frequency distribution
3. Grouped frequency distribution

43
1. Categorical frequency Distribution
• Used for organizing and presenting qualitative data(nominal,
or ordinal). e.g. marital status, sex, disease Dx.

44
Eg: categorical frequency distribution for Sex of year-II
nursing students.

Categories of sex Tally frequency(n) percentage(%)

1. Male ///// 5 62.5

2 . Female /// 3 37.5

45
2. Ungrouped frequency distribution

•Used for organizing and presenting quantitative variables.

•Presents the frequency of each value of the variable.

Example: Considers the following raw data showing age of

individual and construct a ungrouped frequency distribution.

•30,25, 30, 29,33,40,29,18,33,30

46
47
3. Grouped Frequency Distribution
• Used for organizing and presenting large set of quantitative data.

• Does not tell frequency of each value of the data set.


• Tells the Frequency of each group

• We need to divide the data into groups or non-overlapping class


intervals.
• each value should be placed in one, and only one of the intervals.

48
Example:
• Construct a grouped frequency distribution of the following
data on the amount of time (in hours) that 80 college
students devoted to leisure activities during a typical school
week:
23 24 18 14 20 24 24 26 23 21
16 15 19 20 22 14 13 20 19 27
29 22 38 28 34 32 23 19 21 31
16 28 19 18 12 27 15 21 25 16
30 17 22 29 29 18 25 20 16 11
17 12 15 24 25 21 22 17 18 15
21 20 23 18 17 15 16 26 23 22
11 16 18 20 23 19 17 15 20 10

49
Cumulative and Relative Frequencies

• Cumulative Frequencies: when frequencies of two or


more classes are added up, such total frequencies are
called Cumulative Frequencies.
• Relative Frequencies express the frequency of each value
or class as a percentage to the total frequency.
• Relative Cumulative Frequencies: when relative frequencies
of two or more classes are added up, such total
frequencies are called relative cumulative frequencies.

50
3. Statistical Tables
• A statistical table is systematic presentation of numerical data
in rows and columns.
• Rows are horizontal and columns are vertical arrangements.

• Can be used for qualitative and quantitative variables.

51
Types of Table

1. Simple frequency table is used when the individual


observations involve only to a single variable
Table 1: Overall immunization status of children in Adami Tullu
Woreda, Feb. 1995
Immunization status Frequency Percent
Not immunized 75 35.7
Partially immunized 57 27.1
Fully immunized 78 37.2
Total 210
100

Source: Fikru T et al. EPI Coverage in Adami Tulu. Eth J Health Dev
1997;11(2): 109-113
52
2. Cross tabulation /Two-way table
• Is used to obtain the frequency distribution of one variable by the
subset of another variable.
Table 3: Frequency and percentage distribution of anemia among women of reproductive age in
Ethiopia by age, 2018: Data from 2016 EDHS (n=14,489).

Age Anemia level  


  No anemia Mild n (%) Moderate Severe n (%) Total (n)
n %  n % n % n % n %
 
 
15-24 4,406 78.29 928 16.49 250 4.44 44 0.78 5628

25-34 3,682 74.62 885 17.93 313 6.34 55 1.11 4935

35-49 2,976 75.79 771 19.65 167 4.25 12 0.31 3926

Total 11,064 76.37 2,584 17.8 730 5 111 0.8 14489

source: Central Statistical Agency (CSA) [Ethiopia] and ICF. 2016, Ethiopia Demographic and Health Survey 2016.
Addis Ababa, Ethiopia, and Rockville, Maryland, USA:
53
CSA and ICF.ETHIOPIA, 2017.
3. Higher order table

• It is used to represent three or more variables in a


single table.

54
4. Diagrammatic Presentation of Data

• The choice of the particular form of graph depends on


personal choices and/or the type of the variable.
• Bar charts and pie chart are commonly used for
qualitative or quantitative discrete variables.
• Histograms, frequency polygons and ogive curve are
used for quantitative continuous variables.

55
1. Bar Chart
• Are used to represent and compare the frequency distribution
of categorical and discrete variables.

• All the bars must have equal width and the distance between
bars must be equal.

56
Cont’d
Example of simple bar diagram
100
80
Number of children

60
40
20
0
Not immunized Partially im- Fully immunized
munized

Immunization status

Fig. 1. Immunization status of Children in the District of X, in Y year


57
2. Pie-chart
• It is a circle divided into sectors so that the areas of the
sectors are proportional to the frequencies.
• Qualitative or quantitative discrete data

27%
37%
FI
NI
PI

36%
Fig. Immunization status of children in X 58
3. Histograms
• A histogram is the graph of the frequency distribution of
continuous variables.
• It is constructed on the basis of the following principles:
a) values of the variable under consideration are
represented by the horizontal axis.
• It should be labeled with the name of the variable and the
units of measurement.
b) frequency or relative frequency is represented by vertical
axis(bars)
59
Cont…
c) For each class in the distribution a vertical bars (rectangle) is
drawn with:

• its base on the horizontal axis extends from one class


boundary to the other class boundary.

• There will never be any gap between the histogram


rectangles(bars).

• the bases of all rectangles will be determined by the width of


the class intervals.

• We use true class limit in histogram. 60


• Example: Consider the data on time (in hours) that 80 college
students devoted to leisure activities during a typical school
week:
30

25
Number of students

20

15

10

0
1
Amout of time spent
Fig 6: Histogram for amount of time college students devoted to leisure activities
in X college in Y year.
61
4. Frequency Polygon
• If we join the midpoints of the tops of the adjacent rectangles
of the histogram with line segments, a frequency polygon is
obtained.

• When the polygon is continued to the X-axis just out side the
range of the lengths the total area under the polygon will be
equal to the total area under the histogram.

• Note that it is not essential to draw histogram in order to

obtain frequency polygon.


62
Cont’d

• It can be drawn with out erecting rectangles of histogram as


follows:
1) The scale should be marked in the numerical values of the mid-
points of intervals.

2) Erect ordinates on the midpoints of the interval - the length or


altitude of an ordinate representing the frequency of the class.
3) Join the tops of the ordinates and extend the connecting lines to
the scale of sizes.
4) Use one additional mid-points with zero frequency at each end of
to close the frequency polygon.
63
Cont’d
• Example: Consider the above data on time spent on leisure
activities
30

25
Number of students

20

15

10

0
7 12 17 22 27 32 37 42

Mid points of class intervals

Fig 7: Frequency polygon curve on time spent for leisure activities by students
64
5. O-give or cumulative frequency curve

• Used to graph the cumulative frequencies of a distribution.

To construct an O-give curve:

1. Compute the cumulative frequency of the distribution.

2. Prepare a graph with the cumulative frequency on the vertical


axis and the true upper class limits (class boundaries) in the
horizontal axis.

• The true lower limit of the next lower interval having a


cumulative frequency of 0 is included in the X-axis scale. 65
Cont’d
• Example: Consider the above data on time spend on leisure
activities
90
80
Cumulative frequency

70
60
50
40
30
20
10
0
4.5 9.5 14.5 19.5 24.5 29.5 34.5 39.5

Upper class boundary

Fig 8: Cumulative frequency curve for amount of time college students devoted to leisure
activities in X college, in Y year.
66
6. The line diagram

• Is useful for the study of some variables according to the


passage of time.

• The time, in weeks, months or years is marked along the


horizontal axis.

• The value of the quantity(frequency) that is being studied is


marked on the vertical axis.

• The line graph is suitable for depicting a consecutive trend of


a series over a long period.
67
Cont’d
• Example: Malaria parasite rates as obtained from malaria
seasonal blood survey results, Ethiopia (1967-79 E.C)
5.5
5.0
4.5
4.0
3.5
Rate (%)

3.0
2.5
2.0
1.5
1.0
0.5
0.0
1967 1969 1971 1973 1975 1977 1979
Year

Fig 9: Malaria Parasite Prevalence Rates in Ethiopia, 1967 – 1979 E.C.


68
Summarizing Data

69
Session objectives
At the end of this session, the student will be able to:
• Identify the different methods of data summarization

• Compute appropriate summary values for a set of data.


• Appreciate the properties and limitations of summary
measures.

70
Introduction

•Summarizing is the ability to represent or describe data by means of

just a few descriptive measures(values).

•In a large data set, it is easy to lose track of the overall picture by

looking at all the data at once.

• Before making inference, the data must be summarized as brief as

possible.

•Although frequency distributions serve useful purposes, there are

many situations that require other type of data summarization.


71
Notations

• Σ is read as Sigma (the Greek Capital letter for S) means


the sum of.
• Suppose n values of a variable are denoted as x1, x2,
x3…., xn then Σxi = x1,+x2,+ x3 +…xn where the
subscript i range from 1 up to n.
• Example: Let x1=2, x2 = 5, x3=1, x4 =4, x5=10, x6= −5,
x7 = 8

72
1. Measures of Central Tendency

• The tendency of statistical data to get concentrated at


certain values is called the “Central Tendency” .

• The various methods of determining the actual value at


which the data tend to concentrate are called measures of
central Tendency or averages.

• are measures which indicates where the middle/center of


the data is.
74
Importance
•To understand the data easily.

•To facilitate comparison.

•To make further statistical analysis.

The three most commonly used measures of central tendency are:

•Mean(computed average)

•Median(positional average)

•Mode(frequent observation)

75
Characteristics for a good average
1. It should be rigidly defined.
2. It should be easy to understand and compute.
3. It should be based on all items in the data.
4. Its definition shall be in the form of a mathematical formula.
5. It should be capable of further algebraic treatment.
6. It should have sampling stability.
7. It should be capable of being used in further statistical
computations or processing.

76
1. The Arithmetic Mean(computed average)

• The arithmetic mean is the sum of all observations or raw


scores divided by the number of observations.
• Sample mean is usually denoted by X̅.

• Population mean is denoted by 


• It is written in statistical terms as:
1
Χ  Σx , i  1,2,3...
n i
X=values of observations
n=number of observations(sample size) 77
Formula for population mean

1
 Σx , i  1,2,3...
N i

N=number of population

78
• The mean for data organized by ungrouped frequency
distribution is given as:
Mean= Σfixi / Σfi
Where f is frequency of each observation.
Example:-The mean of a set of ten temperature:
7,9,11,4,7,13,9,6,11,13 is found by adding them together
and dividing by 10.
• x̅ =9
79
Cont’d
• Example: Find the mean age of the following data.
Age 10 15 20 25 30 Total

frequen 3 6 5 4 2 20
cy

• Suggested answer:

X̅=

• Mean= Σfixi / Σfi =380/20 =19 years

80
Example
• Suppose the sample shown below consists of birth
weights (in grams) of all live born infants born at a
private hospital in a city during a 1-week period:

3265 3323 2581 2759 3260 3649 2841


3248 3245 3200 3609 3314 3484 3031
2838 3101 4146 2069 3541 2834

What is the arithmetic mean for the sample birth


weights?
81
Solution

1 1
Χ  Σx  (3265 + 3260 + ….+ 2834)
n i 20

63,338
 3166.9 g
20

82
Characteristics of mean

• The value is determined by every item in the data set.

• It is greatly affected by extreme values.

• The sum of the deviations of each observation about

it is zero.

• Used for discrete or continuous data

• do NOT use for ordinal or nominal data.


83
Advantages

• It is based on all values given in the distribution.

• It is the most easily understood.

• It is most amenable to algebraic treatment.

• It always exists
Disadvantages

• It may be greatly affected by extreme items

• It cannot be calculated for data which are not quantifiable.

• Cannot be used for grouped data with open-end classes. 84


2. Median

•is the number separating the higher half of a sample or a population

from the lower half.

•It is an alternative measure of central location and perhaps the second

in popularity to the arithmetic mean.

•The median is defined differently when n is even and odd.

•For samples with an odd sample size, there is a unique central point.

•Eg. for sample of size 9, the fifth largest point is the central point in the

sense that 4 points are both smaller and larger than it.

85
th
 n  1
• median is the   observation if n is odd.
 2 
• For samples with an even size, there is no unique central
point and the middle 2 values must be averaged.
n th n th
   
• Median is the average of the   1 and  
2  2
observations if n is even.
• The rational for these definitions is to ensure an equal
number of sample points on both sides of the sample
median.
86
Example-1
• Consider the following data, which consists of white blood counts
taken on admission of all patients entering a small hospital on a
given day.

• Compute the median white-blood count (103). 7,


35,5,9,8,3,10,12,8

• Solution: First, put the data in ascending order as follows.


3,5,7,8,8,9,10,12,35.

• Since n is odd, the sample median is given by the 5 th, (9+1)/2)th,


largest point, which is equal to 8.
87
Example-2
• Compute the sample median for the birth weight data (above
table)
• Find median for the following age : 25, 18, 27, 10, 8, 30,
42, 20, 53
• Find median for the following data: 5, 8, 12, 30, 18, 10, 2,
22

88
Characteristics

• It is an average of position.

• It is affected by the number of items than by extreme


values.
• it is insensitive to very large or very small values.
• it is determined mainly by the middle points in a sample
and is less sensitive to the actual numerical values of the
remaining data points.
• Used for discrete, continuous or ordinal data

• Uniqueness 89
Advantages

•is not much disturbed by extreme values

•It is more typical of the series

•May be located even when the data are incomplete, e.g,

when the class intervals are irregular and the final

classes have open ends.

•It can be computed for ratio, interval and ordinal data.


90
Disadvantages

• It doesn’t take each and every value into

consideration.

• Arrangement of the data in order

• It is not so generally familiar as the arithmetic mean.

91
4. Mode
• is an observation that occurs most frequently.
• Used for all data types, but most useful for ordinal and
nominal data
• The mode of a set of data or distribution can be:
No mode: All values appear equal number of times.
Unimodal: If the distribution has only one mode
Bimodal: If the distribution has two modes
Multi-modal: If the distribution has more than two modes.
92
Cont’d
• Find the modal values for the following data

a) 22, 66, 69, 70, 73. (no modal value)


b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2, 3.5

( modal value = 3.0 kg).

93
Characteristics

• It is an average of position
• It is not affected by extreme values
• It is the most typical and actual value of the
distribution.

94
Advantages of mode
•It is easily identifiable

•It can be applied to measure qualitative data

•It is not affected by extreme values.

•is usually an “actual value”, it indicates the precise value of an important part of the

series.

•Disadvantages

•It may not be unique

•It does not make use of every value in the data.

•It is not capable of mathematical treatment

•In a small number of items, the mode may not exist.


95
Skewness
• Skewness is the lack of symmetry (asymmetry) in the
distribution of data.

• It occurs when there are extremely low or extremely high


observations in the data set.

• In skewed data, the mean tends to shift towards extreme scores.

• Based on the type of skewness, distributions can be:

– Negatively skewed distribution

– Positively skewed distribution

– Symmetrical distribution 96
• “A distribution with extreme values at the right
(asymmetric tail extending to right) is referred to as
“positively skewed” or “skewed to the right,”
• a distribution with extreme values at the left(asymmetric
tail extending out to the left) is referred to as “negatively
skewed” or “skewed to the left.”
• Skewness motivates a researcher to investigate outliers.

97
Example

= mode

> mode

< mode

98
2.Measures of Dispersion

• Dispersion is the spread out of the observations.

• Measures of Dispersion tell us how spread out are the values


of a given variable.
• There is difference or variation among the values.

• The degree of variation is evaluated by various measures of


dispersion.
• Provide a single summary figure which tells whether the
distribution is close to center or spread out.

99
Cont’d
• A measure of dispersion conveys information regarding the amount
of variability present in a set of data.

Note:

1. If all the values are the same → There is no dispersion.

2. If all the values are different → There is a dispersion.

3. If the values close to each other →The amount of dispersion is


small.

4. If the values are widely scattered →The amount of dispersion is


greater. 100
Look the dispersion…!

 Hence measure of central tendency alone is not a sufficient to


evaluate data set.

101
Consider the following data sets:

Set 1: 60 40 30 50 60 40 70 50
Set 2: 50 49 49 51 48 50 53 50
• The two data sets given above have a mean of 50, but
obviously set 1 is more “spread out” than set 2. How
do we express this numerically?
• The object of measuring this scatter or dispersion is
to obtain a single summary figure which adequately
exhibits whether the distribution is compact or spread
out.

102
Cont’d
Some of the commonly used measures of dispersion
(variation) are:
1. Range

2. Variance(S2)
3. Standard deviation(SD) and

4. Coefficient of variation(CV).

103
1. Range

• The range is the difference between the highest and


smallest observation in the data.

• The range is a measure of absolute dispersion and as


such cannot be usefully employed for comparing the
variability of two distributions expressed in different
units.

104
Cont’d
• Range = Xmax - Xmin

• Where

• xmax = highest (maximum) value in the given distribution.

• Xmin =lowest (minimum) value in the given distribution.

Example: Age data: 43,66,61,64,65,38,59,57,57,50.


• Find Range?
• Range=66-38=28
105
2. Variance
• It measures dispersion of the values about their mean.

• Is an average of squared deviation of individual values from


the mean of the data set.
• The variance is a very useful measure of variability because it
uses the information provided by every observation.
• Its main disadvantage is that the units of variance are the
square of the units of the original observations.

106
• It is the sum of the squared deviations of the
measurements
n
about their mean divided by (n - 1).
 ( X i  X ) 2

S2= i 1
n 1

Note that the sum of the deviations of the individual


observations of a sample about the sample mean is
always 0.

107
3. Standard deviation(SD)
• It is the square root of the variance.

• Gives a measure of dispersion in original units.

• The variance represents squared units and, therefore, is not


an appropriate measure of dispersion when we wish to express
the dispersion in terms of the original units.
n

• S or SD=  (
i 1
Xi  X ) 2

n 1

• SD=
108
• Example: systolic blood pressure(SBP) of a sample
of 15 patients is given as follows (mmhg) :
101,105,110,114,115,124,125,125,
130,133,135,136,137,140,145
•Find the variance and standard deviation of the above
distribution.
• The mean SBP of the sample is 125 mmhg.
109
Cont’d
n
• Variance (sample) = s =
( X  X )
2 2
i
i 1

n 1
• = (101-125)2 +(105-125)2 + ….(145-125)2  / (15-1)

• = 2502/14
• = 178.71 (mmhg)2
• Hence, the standard deviation =
178.71
= 13.37mmhg.

110
4. The Coefficient of Variation(CV)

• standard deviation is not directly used for

comparison of variation between groups of data set.

• A special measure called the coefficient of variation

is often used for this purpose.

111
Cont’d
• CV expresses the SD as a percentage of the mean.
• CV= 100% S
Χ

• The coefficient of variation is most useful in comparing


the variability of several different samples, each with
different means.
• CV is a relative measure free from unit of measurement.

112
Cont’d

• Example: Compute the CV for the birth weight data


with Χ = 3166.9g and S = 445.3 g.
S 445.3
• Solution: in grams CV= 100%    100%  3166.9  14.1%
• CV=14.1%

113
3. MEASURES OF RELATIVE STANDING
• They show the position of one observation relative to others
in a set of data.

1. Percentiles

2. Quartiles

114
quizz
Q. Suppose that the following data set is showing the age of sample
of 10 patients in X hospital.

6, 4, 10, 14, 2, 7, 12, 15, 11,13


(2,4,6,7,10,11,12,13,14,15,)

I. Find (1 point):

a). Mean b). Median c). Mode


II. find(3 points)

a. Standard deviation b). Coefficient of variation

c. Interquartile range.
115
Wu, CMHS, Dep’t of PH

Demography And Health Service Statistics

116
Session objectives

Up on completion of this session, students will be able to:

•Define the concepts of demography

•Identify different sources of demographic data

•Tools of demographic measurements

•Compute measures of fertility, mortality and health services.

. 117
Demography

Demography comes from two Greek words:


demos = Population and graphics = to study.
• Demography is the study of human population with respect to

size, distribution, composition, social mobility and its variation

with respect to all the above features and the causes of such

variation and the effect of all these on health, social and economic

conditions.
• Population: is a collection of persons at a specified
point in time who shares some characteristics in common.
118
Sources of Demographic Data
Demographic data can be obtained from:
• Census

• Registration of vital events (Records)


• Sample surveys

• Health service records

119
1. Census

• Census is defined as a complete population count at a point


in time within a specified geographical area.
• Is a nation-wide counting of population.

• In Ethiopia census has been conducted every ten years since


1984.

Wcu,dep’t of ph 120
In census, Data collected on:

• Sex

• age

• marital status

• education
• place of birth,
• language, fertility, mortality, living conditions (e.g. house-
ownership, type of housing ), religion, etc..

wcu 121
Two techniques of conducting census:

A. DE JURE
• is the counting of people according to the permanent place of
location or residence.Eg, USA.
• excludes temporary residents and visitors, but includes
permanent residents who are temporarily away.

Advantages
• It yields information relatively unaffected by seasonal and other
temporary movements of people (i.e, it gives a picture of the
permanent population).

123
B. DE FACTO

• refers to counting persons according to their actual place of residence


on the day of census. Eg, Britain

Advantages

• less chance of double counting and omission of persons from the


count.

Disadvantages

• Population figures may be inflated or deflated by tourists,


travelling salesmen and other transients.

• In areas with high migration, the registration of vital events is


liable to distortion. 125
ii) Surveys

• A survey is a technique based on sampling methods by


means of which we obtain specific information from part of
the population considered as representative of the whole.

• Surveys are made at a given moment, in a specific territory;


with or without periodicity for the deep study of a problem.

Eg. Demographic and health survey(DHS).

126
iii) Registration of vital events
• It is a collection of data on birth, death, marriage and
divorce.
• Changes in population numbers are taking place every
day.
• Additions are made by births or through new arrivals
from outside the area.
• Reductions take place because of deaths, or through
people leaving the area.
127
4. Health Service Records
• All health institutions report their activities to the MOH
through regional health bureaus.
• The Ministry compiles, analyzes and publishes it.

• It is therefore the major source of health information in


Ethiopia.

128
Tools of demographic measurement

• Ratios

• Proportions

• Rates

WCU, Department of PH/


130
1. Ratios

• Ratio is a relation of any two demographic quantities.

• It is the expression of magnitude of one event in relation to


another.
• It is expressed in the form of a : b

Examples: Sex Ratio =

Wcu ,dep’t of ph. 131


2. Proportion

• is a type of ratio which quantifies occurrences in


relation to the population in which these occurrences take
place.
• The numerator is always included in the denominator.
• A proportion is usually expressed as a percentage.

Example: Males to total population ratio


x 100

• The proportion of TB cases among inhabitants of certain


locality.
WCU, Department of PH/ 132
3. Rates
• A rate measures the occurrence of particular event in a population
during a given time period.

• A rate is a proportion with a time element, i.e., in which occurrences


are quantified over a period of time.

• It is defined per unit of time.

Rate=K

where K is a constant mainly a multiple of 10 (100, 1000, 10000, etc.).

• Population at risk: This could be the mid-year population (population


at the first of July 1), population at the beginning of the year. Period for
a rate is usually a year. WCU, Department of PH/
133
Cont…

Example: Death Rate = No. of deaths in one year X 1000


Mid year population
A rate comprises the following elements:
Numerator
Denominator
Time specification and
Multiplier or constant (100, 1000, 10000, 100,000
etc.)
WCU, Department of PH/
134
Cont…
Two major types of rates are:

1. Crude rates –are those computed for the entire population of a


given country or place.

Eg. Crude Death Rate: is the number of deaths per 1000 population in
a given year.
2. Specific rate –are rate computed for specific group of population.

• Specific rate includes rates like age specific, sex specific and
occupation specific rates, etc.

Eg. Child Mortality Rate: is the number of deaths of children 1-4 years
of age per 1000 children 1-4 years of age. 135
Rate, Ratios and Proportions

WCU, Department of PH/ 136


Graphing age-sex composition
• Population Pyramid is the tool which graphically presents the
population of an area or country by age and sex at a point in
time.
• The pyramid consists of a series of bars, each drawn
proportionately to represent the relative contribution of each age-
sex group (often in five-year groupings) to the total population.
• By convention, males are shown on the left of the pyramid,
females on the right, young persons at the bottom, and the
elderly at the top.

AMU, CMHS, Department of PH/RH by


137
Gemechu K.
Pyramid

AMU, CMHS, Department of PH/RH by


138
Gemechu K.
Sampling Methods
Sampling Methods

Learning Objectives
At the end of this session participants should be able to:
  
 Define sample.        
 Identify the population to be studied
 Identify probability and non-probability sampling
methods
 Describe common methods of sampling.  
What Sampling?
 Researchers often use sample survey
methodology to obtain information about a larger
population by selecting and measuring a sample
from that population.

 Sampling involves the selection of a number of a


study units from a defined population.

 Sampling is a process of choosing a section of the


population for observation and study
Why Sample?
 Cost in terms of money, time and
manpower
 Accessibility
 Utility e.g. to do diagnostic laboratory test
you don’t draw the whole of patient’s
blood.

A census is a sample consisting of the entire


population.
 Inferences about the population are based on the
information from the sample drawn from that
population.

 However, due to the variability in the characteristics of


the population, scientific sample designs should be applied
to select a representative sample.

 Sampling enables us to estimate the characteristic of a


population by directly observing a portion of the
population.
Sample Information

Population
 If the wrong questions are posed to the wrong
people, reliable information will not be received
and lead to a wrong conclusion when applied to the
entire population.

 A main concern in sampling:


 Ensure that the sample represents the
population, and
The findings can be generalized.
Advantages of sampling:

Feasibility: Sampling may be the only feasible


method of collecting information.
Reduced cost: Sampling reduces demands on
resource such as finance, personnel, and material.
Greater accuracy: Sampling may lead to better
accuracy of collecting data
Greater speed: Data can be collected and
summarized more quickly
Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of
discrimination within the population.
• It may be inadvisable where every unit in the
population is legally required to have a record
Errors in sampling
 No sample is the exact mirror image of the
population.
1) Sampling error: Errors introduced due to errors in
the selection of a sample.
– They cannot be avoided or totally eliminated.
2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
• While selecting a SAMPLE, there are basic questions:

– What is the group of people (STUDY POPULATION)


from which we want to draw a sample?
– How many people do we need in our sample?
– How will these people be selected?
• Reference population (or target population): the
population of interest to whom the researchers
would like to make generalizations.

• Study population : the subset of the target


population from which a sample will be drawn.

• Sample: the selected individuals in which the study is


conducted
• Study unit: the units on which information will be
collected: persons, housing units, etc.
A. Probability sampling
• Involves random selection of a sample

• Every sampling unit has a known and non-zero


probability of selection into the sample.

• Involves the selection of a sample from a population,


based on chance.
• Probability sampling is:
– more complex,
– more time-consuming and
– usually more costly than non-probability
sampling.
• However, because study samples are
randomly selected and their probability of
inclusion can be calculated,
– reliable estimates can be produced and
– inferences can be made about the population.
Most common probability
sampling methods
1. Simple random sampling
2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling
1. Simple random sampling
• The required number of individuals are selected at
random from the sampling frame, a list or a
database of all individuals in the population
• Each member of a population has an equal chance
of being included in the sample.
• One has a list of serially numbered sampling units
(1 to N), and then a lottery method is used to select
n individuals out of N
• SRS has certain limitations:
– Requires a sampling frame.
– Difficult if the reference population is dispersed.
– Minority subgroups of interest may not be
selected.
2. Systematic random sampling

• Sometimes called interval sampling


• Selection of individuals from the sampling frame
systematically rather than randomly
• Individuals are taken at regular intervals down the list
• The starting point is chosen at random
3. Stratified random sampling

• It is done when the population is known to be have


heterogeneity with regard to some factors and those
factors are used for stratification

• Using stratified sampling, the population is divided


into homogeneous, mutually exclusive groups called
strata, and

• A population can be stratified by any variable that is


available for all units prior to sampling (e.g., age, sex,
province of residence, income, etc.).
• A separate sample is taken independently
from each stratum.

• Any of the sampling methods mentioned


in this section (and others that exist) can
be used to sample within each stratum.
• Stratified sampling ensures an adequate sample size
for sub-groups in the population of interest.
• When a population is stratified, each stratum
becomes an independent population and you will
need to decide the sample size for each stratum.
4. Cluster sampling
• Sometimes it is too expensive to carry out SRS
– Population may be large and scattered.
– Complete list of the study population unavailable
– Travel costs can become expensive if interviewers have to
survey people from one end of the country to the other.
• Cluster sampling is the most widely used to reduce
the cost
• The clusters should be homogeneous, unlike stratified
sampling where the strata are heterogeneous
5. Multi-stage sampling
• Similar to the cluster sampling, except that it
involves picking a sample from within each
chosen cluster, rather than including all units
in the cluster.
• This type of sampling requires at least two
stages.
• The primary sampling unit (PSU) is the
sampling unit in the first sampling stage.

• The secondary sampling unit (SSU) is the


sampling unit in the second sampling
stage, etc.
Multi-stage sampling …

Woreda PSU

Kebele SSU

Sub-Kebele TSU

HH
• In the first stage, large groups or clusters are
identified and selected. These clusters contain
more population units than are needed for
the final sample.

• In the second stage, population units are


picked from within the selected clusters
(using any of the possible probability
sampling methods) for a final sample.
• If more than two stages are used, the process of
choosing population units within clusters continues
until there is a final sample.

• Also, you do not need to have a list of all of the units


in the population. All you need is a list of clusters
and list of the units in the selected clusters.

• However, multi-stage sampling still saves a great


amount of time and effort by not having to create a
list of all the units in a population.
B. Non-probability sampling

• In non-probability sampling, every item has an


unknown chance of being selected.

• In non-probability sampling, there is an assumption


that there is an even distribution of a characteristic of
interest within the population.

• For probability sampling, random is a feature of the


selection process.
• This is what makes the researcher believe that any sample
would be representative and because of that, results will be
accurate.

• In non-probability sampling, since elements are chosen


arbitrarily, there is no way to estimate the probability of any
one element being included in the sample.

• They are quick, inexpensive and convenient.


The most common types of non-probability
sampling

1. Convenience or haphazard sampling


2. Volunteer sampling
3. Judgment/ Purposive sampling
4. Quota sampling
5. Snowball sampling technique
1. Convenience or haphazard sampling

• Convenience sampling is sometimes referred to


as haphazard or accidental sampling.
• It is not normally representative of the target
population because sample units are only
selected if they can be accessed easily and
conveniently.
• Although useful applications of the technique
are limited, it can deliver accurate results when
the population is homogeneous.
2. Volunteer sampling
• As the term implies, this type of sampling occurs when people
volunteer to be involved in the study.
• In psychological experiments or pharmaceutical trials (drug
testing), for example, it would be difficult and unethical to enlist
random participants from the general public.
• In these instances, the sample is taken from a group of
volunteers.
• Sometimes, the researcher offers payment to attract
respondents.
• In exchange, the volunteers accept the possibility of a lengthy,
demanding or sometimes unpleasant process.
3. Judgment sampling
• This approach is used when a sample is taken based
on certain judgments about the overall population.

• Selecting sample on the basis of knowledge of the


research problem to allow selection of "typical"
persons for inclusion in the sample.

• The critical issue here is objectivity: how much can


judgment be relied upon to arrive at a typical
sample?
4. Quota sampling
• This is one of the most common forms of non-probability
sampling.
• Sampling is done until a specific number of units
(quotas) for various sub-populations have been selected.
• Since there are no rules as to how these quotas are to be
filled, quota sampling is really a means for satisfying
sample size objectives for certain sub-populations.
• Some units may have no chance of selection or the
chance of selection may be unknown.
5. Snowball sampling
• A technique for selecting a research sample
where existing study subjects recruit future
subjects from among their acquaintances.
• Thus the sample group appears to grow like a
rolling snowball.
• This sampling technique is often used in hidden
populations which are difficult for researchers to
access; example populations would be drug
users or commercial sex workers.
Estimation

176
Learning objectives
At the end of this session, the student will be able to:
• Understand the concepts of sample statistics and population parameters

• Understand the principles of sampling distributions of means and

calculate their standard errors.

• Understand estimation and differentiate between point and interval

estimations.

• Compute appropriate confidence intervals for population means and

proportions, and interpret the findings

• Describe methods of sample size calculation for cross–sectional studies


177
1. Definition of terms
• Statistical inference is the act of generalizing from a
sample to a population with calculated degree of
certainty.
• Statistical inference: drawing conclusions about the whole
population on the basis of a sample.
• Sampling is precondition for statistical inference
• A sample is randomly selected from the population
(=probability sample)
178
• A parameter is a numerical descriptive measure of a
population.
• Population mean(μ) is an example of a parameter.

• A statistic is a numerical descriptive measure of a sample.

• Sample mean( Χ ) is an example of statistic .

• To each sample statistic there is a corresponding population


parameter.

179
Cont’d
Sample Statistic Population parameter
• X̅ (sample mean) • μ (population mean)

• S2 ( sample variance) • σ2 ( population variance)


• S (sample Standard • σ(population standard
deviation) deviation)
• p ( sample proportion) • P or π (Population proportion)

180
2. Sampling Distribution of means
• Is the distribution of the means of many samples of equal
size n taken from the same population repeatedly.
• Since it is a frequency distribution, it has its own mean and
standard deviation.
• The frequency distributions for statistics are called sampling
distributions because, in repeated sampling, they provide this
information:
• What values of the statistic can occur.
• How often each value occurs.

181
Sampling distribution of means are found as follows

1. Obtain a sample of n observations selected completely at random


from a large population. Determine their mean and then replace the
observations in the population.

2. Obtain another random sample of n observations from the


population, determine their mean and again replace the observations.

3. Repeat the sampling procedure indefinitely, calculating the mean of


the random sample of n each time and subsequently replacing the
observations in the population.

182
4. The result is a series of means of samples of size n.
• If each mean in the series is now treated as an
individual observation and arrayed in a frequency
distribution, one determines the sampling
distribution of means of samples of size n.
• the scores ( Χs) in the sampling distribution of means
are themselves means.

183
Properties of sampling distribution of means
1. The mean of the sampling distribution of means is the
same as the population mean, μ.
2. The SD of the sampling distribution of means is σ /√n.
3. The shape of the sampling distribution of means is
approximately a normal curve, regardless of the shape of
the population distribution and provided n is large
enough(n>=30) (Central limit theorem).

184
Standard Error(SE)
• Standard deviation of the sampling distribution of means is
called the standard error.
σ
• SE (σ x )  n
• SE quantifies the variability among means of repeated
samples drawn from that population.

185
Central Limit Theorem

• As long as the samples are large enough ( n >= 30) the


distribution of the sample means will be nearly normal
whatever the distribution of the variable in the
population.
• When the sampling is done from a non-normally
distributed population, the central limit theorem is used.
• The larger the sample size, the better will be the normal
approximation to the sampling distribution of the mean.
186
Example
• Suppose you have a population having four members with
values 10, 20, 30 and 40. If you take all conceivable
samples of size 2 with replacement:
a) What is the frequency distribution of the sample means ?
b) Find the mean and standard deviation of the distribution
(standard error of the mean).

187
Cont’d
• Possible samples X̅ i ( sample mean )
• (10, 20) or (20, 10) 15
• (10, 30 ) or (30, 10) 20
• (10, 40) or (40, 10) 25
• (20, 30) or (30, 20) 25
• (20, 40) or (40, 20) 30
• (30, 40) or (40, 30) 35
• (10, 10) 10
• (20, 20) 20
• (30, 30) 30
• (40, 40) 40
188
Cont’d
a) frequency distribution of sample means
sample mean ( Χi ) frequency (fi)
10 1
15 2
20 3
25 4
30 3
35 2
40 1
189
Cont’d
b) The mean of the sampling distribution of means
= Σ Χ ifi / Σfi = Σ10x1+15x2…../16= 400 / 16 = 25
• The standard deviation of the sampling distribution of
mean:
(σ Χ )= √Σ ( Χi - μ)2 / Σfi
• = √{Σ (10 - 25)2 + (15 - 25)2 + …. + ( 40 - 25)2 } / 16
• = √ 1000 / 16 = √62.5 = 7.90
190
For the population given above (10,20,30 and 40)

a) Find the population mean. Show that the population mean ( μ ) =


the mean of the sampling distribution of means

b) Find the population standard deviation and show that the


standard error of the mean (σΧ = σ/√n ).

191
a) μ = Σ xi / N = (10 + 20 + 30 + 40) / 4 = 25
b) σ2 = Σ(xi - μ)2 / N = (225+ 25+ 25 + 225) / 4 = 125
• Hence, σ = √ 125 = 11.18 and σ Χ = σ / = 11.180 / = 7.9

192
3. Estimation

• It is estimating population parameters based on sample


statistic.

• In short, it is the use of sample statistic to estimate


population parameter.

• Estimator is the formula or the method used to obtain an


estimate.

193
Statistical estimation
• There are two ways to estimate population values from sample
values.

1. Point estimation is using a sample statistic to estimate a


population parameter based on a single value.

Eg. if a weight of random sample of newborn = 3.5kg, and we use


it to estimate  (= 3.5kg)( the mean birth weight of all newborn
in the population, we are making a point estimation).

• Point estimation ignores sampling error


194

22-06-18 Dr. Haftom


2. Interval estimation
• It is the act of finding two values that contain population
parameter with a certain confidence level.

• Is using a sample statistic to estimate a population


parameter by making allowance for sampling error.

• It consists of two numerical values defining a range of


values that most likely includes the parameter being
estimated.

195
Interval Estimation………..
• It is a statement that describes a population parameter has
a value lying in between two specified limits with a
certain confidence level.
• A point estimate does not give any indication on how far
away the parameter lies.
• A more useful method of estimation is to compute an
interval which has a high probability of containing the
parameter.
196
Confidence interval

• Is an interval(two values) containing an estimate of a population

parameter.

• It provides a range of values of the estimate likely to include the

“true” population value with a given probability.

• Level of confidence( Z at a certain value of 1-α) is the

probability of obtaining the population parameter within the

interval.

• α - is the probability that interval does not contain the parameter.


197
• Level of confidence is denoted as (1-α )100%.
• Confidence level can never be 100%.
• Increasing the desired confidence level will widen
the confidence interval.

198
Cont’d
• It is usually accepted that a 5%(α=5%) chance that the
range will not include the true population value and the
range of interval is called 95% confidence interval.
• When we say a 95 % confidence interval, it is to mean
that the interval would contain the true parameter value in
95 % of the time.

199
A confidence interval is given as:
Point estimate ± Reliability coefficient x Standard error

Reliability coefficient (Z-score) is a value from the standard


normal distribution (the standardized z value) corresponding to the
given level of confidence.
• Z = 1.64 if your confidence level is 90%.
• Z = 1.96 if your confidence level is 95%.
• Z = 2.58 if your confidence level is 99%.
Standard error: standard deviation of the sampling distribution of
mean . 200
Cont’d
• The confidence interval has a lower and upper limit that
can be expressed in the form of:
lower limit = [Estimate – (Z x standard error)]
upper limit=[Estimate + (Z x standard error)]

N.B. The standard error is computed as σ/√n

201
2.1 . Confidence interval for a single population mean
• Confidence interval for population mean;

CI for μ = Χ ± Z(1-/2) x σ/
Χ -Is the point estimate of the population mean

Z(1-/2) -is the value of Z to the left of which lies 1-/2 of the area
and to the right of which lies /2 of the area under standard
normal curve.
Χ - Z(1-/2) x σ/
Χ+ Z(1-/2) x σ/

Lower Limit Upper 202


Limit
Eg. 95% confidence interval

95% chance of finding  within this interval


Standard

X 1.96 X X +1.96
n error of the
n
sample
mean

• The 95% confidence interval gives an interval of


values within which there is a 95% chance of finding
the true population mean . 203

22-06-18 Dr. Haftom


Example-1
• The mean weight of a sample of 100 children who are 5
years old in a certain locality is found to be 14 kg. A
clinician wants to know the mean weight of all the
children in that locality with 95% confidence interval, if
it is known that the SD for all children is 4kg.

204
Cont’d
• Given points:
• Level of confidence = 95 %, α = 0.05 and α/2 = 0.025
and the value of Z at 1-α/2 is 1.96
• n=100 σ = 4 and x̅ = 14
• inserting the given values in the formula;
• The result will be 14 ± 0.784= (13.2 and 14.8)
• Interpretation: A clinician is 95% confident that the
mean weight of all children is between 13.2 and 14.8 kg.
205
2.2. Confidence interval for a single population proportion
• Notation: P (or π) = proportion of “successes” in a
population (parameter)
• Q = 1-P = proportion of “failures” in a population
• p = proportion of successes in a sample
• q = 1-p= proportion of “failures” in a sample
• σp= Standard deviation of the sampling distribution
of proportions or Standard error of proportions
• n = size of the sample

206
Cont’d
• The confidence interval for the population proportion (P) is
given by the formula:

• CI for P = p ± Z(1-/2) x σp ( that is, p - Z(1-/2) x σp and p +

Z(1-/2)x σp)

• p = sample proportion, σp = standard error of the


proportion .

• Z(1-/2) - is Z- value corresponding to the confidence level.

207
Example
• An epidemiologist is worried about the ever increasing trend of

malaria in a certain locality and wants to estimate the proportion

of persons infected in the peak malaria transmission period.

• If he takes a random sample of 150 persons in that locality during

the peak transmission period and finds that 60 of them are positive

for malaria, find

a) 95%, b) 90% and c) 99% confidence intervals for the

proportion of the whole infected people in that locality during the

peak malaria transmission period. 208


Solution
• Sample proportion = 60 / 150 =0 .4

• The standard error of proportion depends on the population P.

• However, the population proportion (P) is unknown. In such


situations, can be used as an approximation to σp = = .04.

209
a) A 95% C.I for the population proportion
= 0. 4 ± 1.96 (.04) = (0.4 ± .078) = (0.322, 0.478).

Interpretation: we are 95 percent confident that the proportion


of the whole infected people is between 0.322 and 0.478 .

b. A 90% C.I for the population proportion

= 0.4 ± 1.64 (.04) = (.4 ± .066) =(0.334, 0.466).

210
Sample Size Estimation
• Is deciding on the number of people needed to be
studied in order to answer the study objectives.

211
To calculate sample size in cross-sectional studies:
• estimate how big the proportion might be (P)-from the
previous study.
• choose the margin of error you will allow in the estimate of
the proportion (say ± w)
• choose the level of confidence that the proportion in the
whole population is indeed between (p-w) and (p+w).
• We can never be 100% sure.

212
Cont’d
• The minimum sample size required, for a very large population
(N≥10,000) is:
n = Z2 p(1-p) / w2
• Where:
• Z=Reliability Coefficient corresponding to confidence level.
• p=Population proportion from previous data
• q=1-p
• w= Margin of error to be tolerated

213
Example
• A survey is being planned to determine what proportion
of families in a certain area are poor. It is believed that
the proportion cannot be greater than 0.35. A 95
percent confidence interval is desired with w: 0.05.
What size of sample of families should be selected?
• Solution: If the finite population correction can be
ignored, we have
• n= 1:96 2x0.35x0.65/0.052= 349.59
214
Hypothesis Testing

215
Learning objectives
At the end of this session, the students will be able to:
• Understand the concepts of null and alternative
hypothesis
• Differentiate between type I and type II errors
• Explain the meaning and application of P – values

216
• Hypothesis is a statement made about one or more population parameter.

• A statistical hypothesis is an assumption or a statement which may or

may not be true concerning one or more populations.

Example.:-The average length of stay of patients admitted to the

hospital is 5 days.

• The prevalence of malaria is 60% in population of X town.

• Certain drug will be effective in 90% of the cases for which it is

used.
217
Hypothesis testing
• It is determining whether or not statements(hypothesis) are
true based on the sample data.
• It is using sample statistics to test hypothes is about population
parameters.
• It is deciding to accept or reject the pre-set hypothesis, using
the sample statistics.

• It indicates whether the hypothesis is supported or is not


supported by the available sample data.
218
• purpose of hypothesis testing is to aid the clinician,
researcher or administrator in reaching on a conclusion
about a population by examining a sample from that
population.

Components in any hypothesis test:

a. Null Hypothesis
b. Alternative Hypothesis
c. Test Statistic
e. Conclusion 220
Two types of hypothesis
1. Null hypothesis
• Is the main hypothesis which we wish to test.
• It is denoted by the symbol Ho.

• Ho is always a statement about a parameter (mean, proportion, etc.


of a population).
• It is the hypothesis of no difference, since it is a statement of
agreement with a conditions presumed to be true in the population of
interest.
• Ho is an equality hypothesis( μ = 14) rather than an inequality ( μ <
14 or μ > 14). 221
2. Alternative hypothesis
• It is a statement that will be true, if Ho is rejected.

• It is a hypothesis that states “there is a difference”.

• It is denoted by HA or H1.

223
Rules of thumb for stating the null and alternative
hypothesis
a. The null hypothesis should contain a statement of equality, (=, ≥ or ≤
b. What we hope or expect to be able to conclude as a result of the test
usually should be placed in the alternative hypothesis.
c. The null and alternative hypotheses are complementary.
Example: if we want to answer the question, can we conclude that a certain
population mean age is not 35?

• Ho: µ = 35 and HA: µ ≠ 35.

• Suppose we want to know if we can conclude that the population mean age
is greater than 50. Our hypotheses are:

Ho: µ ≤ 50 and HA: µ > 50


3. Test statistic

• Is a statistic whose value serves to determine whether to reject or


accept Ho.
• It is some statistic that is computed from the data of the sample.
• The test statistic serves as a decision maker, since the decision to
reject or not to reject the Ho depends on the magnitude of the
test statistic.

General Formula for Test Statistic:

226
Z-test is used when:
1. sampling is from a normally distributed population and σ is
known.

2. sampling is from a normal distribution with unknown σ and


sample size is large (n>=30).

3. sampling is from a non-normally distributed population and


sample size is large (n>=30).
• (for population mean).

• - is a hypothesized value of a population mean.

• - is the standard error of


227
t-test is used when:
• sampling is from a normally distributed population with unknown σ

and sample size is small(<30).

4. Decision rule: when (HA: µ ≠ )

If the calculated Z value is positive, the rule says:

• reject HO if Z calculated ( Z calc) > Z tabulated(Z tab)

• or accept HO if Z calculated < Z tabulated.

On the other hand, if the calculated Z value is negative:

• reject HO if Z calculated ( Z calc) < Z tabulated (Z tab).


228
• (Here, both Zcalculated and Ztabulated are negative values)
5. Conclusion
• A random sample of size n is taken and the information from
the sample is used to reject or accept (fail to reject) the null
hypothesis.
• If H0 is rejected, we conclude that HA is true.
• If H0 is not rejected, we conclude that H0 is true.

230
Two types of Errors in hypothesis testing
It is not always possible to make a correct decision since we are

dealing with random samples.

Two types of Errors in hypothesis testing

1. Type I error is made when Ho is true but rejected.

2. Type II error is made when Ho is false but we fail to reject it .

• α (level of significance) is the probability of a type I error.

• β is the probability of a type II error.

231
• α (level of significance) is arbitrarily chosen, equal to a small
number (usually 0.1, .01, .05, etc.)

– when α= 0.10, there is a 10% chance of rejecting a true


null hypothesis.
– when α =0.05, there is a 5% chance of rejecting a true null
hypothesis.
– when α= 0.01, there is a 1% chance of rejecting a true null
hypothesis.
232
• Whenever we reject a null hypothesis, there is always the
concomitant risk of committing a type I error( rejecting a true
null hypothesis).
• Whenever we fail to reject a null hypothesis, there is always
the concomitant risk of committing a type II(the risk of failing
to reject a false null hypothesis).
233
Summary

Type of decision Ho True Ho False

Reject H0 Type I Error Correct decision

Correct decision
Fail to reject Ho Type II Error

Define:
a = P(Type I error) = P(rejecting H0 when H0 is true)
b =P(Type II error) = P(accepting H0 when H0 is false)
234
P – Values
• The P-value (or probability value) is the probability of getting
a sample statistic (such as the mean) or a more extreme sample
statistic in the direction of the alternative hypothesis when the
null hypothesis is true.
• Shows the exact probability of getting a test statistic value if
Ho is true.
• It tells how common(>=0.05) or how rare(<0.05) is the
computed value of the test statistic given that H0 is true.
235
Steps in hypothesis testing
1. Formulate H0 and HA

2. Choose a level of significance (α)

3. Select the test statistic ( Z or t)

4. Determine decision rule about when to reject the Ho hypothesis and when

to fail to reject it.

5. Choose a random sample from the population and compute appropriate

statistic: that is, mean, proportion and so on.

6. Calculate the test statistic and compare to the critical value

corresponding to the chosen α.

7. Make a decision 238


1. Hypothesis testing about a single population mean, µ

• One begins with a statement that claims a particular value


for the unknown population mean.

• The statistical inference consists of two conclusions :

I) Reject the claim about the population mean (Ho) because


there is sufficient evidence to do so.

II) Do not reject the claim about the population mean,


because there is no sufficient evidence to do so.

239
Example-1

• Assume that in a certain district the mean systolic blood


pressure of persons aged 20 to 40 is 130 mm Hg with a
standard deviation of 10 mmHg. A random sample of 64
persons aged 20 to 40 from village X of the same district
has a mean systolic blood pressure of 132 mm Hg. Does
the mean systolic blood pressure of the dwellers of the
district(aged 20 to 40) differs from 130, at a 5% level of
significance?
240
Solution
1. Ho : μ = 130
HA: μ ≠ 130
2. α = .05 ( that is, the probability of rejecting Ho when it
is true).
3. Z-test is appropriate(b/s n=64 and σ is known )
4. Decision rule: reject HO if Z calculated < -1.96 or Z
calculated >1.96.

241
Cont’d
5. Calculate test statistic
• The Z score for the random sample of 64 persons of the
village aged 20 to 40 years:

• Z calc = (132-130) / (10/ √64) = 2 / 1.25 = 1.6


• This score falls inside the “fail to reject region” from Z
calc or <1.96 .
242
Cont’d

6. Decision: Hence, the null hypothesis of the above example


is failed to reject.
7. Conclusion: the mean systolic blood pressure of persons
( aged 20 to 40 ) living in district is not different from 130.

243
Example-2

• Researchers are interested in the mean age of a certain


population. A random sample of 10 individuals drawn
from the population of interest has a mean of 27.
Assuming that the population is approximately normally
distributed with variance 20, can we conclude that the
population mean is different from 30 years? (α=0.05).
Solution
1. H0 : μ =30
• HA: μ 30
2. α=0.05 [-1.96, 1.96]
3-Test Statistic is Z(normally distributed and α is known)
4. Decision Rule
reject H0 if Zcalc < -1.96 or Zcalc > 1.96

Z calc=
n=10, x =27 , σ2=20,
• 27-30/1.41= -2.12
Decision : -2.12 < -1.96- reject Ho
Conclusion: We can conclude that μ is not equal to 30
2. Hypothesis Testing for a single Population Proportion

• Hypothesis testing for a population proportion follows a


similar step like that of testing population mean, but the
difference is the way how we calculate the test statistic Z
ˆ  p0
p
Z 
p0 q0
n
• p̂  sample proportion
• Po-population proportion
• qo=1-po 246
Reading assignment
• X2-test, and t-test

250
thank you

You might also like