Intro to Biostats

Chapter-1
Introduction
• to Biostatistics
•
• Name: Jiregna Indalu
• E-mail: jiregnaindalu@gmail.com
• Mob.: 0926-987846
•
Introduction
• What is statistics?
• Statistics: A field of study concerned with:
– collection, organization, analysis, summarization
and interpretation of numerical data, and
– the drawing of inferences about a body of data
when only a small part of the data is observed.
• Statistics helps us use numbers to
communicate ideas.
• Statisticians try to interpret and communicate
the results to others.
2
Cont.
· Biostatistics: The application of statistical
methods to the fields of biological and medical
sciences.
· Concerned with interpretation of biological data
& the communication of information derived
from these data.
· Has central role in medical investigations.
3
Uses of biostatistics
• Provide methods of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• Magnitude of association
– Strong vs weak association between exposure
and outcome
4
Cont.
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of people free
from the disease is greater among the vaccinated than
the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population
5
What does biostatistics cover?
Research Planning
Design The best way to

learn about
biostatistics is to
Execution (Data collection)
follow the flow of a
research from
Data Processing
inception to the
final publication
Data Analysis
Presentation
Interpretation
Publication 6
variable:
 It is a characteristic that takes on different values in
different persons, places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.
7
Types of variables
Quantitative Qualitative
Quantitative Variables Qualitative Variables
It can be measured in the Many characteristics are not

usual sense. capable of being measured. Some
of them can be ordered or ranked.
For example:
 the heights of adult males, For example:
 the weights of preschool  classification of people into socio-
children, economic groups,
 the ages of patients seen in  social classes based on income,
a education, etc.
 dental clinic. 8
Types of variables &
scale of measurement
Quantitative variables Qualitative variables

(Numerical) (Categorical)
Interval Nominal
Ordinal
Ratio
9
Types of Statistics
1. Descriptive statistics:
• Ways of organizing and summarizing data.
• Helps to identify the general features and trends in a set

of data and extracting useful information.
• Also very important in conveying the final results of a

study.
• Example: tables, graphs, numerical summary measures
10
Cont.
2. Inferential statistics:
• Methods used for drawing conclusions about a

population based on the information obtained
from a sample of observations drawn from that
population
• Example: Principles of probability, estimation,

confidence interval, comparison of two or more
means or proportions, hypothesis testing, etc.
11
Data
• Data are numbers which can be obtained by
measurement or counting
• The raw material for statistics
• Can be obtained from:

– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc
12
Types of Data
1. Primary data: collected from the items or
individual respondents directly by the
researcher for the purpose of a study.
2. Secondary data: which had been collected by

certain people or organization, & statistically
treated and the information contained in it is
used for other purpose by other people
13
Population and Sample
• Population:
– Refers to any collection of objects.
• Target population:
– A collection of items that have something in common
for which we wish to draw conclusions at a particular
time.
• E.g., All hospitals in Ethiopia.
– The whole group of interest.
14
Cont.
Study Population:
• The subset of the target population that has at least some chance of
being sampled.
• The specific population group from which samples are drawn and
data are collected.
Sample:
• A subset of a study population, about which information is actually
obtained.
• The individuals who are actually measured and comprise the actual
data.
15
Cont.
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Ayder of Mekelle
Sample were included.
Target Population: All
Study Population adolescents in Ethiopia
Target Population Study population: All

adolescents in Mekelle
Sample: Adolescents in Ayder
sub-city who were included in
the study
16
Generalizability
• Is a two-stage procedure:
• We need to be able to generalize from:
– the sample to the study population, &
– then from the study population to the target
population
• If the sample is not representative of the
population, the conclusions are restricted to
the sample & don’t have general applicability
17
Parameter and Statistic
• Parameter: A descriptive measure computed
from the data of a population.
– E.g., the mean (µ) age of the target population
• Statistic: A descriptive measure computed

from the data of a sample.
– E.g., sample mean age ( )
18
Sampling and Sampling
Distributions
19
Cont.
• Researchers often use sample survey methodology to
obtain information about a larger population by
selecting and measuring a sample from that population.
• Since population is too large, we rely on the information

collected from the sample.
• Inferences about the population are based on the

information from the sample drawn from that
population.
20
Cont.
• A sample is a collection of individuals selected

from a larger population.
• Sampling enables us to estimate the

characteristic of a population by directly
observing a portion of the population.
21
Cont.
Sample Information
Population
22
Steps needed to select a sample and ensure that
this sample will fulfill its goals.
1. Establish the study's objectives

– The first step in planning a useful and efficient survey is
to specify the objectives with as much detail as
possible.
– Without objectives, the survey is unlikely to generate
valuable results.
– Clarifying the aims of the survey is critical to its
ultimate success.
– The initial users and uses of the data should be
identified at this stage.
23
Cont.
2. Define the target population
– The target population is the total population for which the

information is required.
– Specifically, the target population is defined by the following

characteristics:
• Nature of data required
• Geographic location
• Reference period
• Other characteristics, such as socio-demographic characteristics
24
Cont.
3. Decide on the data to be collected
– The data requirements of the survey must be established.
– To ensure that the requirements are operationally sound, the necessary data
terms and definitions also need to be determined.
4. Set the level of precision
– There is a level of uncertainty associated with estimates coming

from a sample.
– The sample-to-sample variation is what causes the sampling error.
– Researchers can estimate the sampling error associated with a
particular sampling plan, and try to minimize it.
25
Cont.
5. Decide on the methods on measurement
– Choose measuring instrument and method of approach to the

population
– Data about a person’s state of health may be obtained from
statements that he/she makes or from a medical examination
– The survey may employ a self-administered questionnaire, an
interviewing
6. Preparing Frame
– List of all members of the population
– The elements must not overlap
26
Sampling
• The process of selecting a portion of the

population to represent the entire population.
• A main concern in sampling:
– Ensure that the sample represents the population,
and
– The findings can be generalized.
27
Advantages of sampling:
• Feasibility: Sampling may be the only feasible method of
collecting information.
• Reduced cost: Sampling reduces demands on resource such as

finance, personnel, and material.
• Greater accuracy: Sampling may lead to better accuracy of

collecting data
• Sampling error: Precise allowance can be made for sampling error
• Greater speed: Data can be collected and summarized more

quickly
28
Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of
discrimination within the population.
• Sampling may be inadvisable where every unit
in the population is legally required to have a
record.
29
Errors in sampling
1) Sampling error: Errors introduced due to errors
in the selection of a sample.
– They cannot be avoided or totally eliminated.
2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
30
Sampling Methods
Two broad divisions:
A. Probability sampling methods
B. Non-probability sampling methods
31
Probability sampling
• Involves random selection of a sample
• A sample is obtained in a way that ensures every

member of the population to have a known, non zero
probability of being included in the sample.
• The method chosen depends on a number of factors,

such as
– the available sampling frame,
– how spread out the population is,
– how costly it is to survey members of the population
32
Most common probability
sampling methods
1. Simple random sampling

2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling
33
1. Simple random sampling
• Involves random selection
• Each member of a population has an equal
chance of being included in the sample.
• To use a SRS method:
– Make a numbered list of all the units in the
population
– Each unit should be numbered from 1 to N
(where N is the size of the population)
– Select the required number.
34
Cont.
• The randomness of the sample is ensured

by:
• use of “lottery’ methods
• a table of random numbers
35
Example
• Suppose your school has 500 students and
you need to conduct a short survey on the
quality of the food served in the cafeteria.
• You decide that a sample of 10 students

should be sufficient for your purposes.
• In order to get your sample, you assign a

number from 1 to 500 to each student in your
school.
36
Cont.
• To select the sample, you use a table of

randomly generated numbers.
• Pick a starting point in the table (a row and

column number) and look at the random
numbers that appear there. In this case, since
the data run into three digits, the random
numbers would need to contain three digits as
well.
37
Cont.
• Ignore all random numbers after 500 because they do
not correspond to any of the students in the school.
• Remember that the sample is without replacement, so

if a number recurs, skip over it and use the next random
number.
• The first 10 different numbers between 001 and 500

make up your sample.
38
Cont.
• SRS has certain limitations:

– Requires a sampling frame.
– Difficult if the reference population is dispersed.
– Minority subgroups of interest may not be
selected.
39
2. Systematic random sampling
• Sometimes called interval sampling,
systematic sampling means that there is a gap,
or interval, between each selected unit in the
sample
• The selection is systematic rather than

randomly
40
Cont.
• Important if the reference population is

arranged in some order:
– Order of registration of patients
– Numerical number of house numbers
– Student’s registration books
• Taking individuals at fixed intervals (every kth)

based on the sampling fraction, eg. if the
sample includes 20%, then every fifth.
41
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the total
population size).
2. Determine the sampling interval (K) by dividing the number of units in

the population by the desired sample size.
3. Select a number between one and K at random. This number is called

the random start and would be the first number included in your
sample.
4. Select every Kth unit after that first number
Note: Systematic sampling should not be used when a cyclic repetition is

inherent in the sampling frame.
42
Example
• To select a sample of 100 from a population of 400,
you would need a sampling interval of 400 ÷ 100 = 4.
• Therefore, K = 4.
• You will need to select one unit out of every four units
to end up with a total of 100 units in your sample.
• Select a number between 1 and 4 from a table of

random numbers.
43
Cont.
• If you choose 3, the third unit on your frame
would be the first unit included in your
sample;
• The sample might consist of the following

units to make up a sample of 100: 3 , 7, 11, 15,
19...395, 399 (up to N, which is 400 in this
case).
44
Cont.
• Using the above example, you can see that
with a systematic sample approach there are
only four possible samples that can be
selected, corresponding to the four possible
random starts:
A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400
45
3. Stratified random sampling
• It is done when the population is known to be have

heterogeneity with regard to some factors and those factors
are used for stratification
• Using stratified sampling, the population is divided into

homogeneous, mutually exclusive groups called strata, and
• A population can be stratified by any variable that is available

for all units prior to sampling (e.g., age, sex, province of
residence, income, etc.).
• A separate sample is taken independently from each stratum.
46
Why do we need to create strata?
• That it can make the sampling strategy more efficient.
• A larger sample is required to get a more accurate

estimation if a characteristic varies greatly from one
unit to the other.
• For example, if every person in a population had the

same salary, then a sample of one individual would be
enough to get a precise estimate of the average
salary.
47
Cont.
• Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation:
n
nj  N j , j = 1, 2, ..., k where, k is
N the number of strata and
– nj is sample size of the jth stratum

– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population
size
48
4. Cluster sampling
• Sometimes it is too expensive to spread a sample
across the population as a whole.
• Travel costs can become expensive if interviewers have

to survey people from one end of the country to the
other.
• To reduce costs, researchers may choose a cluster

sampling technique
• The clusters should be homogeneous, unlike stratified

sampling where by the strata are heterogeneous
49
Steps in cluster sampling
• Cluster sampling divides the population into groups or clusters.
• A number of clusters are selected randomly to represent the

total population, and then all units within selected clusters are
included in the sample.
• No units from non-selected clusters are included in the sample

—they are represented by those from selected clusters.
• This differs from stratified sampling, where some units are

selected from each group.
50
Example
• In a school based study, we assume students

of the same school are homogeneous.
• We can select randomly sections and include

all students of the selected sections only
51
Cont.
• Sometimes a list of all units in the population is not available,
while a list of all clusters is either available or easy to create.
• In most cases, the main drawback is a loss of efficiency when

compared with SRS.
• It is usually better to survey a large number of small clusters

instead of a small number of large clusters.
– This is because neighboring units tend to be more alike, resulting in a
sample that does not represent the whole spectrum of opinions or
situations present in the overall population.
• Another drawback to cluster sampling is that you do not have total control
over the final sample size.
52
5. Multi-stage sampling
• Similar to the cluster sampling.
• But it involves picking a sample from within each chosen cluster,

rather than including all units in the cluster.
• This type of sampling requires at least two stages.
• In the first stage, large groups or clusters are identified and selected.
• In the second stage, population units are picked from within the
selected clusters (using any of the possible probability sampling
methods) for a final sample.
53
Cont.
• If more than two stages are used, the process of choosing
population units within clusters continues until there is a final
sample.
• Also, you do not need to have a list of all of the units in the
population. All you need is a list of clusters and list of the units
in the selected clusters.
• Admittedly, more information is needed in this type of sample

than what is required in cluster sampling. However, multi-stage
sampling still saves a great amount of time and effort by not
having to create a list of all the units in a population.
54
B. Non-probability sampling
• The difference between probability and non-
probability sampling has to do with a basic
assumption about the nature of the population
under study.
• In probability sampling, every item has a known

chance of being selected.
• In non-probability sampling, there is an assumption

that there is an even distribution of a characteristic
of interest within the population.
55
Cont.
• In non-probability sampling, since elements

are chosen arbitrarily, there is no way to
estimate the probability of any one element
being included in the sample.
• Also, no assurance is given that each item has
a chance of being included, making it
impossible either to estimate sampling
variability or to identify possible bias
56
Cont.
• Reliability cannot be measured in non-probability sampling;
the only way to address data quality is to compare some of
the survey results with available information about the
population.
• Still, there is no assurance that the estimates will meet an

acceptable level of error.
• Researchers are reluctant to use these methods because

there is no way to measure the precision of the resulting
sample.
57
Cont.
• Despite these drawbacks, non-probability

sampling methods can be useful when
descriptive comments about the sample itself
are desired.
• There are also other circumstances, such as

researches, when it is unfeasible or
impractical to conduct probability sampling.
58
The most common types of non-
probability sampling
1. Convenience or haphazard sampling

2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique
59
1. Convenience or haphazard sampling
• Convenience sampling is sometimes

referred to as haphazard or accidental
sampling.
• It is not normally representative of the
target population because sample units are
only selected if they can be accessed easily
and conveniently.
60
Cont.
• The obvious advantage is that the method is

easy to use, but that advantage is greatly
offset by the presence of bias.
• Although useful applications of the technique

are limited, it can deliver accurate results
when the population is homogeneous.
61
Cont.
• For example, a scientist could use this method to

determine whether a lake is polluted or not.
• Assuming that the lake water is well-mixed, any

sample would yield similar information.
• A scientist could safely draw water anywhere on the

lake without bothering about whether or not the
sample is representative
62
2. Volunteer sampling
• As the term implies, this type of sampling occurs
when people volunteer to be involved in the study.
• In psychological experiments or pharmaceutical

trials (drug testing), for example, it would be
difficult and unethical to enlist random
participants from the general public.
• In these instances, the sample is taken from a

group of volunteers.
63
Cont.
• Sometimes, the researcher offers payment to

attract respondents.
• In exchange, the volunteers accept the

possibility of a lengthy, demanding or
sometimes unpleasant process.
64
Cont.
• Sampling voluntary participants as opposed to
the general population may introduce strong
biases.
• Often in opinion polling, only the people who

care strongly enough about the subject tend
to respond.
• The silent majority does not typically

respond, resulting in large selection bias.
65
3. Judgment sampling
• This approach is used when a sample is taken based
on certain judgments about the overall population.
• The underlying assumption is that the investigator

will select units that are characteristic of the
population.
• The critical issue here is objectivity: how much can

judgment be relied upon to arrive at a typical
sample?
66
Cont.
• Judgment sampling is subject to the
researcher's biases and is perhaps even more
biased than haphazard sampling.
• Since any preconceptions the researcher may

have are reflected in the sample, large biases
can be introduced if these preconceptions are
inaccurate.
67
Cont.
• Researchers often use this method in exploratory
studies like pre-testing of questionnaires and focus
groups.
• They also prefer to use this method in laboratory
settings where the choice of experimental subjects
(i.e., animal, human) reflects the investigator's pre-
existing beliefs about the population.
• One advantage of judgment sampling is the
reduced cost and time involved in acquiring the
sample.
68
4. Quota sampling
• This is one of the most common forms of
non-probability sampling.
• Sampling is done until a specific number of
units (quotas) for various sub-populations
have been selected.
• Since there are no rules as to how these
quotas are to be filled, quota sampling is
really a means for satisfying sample size
objectives for certain sub-populations.
69
Cont.
• As with all other non-probability sampling

methods, in order to make inferences about
the population, it is necessary to assume that
persons selected are similar to those not
selected.
• Such strong assumptions are rarely valid.
70
Cont.
• The main argument against quota sampling is

that it does not meet the basic requirement of
randomness.
• Some units may have no chance of selection
or the chance of selection may be unknown.
• Therefore, the sample may be biased.
• Quota sampling is generally less expensive
than random sampling.
71
Cont.
• It is also easy to administer, especially considering
the tasks of listing the whole population, randomly
selecting the sample and following-up on non-
respondents can be omitted from the procedure.
• Quota sampling is an effective sampling method
when information is urgently required and can be
carried out sampling frames.
• In many cases where the population has no
suitable frame, quota sampling may be the only
appropriate sampling method.
72
5. Snowball sampling
• A technique for selecting a research sample
where existing study subjects recruit future
subjects from among their acquaintances.
• Thus the sample group appears to grow like
a rolling snowball.
73
Cont.
• This sampling technique is often used in hidden
populations which are difficult for researchers to
access; example populations would be drug users or
commercial sex workers.
• Because sample members are not selected from a
sampling frame, snowball samples are subject to
numerous biases. For example, people who have
many friends are more likely to be recruited into the
sample.
74
Estimation
• Up until this point, we have assumed that the values
of the parameters of a probability distribution are
known.
• In the real world, the values of these population

parameters are usually not known
• Instead, we must try to say something about the way

in which a random variable is distributed using the
information contained in a sample of observations
• The process of drawing conclusions about an entire
population based on the data in a sample is known as
statistical inference.
• Methods of inference usually fall into one of two broad

categories:
** Estimation or Hypothesis testing **
• For now, we will focus on using the observations in a

sample to estimate a population parameter
76
Estimation
• It is concerned with estimating the values of
specific population parameters based on
sample statistics.
• It is about using information in a sample to
make estimates of the characteristics
(parameters) of the source population.
77
Estimation, Estimator & Estimate
♣ Estimation is the computation of a statistic from sample

data, often yielding a value that is an approximation (guess)
of its target, an unknown true population parameter value.
♣ The statistic itself is called an estimator and can be of two

types - point estimator or interval estimator.
♣ The value or values that the estimator assumes are called

estimates.
78
Point versus Interval Estimators
• Point estimation involves the calculation of a single
number to estimate the population parameter
• Interval estimation specifies a range of reasonable

values for the parameter
 Thus,
– A point estimate is of the form: [ Value ],
– Whereas, an interval estimate is of the form:
[ lower limit, upper limit ]
79
1. Point Estimate
• A single numerical value used to estimate
the corresponding population parameter.
Sample Statistics are Estimators of Population Parameters
Sample mean, µ
Sample variance, S2 2
Sample proportion, P or π
Sample Odds Ratio,
OR
OŔ
RR
Sample Relative Risk, RŔ
ρ
Sample correlation coefficient, r
80
2. Interval Estimation
• Interval estimation specifies a range of reasonable values for the population
parameter based on a point estimate.
• A confidence interval is a particular type of interval estimator and
 Give a plausible range of values of the estimate likely to include the

“true” (population) value with a given confidence level.
 Also give information about the precision of an estimate.
 Wider CIs indicate less certainty.
 CIs can also answer the question of whether or not an association exists
81
Confidence Level
• Confidence Level
– Confidence in which the interval will contain
the unknown population parameter
• P (L, U) = (1 - α)
82
Estimation for Single Population
83
1. CI for a Single Population Mean
(normally distributed)
A. Known variance (large sample size)
• There are 3 elements to a CI:

1. Point estimate
2. SE of the point estimate
3. Confidence coefficient
• Consider the task of computing a CI estimate of μ for

a population distribution that is normal with σ
known.
• Available are data from a random sample of size = n.
84
Cont.
Assumptions
 Population standard deviation () is known
 Population is normally distributed
• A 100(1-)% C.I. for  is:
·  is to be chosen by the researcher, most common values of  are 0.05,

0.01 and 0.1.
85
Margin of Error
(Precision of the estimate)
86
Cont.
 As n increases, the CI decreases.
 As s increases, the length of CI increases.
 As the confidence level increases (α

decreases), the length of CI increases.
87
Example:
1. Waiting times (in hours) at a particular hospital are
believed to be approximately normally distributed
with a variance of 2.25 hr.
a. A sample of 20 outpatients revealed a mean waiting

time of 1.52 hours. Construct the 95% CI for the
estimate of the population mean.
b. Suppose that the mean of 1.52 hours had resulted

from a sample of 32 patients. Find the 95% CI.
c. What effect does larger sample size have on the CI?
88
Solution:
2.25
a. 1.52  1.96  1.52  1.96(.33)
20
 1.52  .65  (0.87, 2.17)
• We are 95% confident that the true mean waiting

time is between 0.87 and 2.17 hrs.
 An incorrect interpretation is that there is 95%

probability that this interval contains the true
population mean.
89
Cont.
b.
2.25
1.52  1.96  1.52  1.96(.27)
32
 1.52  .53  (.99, 2.05)
c. The larger the sample size makes the CI
narrower (more precision).
90
Cont.
B. Unknown variance (small sample size, n ≤ 30)
• What if the  for the underlying population is
unknown and the sample size is small?
• As an alternative we use Student’s t

distribution.
91
Cont.
92
Example
• Standard error =
• t-value at 90% CL at 19 df =1.729
93
Cont.
94
Exercise
• Compute a 95% CI for the mean birth
weight based on n = 10, sample mean =
116.9 oz and s =21.70.
• From the t Table, t9, 0.975 = 2.262
• Ans: (101.4, 132.4)
95
2. CIs for single population proportion, p
• Is based on three elements of CI

– Point estimate
– SE of point estimate
– Confidence coefficient
96
Cont.
97
Example 1
• A random sample of 100 people shows that 25
are left-handed. Form a 95% CI for the true
proportion of left-handers.
98
Interpretation
99
Example 2
• Suppose that among 10,000 female operating-room nurses,
60 women have developed breast cancer over five years. Find
the 95% for p based on point estimate.
• Point estimate = 60/10,000 = 0.006
• The 95% CI for p is given by the interval:
• The 95% CI for p is:
100
• Hypothesis Testing
• The purpose of Hypothesis Testing is to aid

the researcher in reaching a decision
(conclusion) concerning a population by
examining a sample from that population.
Hypothesis
• Is a statement about one or more
• Is a claim (assumption) about a population
parameter
The purpose of Hypothesis Testing is to aid the

researcher in reaching a decision (conclusion)
concerning a population by examining a
sample from that population.
102
Examples of Research Hypotheses
Population Mean
• The average length of stay of patients
admitted to the hospital is five days
• The mean birth weight of babies delivered by

mothers with low SES is lower than those
from higher SES.
• Etc
103
Types of Hypothesis
1. The Null Hypothesis, H0
· Is a statement claiming that there is no difference between the

hypothesized value and the population value.
· (The effect of interest is zero = no difference)
· States the assumption (hypothesis) to be tested
· H0 is always about a population parameter, not about a sample statistic
· Begin with the assumption that the Ho is true

– Similar to the notion of innocent until proven guilty
104
Cont.
2. The Alternative Hypothesis, HA
• Is a statement of what we will believe is true if our

sample data causes us to reject Ho.
• Is generally the hypothesis that is believed (or needs

to be supported) by the researcher.
• Is a statement that disagrees (opposes) with Ho

(The effect of interest is not zero)
105
Steps in Hypothesis Testing
1. Formulate the appropriate statistical hypotheses
clearly
• Specify HO and HA
H0:  = 0 H0:  ≤ 0 H0:  ≥ 0
H1:   0 H1:  > 0 H1:  < 0
two-tailed one-tailed one-tailed
2. State the assumptions necessary for computing
probabilities
• A distribution is approximately normal (Gaussian)
• Variance is known or unknown
106
Cont.
3. Select a sample and collect data
• Categorical, continuous
4. Decide on the appropriate test statistic for

the hypothesis. E.g., One population
OR
107
Cont.
5. Specify the desired level of significance for
the statistical test (=0.05, 0.01, etc.)
6. Determine the critical value.
– A value the test statistic must attain to be
declared significant.
-1.96 1.96 1.645 -1.645
108
7. Obtain sample evidence and compute the
test statistic
8. Reach a decision and draw the conclusion
• If Ho is rejected, we conclude that HA is true
(or accepted).
• If Ho is not rejected, we conclude that Ho may
be true.
109
Rules for Stating Statistical Hypotheses
1. One population
• Indication of equality (either =, ≤ or ≥) must
appear in Ho.
Ho: μ = μo, HA: μ ≠ μo
Ho: P = Po, HA: P ≠ Po
• Can we conclude that a certain population mean
is
– not 50?
Ho: μ = 50 and HA: μ ≠ 50
– greater than 50?
Ho: μ ≤ 50 HA: μ > 50
110
Cont.
• Can we conclude that the proportion of

patients with leukemia who survive more
than six years is not 60%?
Ho: P = 0.6 HA: P ≠ 0.6
111
Statistical Decision
• Reject Ho if the value of the test statistic
that we compute from our sample is one of
the values in the rejection region
• Don’t reject Ho if the computed value of the

test statistic is one of the values in the non-
rejection region.
112
Another way to state conclusion
• Reject Ho if P-value < α

• Accept Ho if P-value ≥ α
 P-value is the probability of obtaining a test statistic

as extreme as or more extreme than the actual test
statistic obtained if the Ho is true
 The larger the test statistic, the smaller is the P-

value. OR, the smaller the P-value the stronger the
evidence against the Ho.
113
Types of Errors in Hypothesis Tests
• Whenever we reject or accept the Ho, we

commit errors.
• Two types of errors are committed.

– Type I Error
– Type II Error
114
Type I Error
• The probability of a type I error is the
probability of rejecting the Ho when it is true
• The probability of type I error is α
• Called level of significance of the test
• Set by researcher in advance
115
Type II Error
• The error committed when a false Ho is not
rejected
• The probability of Type II Error is 
• Usually unknown but larger than α
116
Cont.
Action Reality
(Conclusion)
Ho True Ho False
Do not Correct action Type II error (β)

reject Ho (Prob. = 1-α) (Prob. = β= 1-Power)
Reject Ho Type I error (α) Correct action

(Prob. = α = Sign. level) (Prob. = Power = 1-β)
117
Type I & II Error Relationship
118
Hypothesis Testing of a Single Mean
(Normally Distributed)
119
Known Variance
120
Example: Two-Tailed Test
1. A simple random sample of 10 people from a certain
population has a mean age of 27. Can we conclude that
the mean age of the population is not 30? The variance is
known to be 20. Let = .05.
 Answer, "Yes we can, if we can reject the Ho that it is 30."

A. Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
B. Assumptions
Simple random sample
Normally distributed population
121
Cont.
C. Hypotheses
Ho: µ = 30
HA: µ ≠ 30
D. Test statistic
As the population variance is known, we use Z
as the test statistic.
122
Cont.
E. Decision Rule
 Reject Ho if the Z value falls in the rejection region.
 Don’t reject Ho if the Z value falls in the non-rejection region.
 Because of the structure of Ho it is a two tail test. Therefore,
reject Ho if Z ≤ -1.96 or Z ≥ 1.96.
123
F. Calculation of test statistic
G. Statistical decision
We reject the Ho because Z = -2.12 is in the rejection
region. The value is significant at 5% α.
H. Conclusion
We conclude that µ is not 30. P-value = 0.0340
A Z value of -2.12 corresponds to an area of 0.0170. Since there are two
parts to the rejection region in a two tail test, the P-value is twice this
which is .0340.
124
Hypothesis test using
confidence interval
• A problem like the above example can also be solved
using a confidence interval.
• A confidence interval will show that the calculated

value of Z does not fall within the boundaries of the
interval. However, it will not give a probability.
• Confidence interval
125
Example: One -Tailed Test
• A simple random sample of 10 people from a certain
population has a mean age of 27. Can we conclude that the
mean age of the population is less than 30? The variance is
known to be 20. Let α = 0.05.
• Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
• Hypotheses
Ho: µ ≥ 30, HA: µ < 30
126
• Test statistic
• Rejection Region
Lower tail test
• With α = 0.05 and the inequality, we have the entire rejection region at
the left. The critical value will be Z = -1.645. Reject Ho if Z < -1.645.
127
Cont.
• Statistical decision
– We reject the Ho because -2.12 < -1.645.
• Conclusion
– We conclude that µ < 30.
– p = .0170 this time because it is only a one tail test and not a two tail test.
128
Unknown Variance
• In most practical applications the standard

deviation of the underlying population is not
known
• In this case,  can be estimated by the
sample standard deviation s.
• If the underlying population is normally
distributed, then the test statistic is:
129
Example: Two-Tailed Test
• A simple random sample of 14 people from a certain population gives
a sample mean body mass index (BMI) of 30.5 and sd of 10.64. Can we
conclude that the BMI is not 35 at α 5%?
• Ho: µ = 35, HA: µ ≠35
• Test statistic
• If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.
130
Cont.
• Decision rule
– We have a two tailed test. With α = 0.05 it means that each tail is 0.025. The
critical t values with 13 df are -2.1604 and 2.1604.
– We reject Ho if the t ≤ -2.1604 or t ≥ 2.1604.
• Do not reject Ho because -1.58 is not in the rejection region. Based

on the data of the sample, it is possible that µ = 35. P-value = 0.1375
131
Sampling from a population that is not
normally distributed
• Here, we do not know if the population displays a
normal distribution.
• However, with a large sample size, we know from

the Central Limit Theorem that the sampling
distribution of the population is distributed
normally.
132
Cont.
• With a large sample, we can use Z as the test
statistic calculated using the sample sd.
133
Hypothesis Tests for Proportions
• Involves categorical values
• Two possible outcomes
– “Success” (possesses a certain

characteristic)
– “Failure” (does not possesses that
characteristic)
• Fraction or proportion of population in the “success”

category is denoted by p
134
Proportions
135
Hypothesis Testing about a Single Population
Proportion
(Normal Approximation to Binomial Distribution)
136
137
Example
• We are interested in the probability of developing asthma
over a given one-year period for children 0 to 4 years of age
whose mothers smoke in the home. In the general
population of 0 to 4-year-olds, the annual incidence of
asthma is 1.4%. If 10 cases of asthma are observed over a
single year in a sample of 500 children whose mothers
smoke, can we conclude that this is different from the
underlying probability of p0 = 0.014? Α = 5%
H0 : p = 0.014
HA: p ≠ 0.014
138
Cont.
• The test statistic is given by:
139
Cont.
• The critical value of Zα/2 at α=5% is ±1.96.
• Don’t reject Ho since Z (=1.14) in the non-rejection

region between ±1.96.
• P-value = 0.2548
• We do not have sufficient evidence to conclude that the

probability of developing asthma for children whose
mothers smoke in the home is different from the
probability in the general population
140
Sample size determination
• Sample Size: The number of study subjects selected to
represent a given study population.
• In estimating a certain characteristic of a population,

sample size calculations are important to ensure that
estimates are obtained with required precision or
confidence
• Should be sufficient to represent the characteristics of

interest of the study population.
141
Cont.
Sample size determination depends on the:

– Objective of the study
– Design of the study
• Descriptive/Analytic
– Accuracy of the measurements to be made
– Degree of precision required for generalization
– Plan for statistical analysis
– Degree of confidence with which to conclude
142
Cont.
• The feasible sample size is also determined

by the availability of resources:
– time
– manpower
– transport
– available facility, and
– money
143
Sample size for single sample
A. Sample size for estimating a single

population mean
B. Sample size to estimate a single population

proportion
144
A. Sample size for estimating a single
population mean
Where d = e in some text books
• where d = Margin of error =

= Absolute precision
= Half of the width (w) of CI
145
Examples:
1. Find the minimum sample size needed to
estimate the drop in heart rate (µ) for a new
study using a higher dose of propranolol than the
standard one. We require that the two-sided 95%
CI for µ be no wider than 5 beats per minute and
the sample sd for change in heart rate equals 10
beats per minute.
2 2 2
n = (1.96) 10 /(2.5) = 62 patients
146
2. Suppose that for a certain group of cancer patients, we
are interested in estimating the mean age at diagnosis.
We would like a 95% CI of 5 years wide. If the population
SD is 12 years, how large should our sample be?
147
Cont.
• Suppose d=1
• Then the sample size increases
148
Cont.
3. A hospital director wishes to estimate the

mean weight of babies born in the hospital.
How large a sample of birth records should
be taken if she/he wants a 95% CI of 0.5
wide? Assume that a reasonable estimate of
 is 2. Ans: 246 birth records.
149
But the population 2 is most of
the time unknown
As a result, it has to be estimated from:
• Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, s2
• Previous or similar studies
150
B. Sample size to estimate a single
population proportion
151
Cont.
1. Suppose that you are interested to know the

proportion of infants who breastfed >18 months of
age in a rural area. Suppose that in a similar area,
the proportion (p) of breastfed infants was found to
be 0.20. What sample size is required to estimate
the true proportion within ±3% points with 95%
confidence. Let p=0.20, d=0.03, α=5%
152
Sample Size: Two Samples
A. Estimation of the difference between two

population means
B. Estimation of the difference between two

population proportions
153
A. Sample size for estimating a difference in
two means
154
B. Sample size for estimating a difference
in two proportions
155
Data check entry
• One of the first steps to proper data

screening is to ensure the data is correct
– Check out each person’s entry individually
• Makes sense if small data set or proper data checking

procedure
• Can be too costly so…
– range of data should be checked

156
Data Screening
157
Normality
• All of the continuous data we are covering
need to follow a normal curve
• Skewness (univariate) – this represents the

spread of the data
158
Cont.
• skewness statistic is output by SPSS and SE
skewness is
S Skewness
 Z skewness
SESkewness
Z skewness  3.2 violation of skewness assumption
159
Cont.
• Kurtosis (univariate) – is how peaked the data is; Kurtosis stat
output by SPSS
• Kurtosis standard error
S Kurtosis
 Z kurtosis
SEKurtosis
Z kurtosis  3.2 violation of kurtosis assumption
– for most statistics the skewness assumption is more important that

the kurtosis assumption
160
Outliers
• technically it is a data point outside of you

distribution; so potentially detrimental
because may have undo effect on
distribution
161
Linearity
• relationships among variables are linear in
nature; assumption in most analyses
162
Homoscedasticity
• For grouped data this is the same as
homogeneity of variance
• For ungrouped data – variability for one

variables is the same at all levels of another
variable (no variance interaction)
163
Multicollinearity/Singularity
• If correlations between two variables are excessive
(e.g. 0.95) then this represents multicollinearity
• If correlation is 1 then you have singularity
• Often Multicollinearity/Singularity occurs in data

because one variable is a near duplicate of another
164

Intro to Biostats

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intro to Biostats

Uploaded by

Copyright:

Available Formats

Chapter-1

Design The best way to

Quantitative Variables Qualitative Variables

It can be measured in the Many characteristics are not

Quantitative variables Qualitative variables

• Ways of organizing and summarizing data.

• Helps to identify the general features and trends in a set

• Also very important in conveying the final results of a

• Example: tables, graphs, numerical summary measures

• Methods used for drawing conclusions about a

• Example: Principles of probability, estimation,

• The raw material for statistics

• Can be obtained from:

2. Secondary data: which had been collected by

Target Population Study population: All

• Statistic: A descriptive measure computed

• Since population is too large, we rely on the information

• Inferences about the population are based on the

• A sample is a collection of individuals selected

• Sampling enables us to estimate the

1. Establish the study's objectives

– The target population is the total population for which the

– Specifically, the target population is defined by the following

4. Set the level of precision

– There is a level of uncertainty associated with estimates coming

– Choose measuring instrument and method of approach to the

• The process of selecting a portion of the

• Reduced cost: Sampling reduces demands on resource such as

• Greater accuracy: Sampling may lead to better accuracy of

• Sampling error: Precise allowance can be made for sampling error

• Greater speed: Data can be collected and summarized more

Two broad divisions:

A. Probability sampling methods

B. Non-probability sampling methods

• A sample is obtained in a way that ensures every

• The method chosen depends on a number of factors,

1. Simple random sampling

• The randomness of the sample is ensured

• You decide that a sample of 10 students

• In order to get your sample, you assign a

• To select the sample, you use a table of

• Pick a starting point in the table (a row and

• Remember that the sample is without replacement, so

• The first 10 different numbers between 001 and 500

• SRS has certain limitations:

• The selection is systematic rather than

• Important if the reference population is

• Taking individuals at fixed intervals (every kth)

2. Determine the sampling interval (K) by dividing the number of units in

3. Select a number between one and K at random. This number is called

4. Select every Kth unit after that first number

Note: Systematic sampling should not be used when a cyclic repetition is

• Select a number between 1 and 4 from a table of

• The sample might consist of the following

• It is done when the population is known to be have

• Using stratified sampling, the population is divided into

• A population can be stratified by any variable that is available

• A separate sample is taken independently from each stratum.

• That it can make the sampling strategy more efficient.

• A larger sample is required to get a more accurate

• For example, if every person in a population had the

– nj is sample size of the jth stratum

• Travel costs can become expensive if interviewers have