You are on page 1of 164

Chapter-1

Introduction
• to Biostatistics

• Name: Jiregna Indalu
• E-mail: jiregnaindalu@gmail.com
• Mob.: 0926-987846

Introduction
• What is statistics?
• Statistics: A field of study concerned with:
– collection, organization, analysis, summarization
and interpretation of numerical data, and
– the drawing of inferences about a body of data
when only a small part of the data is observed.
• Statistics helps us use numbers to
communicate ideas.
• Statisticians try to interpret and communicate
the results to others.
2
Cont.
· Biostatistics: The application of statistical
methods to the fields of biological and medical
sciences.
· Concerned with interpretation of biological data
& the communication of information derived
from these data.
· Has central role in medical investigations.

3
Uses of biostatistics
• Provide methods of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• Magnitude of association
– Strong vs weak association between exposure
and outcome

4
Cont.
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of people free
from the disease is greater among the vaccinated than
the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population

5
What does biostatistics cover?
Research Planning

Design The best way to


learn about
biostatistics is to
Execution (Data collection)
follow the flow of a
research from
Data Processing
inception to the
final publication
Data Analysis

Presentation

Interpretation
Publication 6
variable:
 It is a characteristic that takes on different values in
different persons, places, or things.
For example:
- heart rate,
- the heights of adult males,
- the weights of preschool children,
- the ages of patients seen in a dental clinic.

7
Types of variables

Quantitative Qualitative

Quantitative Variables Qualitative Variables

It can be measured in the Many characteristics are not


usual sense. capable of being measured. Some
of them can be ordered or ranked.
For example:
 the heights of adult males, For example:
 the weights of preschool  classification of people into socio-
children, economic groups,
 the ages of patients seen in  social classes based on income,
a education, etc.
 dental clinic. 8
Types of variables &
scale of measurement

Quantitative variables Qualitative variables


(Numerical) (Categorical)

Interval Nominal

Ordinal
Ratio

9
Types of Statistics
1. Descriptive statistics:

• Ways of organizing and summarizing data.

• Helps to identify the general features and trends in a set


of data and extracting useful information.

• Also very important in conveying the final results of a


study.

• Example: tables, graphs, numerical summary measures

10
Cont.
2. Inferential statistics:

• Methods used for drawing conclusions about a


population based on the information obtained
from a sample of observations drawn from that
population

• Example: Principles of probability, estimation,


confidence interval, comparison of two or more
means or proportions, hypothesis testing, etc.
11
Data
• Data are numbers which can be obtained by
measurement or counting

• The raw material for statistics

• Can be obtained from:


– Routinely kept records, literature
– Surveys
– Counting
– Experiments
– Reports
– Observation
– Etc
12
Types of Data
1. Primary data: collected from the items or
individual respondents directly by the
researcher for the purpose of a study.

2. Secondary data: which had been collected by


certain people or organization, & statistically
treated and the information contained in it is
used for other purpose by other people

13
Population and Sample

• Population:
– Refers to any collection of objects.

• Target population:
– A collection of items that have something in common
for which we wish to draw conclusions at a particular
time.
• E.g., All hospitals in Ethiopia.
– The whole group of interest.

14
Cont.
Study Population:

• The subset of the target population that has at least some chance of
being sampled.
• The specific population group from which samples are drawn and
data are collected.
Sample:
• A subset of a study population, about which information is actually
obtained.

• The individuals who are actually measured and comprise the actual
data.

15
Cont.
E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Ayder of Mekelle
Sample were included.
Target Population: All
Study Population adolescents in Ethiopia

Target Population Study population: All


adolescents in Mekelle
Sample: Adolescents in Ayder
sub-city who were included in
the study
16
Generalizability
• Is a two-stage procedure:
• We need to be able to generalize from:
– the sample to the study population, &
– then from the study population to the target
population
• If the sample is not representative of the
population, the conclusions are restricted to
the sample & don’t have general applicability

17
Parameter and Statistic
• Parameter: A descriptive measure computed
from the data of a population.
– E.g., the mean (µ) age of the target population

• Statistic: A descriptive measure computed


from the data of a sample.
– E.g., sample mean age ( )

18
Sampling and Sampling
Distributions

19
Cont.
• Researchers often use sample survey methodology to
obtain information about a larger population by
selecting and measuring a sample from that population.

• Since population is too large, we rely on the information


collected from the sample.

• Inferences about the population are based on the


information from the sample drawn from that
population.

20
Cont.

• A sample is a collection of individuals selected


from a larger population.

• Sampling enables us to estimate the


characteristic of a population by directly
observing a portion of the population.

21
Cont.

Sample Information

Population

22
Steps needed to select a sample and ensure that
this sample will fulfill its goals.

1. Establish the study's objectives


– The first step in planning a useful and efficient survey is
to specify the objectives with as much detail as
possible.
– Without objectives, the survey is unlikely to generate
valuable results.
– Clarifying the aims of the survey is critical to its
ultimate success.
– The initial users and uses of the data should be
identified at this stage.

23
Cont.
2. Define the target population

– The target population is the total population for which the


information is required.

– Specifically, the target population is defined by the following


characteristics:
• Nature of data required
• Geographic location
• Reference period
• Other characteristics, such as socio-demographic characteristics

24
Cont.
3. Decide on the data to be collected
– The data requirements of the survey must be established.

– To ensure that the requirements are operationally sound, the necessary data
terms and definitions also need to be determined.

4. Set the level of precision

– There is a level of uncertainty associated with estimates coming


from a sample.
– The sample-to-sample variation is what causes the sampling error.
– Researchers can estimate the sampling error associated with a
particular sampling plan, and try to minimize it.

25
Cont.
5. Decide on the methods on measurement

– Choose measuring instrument and method of approach to the


population
– Data about a person’s state of health may be obtained from
statements that he/she makes or from a medical examination
– The survey may employ a self-administered questionnaire, an
interviewing

6. Preparing Frame
– List of all members of the population
– The elements must not overlap
26
Sampling

• The process of selecting a portion of the


population to represent the entire population.
• A main concern in sampling:
– Ensure that the sample represents the population,
and
– The findings can be generalized.

27
Advantages of sampling:
• Feasibility: Sampling may be the only feasible method of
collecting information.

• Reduced cost: Sampling reduces demands on resource such as


finance, personnel, and material.

• Greater accuracy: Sampling may lead to better accuracy of


collecting data

• Sampling error: Precise allowance can be made for sampling error

• Greater speed: Data can be collected and summarized more


quickly

28
Disadvantages of sampling:
• There is always a sampling error.
• Sampling may create a feeling of
discrimination within the population.
• Sampling may be inadvisable where every unit
in the population is legally required to have a
record.

29
Errors in sampling
1) Sampling error: Errors introduced due to errors
in the selection of a sample.
– They cannot be avoided or totally eliminated.

2) Non-sampling error:
- Observational error
- Respondent error
- Lack of preciseness of definition
- Errors in editing and tabulation of data
30
Sampling Methods

Two broad divisions:

A. Probability sampling methods

B. Non-probability sampling methods

31
Probability sampling
• Involves random selection of a sample

• A sample is obtained in a way that ensures every


member of the population to have a known, non zero
probability of being included in the sample.

• The method chosen depends on a number of factors,


such as
– the available sampling frame,
– how spread out the population is,
– how costly it is to survey members of the population

32
Most common probability
sampling methods

1. Simple random sampling


2. Systematic random sampling
3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling

33
1. Simple random sampling
• Involves random selection
• Each member of a population has an equal
chance of being included in the sample.
• To use a SRS method:
– Make a numbered list of all the units in the
population
– Each unit should be numbered from 1 to N
(where N is the size of the population)
– Select the required number.

34
Cont.

• The randomness of the sample is ensured


by:
• use of “lottery’ methods
• a table of random numbers

35
Example
• Suppose your school has 500 students and
you need to conduct a short survey on the
quality of the food served in the cafeteria.

• You decide that a sample of 10 students


should be sufficient for your purposes.

• In order to get your sample, you assign a


number from 1 to 500 to each student in your
school.

36
Cont.

• To select the sample, you use a table of


randomly generated numbers.

• Pick a starting point in the table (a row and


column number) and look at the random
numbers that appear there. In this case, since
the data run into three digits, the random
numbers would need to contain three digits as
well.

37
Cont.
• Ignore all random numbers after 500 because they do
not correspond to any of the students in the school.

• Remember that the sample is without replacement, so


if a number recurs, skip over it and use the next random
number.

• The first 10 different numbers between 001 and 500


make up your sample.

38
Cont.

• SRS has certain limitations:


– Requires a sampling frame.
– Difficult if the reference population is dispersed.
– Minority subgroups of interest may not be
selected.

39
2. Systematic random sampling
• Sometimes called interval sampling,
systematic sampling means that there is a gap,
or interval, between each selected unit in the
sample

• The selection is systematic rather than


randomly

40
Cont.

• Important if the reference population is


arranged in some order:
– Order of registration of patients
– Numerical number of house numbers
– Student’s registration books

• Taking individuals at fixed intervals (every kth)


based on the sampling fraction, eg. if the
sample includes 20%, then every fifth.
41
Steps in systematic random sampling
1. Number the units on your frame from 1 to N (where N is the total
population size).

2. Determine the sampling interval (K) by dividing the number of units in


the population by the desired sample size.

3. Select a number between one and K at random. This number is called


the random start and would be the first number included in your
sample.

4. Select every Kth unit after that first number

Note: Systematic sampling should not be used when a cyclic repetition is


inherent in the sampling frame.

42
Example
• To select a sample of 100 from a population of 400,
you would need a sampling interval of 400 ÷ 100 = 4.

• Therefore, K = 4.

• You will need to select one unit out of every four units
to end up with a total of 100 units in your sample.

• Select a number between 1 and 4 from a table of


random numbers.

43
Cont.
• If you choose 3, the third unit on your frame
would be the first unit included in your
sample;

• The sample might consist of the following


units to make up a sample of 100: 3 , 7, 11, 15,
19...395, 399 (up to N, which is 400 in this
case).

44
Cont.
• Using the above example, you can see that
with a systematic sample approach there are
only four possible samples that can be
selected, corresponding to the four possible
random starts:
A. 1, 5, 9, 13...393, 397
B. 2, 6, 10, 14...394, 398
C. 3, 7, 11, 15...395, 399
D. 4, 8, 12, 16...396, 400
45
3. Stratified random sampling

• It is done when the population is known to be have


heterogeneity with regard to some factors and those factors
are used for stratification

• Using stratified sampling, the population is divided into


homogeneous, mutually exclusive groups called strata, and

• A population can be stratified by any variable that is available


for all units prior to sampling (e.g., age, sex, province of
residence, income, etc.).

• A separate sample is taken independently from each stratum.

46
Why do we need to create strata?

• That it can make the sampling strategy more efficient.

• A larger sample is required to get a more accurate


estimation if a characteristic varies greatly from one
unit to the other.

• For example, if every person in a population had the


same salary, then a sample of one individual would be
enough to get a precise estimate of the average
salary.

47
Cont.
• Equal allocation:
– Allocate equal sample size to each stratum
• Proportionate allocation:
n
nj  N j , j = 1, 2, ..., k where, k is
N the number of strata and

– nj is sample size of the jth stratum


– Nj is population size of the jth stratum
– n = n1 + n2 + ...+ nk is the total sample size
– N = N1 + N2 + ...+ Nk is the total population
size
48
4. Cluster sampling
• Sometimes it is too expensive to spread a sample
across the population as a whole.

• Travel costs can become expensive if interviewers have


to survey people from one end of the country to the
other.

• To reduce costs, researchers may choose a cluster


sampling technique

• The clusters should be homogeneous, unlike stratified


sampling where by the strata are heterogeneous

49
Steps in cluster sampling
• Cluster sampling divides the population into groups or clusters.

• A number of clusters are selected randomly to represent the


total population, and then all units within selected clusters are
included in the sample.

• No units from non-selected clusters are included in the sample


—they are represented by those from selected clusters.

• This differs from stratified sampling, where some units are


selected from each group.

50
Example

• In a school based study, we assume students


of the same school are homogeneous.

• We can select randomly sections and include


all students of the selected sections only

51
Cont.
• Sometimes a list of all units in the population is not available,
while a list of all clusters is either available or easy to create.

• In most cases, the main drawback is a loss of efficiency when


compared with SRS.

• It is usually better to survey a large number of small clusters


instead of a small number of large clusters.
– This is because neighboring units tend to be more alike, resulting in a
sample that does not represent the whole spectrum of opinions or
situations present in the overall population.

• Another drawback to cluster sampling is that you do not have total control
over the final sample size.
52
5. Multi-stage sampling
• Similar to the cluster sampling.

• But it involves picking a sample from within each chosen cluster,


rather than including all units in the cluster.

• This type of sampling requires at least two stages.

• In the first stage, large groups or clusters are identified and selected.

• In the second stage, population units are picked from within the
selected clusters (using any of the possible probability sampling
methods) for a final sample.

53
Cont.
• If more than two stages are used, the process of choosing
population units within clusters continues until there is a final
sample.

• Also, you do not need to have a list of all of the units in the
population. All you need is a list of clusters and list of the units
in the selected clusters.

• Admittedly, more information is needed in this type of sample


than what is required in cluster sampling. However, multi-stage
sampling still saves a great amount of time and effort by not
having to create a list of all the units in a population.

54
B. Non-probability sampling
• The difference between probability and non-
probability sampling has to do with a basic
assumption about the nature of the population
under study.

• In probability sampling, every item has a known


chance of being selected.

• In non-probability sampling, there is an assumption


that there is an even distribution of a characteristic
of interest within the population.

55
Cont.

• In non-probability sampling, since elements


are chosen arbitrarily, there is no way to
estimate the probability of any one element
being included in the sample.
• Also, no assurance is given that each item has
a chance of being included, making it
impossible either to estimate sampling
variability or to identify possible bias

56
Cont.
• Reliability cannot be measured in non-probability sampling;
the only way to address data quality is to compare some of
the survey results with available information about the
population.

• Still, there is no assurance that the estimates will meet an


acceptable level of error.

• Researchers are reluctant to use these methods because


there is no way to measure the precision of the resulting
sample.
57
Cont.

• Despite these drawbacks, non-probability


sampling methods can be useful when
descriptive comments about the sample itself
are desired.

• There are also other circumstances, such as


researches, when it is unfeasible or
impractical to conduct probability sampling.

58
The most common types of non-
probability sampling

1. Convenience or haphazard sampling


2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
5. Snowball sampling technique

59
1. Convenience or haphazard sampling

• Convenience sampling is sometimes


referred to as haphazard or accidental
sampling.
• It is not normally representative of the
target population because sample units are
only selected if they can be accessed easily
and conveniently.

60
Cont.

• The obvious advantage is that the method is


easy to use, but that advantage is greatly
offset by the presence of bias.

• Although useful applications of the technique


are limited, it can deliver accurate results
when the population is homogeneous.

61
Cont.

• For example, a scientist could use this method to


determine whether a lake is polluted or not.

• Assuming that the lake water is well-mixed, any


sample would yield similar information.

• A scientist could safely draw water anywhere on the


lake without bothering about whether or not the
sample is representative

62
2. Volunteer sampling
• As the term implies, this type of sampling occurs
when people volunteer to be involved in the study.

• In psychological experiments or pharmaceutical


trials (drug testing), for example, it would be
difficult and unethical to enlist random
participants from the general public.

• In these instances, the sample is taken from a


group of volunteers.

63
Cont.

• Sometimes, the researcher offers payment to


attract respondents.

• In exchange, the volunteers accept the


possibility of a lengthy, demanding or
sometimes unpleasant process.

64
Cont.
• Sampling voluntary participants as opposed to
the general population may introduce strong
biases.

• Often in opinion polling, only the people who


care strongly enough about the subject tend
to respond.

• The silent majority does not typically


respond, resulting in large selection bias.
65
3. Judgment sampling
• This approach is used when a sample is taken based
on certain judgments about the overall population.

• The underlying assumption is that the investigator


will select units that are characteristic of the
population.

• The critical issue here is objectivity: how much can


judgment be relied upon to arrive at a typical
sample?

66
Cont.
• Judgment sampling is subject to the
researcher's biases and is perhaps even more
biased than haphazard sampling.

• Since any preconceptions the researcher may


have are reflected in the sample, large biases
can be introduced if these preconceptions are
inaccurate.

67
Cont.
• Researchers often use this method in exploratory
studies like pre-testing of questionnaires and focus
groups.
• They also prefer to use this method in laboratory
settings where the choice of experimental subjects
(i.e., animal, human) reflects the investigator's pre-
existing beliefs about the population.
• One advantage of judgment sampling is the
reduced cost and time involved in acquiring the
sample.
68
4. Quota sampling
• This is one of the most common forms of
non-probability sampling.
• Sampling is done until a specific number of
units (quotas) for various sub-populations
have been selected.
• Since there are no rules as to how these
quotas are to be filled, quota sampling is
really a means for satisfying sample size
objectives for certain sub-populations.

69
Cont.

• As with all other non-probability sampling


methods, in order to make inferences about
the population, it is necessary to assume that
persons selected are similar to those not
selected.
• Such strong assumptions are rarely valid.

70
Cont.

• The main argument against quota sampling is


that it does not meet the basic requirement of
randomness.
• Some units may have no chance of selection
or the chance of selection may be unknown.
• Therefore, the sample may be biased.
• Quota sampling is generally less expensive
than random sampling.

71
Cont.
• It is also easy to administer, especially considering
the tasks of listing the whole population, randomly
selecting the sample and following-up on non-
respondents can be omitted from the procedure.
• Quota sampling is an effective sampling method
when information is urgently required and can be
carried out sampling frames.
• In many cases where the population has no
suitable frame, quota sampling may be the only
appropriate sampling method.

72
5. Snowball sampling
• A technique for selecting a research sample
where existing study subjects recruit future
subjects from among their acquaintances.
• Thus the sample group appears to grow like
a rolling snowball.

73
Cont.
• This sampling technique is often used in hidden
populations which are difficult for researchers to
access; example populations would be drug users or
commercial sex workers.
• Because sample members are not selected from a
sampling frame, snowball samples are subject to
numerous biases. For example, people who have
many friends are more likely to be recruited into the
sample.

74
Estimation
• Up until this point, we have assumed that the values
of the parameters of a probability distribution are
known.

• In the real world, the values of these population


parameters are usually not known

• Instead, we must try to say something about the way


in which a random variable is distributed using the
information contained in a sample of observations
• The process of drawing conclusions about an entire
population based on the data in a sample is known as
statistical inference.

• Methods of inference usually fall into one of two broad


categories:
** Estimation or Hypothesis testing **

• For now, we will focus on using the observations in a


sample to estimate a population parameter

76
Estimation
• It is concerned with estimating the values of
specific population parameters based on
sample statistics.
• It is about using information in a sample to
make estimates of the characteristics
(parameters) of the source population.

77
Estimation, Estimator & Estimate

♣ Estimation is the computation of a statistic from sample


data, often yielding a value that is an approximation (guess)
of its target, an unknown true population parameter value.

♣ The statistic itself is called an estimator and can be of two


types - point estimator or interval estimator.

♣ The value or values that the estimator assumes are called


estimates.

78
Point versus Interval Estimators
• Point estimation involves the calculation of a single
number to estimate the population parameter

• Interval estimation specifies a range of reasonable


values for the parameter

 Thus,
– A point estimate is of the form: [ Value ],
– Whereas, an interval estimate is of the form:
[ lower limit, upper limit ]

79
1. Point Estimate
• A single numerical value used to estimate
the corresponding population parameter.
Sample Statistics are Estimators of Population Parameters

Sample mean, µ
Sample variance, S2 2
Sample proportion, P or π
Sample Odds Ratio,
OR

RR
Sample Relative Risk, RŔ
ρ
Sample correlation coefficient, r

80
2. Interval Estimation
• Interval estimation specifies a range of reasonable values for the population
parameter based on a point estimate.

• A confidence interval is a particular type of interval estimator and

 Give a plausible range of values of the estimate likely to include the


“true” (population) value with a given confidence level.

 Also give information about the precision of an estimate.

 Wider CIs indicate less certainty.

 CIs can also answer the question of whether or not an association exists

81
Confidence Level
• Confidence Level
– Confidence in which the interval will contain
the unknown population parameter

• P (L, U) = (1 - α)

82
Estimation for Single Population

83
1. CI for a Single Population Mean
(normally distributed)
A. Known variance (large sample size)

• There are 3 elements to a CI:


1. Point estimate
2. SE of the point estimate
3. Confidence coefficient

• Consider the task of computing a CI estimate of μ for


a population distribution that is normal with σ
known.

• Available are data from a random sample of size = n.

84
Cont.
Assumptions
 Population standard deviation () is known
 Population is normally distributed

• A 100(1-)% C.I. for  is:

·  is to be chosen by the researcher, most common values of  are 0.05,


0.01 and 0.1.

85
Margin of Error
(Precision of the estimate)

86
Cont.
 As n increases, the CI decreases.

 As s increases, the length of CI increases.

 As the confidence level increases (α


decreases), the length of CI increases.

87
Example:
1. Waiting times (in hours) at a particular hospital are
believed to be approximately normally distributed
with a variance of 2.25 hr.

a. A sample of 20 outpatients revealed a mean waiting


time of 1.52 hours. Construct the 95% CI for the
estimate of the population mean.

b. Suppose that the mean of 1.52 hours had resulted


from a sample of 32 patients. Find the 95% CI.

c. What effect does larger sample size have on the CI?

88
Solution:

2.25
a. 1.52  1.96  1.52  1.96(.33)
20
 1.52  .65  (0.87, 2.17)

• We are 95% confident that the true mean waiting


time is between 0.87 and 2.17 hrs.

 An incorrect interpretation is that there is 95%


probability that this interval contains the true
population mean.

89
Cont.
b.
2.25
1.52  1.96  1.52  1.96(.27)
32
 1.52  .53  (.99, 2.05)
c. The larger the sample size makes the CI
narrower (more precision).

90
Cont.
B. Unknown variance (small sample size, n ≤ 30)
• What if the  for the underlying population is
unknown and the sample size is small?

• As an alternative we use Student’s t


distribution.

91
Cont.

92
Example

• Standard error =
• t-value at 90% CL at 19 df =1.729

93
Cont.

94
Exercise
• Compute a 95% CI for the mean birth
weight based on n = 10, sample mean =
116.9 oz and s =21.70.

• From the t Table, t9, 0.975 = 2.262

• Ans: (101.4, 132.4)

95
2. CIs for single population proportion, p

• Is based on three elements of CI


– Point estimate
– SE of point estimate
– Confidence coefficient

96
Cont.

97
Example 1
• A random sample of 100 people shows that 25
are left-handed. Form a 95% CI for the true
proportion of left-handers.

98
Interpretation

99
Example 2
• Suppose that among 10,000 female operating-room nurses,
60 women have developed breast cancer over five years. Find
the 95% for p based on point estimate.
• Point estimate = 60/10,000 = 0.006
• The 95% CI for p is given by the interval:

• The 95% CI for p is:

100
• Hypothesis Testing

• The purpose of Hypothesis Testing is to aid


the researcher in reaching a decision
(conclusion) concerning a population by
examining a sample from that population.
Hypothesis
• Is a statement about one or more
• Is a claim (assumption) about a population
parameter

The purpose of Hypothesis Testing is to aid the


researcher in reaching a decision (conclusion)
concerning a population by examining a
sample from that population.

102
Examples of Research Hypotheses
Population Mean
• The average length of stay of patients
admitted to the hospital is five days

• The mean birth weight of babies delivered by


mothers with low SES is lower than those
from higher SES.
• Etc

103
Types of Hypothesis
1. The Null Hypothesis, H0

· Is a statement claiming that there is no difference between the


hypothesized value and the population value.
· (The effect of interest is zero = no difference)

· States the assumption (hypothesis) to be tested

· H0 is always about a population parameter, not about a sample statistic

· Begin with the assumption that the Ho is true


– Similar to the notion of innocent until proven guilty

104
Cont.
2. The Alternative Hypothesis, HA

• Is a statement of what we will believe is true if our


sample data causes us to reject Ho.

• Is generally the hypothesis that is believed (or needs


to be supported) by the researcher.

• Is a statement that disagrees (opposes) with Ho


(The effect of interest is not zero)

105
Steps in Hypothesis Testing
1. Formulate the appropriate statistical hypotheses
clearly
• Specify HO and HA
H0:  = 0 H0:  ≤ 0 H0:  ≥ 0
H1:   0 H1:  > 0 H1:  < 0
two-tailed one-tailed one-tailed
2. State the assumptions necessary for computing
probabilities
• A distribution is approximately normal (Gaussian)
• Variance is known or unknown

106
Cont.
3. Select a sample and collect data
• Categorical, continuous

4. Decide on the appropriate test statistic for


the hypothesis. E.g., One population

OR

107
Cont.
5. Specify the desired level of significance for
the statistical test (=0.05, 0.01, etc.)
6. Determine the critical value.
– A value the test statistic must attain to be
declared significant.

-1.96 1.96 1.645 -1.645

108
7. Obtain sample evidence and compute the
test statistic
8. Reach a decision and draw the conclusion
• If Ho is rejected, we conclude that HA is true
(or accepted).
• If Ho is not rejected, we conclude that Ho may
be true.

109
Rules for Stating Statistical Hypotheses

1. One population
• Indication of equality (either =, ≤ or ≥) must
appear in Ho.
Ho: μ = μo, HA: μ ≠ μo
Ho: P = Po, HA: P ≠ Po
• Can we conclude that a certain population mean
is
– not 50?
Ho: μ = 50 and HA: μ ≠ 50
– greater than 50?
Ho: μ ≤ 50 HA: μ > 50
110
Cont.

• Can we conclude that the proportion of


patients with leukemia who survive more
than six years is not 60%?
Ho: P = 0.6 HA: P ≠ 0.6

111
Statistical Decision
• Reject Ho if the value of the test statistic
that we compute from our sample is one of
the values in the rejection region

• Don’t reject Ho if the computed value of the


test statistic is one of the values in the non-
rejection region.

112
Another way to state conclusion

• Reject Ho if P-value < α


• Accept Ho if P-value ≥ α

 P-value is the probability of obtaining a test statistic


as extreme as or more extreme than the actual test
statistic obtained if the Ho is true

 The larger the test statistic, the smaller is the P-


value. OR, the smaller the P-value the stronger the
evidence against the Ho.

113
Types of Errors in Hypothesis Tests

• Whenever we reject or accept the Ho, we


commit errors.

• Two types of errors are committed.


– Type I Error
– Type II Error

114
Type I Error
• The probability of a type I error is the
probability of rejecting the Ho when it is true

• The probability of type I error is α

• Called level of significance of the test

• Set by researcher in advance

115
Type II Error
• The error committed when a false Ho is not
rejected

• The probability of Type II Error is 

• Usually unknown but larger than α

116
Cont.

Action Reality
(Conclusion)
Ho True Ho False

Do not Correct action Type II error (β)


reject Ho (Prob. = 1-α) (Prob. = β= 1-Power)

Reject Ho Type I error (α) Correct action


(Prob. = α = Sign. level) (Prob. = Power = 1-β)

117
Type I & II Error Relationship

118
Hypothesis Testing of a Single Mean
(Normally Distributed)

119
Known Variance

120
Example: Two-Tailed Test
1. A simple random sample of 10 people from a certain
population has a mean age of 27.  Can we conclude that
the mean age of the population is not 30?  The variance is
known to be 20.  Let  = .05.

 Answer, "Yes we can, if we can reject the Ho that it is 30."


A. Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
B. Assumptions
Simple random sample
Normally distributed population 

121
Cont.
C. Hypotheses
Ho: µ = 30
HA: µ ≠ 30
D. Test statistic
As the population variance is known, we use Z
as the test statistic.

122
Cont.
E. Decision Rule
 Reject Ho if the Z value falls in the rejection region. 
 Don’t reject Ho if the Z value falls in the non-rejection region.
 Because of the structure of Ho it is a two tail test.  Therefore,
reject Ho if Z ≤ -1.96 or Z ≥ 1.96.  

123
F. Calculation of test statistic

G. Statistical decision
We reject the Ho because Z = -2.12 is in the rejection
region. The value is significant at 5% α.
H. Conclusion
We conclude that µ is not 30. P-value = 0.0340

A Z value of -2.12 corresponds to an area of 0.0170.  Since there are two
parts to the rejection region in a two tail test, the P-value is twice this
which is .0340.

124
Hypothesis test using
confidence interval
• A problem like the above example can also be solved
using a confidence interval.  

• A confidence interval will show that the calculated


value of Z does not fall within the boundaries of the
interval. However, it will not give a probability.  
• Confidence interval

125
Example: One -Tailed Test
• A simple random sample of 10 people from a certain
population has a mean age of 27.  Can we conclude that the
mean age of the population is less than 30?  The variance is
known to be 20.  Let α = 0.05.
• Data
n = 10, sample mean = 27, 2 = 20, α = 0.05
• Hypotheses
Ho: µ ≥ 30, HA: µ < 30

126
• Test statistic

• Rejection Region

Lower tail test

• With α = 0.05 and the inequality, we have the entire rejection region at
the left. The critical value will be Z = -1.645.  Reject Ho if Z < -1.645.

127
Cont.

• Statistical decision
– We reject the Ho because -2.12 < -1.645.

• Conclusion
– We conclude that µ < 30.
– p = .0170 this time because it is only a one tail test and not a two tail test.

128
Unknown Variance

• In most practical applications the standard


deviation of the underlying population is not
known
• In this case,  can be estimated by the
sample standard deviation s.
• If the underlying population is normally
distributed, then the test statistic is:

129
Example: Two-Tailed Test
• A simple random sample of 14 people from a certain population gives
a sample mean body mass index (BMI) of 30.5 and sd of 10.64. Can we
conclude that the BMI is not 35 at α 5%?

• Ho: µ = 35, HA: µ ≠35

• Test statistic

• If the assumptions are correct and Ho is true, the test statistic follows
Student's t distribution with 13 degrees of freedom.

130
Cont.
• Decision rule
– We have a two tailed test.  With α = 0.05 it means that each tail is 0.025. The
critical t values with 13 df are -2.1604 and 2.1604.
– We reject Ho if the t ≤ -2.1604 or t ≥ 2.1604.

• Do not reject Ho because -1.58 is not in the rejection region. Based


on the data of the sample, it is possible that µ = 35.  P-value = 0.1375

131
Sampling from a population that is not
normally distributed
• Here, we do not know if the population displays a
normal distribution.

• However, with a large sample size, we know from


the Central Limit Theorem that the sampling
distribution of the population is distributed
normally.

132
Cont.
• With a large sample, we can use Z as the test
statistic calculated using the sample sd.

133
Hypothesis Tests for Proportions
• Involves categorical values

• Two possible outcomes

– “Success” (possesses a certain


characteristic)
– “Failure” (does not possesses that
characteristic)

• Fraction or proportion of population in the “success”


category is denoted by p

134
Proportions

135
Hypothesis Testing about a Single Population
Proportion
(Normal Approximation to Binomial Distribution)

136
137
Example
• We are interested in the probability of developing asthma
over a given one-year period for children 0 to 4 years of age
whose mothers smoke in the home. In the general
population of 0 to 4-year-olds, the annual incidence of
asthma is 1.4%. If 10 cases of asthma are observed over a
single year in a sample of 500 children whose mothers
smoke, can we conclude that this is different from the
underlying probability of p0 = 0.014? Α = 5%

H0 : p = 0.014
HA: p ≠ 0.014

138
Cont.
• The test statistic is given by:

139
Cont.
• The critical value of Zα/2 at α=5% is ±1.96.

• Don’t reject Ho since Z (=1.14) in the non-rejection


region between ±1.96.

• P-value = 0.2548

• We do not have sufficient evidence to conclude that the


probability of developing asthma for children whose
mothers smoke in the home is different from the
probability in the general population

140
Sample size determination
• Sample Size: The number of study subjects selected to
represent a given study population.

• In estimating a certain characteristic of a population,


sample size calculations are important to ensure that
estimates are obtained with required precision or
confidence

• Should be sufficient to represent the characteristics of


interest of the study population.
141
Cont.

Sample size determination depends on the:


– Objective of the study
– Design of the study
• Descriptive/Analytic
– Accuracy of the measurements to be made
– Degree of precision required for generalization
– Plan for statistical analysis
– Degree of confidence with which to conclude

142
Cont.

• The feasible sample size is also determined


by the availability of resources:
– time
– manpower
– transport
– available facility, and
– money

143
Sample size for single sample

A. Sample size for estimating a single


population mean

B. Sample size to estimate a single population


proportion

144
A. Sample size for estimating a single
population mean

Where d = e in some text books

• where d = Margin of error =


= Absolute precision
= Half of the width (w) of CI

145
Examples:
1. Find the minimum sample size needed to
estimate the drop in heart rate (µ) for a new
study using a higher dose of propranolol than the
standard one. We require that the two-sided 95%
CI for µ be no wider than 5 beats per minute and
the sample sd for change in heart rate equals 10
beats per minute.
2 2 2
n = (1.96) 10 /(2.5) = 62 patients

146
2. Suppose that for a certain group of cancer patients, we
are interested in estimating the mean age at diagnosis.
We would like a 95% CI of 5 years wide. If the population
SD is 12 years, how large should our sample be?

147
Cont.

• Suppose d=1
• Then the sample size increases

148
Cont.

3. A hospital director wishes to estimate the


mean weight of babies born in the hospital.
How large a sample of birth records should
be taken if she/he wants a 95% CI of 0.5
wide? Assume that a reasonable estimate of
 is 2. Ans: 246 birth records.

149
But the population 2 is most of
the time unknown
As a result, it has to be estimated from:
• Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, s2
• Previous or similar studies

150
B. Sample size to estimate a single
population proportion

151
Cont.

1. Suppose that you are interested to know the


proportion of infants who breastfed >18 months of
age in a rural area. Suppose that in a similar area,
the proportion (p) of breastfed infants was found to
be 0.20. What sample size is required to estimate
the true proportion within ±3% points with 95%
confidence. Let p=0.20, d=0.03, α=5%

152
Sample Size: Two Samples

A. Estimation of the difference between two


population means

B. Estimation of the difference between two


population proportions

153
A. Sample size for estimating a difference in
two means

154
B. Sample size for estimating a difference
in two proportions

155
Data check entry

• One of the first steps to proper data


screening is to ensure the data is correct

– Check out each person’s entry individually

• Makes sense if small data set or proper data checking


procedure

• Can be too costly so…

– range of data should be checked


156
Data Screening

157
Normality
• All of the continuous data we are covering
need to follow a normal curve

• Skewness (univariate) – this represents the


spread of the data

158
Cont.
• skewness statistic is output by SPSS and SE
skewness is
S Skewness
 Z skewness
SESkewness
Z skewness  3.2 violation of skewness assumption

159
Cont.
• Kurtosis (univariate) – is how peaked the data is; Kurtosis stat
output by SPSS
• Kurtosis standard error

S Kurtosis
 Z kurtosis
SEKurtosis
Z kurtosis  3.2 violation of kurtosis assumption

– for most statistics the skewness assumption is more important that


the kurtosis assumption

160
Outliers

• technically it is a data point outside of you


distribution; so potentially detrimental
because may have undo effect on
distribution

161
Linearity
• relationships among variables are linear in
nature; assumption in most analyses

162
Homoscedasticity
• For grouped data this is the same as
homogeneity of variance

• For ungrouped data – variability for one


variables is the same at all levels of another
variable (no variance interaction)

163
Multicollinearity/Singularity
• If correlations between two variables are excessive
(e.g. 0.95) then this represents multicollinearity

• If correlation is 1 then you have singularity

• Often Multicollinearity/Singularity occurs in data


because one variable is a near duplicate of another

164

You might also like