Stat 314 Main Lecture Notes

STAT 314: BIOSTATISTICS
Course Outline
1. Elementary concept of random sampling, stratified sampling, systematic sampling and cluster sampling.
2. Combinatory mathematics(Definitions for Factorial, combinations and permutations)
3. Statistical Hypothesis: (Definition(Statistical Hypothesis) Types of Statistical Hypothesis(Null & Alternative)

Types of Statistical Errors (Type I & II)
4. Test of Goodness of Fit
5. Relationship Test(Chi-square test)
6. Association Test (Analysis of Variance: ANOVA)
7. Non-Parametric Tests
8. Statistical Software(R,SPSS,SAS,STATA): One of them
Reference Books
1. Ken Black. Introduction to business Statistics for contemporary decision making
2. H J Larsons. Introduction to Probability Theory and Statistical Inference. 3rd ed. Wiley, 1982
3. Sampling Techniques: W.G. Cochran, Wiley
CHAPTER 1
ELEMENTARY CONCEPT OF RANDOM SAMPLING, STRATIFIED SAMPLING, SYSTEMATIC SAMPLING AND CLUSTER
SAMPLING.
Sampling terms
Sampling: a means of gathering useful information about a population. Data are gathered from samples
and conclusions are drawn about the population as a part of the inferential statistics process.
Population: a population signifies the units that we are interested in studying. These units could be
people, cases and pieces of data.
Unit: The element of the sample selected from the population. The population that you are interested consists of
units, which can be people, cases or pieces of data. Consider that you want to examine the erect of health care
facilities in a community on prenatal care. What is the unit of analysis: health facility or the individual woman?
Sample: A subset of the population selected for the study.
Sampling techniques – This are ways to help select units. Methods for creating samples
Sampling frame
A sampling frame is a clear and concise description of the population under study, from which the population units
can be identified unambiguously and contacted, if desired for the
purpose of study.
Census: This is the procedure of systematically calculating, acquiring and recording information about the
members of a given population
Survey: A survey is a research method used for collecting data from a predefined group of respondents to gain
information and insights into various topics of interest.
Characteristics of Good sampling
• Meet the requirements of the study objectives
• Provides reliable results
• Clearly understandable
• Manageable/realistic: could be implemented
• Time consideration: reasonable and timely
• Cost consideration: it should be economical
• Interpretation: accurate, representative
Advantages of Sampling
• Greater economy
• Shorter time-lag
• Greater scope
• Higher quality of work
• Evaluation of reliability
• Helps drawing statistical inferences from analytical surveys

Types of sampling
(A)Probability Samples
(i) Simple random sampling: Every element of the population has equal probability of being chosen. If selected at
random, said to
be unbiased. However incorrect sampling may introduce bias. (ii)Systematic sampling: Start your sampling by
selecting an element from the list at random and then every 𝑘th; where 𝑘 is
known as the sampling interval or skip
(iii)Stratified sampling: In stratified sampling, the population is partitioned into groups, called strata, and sampling
is
performed separately within each stratum. Stratum variables are mutually exclusive. The aim is to have
homogenous population within-stratum and heterogeneous between strata. Dividing members of the population
into homogeneous, non-overlapping subgroups and
then sample. The principal objective of stratification is to reduce sampling errors.
(iii)Cluster sampling: Selecting random clusters. You want the cluster to be heterogeneous as possible, so that each
cluster is a good small scale representation of the population
(B)Non Probability sampling

(i) Judgment sampling
(ii) Purposive sampling
(iii) Convenience sampling
(iv) Snow-ball sampling
Advantages of probability sampling

• Provides a quantitative measure of the extent of variation due to random effects
• Provides data of known quality
• Provides data in timely fashion
• Provides acceptable data at minimum cost
• Better control over nonsampling sources of errors
• Mathematical statistics and probability can be

applied to analyze and interpret the data
Disadvantages of Non-probability Sampling

• purposively selected without any confidence
• Selection bias likely
• Bias unknown
• No mathematical property
• Non-probability sampling should not be undertaken with science in mind
• Provides false economy
Limitations of Sampling
• sampling frame: may need complete
enumeration
• Errors of sampling may be high in small

areas
• May not be appropriate for the study

objectives/questions
• Representativeness may be vague,

controversial
CHAPTER 2: COMBINATORY MATHEMATICS
(Definitions for Factorial,combinations and permutations)

Send HAND OUT
CHAPTER THREE: STATISTICAL HYPOTHESIS
A foremost statistical mechanism for decision-making is the hypothesis test. It is tool in statistics to “prove” or
“disprove” claims. It enables business analysts to structure problems in such a way that they can use statistical
evidence to test various theories about business phenomena.
For instance, determining whether a production line process is out of control to providing conclusive evidence that
a new management leadership approach is significantly more effective than an old one.
The three types of hypotheses are
- Research hypothesis
- Statistical hypothesis
- Substantive hypothesis
All statistical hypotheses consist of two parts.
(a)The null hypothesis (H0)states that there is nothing new happening, the old theory is still true, the old standard
is correct. It represents a theory that has been put forward, either because it is believed to be true or because it is
to be used as a basis for argument, but has not been proved. For example, in the study of the effects of a new finance
policy on the performance of a company, the null hypothesis might be that the new policy is no better, on average,
than the current policy. We would write
𝐻0: There is no difference between the two financial policies on average.
(b)The alternative hypothesis (H1) states that the new theory is true, there are new standards, the system is out of
control. It is a statement of what a statistical hypothesis test is set up to establish. For example, Two years ago, the
proportion of infected plants was 37%. We believe that a treatment has helped, and we want to test the claim that
there has been a reduction in the proportion of infected plants.
The final conclusion once the test has been carried out is always given in terms of the null hypothesis. The two
possible conclusions are:
- Reject 𝐻0 in favor of 𝐻1
- Fail to reject 𝐻0
Concluding “Fail to reject H0" does not necessarily mean that the null hypothesis is true, it only suggests that there
is not sufficient evidence against 𝐻0 in favor of 𝐻1.
Rejecting the null hypothesis then, suggests that the alternative hypothesis is likely to be true
Type (I) and Type (II) Errors
When making a conclusion in hypothesis testing two types of errors can be made.
Type I Error: A type I error occurs when the null hypothesis is rejected when it is in fact true; that is, 𝐻0 is wrongly
rejected. For example, in the study of the effects of a new finance policy on the performance of a company, the
null hypothesis might be that the new policy is no better, on average, than the current policy. That is: 𝐻0: There is
no difference between the two financial policies on average. A type I error would occur if we concluded that the
two policies produced different effects when in fact there was no difference between them.
Type II Error: A type II error occurs when the null hypothesis 𝐻0, is not rejected when it is in fact false. For
example, in the study of the effects of a new finance policy on the performance of a company, the null hypothesis
might be that the new policy is no better, on average, than the current policy. That is: 𝐻0: There is no difference
between the two financial policies on average. A type II error would occur if it was concluded that the two policies
produced the same effect, i.e. there is no difference between the two policies on average, when in fact they
produced different ones.
A type II error is frequently due to sample sizes being too small.
A type I error is often considered to be more serious, and therefore more important to avoid, than a type II error.
Significance Level
The significance level of a statistical hypothesis test is a fixed probability of wrongly rejecting the null hypothesis
H0, if it is in fact true.
It is the probability of a type I error and is set by the investigator in relation to the consequences of such an error.
That is, we want to make the significance level as small as possible in order to protect the null hypothesis and to
prevent, as far as possible, the investigator from inadvertently making false claims.
The significance level is usually denoted by α
P- Value
P-Value The probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test
statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0 is true.
Test Statistic
A test statistic is a quantity calculated from our sample of data. Its value is used to decide whether or not the null
hypothesis should be rejected in our hypothesis test. The choice of a test statistic will depend on the assumed
probability model and the hypotheses under question.
Critical Value(s) and Region
The critical value(s) for a hypothesis test is a threshold to which the value of the test statistic in a sample is
compared to determine whether or not the null hypothesis is rejected. The critical value for any hypothesis test
depends on the significance level at which the test is carried out, and whether the test is one-sided or two-sided.
The critical region, or rejection region, is a set of values of the test statistic for which the null hypothesis is rejected
in a hypothesis test.
That is, the sample space for the test statistic is partitioned into two regions; one region (the critical region) will
lead us to reject the null hypothesis H0, the other will not. So, if the observed value of the test statistic is a
member of the critical region, we conclude "Reject H0"; if it is not a member of the critical region then we
conclude "Fail to reject H0".
One-sided Test
A one-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, 𝐻0
are located entirely in one tail of the probability distribution. In other words, the critical region for a one-sided test
is the set of values less than the critical value of the test, or the set of values greater than the critical value of the
test.
A one-sided test is also referred to as a one-tailed test of significance. The choice between a one-sided and a two-
sided test is determined by the purpose of the investigation or prior reasons for using a one-sided test.
Example
Suppose we wanted to test a manufacturer’s claim that there are, on average, 50 matches in a box. We could set
up the following hypotheses.
𝐻0: µ = 50,
Against
𝐻1: µ < 50 or 𝐻1: µ > 50
Either of these two alternative hypotheses would lead to a one-sided test.
Presumably, we would want to test the null hypothesis against the first alternative hypothesis since it would be
useful to know if there is likely to be less than 50 matches, on average, in a box (no one would complain if they get
the correct number of matches in a box or more).
Two-Sided Test
A two-sided test is a statistical hypothesis test in which the values for which we can reject the null hypothesis, H0
are located in both tails of the probability distribution.
In other words, the critical region for a two-sided test is the set of values less than a first critical value of the test
and the set of values greater than a second critical value of the test.
A two-sided test is also referred to as a two-tailed test of significance. The choice between a one-sided test and a
two-sided test is determined by the purpose of the investigation or prior reasons for using a one-sided test.
Example
Suppose we wanted to test a manufacturers claim that there are, on average, 50 matches in a box. We could set
up the following hypotheses
𝐻0: µ = 50, 𝐻1: µ < 50 or 𝐻1: µ > 50 Either of these two alternative hypotheses would lead to a one-sided test.
Presumably, we would want to test the null hypothesis against the first alternative hypothesis since it would be
useful to know if there is likely to be less than 50 matches, on average, in a box (no one would complain if they get
the correct number of matches in a box or more).
Yet another alternative hypothesis could be tested against the same null, leading this time to a two-sided test:
𝐻0: µ = 50
𝐻1: µ ≠ 50
Here, nothing specific can be said about the average number of matches in a box; only that, if we could reject the
null hypothesis in our test, we would know that the average number of matches in a box is likely to be less than or
greater than 50.
Steps in Conducting Hypothesis Testing
i. Begin by stating the claim or hypothesis that is being tested. Also form a statement for the case that
the hypothesis is false. These are 𝐻0and 𝐻1.
ii. Choose the desired significance level 𝛼. The values 0.05 and 0.01 are common values used for alpha,
but any positive number between 0 and 0.50 could be used for a significance level.
iii. Determine which statistic and distribution to use. The type of distribution is dictated by features of
the data. Common distributions include: 𝑧 score, 𝑡 score and chi-squared (𝜒 2 ) and 𝑓.
iv. Compute the test statistic and the 𝑝 value for this statistic. Here we will have to consider if we are
conducting a two tailed test (typically when the alternative hypothesis contains a “is not equal to”
symbol, or a one tailed test (typically used when an inequality is involved in the statement of the
alternative hypothesis).
v. If the 𝑝 value is less than the set significance level 𝛼 we must reject the null hypothesis. The
alternative hypothesis stands. If p value is not less 𝛼 then we fail to reject the null hypothesis. This
does not prove that the null hypothesis is true, but gives a way to quantify how likely it is to be true.
vi. We now state the results of the hypothesis test in such a way that the original claim is addressed
CONTINUATION
***check handout***
CHAPTER FOUR: 4.
TEST OF GOODNESS OF FIT
This is a hypothesis test used to check whether the data “fit” a particular distribution or not. It uses the Chi-square
statistical tests. It compares the expected, or theoretical, frequencies of categories from a population distribution
to the observed, or actual, frequencies from a distribution to determine whether there is a difference between
what was expected and what was observed. It is given by;
Example
How would you rate the level of service that Ukulima Sacco provides?
The distribution of responses to this question was as follows. Excellent 8% Pretty good 47% Only fair 34% Poor
11%.
To validate this results the Sacco manager interviews 207 randomly selected customers as they leave the Sacco
premises for a given time period.
She asks the customers how they would rate the level of service at the Sacco from which they had just exited. The
response categories are excellent, pretty good, only fair, and poor. The observed responses from this study are
given in the table below.
Response Frequency (fo) Excellent 21 Pretty good 109 Only fair 62 Poor 15 .
use a chi-square goodness-of-fit test and the eight-step approach to determine whether the observed frequencies
of responses from this survey are the same as the frequencies that would be expected on the basis of the national
survey.
Solution
Step 1 : The hypotheses for this example follow

The chi-square goodness-of-fit can then be calculated, as shown
Example 2
Solution
Exercises
1. Use a chi-square goodness-of-fit test to determine whether the observed frequencies are distributed the
same as the expected frequencies (α = .05)
Chi-Square Test of Independence
The chi-square goodness-of-fit test is used to analyze the distribution of frequencies for categories of one variable,
such as age or number of bank arrivals, to determine whether the distribution of these frequencies is the same as
some hypothesized or expected distribution.
The chi-square test of independence, can be used to analyze the frequencies of two variables with multiple
categories to determine whether the two variables are independent.
The null hypothesis for a chi-square test of independence is that the two variables are independent. The
alternative hypothesis is that the variables are not independent. This test is one-tailed. The degrees of freedom are
as shown above.
Example 1
Suppose you want to determine if certain types of products sell better in certain geographic locations than others.
Consider the accompanying data of number of sales of three products in three regions. Test the hypothesis of
independence between type of product and region
Solution
Thus, we would reject the null hypothesis that there is no relationship between type of product and region. Our
data tell us there is a statistically significant relationship between type of product and region
Exercise 8
A company packages a particular product in cans of three different sizes, each one using a different production
line. Most cans conform to specifications, but a quality control engineer has identified the following reasons for
non-conformance: (1) blemish on can; (2) crack in can; (3) improper pull tab location; (4) pull tab missing; (5) other.
A sample of nonconforming units is selected from each of the three lines, and each unit is categorized according to
reason for nonconformity, resulting in the following contingency table data:
CHAPTER 6
ASSOCIATION TEST (ANALYSIS OF VARIANCE: ANOVA)
Analysis of Variance, popularly known as the ANOVA, is used to compare the means in cases where there are more
than two groups.
When we have only two samples we can use the t-test to compare the means of the samples but it might become
unreliable in case of more than two samples. If we only compare two means, then the t-test (independent
samples) will give the same results as the ANOVA.
The null and alternative hypotheses are
𝐻0:𝜇1= 𝜇2=⋯= 𝜇𝑘
𝐻1:𝜇𝑖≠ 𝜇𝑗
For at least one 𝑖,
Example
Malawi greenhouse suppliers has four main plans of recruiting farmers. The data below show the number of
farmers recruited by each of these plans by 23 interns. Do the plans differ in mean achievement?
Solution
Exercise
Does the data indicate a difference in branches? Use a level of significance of 0.05.
CHAPTER 7
NON PARAMETRIC TESTS

Stat 314 Main Lecture Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat 314 Main Lecture Notes

Uploaded by

Copyright:

Available Formats

STAT 314: BIOSTATISTICS

2. Combinatory mathematics(Definitions for Factorial, combinations and permutations)

3. Statistical Hypothesis: (Definition(Statistical Hypothesis) Types of Statistical Hypothesis(Null & Alternative)

4. Test of Goodness of Fit

5. Relationship Test(Chi-square test)

6. Association Test (Analysis of Variance: ANOVA)

8. Statistical Software(R,SPSS,SAS,STATA): One of them

1. Ken Black. Introduction to business Statistics for contemporary decision making

3. Sampling Techniques: W.G. Cochran, Wiley

Sample: A subset of the population selected for the study.

Characteristics of Good sampling

• Meet the requirements of the study objectives

• Provides reliable results

• Manageable/realistic: could be implemented

• Time consideration: reasonable and timely

• Cost consideration: it should be economical

• Interpretation: accurate, representative

• Higher quality of work

• Helps drawing statistical inferences from analytical surveys

(B)Non Probability sampling

(ii) Purposive sampling

(iii) Convenience sampling

(iv) Snow-ball sampling

Advantages of probability sampling

• Provides data of known quality

• Provides data in timely fashion

• Provides acceptable data at minimum cost

• Better control over nonsampling sources of errors

• Mathematical statistics and probability can be

Disadvantages of Non-probability Sampling

• Selection bias likely

• Provides false economy

• Errors of sampling may be high in small

• May not be appropriate for the study

• Representativeness may be vague,

CHAPTER 2: COMBINATORY MATHEMATICS

(Definitions for Factorial,combinations and permutations)

CHAPTER THREE: STATISTICAL HYPOTHESIS

The three types of hypotheses are

All statistical hypotheses consist of two parts.

𝐻0: There is no difference between the two financial policies on average.

Type (I) and Type (II) Errors

A type II error is frequently due to sample sizes being too small.

The significance level is usually denoted by α

Critical Value(s) and Region

𝐻1: µ < 50 or 𝐻1: µ > 50

Either of these two alternative hypotheses would lead to a one-sided test.

Steps in Conducting Hypothesis Testing

TEST OF GOODNESS OF FIT

Step 1 : The hypotheses for this example follow

Chi-Square Test of Independence

ASSOCIATION TEST (ANALYSIS OF VARIANCE: ANOVA)

The null and alternative hypotheses are

For at least one 𝑖,

NON PARAMETRIC TESTS

You might also like