You are on page 1of 132

BASIC COMMON CONCEPTS IN

STATISTICS

BY
Zzigwa Marvin & Samuel Kibirango
WHAT IS STATISTICS?

• Statistics is the science of creating, developing, and applying techniques such that
decisions can be made and evaluated in the face of uncertainty.

• It involves the collection, analysis and interpretation of numerical data.

• It involves the collection of data by means of experiments or surveys conducted


according to various principles, and the drawing of conclusions (or inferences) from data
through the use of various procedures known as statistical analysis.
BASIC COMMON CONCEPTS IN STATISTICS
Population (universe)
• A Population generally means the actual number of people in the demographic sense. However, statisticians stretch
the word to mean any group of inanimate objects which are the subject of interest to the investigator.
• In general the size of a population, if finite, is given by N.
Sample
• Sometimes it is not possible to study the whole population of the objects of our interest. It becomes necessary to
choose a few representative members of that population which will be expected to have adequate characteristics of
the universe.
• Rigorous methods of sampling are used to ensure that all members are represented by the resulting sample, and that
the sample has the representative characteristics of the entire population.
• A sample is a smaller, manageable version that contains characteristics of a larger population. Samples are used when
population sizes are too large for the test to include all possible members or observations. A sample is therefore a
subset of a population.
• For reasons such as cost, time, etc, it is usually necessary to study as sample from a population to obtain information
about the population. Inferences about a population are based on sample data only.
• . In general, the size of a sample is given by n.
BASIC COMMON CONCEPTS IN STATISTICS
Parameter

• Any numerical or descriptive measure of a population is referred to as a parameter. Since a parameter is a


measure from the entire population, it is a fixed value.

Statistic

• Any numerical or descriptive measure of a sample is referred to as a statistic. Usually, a given population
parameter corresponds to a sample statistic.A statistic varies from sample to sample and within samples.

Estimate

• A sample statistic ESTIMATES a given population parameter. Since it is generally not possible, or very difficult to
calculate the value of the population parameter directly, we calculate a sample statistic that corresponds to that
parameter and it is hopefully close to its true value.

Experiment/Trial/event

• An experiment is the process by which an observation (or measurement) is obtained. The experiment, when
performed, it often called a trial. An event is the outcome of an experiment. Example: A coin is tossed. This is an
experiment.The outcome is either ‘Heads’ or ‘Tails’.The outcome that is obtained in the experiment is the event.
BASIC COMMON CONCEPTS IN STATISTICS

Variable (variate)

• A variable is a phenomenon which changes with respect to the influence of other


phenomena or objects.

• This changing aspect of the phenomenon is what makes it be known as a variable,


because it varies with respect to changes in other phenomena.

• Commonly we talk about two types of variables in statistics i.e., dependent


variables and independent variables. The former vary because of other phenomena
which exist outside the domain of the investigation. Therefore its magnitude
depends on the magnitude or the presence of other variables - hence the term
dependent variable.

• Independent variables are those phenomena which are capable of influencing other
variables. For instance, the weight of an individual is a dependent variable, while the
dietary habits which influence that weight are the independent variable.
BASIC COMMON CONCEPTS IN
STATISTICS

• Different sorts of variables are encountered by statisticians and it is desirable to distinguish among
them.This distinction is illustrated in the diagram below.

• Variate

Quantitative Qualitative

Nominal/ categorical
Discrete Continuous Ordinal
BASIC COMMON CONCEPTS IN STATISTICS

a) A quantitative variate includes data that represents the amount or quantity of something and measurement is made
over a range of values.The observations of a quantitative variate posses a natural order or ranking.

i) A continuous quantitative variate is one for which all values in some range are possible e.g., height, weight, etc.

ii) A discrete quantitative variate is one for which only certain values are possible. They are usually consecutive
integers; for example, size of household, number of insects in an insect trap, etc.

b) A qualitative variate describes a characteristic or attribute of an individual. It includes data that can be categorized
but not quantified.

i) Ordinal variates deal with relative differences e.g:

- Short – Average – Tall

- Bad – Good – Excellent

• This type of variates is often represented by ‘scores’ such as 1 – 2 – 3, and as such are erroneously treated
statistically as discrete variables.

i) Nominal/categorical variates are associated with some quality, characteristic or attribute which the variable
possesses; for example, eye color (blue – grey - green – brown); vehicle type (bus – truck – car – bicycle); etc.
BASIC COMMON CONCEPTS IN STATISTICS
Random variate

• The observations of a random variate can assume any of the possible values in a population with equal likelihood.

Model

• A model is an attempt to construct a simplified explanation of the complex reality in a manner which can be easily
understood, and whose nature can generally be said to be possible at all times.

• Models are often used to describe phenomena in social sciences because the interaction of variables is of a
complex nature and it is impossible to run controlled experiments to determine the nature of the interaction of
such variables. For example, the habits of any consumer population can only be modeled, because we cannot lock
any member or members of the population or any two members of the population in a laboratory for a long time
to study their consumer habits.
Survey
• In a survey, data is collected over various locations, with no control over the factors under study. In a survey,
extraneous factors cannot be held constant.
Hypothesis
A hypothesis is an educated guess of the nature of the relationship between variables which is capable of being verified
during the process of modeling or experimentation. Hypotheses can never be proven; they are only verified.
TYPES OF STUDIES

• Comparative vs absolute

In a comparative experiment, two or more procedures or treatments are tested against one another by
means of some criterion for performance. In an absolute experiment, a single procedure is studied
and/or the purpose is to obtain a single certain measurement.

• Objective experimentation

Here, the focus is to: (i) verify a hypothesis, and (ii) generate ‘empirical knowledge’ (knowledge based
on observation or experience rather than theory. Such a study uses the ‘empirical method’.

• Observational experiments

These are carried out to demonstrate (usually visually) the effectiveness or otherwise of particular
practices or recommendations
DESCRIPTIVE STATISTICS
DESCRIPTIVE STATISTICS
▪ Descriptive statistics are used to describe the basic features of the data in a study.

▪ They provide simple summaries about the sample and the measures.

▪ Together with simple graphics analysis, they form the basis of virtually every quantitative analysis
of data.
▪ While with inferential statistics, we try to reach conclusions that extend beyond the immediate data
alone. For instance, we use inferential statistics to try to infer from the sample data what the
population might think.
▪ How to properly describe data through statistics and graphs is important. Typically, there are two
general types of statistic that are used to describe data:
▪ Measures of central tendency: these are ways of describing the central position of a frequency
distribution for a group of data. There are three major types of estimates of central tendency:
mode, median, and mean.
▪ Measures of spread: these are ways of summarizing a group of data by describing how spread out
the scores are. For example, the mean score of our 100 students may be 65 out of 100. However, not
all students will have scored 65 marks. Rather, their scores will be spread out. Some will be lower
and others higher.
▪ Measures of spread help us to summarize how spread out these scores are. To describe this spread,
a number of statistics are available to us, including the range, quartiles, variance and standard
deviation.
GRAPHICAL METHODS OF DATA DESCRIPTION
▪ Data can be described using types of graphs depending on the type of data (variable)
(a) Continuous data
- Histograms
- Line graphs (frequency polygons)
- Cumulative frequency curves (ogive)
- Boxplot
(b) Discrete data
▪ - column charts
▪ - bar graphs
▪ - pie charts
(c) Qualitative data (nominal and ordinal)
▪ - column charts
▪ - bar graphs
▪ - pie charts
▪ Consider results of a survey in which the gross income of 8,644 households is obtained.
In raw form, such information would somewhat be meaningless because of bulkiness.
In such situations it is necessary to summarize the data in such a way that some ‘picture’
is obtained. The table below is the result of such a summary.

Income (‘000 UGX) Number of households


(frequency)
<2,000 1,406
2,000 - 4,352
4,000 - 1,833
6,000 - 489
10,000 - 163
20,000 + 406
Total 8,644
GUIDELINES FOR THE GROUPING OF DATA

(a) Whenever possible ensure that class intervals are the same for all classes.

(b) Ensure that each item (observation or measurement) is allocated to one class
only.

MEAN
• The mean (informally, the “ average“) is found by adding all of the numbers
together and dividing by the number of items in the set: 10 + 10 + 20 + 40 + 70 / 5
= 30.
MEAN FOR GROUPED DATA

• Example: The following table gives the frequency distribution of the number of
orders received each day during the past 50 days at the office of a mail-order
company. Calculate the mean.
▪ Start by getting class boundaries.
σ 𝑓𝑋
▪ Then get the midpoint X of each class adding the ▪ The mean/average/Ⴟ =
𝑛
class limits and divide them by 2.
832
▪ =
50
▪ = 16.64
MEDIAN

▪ The median is the middle number in a data set. To find the median, list
your data points in ascending order and then find the middle number.
The middle number in this set is 28 as there are 4 numbers below it and
4 numbers above:
23, 24, 26, 26, 28, 29, 30, 31, 33
▪ Note: If you have an even set of numbers, average the middle two to
find the median. For example, the median of this set of numbers is 28.5
(28 + 29 / 2).
23, 24, 26, 26, 28, 29, 30, 31, 33, 34
MEDIAN FOR GROUPED DATA

▪ Step 1: Construct the cumulative frequency distribution.


Step 2: Decide the class that contain the median.
Class Median is the first class with the value of cumulative frequency equal at least
n/2.
Step 3: Find the median by using the following formula:
𝑛
−𝐹
Median = Lm + 2
i
𝑓𝑚

▪ Where:
n = the total frequency
F = the cumulative frequency before class median
i = the class width
Lm = the lower boundary of the class median
𝑓𝑚 = the frequency of the class median
▪ Based on the grouped data below, find the median:
SOLUTION

▪ First step: Construct the cumulative frequency


distribution 𝑛 50
▪ Median class = = = 25 which is
2 2
in the third class.
▪ So, F = 22, 𝑓𝑚 = 12, Lm = 20.5 and i =
10
𝑛
−𝐹
▪ Therefore, Median = Lm + 2
i
𝑓𝑚

25−22
▪ = 20.5 + 10
12

▪ = 23
QUARTILES

▪ Using the same method of calculation ▪ Example: Based on the grouped data
as in the Median, below, find the Interquartile Range
we can get Q1 and Q3 equation as

follows:
𝑛
−𝐹
▪ Q1 = LQ1+ 4
i and Q3 = LQ3+
𝑓𝑄1
3𝑛
−𝐹
4
i
𝑓𝑄3
SOLUTION
▪ Second Step: Determine the Q1 and Q3
▪ First Step: Construct the cumulative
frequency distribution. 𝑛 50
▪ Q1 class = = = 12.5
4 4

▪ Q1 class is in the 2nd class.


𝑛
−𝐹
▪ Therefore, Q1 = LQ1+ 4
i
𝑓𝑄1

12.5−8
▪ = 10.5+ 10 = 13.7143
14
3𝑛 3(500)
▪ Q3 class = = = 37.5
4 4

▪ Q3 class is in the 4nd class.


3𝑛
−𝐹
▪ Therefore, Q3 = LQ3+ 4
i
𝑓𝑄3

37.5−34
▪ = 30.5+ 10 = 34.3889
9
INTERQUARTILE RANGE (IQR)

▪ IQR = Q3 - Q1

▪ calculate the IQR


IQR = Q3 - Q1
▪ = 34.3889 – 13.7143

▪ = 20.6746
MODE
Mode is the value that has the highest frequency ▪ Where
in a data set.
▪ i is the class width
For example, the mode in this set of
21, 21, 21, 23, 24, 26, 26, 28, 29, 30, 31and 33 is 21. ▪ ∆1 is the difference between the frequency of
the modal class and the frequency of the class
▪ For grouped data, class mode (or, modal class) after the modal class.
is the class with the highest frequency.
▪ ∆2 is the difference between the frequency of
▪ To find mode for grouped data, use the the modal class and the frequency of the class
following formula: before the modal class.

▪ Mode = Lmo+
∆1
i ▪ Lmo is the lower boundary of class mode.
∆1 +∆2
▪ Based on the grouped data below, find the mode
SOLUTION
Based on the table above,

▪ Lmo = 10.5, ∆1 = (14-8) = 6, ∆2 = (14-12) = 2 and i =


10

6
▪ Mode = 10.5 + 10 = 17.5
6+2

Mode can also be obtained from a histogram.


Step 1: Identify the modal class and the bar
representing it
Step 2: Draw two cross lines as shown in the
diagram.
Step 3: Drop a perpendicular from the intersection
of the two lines
until it touch the horizontal axis.
Step 4: Read the mode from the horizontal axis.
VARIANCE AND STANDARD DEVIATION OF GROUPED DATA

σ 𝑓𝑋 2 −
σ 𝑓𝑋 2 Example
▪ Population Variance: 𝜎 2 = 𝑁
𝑁 ▪ Find the variance and standard deviation for
the following data
σ 𝑓𝑋 2
σ 𝑓𝑋 2 −
▪ Variance of sample data: s2 = 𝑛
𝑛−1

▪ Standard Deviation
▪ Population: 𝜎 = 𝜎2
▪ Population Sample: s = 𝑠2
σ 𝑓𝑋 2
σ 𝑓𝑋 2 −
▪ Variance s2 = 𝑛
𝑛−1

832 2
14216−
▪ = 50
50−1

▪ = 7.5820

▪ Standard Deviation, s = 𝑠2

▪ = 7.5820

▪ = 2.75
To design a rural multilane road, Vehicle speed surveys were carried out on Entebbe Express way
as shown in the table below

Speed Group Observation


km/hr (Frequency) F
Determine;
36-40 4 1) Mean Speed
41-45 6 2) Median Speed
46-50 8 3) Interquartile speed
51-55 29 4) Mode
56-60 50 5) Standard Deviation
61-65 63 6) 85th percentile speed (design speed)
66-70 45
71-75 29
76-80 90
81-85 19
86-90 12
91-95 10
96-100 9
101-105 2
PROBABILITY
PROBABILITY

▪ Probability is the chance of an event happening.

▪ A probability experiment is the process that leads to well defined results called outcomes.

▪ A sample space is the set of all possible outcomes of a probability experiment.

▪ An event is one or more outcomes of a probability experiment.

▪ The probability of any event E is

Number of outcomes in E
P(E) =
Total number of outcomes in the sample space

▪ This probability is denoted by

n(E)
P(E) =
n(S)

▪ Probabilities can be expressed as fractions, decimals or percentges.


PROBABILITY RULES

1. The probability of any event E is a number (either a fraction or decimal) from


one to zero. This is denoted by 0 ≤ 𝑃(𝐸) ≤ 1

2. If any event can not occur, its probability is zero.

3. If an event is certain, then the probability of E is one

4. The sum of probabilities of all outcomes in the sample space is one.


ADDITION RULE
- The first addition rule is used when two events are mutually exclusive. For example; When two
events A and B are mutually exclusive, the probability that A or B will occur is given by;

P (A or B) =P(A)+P(B)

Example

▪ A day of the week is selected at random, find the probability that it is a weekend day.

P (Sunday or Saturday) =P (Saturday)+P (Sunday)


1 1 2
= + =
7 7 7

- The second addition rule is used when two vents are not mutually exclusive. If A and B are not
mutually exclusive, the probability that A or B will occur is given by;

▪ P (A or B) = P (A)+P (B)-P (A and B)

▪ The second formula can also work for mutually exclusive events though P(A and) is always zero
since both events cannot happen at the same time.
COMPLEMENT RULE

▪ The complement of an event is the set of outcomes that are not included in the outcomes of the
event. If the probability of an event is known, the probability of its complement can be got by
subtracting it from one since one is the highest probability. Likewise, if the probability of the
complement is known, the probability of the event can be got by subtracting the complement
probability from one. The complement of event A is denoted by a bar on top of A.

▪ The formula is given by;

ҧ Or, 𝑃 𝐴ҧ = 1 − 𝑃 𝐴ҧ ; Or, P(A)+P (𝐴)=1


▪ 𝑃 (𝐴) = 1 − 𝑃(𝐴) ҧ

Example
1
▪ He probability that beans will germinate is , find the probability that beans will not germinate.
5

Solution

▪ P(Beans not germinating)=1-P(Beans germinating)


1 4
▪ = 1- =
5 5
MULTIPLICATIVE RULE

▪ The multiplicative rules are used to find probability of two or more events that occur in sequence.
For example if a coin is tossed and a die is rolled, one can get the probability of getting a tail on the
coin and then getting a two in the die. These two events are independent since the outcome of
tossing a coin does not affect the outcome of rolling a die.
- Multiplication rule 1
▪ When two events are independent, the probability of both occurring is given by;

▪ P(A and B)=P (A).P (B)


Example
▪ A coin is tossed and a die is rolled, find the probability of getting a tail on the coin and a Five on
the die.
Solution
▪ P (Tail and a Five) =P(Tail). P (Five)
1 1
▪ .
2 6
1
▪ =
12
MULTIPLICATIVE RULE

- Multiplicative rule 2

▪ This is used to find probability when the outcome of the first event affects the outcome of the
second event. These are called dependent events. The formula for probability is given by;

▪ P (A and B) = P(A).P (B/A)

▪ P (B/A) means that B happened given that A had first happened.


CONDITIONAL PROBABILITY

▪ The conditional probability of an event A in relation to B means that the probability


that A is occurring is because B first occurred.

▪ The condition probability of an event can be got by dividing both sides of the
equation for multiplication rule 2 by Probability of B

▪ P (A and B)=P (B) . P(A/B)

𝑃(𝐴 𝑎𝑛𝑑 𝐵) 𝑃 𝐵 .𝑝(𝐴/𝐵)


▪ =
𝑃 (𝐵) 𝑃 (𝐵)

𝑃(𝐴 𝑎𝑛𝑑 𝐵)
▪ P (A/B)=
𝑃 (𝐵)
BINOMIAL PROBABILITY DISTRIBUTION

▪ A binomial distribution are the outcomes of a binomial experiment and the corresponding
probabilities. In binomial experiments, the outcomes are classified as either successes or
failures. A binomial experiment is a probability experiment that satisfies the following
requirements;

- Each trial can only have two outcomes or outcomes can be reduced to two outcomes.

- There must be a fixed number of trials.

- The outcome of each trial must be independent of each other.

- The probability of a success must remain the same for each trial.
The binomial probability distribution formula is expressed as follows;
𝑛!
𝑃 𝑋 = . 𝑝 𝑥 𝑞𝑛−𝑥
𝑛 − 𝑋 ! 𝑋!

Where P (S) is the symbol for probability of success

P(F) is the symbol for probability of failure

p is the numerical probability of success, P (S)=p

q is the numerical probability of failure, P(F)=q =1-p

0≤𝑋≤𝑛

The mean, variance and standard deviation for the binomial distribution is given by;

Mean,µ=n.p

Variance, 𝜎 2 = 𝑛. 𝑝. 𝑞

Standard deviation,𝜎 = 𝑛. 𝑝. 𝑞
▪ A survey showed that 30% of teenage consumers receive their spending money from part-time
jobs. If 5 teenagers are selected at random, find the probability that atleast 3 of them will have
part-time jobs.
Solution
▪ To find the probability that at least 3 have part-time jobs, we find the individual properties of
3,4,or5 and add them together to get the total probability.
5!
▪ P(3) = . 0.33 0.72 = 0.132
5−3 !3!

5!
▪ P(4) = . 0.34 0.71 = 0.028
5−4 !4!

5!
▪ P(5) = . 0.35 0.70 = 0.002
5−5 !5!

▪ Hence P(at least 3 teenagers have part-time jobs)

▪ = 0.132+0.028+0.002

▪ =0.162
NORMAL
DISTRIBUTIONS
NORMAL DISTRIBUTION
▪ The normal distribution is the continuous, symmetric and bell shaped distribution of a variable. The
standard normal distribution is a normal distribution with a mean of zero and standard deviation one.

▪ A standard normal curve provides both a visual and a numerical representation of how a given variable is
distributed across a population

▪ A normal distribution curve has the following properties;

- It is bell shaped

- The mean, median and mode are equal and located at the center of the distribution.

- It is unimodal; it only has one mode.

- The curve is symmetrical about the mean.

- It is continuous; there are no gaps or holes.

- The curve never touches the x-axis

- The total area under the curve is either 1 or 100%


▪ When the data values are evenly distributed about the mean, the distribution is said to be
symmetrical and when the majority of the values fall either to the left or right of the mean, then
the distribution is skewed.
▪ When the majority of the data values fall to the right of the mean, then the distribution is said to
be negatively skewed.
▪ When the majority of the data values fall to the left of the mean, then the distribution is said to be
positively skewed.
STANDARD SCORES
▪ These are also known as Z-scores and they represent the number of standard deviations a data value falls
above or below the mean. A Z-Score is obtained by subtracting the mean from the data value and dividing the
result by the standard deviation.

▪ If the Z-score is positive, then the data value is above the mean and if its negative, this implies that the data
value is below the mean. This helps in comparing relative standards.

▪ The symbol for Z-score is Z and the formula is expressed as;

𝐷𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒−𝑀𝑒𝑎𝑛
▪ Z=
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

▪ For samples, the formula is given by;

𝑋−𝑋ത
▪ Z= Where x is the data value, 𝑋ത is the sample mean and s is the sample standard deviation.
𝑆

▪ For populations, the formula is given by;

𝑋−µ
▪ Z= Where µ 𝑖𝑠 the population mean, X is the data value and 𝜎 is the population standard deviation.
𝜎
▪ A farmer harvested 65kgs of Nakati in a plot of land mulched with grass and the mean harvest for that plot
was 50kgs and the standard deviation was 10. The same farmer harvested 30kgs of Nakati in a plot mulched
with timber dust. The mean for this plot was 25kgs and the standard deviation was 5. Compare the farmer’s
relative harvest in the two plots.

Solution

▪ For grass mulch;

𝑋−Ϫത 65−50
▪ Z= = =1.5
𝑆 10

▪ For the timber dust;

𝑋−Ϫത 30−25
▪ Z= = =1.0
𝑆 105

▪ Since the Z-score for grass is larger, the farmers relative harvest in the grass mulched plot is higher than the
relative harvest in the timber dust mulched plot.
FINDING THE AREA UNDER THE NORMAL CURVE

▪ The area under the entire normal curve, no matter its precise shape, is assigned the value
1.0. All partial areas under the normal curve are thus decimal numbers between 0 and 1 and
can be converted easily to percentages by multiplying them by 100.
▪ For solutions of problems using a standard normal distribution, a four step process is
recommended;
▪ 1. Find the Z- value

▪ 2. Draw a picture(curve)

▪ 3. Shade the desired area

▪ 4. The next step depends on the position of the shaded area on the curve.

▪ NOTE:

▪ The z-table gives the area between 0 and any z-value.

▪ All areas are positive


▪ If the shaded area is between 0 an any Z-value, look up for the Z value in the z-table to get the
area
▪ Find the area under the standard normal ▪ Since the z-table gives the area between 0
curve between z=0 and z=2.34 and any z-value, look for 2.3 in the left
column and 0.04 in the top row.
▪ Solution
▪ The value where the column and row
meet is the answer.
▪ Hence the area is 0.4904

▪ or 49.04%.
▪ When the shaded area is in any tail of the curve, look up for the z-value in the table and then
subtract the area from 0.5000.
▪ Find the area (z<-1.93) ▪ Again the z-table gives the area for
positive z-values. But from the symmetric
Solution property, the area of the negative z-value
is the same as the area of the positive z-
value.
▪ Look for the area between 0 and +1.93

▪ Subtract that area from 0.5000 and the


difference is the required area.
▪ 0.5000-0.4732

▪ =0.0268

▪ or 2.68%
▪ When the shaded area lies between any two z values on the same side of the mean, look for
the z values and get the areas, subtract the smaller area from the larger area.
▪ Find the area between z=2.00 and z=2.47 ▪ For this situation, look for the area from
z=0 to z=2.47 which is equal to 0.4932 and
Solution the area from z=0 to z=2.00 which is
0.4772.
▪ The difference is the desired area which is
0.4932-0.4772
▪ = 0.0160

▪ or 1.60%.
▪ When the shaded are is less than any z-value where z is greater than the mean, we look for
the z value to get the area and add 0.5000 to the area.
▪ Find the area to the left of z<1.99 ▪ Since the z-table only gives the area
between z=0 and z=1.99, you add 0.500
Solution (one half) to the able area.
▪ The area between z=0 and z=1.99 is
0.4767.
▪ Therefore, the total area = 0.4767+0.5000

▪ = 0.9767

▪ or 97.67%.
▪ If the shaded area lies in both tails of the curve, look for the z values to get the areas, subtract
each area from 0.5000 and add the two answers.
▪ Find the area to the right of z=2.43 and to ▪ The area to the right of 2.43 is 0.5000-
the left of z=-3.01 0.4925 = 0.0075.

Solution ▪ The area to the left of -3.01 is 0.5000-


0.4987 = 0.0013.
▪ The total area then is 0.0075+ 0.0013

▪ = 0.0088

▪ or 0.88%.
▪ When the shaded areas are on opposite sides of the mean, look up for z values to get the
areas and add up the areas.
▪ Find the area between z=168 and z=-1.37 ▪ Since the two areas are on opposite sides of
z=0, we find both areas and add them up.
Solution
▪ The area between z=0 and z=1.68 is 0.4535.

▪ The area between z=0 and z=-1.37 is 0.4147.

▪ Hence the total area is 0.4535+4147

▪ = 0.8682

▪ or 86.82%.
▪ When the shaded area is to the right of any z value and the z-value is less than the mean, look
up for the z value to get the area and add 0.5000 to get the total area.
EXAMPLE

▪ Find the area to the right of z=-1.16 ▪ The area between z=0 and z=-1.16 is
0.3770.
Solution
▪ Hence the total area is 0.5000+0.3770

▪ = 0.8770

▪ or 87.70%.
APPLICATIONS OF NORMAL DISTRIBUTIONS

▪ The standard normal distribution curve can be used to solve a wide variety of practical
problems.
▪ The only requirement is that the variable must be normally or approximately normally
distributed.
▪ To solve problems by using the standard normal distribution, we transform the original
variable to a standard normal distribution variable by using the formula;
𝐷𝑎𝑡𝑎 𝑣𝑎𝑙𝑢𝑒−𝑀𝑒𝑎𝑛
▪ Z=
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
𝑋−µ
▪ or Z=
𝜎
▪ The mean number of hours a year two student ▪ Step 2: Find the z-value corresponding to 3.5
spends on a computer is 3.1 hours per day. 𝑋−µ 3.5−3.1
Assume the standard deviation is 0.5 hours. ▪ Z= = Z= = 0.80
Find the percentage of students who spend less 𝜎 0.5
than 3.5 hours on the computer. Assume the
variable is normally distributed.
Solution
Step 1: Draw the figure showing the area

▪ Step 3: Find the area using the z-table. The area


between z=0 and z=0.8 is 0.2881.
▪ Since the area under the curve to the left of
z=0.8 is desired, we add 0.5000 to 0.2881
▪ = 0.7881.

▪ Hence, 78.81% of year two students spend less


than 3.5 hours per day on the computer.
▪ Each week, an average household earns 28 pounds We look for two z-values
from backyard gardening. Assume the standard
𝑋−µ 27−28 1
deviation is 2 pounds. If a household is selected at ▪ Z1= = = - = -0.5
random, find the probability of it generating 𝜎 2 2
between 27 and 31 pounds per week. 𝑋−µ 31−28 3
▪ Z2= = = - = 1.5
𝜎 2 2
Solution

▪ We then find the appropriate area using the z-table.

▪ The area between z=0 and z=-0.5 is 0.1915.

▪ The area between z=0 and z=1.5 is 0.4332.

▪ The desired area is 0.1915+0.4332 = 0.6247.

▪ The probability that a random hh earns between 27


and 31 pounds is 62.47%.
Calculate the sample statistics for
college station precipitation
Determine;
year Precipitation
(inches) The probability of receiving rainfall between 35-
40 inches in 1980 using rainfall data given.
1970 33.9 Assume it is normally distributed
1971 31.7
1972 31.5
1973 59.6
1974 50.5
1975 38.8
1976 43.4
1977 28.7
1978 32.0
1979 51.8
CONFIDENCE INTERVALS
CONFIDENCE INTERVALS
 A confidence interval, in statistics, refers to the probability that a population
parameter will fall between a set of values for a certain proportion of times.

 Confidence intervals measure the degree of uncertainty or certainty in a sampling


method.

 Since the populations are always big, the values got from samples are just estimates of
the true parameter.

 An important question in estimation is that of the sample size, how large should it be to
make an accurate estimate?

 Therefore a point estimate is used to infer the sample mean to the population mean.

 A point estimate is a specific numerical value estimate of parameter. The best point
estimate of the population mean µ is the sample mean 𝑋. ത

 Sample means other than mode and median are used in estimating the population
mean simply because means of samples vary less than other statistics (mode and
median).

 Sample measures used to estimate population measures are called estimators.


Properties of a good Estimator

 It should be unbiased

 It should be consistent

 It should be relatively efficient(it should have the smallest variance)


 An interval estimate of a parameter is an interval or range of values used to
estimate the parameter.
 Eg; The interval estimate for the average age of year two students is 26.9< µ<27.7
or 27.3 ±0.4 years.
 The confidence level of an interval estimate of a parameter is the probability
that the interval estimate will contain the parameter , assuming that a large
number of samples are selected.
 A confidence interval (CI)is therefore a specific interval estimate of a parameter
determined by using data obtained from the sample and by using the specific
confidence level of the estimate.
 Three common CIs are used; 90,95 and 99% CIs.
 The central limit theorem states that when the sample size is large,
approximately 95% of the sample means will fall within ±1.96 standard errors of
the population mean;
𝜎
 µ ±1.96
𝑛
 If a specific sample mean 𝑋ത is selected, there is a 95% probability that it falls
𝜎
within the range of µ ±1.96 .
𝑛

𝜎
 Likewise, there is a 95% probability that the interval specified by 𝑋ത ±1.96 will
𝑛
contain µ which can also be stated as;
𝜎 𝜎
 𝑋ത −1.96 < µ< 𝑋ത +1.96
𝑛 𝑛

 The value used for the 95% CI is obtained from the bottom of the t-distribution
table.

 Since other CIs are also used, the symbol 𝑧𝛼Τ2 is used in the general formula.

 𝛼 is the error term (1-CI) in decimals.

 At 95% CL, 𝛼 = 1-0.95 = 0.05, at 99% CL, 𝛼 = 0.01 and at 90% CL, 𝛼=0.1
 The general formula for CI is given by;
𝜎 𝜎
 𝑋ത − 𝑧𝛼Τ2 < µ< 𝑋ത + 𝑧𝛼Τ2
𝑛 𝑛

 However, this formula only works when 𝜎 is known and n ≥30.

 For 90% CL, 𝑧𝛼Τ2 =1.65; for 95% CL, 𝑧𝛼Τ2 =1.96 and for 99% CL, 𝑧𝛼Τ2 =2.58.
𝜎
 The term 𝑧𝛼Τ2 is called the maximum error of estimate.
𝑛

 This is the likely difference between the point estimate of a parameter and he
actual value of a parameter.
Example
 A survey of 30 adults showed that the mean age of a person’s vehicle is 5.6
years. Assuming the standard deviation of the population is 0.8 year, find the best
point estimate of the population mean and the 99% confidence interval of the
population mean.

 Solution

 The best point estimate of the population mean is 5.6 years.


𝜎 𝜎
 𝑋ത − 𝑧𝛼Τ2 < µ< 𝑋ത + 𝑧𝛼Τ2
𝑛 𝑛

0.8 0.8
 5.6− 2.58 < µ< 5.6− 2.58
30 30

 5.6-0.38< µ<5.6+0.38

 5.22< µ<5.98

 Hence, one can be 99% confident that the mean age of all vehicles is between
5.2 and 6.0 years, based on 30 vehicles.
Sample size

 Sample size determination is closely related to statistical estimation.


 How large a sample size should be depends on the maximum error
of estimate, the population standard deviation and the degree of
confidence.
 The sample size formula is derived from the maximum error of
𝜎
estimate formula; 𝐸 = 𝑧𝛼Τ2
𝑛

 Solving for n, 𝐸 𝑛 = 𝑧𝛼Τ2 (𝜎)


𝑧𝛼ൗ .𝜎 2
 Hence n = 2
𝐸
Confidence interval for the Mean when is 𝝈 unknown and n<30
 When the population standard deviation is not known and the sample size is less than 30,
the sample standard deviation can be used and for this case, a t-distribution is used.

 Characteristics of the t Distribution

 It is bell shaped

 It is symmetric about the mean

 The mean, mode and median are equal to 0 and located at the center of the
distribution.

 The curve never touches the x-axis.

 The variance is greater than one.

 It is a family of curves based on the concept of degrees of freedom, which is related to


sample size.

 As the sample size increases, the t distribution approaches the standard normal
distribution.
The t Distribution
 Degrees of freedom (d.f) are a number of values that are free to vary after a
sample statistic has been computed.

 For example, if the mean of 10 values is 20, then 9 out of the 10 values are free to
vary.

 Hence the degrees of freedom are 10-1=9 (subtracting 1 from the sample size; d.f
= n-1).

 The formula for a specific confidence interval for the mean when 𝜎 is unknown
and n<30 is;
𝑠 𝑠
 𝑋ത − 𝑡𝛼Τ2 < µ< 𝑋ത + 𝑡𝛼Τ2
𝑛 𝑛
Example 1

 Find the 𝑡𝛼Τ2 value for a 95% confidence interval when the sample size is 22.

 Solution

 The d.f = 22-1=21

 Look for 21 in the left column of the t distribution table and 95% in the row
labeled confidence intervals.

 The intersection where the two meet gives the value for 𝑡𝛼Τ2 which is 2.080.

 Note: At the bottom for the t distribution table where z= ∞, the 𝑧𝛼Τ2 values can be
found.
Example 2

 Ten randomly selected plants were studied and their length was measured. The
mean height was 0.32cm and he standard deviation was 0.08cm. Find the 95% CI of
the mean height assuming that the variable is approximately randomly selected.

 Solution

 Since 𝜎 is unknown, s replaces it and d.f = 10-1=9

 The t distribution is used and with 9 d.f at 95% CL, 𝑡𝛼Τ2 =2.262
𝑠 𝑠
 Using 𝑋ത − 𝑡𝛼Τ2 < µ< 𝑋ത + 𝑡𝛼Τ2 ,
𝑛 𝑛

0.08 0.08
 0.32−(2.262) < µ< 0.32+(2.262)
10 10

 0.32-0.057< µ< 0.32+0.057

 0.26< µ<0.38

 Hence, one can be 95% that the mean height of the population lies between 0.26 an
0.38cm based on a sample of ten plants.
Confidence intervals and sample size for proportions

 A proportion represents a part of a whole. It can be expressed as a fraction,


decimal or percentage.

 Proportions also represent probabilities and they can be obtained from samples
or populations.

 The following symbols will be used;

 p – population proportion

 ṕ (p hat)- sample proportion


𝑥 𝑛−𝑥
 For sample proportion, ṕ = 𝑛
and ǭ = 𝑛
𝑜𝑟 1 − ṕ

 Where x is the number of sample units that possess the characteristics of interest
and n is the sample size.
For example

 In a study, 200 people were asked if they were satisfied with their
jobs;162 said they were.
𝑥 162
 In this case, n=200, x=162, ṕ = = =0.81 or 81%.
𝑛 200

 81% of the people were satisfied with their jobs.

 The proportion of people who did not respond favourably when


asked is ǭ = 1 − ṕ, 1-0.81=0.19 or 19%.
Confidence intervals for proportions

 To construct a confidence interval about a proportion, one must


ṕǭ
use the maximum error of estimate; 𝐸 = 𝑧𝛼Τ2
𝑛

 Confidence intervals about the proport must meet the criteria np≥5 and nq≥5

 The formula for a specific CI for a proportion is;

ṕǭ ṕǭ
 ṕ − 𝑧𝛼Τ2 < 𝑝 < ṕ + 𝑧𝛼Τ2
𝑛 𝑛

 We round off to three decimal places.


Example
 A sample of 500 Food Technician applications included 60 from men. Find the 90% CI of the
true proportion of men who applied for the position.

 Solution

 Since α = 1-0.90 =0.10, 𝑧𝛼Τ2 = 1.65

ṕǭ ṕǭ
 Substituting in the formula ṕ − 𝑧𝛼Τ2 < 𝑝 < ṕ + 𝑧𝛼Τ2
𝑛 𝑛

𝑥 60
 ṕ= = =0.12, ǭ = 1 − ṕ = 1-0.12 = 0.88
𝑛 500

(0.12)(0.88) (0.12)(0.88)
 0.12−1.65 < 𝑝 < ṕ + 1.65
500 500

 0.12-0.024< 𝑝<0.12+0.024

 0.096< 𝑝<0.144

 9.6%< 𝑝<14.4%

 Hence, one can be 90% confident that the percentage of applicants who were men os
Sample size for proportions

ṕǭ
 This is derived from 𝐸 = 𝑧𝛼Τ2
𝑛

𝑧𝛼ൗ 2
 𝑛 = ṕǭ 2
𝐸
Confidence intervals for Variance and Standard Deviations
 To calculate these CIs, we use the chi-square distribution.

 The chi-square variable is similar to the t variable in that its distribution is a family of curves
based on the number of degrees of freedom.

 The symbol for chi-square is 𝑥 2

 A 𝑥 2 variable cannot be negative and the distributions are positively distributed.


 Two different values are used in the formula for CI. One value is found on the left
side of the table and the other on the right.

 For example, to find the table values corresponding to the 95% CI, one must first
change 95% to a decimal and subtract from 1 (1-0.95 = 0.05).

 Then divide the answer by 2 (𝛼Τ2= 0.05Τ2=0.025). This is the column on the right side
of the table used to get values for 𝑥 2 𝑟𝑖𝑔ℎ𝑡 .

 To get values for 𝑥 2 𝑙𝑒𝑓𝑡 , subtract the value of 𝛼Τ2 from 1 (1- 0.05Τ2 = 0.975).

 Finally, find the appropriate row corresponding to the degrees of freedom (n-1)

 A similar procedure is used for a 90 and 99% CI.


Example

 Find the values for 𝑥 2 𝑟𝑖𝑔ℎ𝑡 and 𝑥 2 𝑙𝑒𝑓𝑡 for a 90% CI when n=25.

 Solution

 To find 𝑥 2 𝑟𝑖𝑔ℎ𝑡 , subtract 1-0.90 = 0.10 and divide it by 2 to get 0.05.

 To find 𝑥 2 𝑙𝑒𝑓𝑡 , subtract 1-0.05 to get 0.95.

 Hence, use the 0.95 and 0.05 columns and the row corresponding to
24 d.f.

 𝑥 2 𝑟𝑖𝑔ℎ𝑡 = 36.415

 𝑥 2 𝑙𝑒𝑓𝑡 = 13.848
Formula for Confidence Interval for a Variance and
Standard Deviation

(𝑛−1)𝑠 2 (𝑛−1)𝑠 2
 For Variance, CI is given by < 𝜎2 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡

(𝑛−1)𝑠 2 (𝑛−1)𝑠 2
 For Standard Deviation, the CI is expressed as <𝜎 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡

 Example

 Find the 95% confidence interval for the variance and standard deviation of
nitrogen content of a fertilizer if a sample of 20 packs has a standard deviation of
1.6 miligrams.
Solution

 Since 𝛼=0.05, 𝛼Τ2= 0.05Τ2=0.025 and the other value is 0.975.

 The d.f is 20-1 = 19

 The corresponding values of 𝑥 2 𝑟𝑖𝑔ℎ𝑡 and 𝑥 2 𝑙𝑒𝑓𝑡 are 32.852 and 8.907 respectively.
(𝑛−1)𝑠2 (𝑛−1)𝑠2
 Substituting in the formula for variance, < 𝜎2 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡

(20−1)1.62 (20−1)1.62
 < 𝜎2 <
32.852 8.907

 1.5< 𝜎 2 <5.5

 Hence, one can be 95% confident that the true variance for the nitrogen content lies between
1.5 and 5.5.
 For standard deviation,

(𝑛−1)𝑠 2 (𝑛−1)𝑠 2
 <𝜎 <
𝑥 2 𝑟𝑖𝑔ℎ𝑡 𝑥 2 𝑙𝑒𝑓𝑡

 1.5 < 𝜎 < 5.5

 1.2< 𝜎 <2.3

 Hence, one can be 95% confident that the true standard deviation for the nitrogen
content lies between 1.2 and 2.3 milligrams based on a sample of 20 packs.
HYPOTHESIS TESTING
Hypothesis Testing

 One of the greatest corner-stones of the Scientific Method of


Investigation is the clear statement of the problem through the
formulation of hypotheses.
 This is usually followed by a clear statement of the nature of the
research which will be involved in hypothesis testing and a clear
decision rule.
 There are 3 methods used in hypothesis testing an these are;
 1. The traditional method
 2. The p-value method
 3. The confidence interval method
The traditional method of Hypothesis Testing

 This has been used since the hypothesis testing method was formulated.

 Steps taken in Hypothesis testing using the traditional method

 Every hypothesis-testing situation begins with statement of a hypothesis.

 There are two types of hypotheses for each situation, the null hypothesis and the
alternative hypothesis.

 The null hypothesis symbolized by H0, is a statistical hypothesis that states that
there is no difference between a parameter and a specific value, or that there is
no difference between two parameters.

 The alternative hypothesis symbolized by H1 , is a statistical hypothesis that states


that there is a difference between a parameter and a specific value, or states
that there is a difference between parameters.
 An illustration on how hypotheses should be stated is shown in three different
studies;
 SITUATION A
 A medical researcher interested in finding out whether a new medication will
have any undesirable side effects by looking at the pulse rate of patients who
take the medication. Will the pulse rate increase, decrease or remain the same
after the patient takes the medication?
 Since the researcher knows the mean pulse rate for the population under study
as 82 beats per minute. The hypotheses for this situation are;
 H0 : µ = 82 and H1 : µ ≠82
 The null hypothesis specifies that the mean will remain unchanged, and the
alternative hypothesis states that it will be different.
 This test is called a two-tailed test since the possible side effects would raise or
reduce.
SITUATION B

 A chemist invents an additive to increase the shelf life of food. If the mean
shelf life is 36 days, then her hypotheses are;

 H0 : µ ≤36 and H1 : µ > 36

 In this situation, the chemist is only interested in increasing the food shelf life,
so her alternative hypothesis is that the mean shelf life will be greater than 36
hours.

 This test is called a right-tailed test since it only looks at an increase.


SITUATION C

 A contractor wishes to lower heating bills by using a special type of


insulation. If the average of the monthly heating bill is $78. Her
hypotheses are;

 H0 :µ ≥$78 and H1 : µ < $78

 This test is a left-tailed test, since the contractor is interested in only


lowering heating costs.
Symbols used in stating hypotheses

 Equal to =  For a two tailed test,

 Not equal to ≠ H0 : µ = 𝑘 and H1 : µ ≠ k

 Greater than >  For a right –tailed test,

 Less than < H0 : ≤ k and H1 : µ > 𝑘

 Greater than or equal to ≥  For the left-tailed test,

 Less than or equal to ≤ H0 :µ ≥ 𝑘 and H1 : µ < 𝑘

 Where k is any specified


number.
 A statistical test uses data obtained from a sample to make a decision about
whether the null hypothesis should be rejected or not.
 The numerical value obtained from a statistical test is called the test value.
 In this type of statistical test, the sample mean is computed and compared to
the population mean.
 Then a decision is made to reject or not to reject the null hypothesis based on
the value obtained from the statistical test.
 If the difference is significant, the null hypothesis is rejected and if not, the null
hypothesis is not rejected.
 There are always two possibilities for a correct decision and two possibilities for an
incorrect decision.
 Type I error occurs if one rejects the null hypothesis when its true.
 Type II error occurs if one does not reject the null hypothesis when it is false.
Type I and type II errors
 The level of significance is the maximum probability of committing a type I error.
 This probability is symbolized by α and therefore p = α

 The probability of a type II error is symbolized by β but in most hypothesis testing situations, β cannot be easily computed.

 However, α and β are related in that decreasing one increases the other.

 Statisticians generally agree on using three arbitrary significance levels : the 0.10, 0.05, and 0.01 levels. If the null hypothesis is
rejected, the probability oof a type I error will be 10.5 or 1% depending on the significance level used.

 After the significance level is chosen, a critical value (CV) is selected from table for the appropriate test. If a z-test is used, the
z-table is consulted to find the critical value.

 The critical value determines the critical and noncritical regions. It separates the critical region from the noncritical region.

 The critical or rejection region is the range of values of the test value that indicates that there is a significant difference and that
the null hypothesis should be rejected.

 The noncritical or nonrejection region is the range of values of the test value that indicates that the difference was probably due
to chance and that the null hypothesis should not be rejected.
 The CV can be on the right side of the mean or on the left side for a one-tailed
test and this depends on the inequality sign of the alternative hypothesis.

 A one-tailed test indicates that the null hypothesis should be rejected when the
test value is in the critical region on one side of the mean.

 In a two-tailed test, the null hypothesis should be rejected when the test value is
in either of the two critical regions.

 For a two-tailed test, the critical region must be split into two equal parts. If α
=0.01, then one-half of the area, 0.005 must b to the right of the mean and the
other half to the left of the mean.

 In this case, the area to be found in the z-table is (0.5000-0.005)=0.4950 and the
critical values are +2.58 and -2.58.
Procedure for finding CVs for specific values using the z-table

 Step 1: Draw the figure and indicate the appropriate area.

 Step 2: For a one-tailed test, subtract the area equivalent to α from 0.5000. For
𝛼
the two-tailed test, subtract the area equivalent to from 0.5000.
2

 Step 3: Look for the area in the z-table corresponding to the value obtained step
2. If the exact value can not be found in the table, use the closest value.

 Step 4: Find the z-value the corresponds to the area and this will be the critical
value.

 Step 5: Determine the sign of the CV. For a one-tailed test, if the test is left-tailed,
the CV will be negative and if the test is right-tailed, the CV will be positive. For a
two-tailed test, one value will be positive and the other negative.
z Test for a mean

 Many hypotheses are tested using a statistical test base on the following general
formula:
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 −(𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒)
 Test Value = 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟

 The z test is a statistical test for the mean of a population. It can be used when n ≥ 30.
Ⴟ−µ
 Z= 𝜎
ൗ 𝑛

 Where
 Ⴟ = Sample mean,
 µ = Hypothesized population mean,
 𝜎 = population standard deviation,
 n=sample size
Procedure for hypothesis testing using the traditional
method

 Step1: State the hypothesis and identify the claim

 Step 2: Find the CVs from the appropriate table

 Step 3: Compute the test value

 Step 4: Make the decision to reject or not to reject the null hypothesis

 Step 5: Summarize the results.


Example
 A researcher reports that the average salary of junior agronomists is more than
$42,000. A sample of 30 junior agronomists has a mean salary of $43,260. At α =
0.05, test the claim that junior agronomists earn more than $42,000 a year. The Standard
deviation off the population is $5230.

 Solution

 Step 1: State the hypotheses and identify the claim

H0 : µ ≤ $42,000 and H1 : µ > $42,000 (claim)

 Step 2: Find the CV. Since α = 0.05 and the test is a right-tailed test, the CV is z=+1.65

 Step 3: Compute the test value


Ⴟ−µ $43,230 −$42,000
 Z= 𝜎 = $5230 = 1.32
ൗ 𝑛 ൗ 30
 Sep 4: Make the decision. Since the test value, +1.32 is less than the critical value +1.65,
and is not in the critical region, the decision is not to reject the null hypothesis.

 Step 5: Summarize the results. There is no enough evidence to support the claim that
junior agronomists earn more than $42,000 a year
p-Value method for Hypothesis Testing

 The p-value (probability value) is the probability of getting a sample statistic (eg
mean) in the direction of the alternative hypothesis when the null hypothesis is true.

 The p-values of the z-test can be found by use of the z-table.

 First find the area under the normal distribution curve corresponding to the z-value,
then subtract this area from 0.5000 to get the p-value.

 Note: To get the p-value for a two ailed test, we double the area after subtracting.

 Decision Rule when using a p-value

 If the p-value ≤ α, reject the null hypothesis

 If the p-value> α, do not reject the null hypothesis


Procedure for hypothesis testing using the p-value method

 Step1: State the hypothesis and identify the claim

 Step 2: Compute the test value

 Step 3: Find the p-value

 Step 4: Make the decision to reject or not to reject the null hypothesis

 Step 5: Summarize the results.


Example
 A researcher wishes to test the claim that the average age of year two students is greater
than 24 years. She selects a sample of 36 students and finds the mean of the sample to be
24.7 years, with a standard deviation of 2 years. Is there evidence to support the claim at
α=0.05? Use the p-value method.
 Solution
 Step 1: H0 : µ ≤ 24 and H1 : µ > 24 (claim)
Ⴟ−µ 24.7 −24
 Step 2: Z = 𝜎 = 2 = 2.10
ൗ 𝑛 ൗ 36

 Step 3: Find the p-value. Using the z-table, the corresponding area for z=2.10 is 0.4821. Subtract this value
from 0.5000 to get the area in the right tail. 0.5000-0.4821 = 0.0179
Hence the p-value is 0.0179
 Step 4: Since the p-value is less than α (0.05), the decision is to reject the null hypothesis.
 Step 5: There is enough evidence to support the claim that the average age of year two students is greater
than 24 years.
THE CHI-SQUARE STATISTIC
THE CHI-SQUARE TEST
 In some research situations one may be confronted with qualitative variables
where the Normal Distribution is meaningless because of the small sizes of
samples involved, or because the quantities being measured cannot be
quantified in exact terms.

 Qualities such as marital status, stem color of a plant, sex of the respondents
being observed, etc., are the relevant variables; instead of exact numbers which
we have used to measure variables all our lives.

 Chi-square is a distribution-free statistic which works at lower levels of


measurement like the nominal and ordinal levels.
The Chi Square Distribution

 The chi-square Distribution is used extensively in testing the


independence of any two distributions.
 The question of interest is whether or not the observed proportions
of a qualitative attribute in a specific experimental/survey situation
are identical to those of an expected distribution.
 For this comparison to be done the subject observed frequencies
are compared to those of similar situations - which are called the
expected frequencies.
 The differences are then noted and manipulated statistically in
order to test whether such differences are significant.
𝑂𝑖 − 𝐸𝑖 2
 The expression for the Chi-Square statistic is: 𝑥2 = 𝑘
σ𝑐=1
𝐸𝑖
Where:
 The letter x is the Greek letter Chi. The expression “𝑥 2 ” is “Chi-Squared”, which is the
parameter used in statistical testing and analysis of qualitative samples.


𝑂𝑖 = The observed frequency of the characteristic of interest.


𝐸𝑖 = The expected frequency of the characteristic of interest.


k = The number of paired groups in each class comprising the observed frequencies 𝑂𝑖
and the expected frequencies 𝐸𝑖 .


c = An individual observations of paired groups

 The Chi-Square distribution is such that if the observed values differ very little from the
expected values then the chi-square value is very little. On the other hand when the
differences are large, then the chi squares are also very large.

 This means that the mode of this distribution tends to be located among the fewest
observations.
An outline of the general procedure for a chi-squared test

 State the null and the alternative hypotheses

 Get the Chi-square tabulated using the chi-square table and degrees of
freedom where degree of freedom = n - 1
𝑂𝑖 − 𝐸𝑖 2
 Get the Chi-square calculated using the formula 𝑥2 = 𝑘
σ𝑐=1
𝐸𝑖

 Compare Chi-square tabulated and Chi-squared calculated

 If the Chi-squared calculated is greater than chi-squared tabulated,


reject the null hypothesis

 If the Chi-squared calculated is less than chi-squared tabulated, do not


reject the null hypothesis.
Example
 In a trial production run, 5 machines produced the following quantities of
defective items:

No. of defective
Machine items
A 25
B 24
C 20
D 28
E 23
Total 120

 Are the machines equally efficient?


Solution
 We get the chi-square of the computed value and the tabulated value taking into account
that the level of significance is 0.05.

 H0: All machines are equally efficient

 H1: All machines are not equally efficient.


120
 Expected value, E = µ = = 24
5

Machine No. of defective items (O -E) (O - E)2


A 25 (25-24)=1 1
B 24 (24-24)=0 0
C 20 (20-24)-4 16
D 28 (28-24)=4 16
E 23 (23-24)=-1 1
Total 120
 ∑〖(O-E)〗^2 =34

𝑘 𝑂𝑖 − 𝐸𝑖 2
 𝑥2 = σ𝑐=1
𝐸𝑖

 𝑥 2 = 34/24

 𝑥 2 Calculated= 1.42

 To get 𝑥 2 tabulated, we use degrees of freedom and since we have


five classes, the degrees of freedom is 5-1=4.

 Since the level of significance is 0.05 , 𝑥 2 tabulated is 9.49.


 Since 𝑥 2 calculated is less than 𝑥 2 tabulated, we do not reject the null hypothesis and
therefore the machines are equally efficient.
CORRELATION AND REGRESSION
CORRELATION AND REGRESSION
▪ Another area of inferential statistics involves determining whether a relationship between wo or
more variables exists.
▪ Correlation is a statistical method used to determine whether a relationship between variables
exists while a regression describes the nature of this relationship (whether positive, negative,
linear or nonlinear).
▪ This topic gives a statistical answer to the following questions;

1. Are two or more variables related?


2. If so, what is the strength of the relationship?
3. What type of relationship exists?
4. What kind of predictions can be made from the relationship?
▪ To answer the first two questions, a correlation coefficient is used. For example, there are many
variables that contribute to heart disease such as smoking, lack of exercise, heredity and diet. Of all
of these, some contribute mor than others and therefore variables with higher coefficients are
regarded as more important than others.
▪ To answer the third question, we ascertain the type of relationship and there are two
types of relationships: simple and multiple.
▪ In a simple relationship, there are two variables; an independent variable and a
dependent variable.
▪ A simple relationship analysis is called simple regression and there is only one
independent variable that is used to predict the dependent variable.
▪ In a multiple regression (multiple relationship), two or more independent variables
predict one dependent variable.
▪ Simple relationships can also be positive or negative. A positive relationship exists
when both variables increase or decrease at the same time. In a negative relationship,
one variables increases as the other variable decreases and vice versa.
▪ Finally, the forth question asks what type of predictions can be made. Predictions are
made in all areas and daily.
SCATTER PLOTS
▪ In simple correlation and regression studies, the researcher collects data on two variables to see
whether a relationship exists.
▪ For example, if a researcher wishes to see whether there is a relationship between number of
hours of study and test scores on an exam, she must select a random sample of students,
determine the hours studied and obtain their grades on the exam as shown below;
▪ The two variables of this is study are the independent (can be controlled or manipulated) and the
dependent variable (cannot be controlled or manipulated).
▪ The number of hours of study is the independent variable also referred to as the x variable. The
grade that the student receives is the dependent variable also referred to as the y variable.
▪ This is because the grade a student receives depends on the number of hours of study.
Student Hours of study Grade (y)
(x)
A 6 82
B 2 63
C 1 57
D 5 88
E 2 68
▪ The dependent and independent variables can be plotted on a graph called a scatter plot. The
independent variable; x is plotted on the horizontal axis and the dependent variable; y is plotted
on the y-axis.
▪ A scatter plot therefore is a graph of ordered pairs (x,y) of numbers of the independent and
dependent variables.
▪ It is a visual way of describing the nature of relationship between the independent and dependent
variables.
Example
▪ Construct a scatter plot for the data obtained in a study of age and blood pressure of six randomly
selected subjects;
Subject Age Pressure
A 43 128
B 48 120
C 56 135
D 61 143
E 67 141
F 70 152
▪ Step 1: Draw and label the x and y-axes.

▪ Step 2: Plot each point on the graph.


CORRELATION COEFFICIENT
▪ The correlation coefficient computed from the sample data measures the strength and direction of a
linear relationship between two variables.
▪ The symbol for sample correlation coefficient is r and the symbol for population correlation
coefficient is p (Greek letter rho).
▪ The range of a correlation coefficient is from -1 to +1. If there is a strong positive relationship
between the variables, the value of r will be close to +1 and if it is a strong negative relationship, r
will be close to -1.
▪ The formula for a correlation coefficient is given by;
𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
▪𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2

Rounding rule for correlation coefficient


▪ Round the value of r to three decimal places.
▪ For the previous example, compute the correlation coefficient and interpret the nature and
strength of the relationship between age and pressure.
▪ Step 1: Make a table and find the values of xy, 𝑥 2 and 𝑦 2 .

Subjec Age, x Pressure xy 𝑥2 𝑦2


t ,y
A 43 128 5,504 1,849 16,384
B 48 120 5,760 2,304 14,400
C 56 135 7,560 3,136 18,225
D 61 143 8,723 3,721 20,449
E 67 141 9,447 4,489 19,881
F 70 152 10,640 4,900 23,104
σ 𝑥=345 σ 𝑦=818 σ 𝑥𝑦=47,634 σ 𝑥 2 =20,399 σ 𝑦 2 =112,443
▪ Step 2: Substitute in the formular and solve for r

𝑛 σ 𝑥𝑦 − σ 𝑥 σ 𝑦
▪𝑟=
𝑛 σ 𝑥2 − σ 𝑥 2 𝑛 σ 𝑦2 − σ 𝑦 2

(6) 47,634 − 345 819


▪𝑟= (BODMAS)
(6) σ 20,399 − 345 2 (6) 112,443 − 818 2

▪ =0.897

▪ The correlation coefficient suggests a strong positive relationship between age and
blood pressure.
SIGNIFICANCE OF THE CORRELATION COEFFICIENT
▪ Since the value of r is computed from data obtained from samples, there are two
possibilities when r is not equal to zero:
▪ Either the value of r is high enough to conclude that there is a significant linear
relationship between the variables, or the value of r was due to chance.
▪ To make this decision, a hypothesis testing procedure is used;

▪ Step 1: State the hypothesis

▪ Step2: Find the critical values

▪ Step 3: Compute the test value

▪ Step 4: Make the decision

▪ Step 5: Summarize the results

▪ The population correlation coefficient p is he correlation computed by using all


possible pairs of data values (x,y) taken from a population.
▪ In hypothesis testing, one of these statements is true:

▪ H0 : 𝑝 = 0 The null hypothesis means that there is no correlation between the x and y
variables in he population.
▪ H1 : 𝑝 ≠ 0 This alternative hypothesis means that there is a significant correlation
between the variables in he population.
▪ When he null hypothesis is rejected at a given level, it means that there is a significant
difference between the value of r and 0.
▪ When the null hypothesis is not rejected, it means that the value of r is not significantly
different from 0 and is probably due to chance.
▪ Formula for the t-test for the correlation coefficient

𝑛−2
▪𝑡=𝑟
1−𝑟 2

▪ With degrees of freedom equal to n-2.


▪ Although hypothesis tests can be one-tailed, most hypotheses involving correlation
coefficient are two-tailed.
▪ H0 : 𝑝 = 0 and H1 : 𝑝 ≠ 0

▪ The two-tailed critical values are used and they are found in the t distribution table.

▪ Also, when testing the significance of a correlation coefficient, both the x and y variables
must come from a normally distributed population.
Example
▪ Test the significance of the correlation coefficient from the previous example. Use α =
0.05 and r was 0.897.
Solution
▪ Step 1: State the hypotheses

▪ H0 : 𝑝 = 0 and H1 : 𝑝 ≠ 0
▪ Step 2: Find the critical values. Since α = 0.05 and there are 6-2 degrees of freedom, the critical values
obtained from the t distribution table are ±2.776.

▪ Step 3: Compute the test value

𝑛−2 6−2
▪𝑡=𝑟 = (0.897) = 4.059
1−𝑟 2 1−(0.897)2

▪ Step 4: Make the decision. Reject the null hypothesis since the test value falls in the critical region.
▪ Step 5: Summarize the results. There is a significant relationship between the variables
of blood pressure and age.

You might also like