You are on page 1of 154

Inferential

Statistics
Objectives
At the end of this course students will be able to:

 Define Inferential statistics

 Know statistical estimation

 Understand hypothesis testing & the “types of


errors” in decision making.

 Use test statistics to examine hypothesis about


population parameter
Inference

Use a random sample to


learn something about a
larger population
Inferential Statistics

 Inferential Statistics: Are statistical methods used


for drawing conclusions about a population based on
the information obtained from the a sample of
observations drawn from that population
Inferential Statistics
 Involves
– Estimation Population?
Population?
– Hypothesis
testing

 Purpose
– Make decisions about
population
characteristics
Inferential statistics
Inferential Statistics
Hypothesis
Estimation
testing

One Point
sample estimation

Two Interval
samples estimation
Inferential process
Statistical Estimation
 Estimation is the process of determining a likely value
of population parameter, based on information
collected from the sample
 Estimation is the use of sample statistics to estimate the
corresponding population parameters
 The objective of estimation is to determine the
approximate value of unknown population parameter
on the basis of a sample statistic
Sample Statistics as Estimators of Population
Parameters

A sample statistic is a A population parameter


numerical measure of a is a numerical measure of
summary characteristic of a summary characteristic
a sample. of a population.

 An estimator of a population parameter is a sample statistic used to


estimate or predict the population parameter

 An estimate of a parameter is a particular numerical value of a


sample statistic obtained through sampling.
Estimation
Every member of the
population has the
same chance of being
Population selected in the sample

Parameter

Random sample
Estimation
Statistic
Estimation
Estimation

Point Interval
estimation estimation
Point and Interval Estimates
 A point estimate is a single value used as an estimate of a population

parameter

 Interval estimate is a range or interval of numbers believed to include

unknown population parameter with a certain degree of assurance

 Point estimate is always within the interval estimate

Lower Upper
Confidence Confidence
Point Estimate Limit
Limit

Interval estimate
Estimation Process
Interval estimate
Population Point estimate
Mean I am 95%
 
 confident that 
Mean, , is  X = 50
is between 40 &
unknown 60.

 
RandomSample


 


Point estimation

 A single numerical value used to estimate the


corresponding population parameter
 Gives little information about how close the value is to
the unknown population parameter
 Example: Sample mean X= 3 is point estimate of
unknown population mean
Sample statistic &their corresponding
population parameter
Statistic Parameter
Mean: X estimates 
Variance: s2 estimates 2
Standard
deviation:
s estimates 
Proportion: p estimates 
From entire
From sample
population
Properties of good estimate
a) Unbiasedness: An estimator is said to be unbiased
if its expected value is equal to the population
parameter it estimates.

 For example: when E ( X )   ,the sample mean is an


unbiased estimator of the population mean
 The mean of any single sample will probably not
equal to the population mean, but the average of the
means of repeated independent samples from a
population will equal to the population mean.
Properties of good estimate

b) Minimum variance: An estimate which has


a minimum standard error is a good estimator
 For symmetrical distribution the mean has a minimum
standard error and
 If the distribution is skewed the median has a minimum
standard error
Properties of good estimate

C) Consistency:
C) Consistency: An
An estimator
estimator isis said
said to
to be
be consistent
consistent ifif its
its
probability of
probability of being
being close
close to
to the
the parameter
parameter itit estimates
estimates increases
increases as
as
thesample
the samplesize
sizeincreases
increases

Consistency

n = 10 n = 100
Interval estimation

 A single-valued estimate conveys little information


about the actual value of the population parameter,
about the accuracy of the estimate
 The probability of getting a sample statistic value
that is exactly equal to the corresponding population
parameter is usually quite small
Interval estimation
 It is not reasonable to assume that a sample statistic
value is exactly equal to the corresponding population
parameter
 An interval estimate which locates the population
parameter within an interval, with a level of
confidence is needed
Confidence Interval or Interval Estimate

 Confidence
Confidence interval
interval oror interval
interval estimate
estimate isis aa range
range or
or
interval of
interval of numbers
numbers believed
believed to to include
include anan unknown
unknown
population parameter
population parameter
Confidence
Confidence interval:
interval: provide
provide aa range
range of
of values
values of of the
the
estimate likely
estimate likely to
to include
include the
the “true”
“true” population
population parameter
parameter
with aa given
with given probability
probability

 A confidence interval or interval estimate has two


components:
A range or interval of values
An associated level of confidence
Confidence Level
1. Probability that the unknown population
parameter falls within interval
2. Denoted (1 – 
• is probability that parameter is not within
interval
3. Typical values are 99%, 95%, 90%
CI for population mean:
There are different conditions to be considered to construct confidence intervals of the
population mean,
1. Large-sample size and when  is known

 For sufficiently large sample size n >30, the sampling


distribution of the sample mean, is approximately
normal
 A 100(1‐α)% σ C.I. for μ is: σ σ
x  z /2  (x - z /2 , x + z /2 )
n n n
 α is to be chosen by the researcher, most
common values of α are 0.05, 0.01, 0.001 and 0.1
CI for population mean:
2. Large-sample size and when  is unknown
 Whenever  is not known (and the population is assumed
normal), the correct distribution to use is the t distribution
with n-1 degrees of freedom. However, for large degrees
of freedom, the t distribution is approximated well by the
Z distribution

 A large sample 100(1‐α)% C.I. for μ is:


s
x  z
2 n
 Note that: when  is unknown, s is a good approximation
of 
Example

 An epidemiologist studied the blood glucose level of


a random sample of 100 patients. The mean was 170,
with a SD of 10. Construct the 95% CI for the
population mean.
Solution
s
X  Z /2
n
10
170  1.96
100

 (168.04, 171.96)

 We are 95% sure that the mean blood glucose level


of the population lies between 168.04 and 171.96
CI for population mean:
3. Small sample size (n<30) and when  is
unknown
 If population standard deviation is unknown, then
the sample means from samples of size n are t-
distributed with n-1 degrees of freedom

 A 100(1‐α)% C.I. for μ is:

s
X  t /2, n-1
n
Example: The average earnings per share (EPS)
for 10 industrial stocks randomly selected from
those listed on the Dow-Jones Industrial
Average was found to be X = 1.85 with a
standard deviation of S=0.395. Calculate a 99%
confidence interval for the average EPS of all
the industrials listed on the DJIA.

Solution:
Example: A random sample of 900 workers
showed an average height of 67 inches with a
standard deviation of 5 inches.

A. Find a 95% confidence interval of the mean


height of all workers

B. Find a 99% confidence interval of the mean


height of all workers
Solution:
Example: Suppose we want to estimate a 95%
confidence interval for the average quarterly returns
of all fixed-income funds in the Ethiopia. We draw a
sample of 100 observations and calculate the sample
mean to be 0.05 and the standard deviation 0.03. We
assume that those returns are normally distributed
with known variance.

Solution:
Example:

1. An economist is interested in studying the incomes of


consumers in a particular country. The population standard
deviation is known to be $1,000. A random sample of 50
individuals resulted in a mean income of $15,000. Construct
the 95% confidence interval ?

2. An auditor, examining a total of 820 accounts receivable of a


corporation, took a random sample of 60 of them. The sample
mean was $127 and the sample standard deviation was $43.
Find a 99% confidence interval for the population mean.
CI for a population proportion: Large-sample size

 For sufficiently large samples, the sampling distribution of the


proportion p is approximately normal
 A 100(1‐α)% CI for π is:

p±zα/2 p(1-p)
n

A sample is considered large enough when both n  p and n  q are greater


than 5, where q =1-p.
Example
• In a sample of 400 people who were questioned
regarding their participation in sports, 160 said that
they did participate. Construct a 98 % confidence
interval for P, the proportion of P in the population
who participate in sports.
Exercise:

1. In a survey of 300 automobile drivers in one city, 123


reported that they wear seat belts regularly. Estimate
the seat belt rate of the city and 95% confidence
interval for true population proportion.

2. In a survey of 300 automobile drivers in one city, 123


reported that they wear seat belts regularly. Estimate
the seat belt rate of the city and 95% confidence
interval for true population proportion.
Sample size
determination
Sample size determination
 Common questions:

– “How many subjects should I study?”


– Too small sample: -Waste of time and resources
-Results have no practical use
– Too large sample: -Waste of resources
-Data quality compromised
-Any small difference can be
statistically significant
When deciding on sample size:

 Precision is related to confidence level & CI


Margin of Error
Factors Affecting Margin of Error

 Margin of error is determined by n, s and α


– As n increases, the width of CI decreases.
– As s increases, the width of CI increases
– As the confidence level increases (αdecreases),the
width of CI increases
Reducing the Margin of Error
σ
ME  zα/ 2
n
 The margin of error can be reduced if

– the standard deviation is lower (s ↓)

– The sample size is increased (n↑)

– The confidence level is decreased, (1 – ) ↓


Sample size determination depends on:

 Objective of the study


 Design of the study
 Degree of precision or accuracy – the allowed
deviation from the true population parameter (can be
within 1% to 5%)
 Degree of confidence level required
 Availability of resources
Estimation of single mean
(zα/ 2 )  2 2
n=
d2
Where:
n = sample size
 = population standard deviation if known,
d = desired degree of precision = half of the
width of confidence interval
Z= is the standard normal value at the level of
confidence desired, usually at 95% confidence
level
Example
 Suppose that for a certain group of cancer patients, we are
interested in estimating the mean age at diagnosis. We would
like a 95% CI of 5 years wide. If the population SD is 12
years, how large should our sample be?

(zα/ 2 )2   2 (1.96)2  (144)


n= 2
 2
 88.5  89
d (2.5)
But the population  is most of the time unknown
As a result, it has to be estimated from:
 Pilot or preliminary sample:

– Select a pilot sample and estimate  with the sample

standard deviation, s
 Similar studies
Estimation of single proportion
(zα/ 2 ) 2  pq
n=
d2
Where:
n = sample size
P = percentage
q = 1-p
d = desired degree of precision
Z= is the standard normal value at the level of
confidence desired, usually at 95% confidence
level
Example
A) Suppose that you are interested to know the proportion of
infants who breastfed >18 months of age in a rural area.
Suppose that in a similar area, the proportion (p) of breastfed
infants was found to be 0.20. What sample size is required to
estimate the true proportion within ±3% with 95% confidence

(zα/ 2 )2  pq (1.96)2  (0.2)(0.8)


n= 2
 2
 683
d (0.03)
Example
B) If the above sample is to be taken from a relatively small
population (say N = 3000) , the required minimum sample will
be obtained from the above estimate by making some
adjustment (if the population is less than 10,000 then a smaller
sample size may be required).

n 683
n final =   557
n 683
1+ 1
N 3000
 An estimate of p is not always available
 However, the formula may also be used for sample size
calculation based on various assumptions for the values of p.

Note: if no prior information about the proportion (p),


assume p=q=0.5
• Example 1: Calculate the sample size for a
population of 100000. Take confidence level
as 95% and margin of error as 5%.
• Solution:
• To find: Sample size for 100000 population.
We will calculate the sample size first by
calculating it for infinite size and then adjusting
it to the required size.
Given: Z = 1.960, P = 0.5, M = 0.05
• Using the sample size formula, adjust the
sample size for the required population in
solved example 1.
• Example 3: Using the Sample Size Formula, find the
sample size for a survey where confidence level =
95%, standard deviation = .5, and margin of error =
+/- 5%.
• Solution:
• The Sample Size can be calculated as = (Z-score)2 *
SD*(1-SD) / (margin of error)2
= ((1.96)2 x .5(.5)) / (.05)2
• = (3.8416 x .25) / .0025
• = .9604 / .0025
• = 384.16
• Thus, you will be needing 385 respondents for this
survey.
Hypothesis
testing
Hypothesis testing
 A statistical method that uses sample data to evaluate
a hypothesis about a population parameter.
 It is intended to help researchers differentiate
between real and random patterns in the data(i.e.
Involves conducting a test of statistical significance
and quantifying the degree to which chance or
sampling variability may account for the results
observed in a particular study)
What is a Hypothesis?
 A hypothesis is a
I claim the mean GPA of
claim (assumption) about this class is   3.5
the true value of unknown
population parameter
- Parameter may be
population mean, proportion,
correlation coefficient,...
– Must be stated
before analysis
Hypothesis testing
 The purpose of hypothesis testing is to determine whether
enough statistical evidence exists to enable us to conclude that
a belief or hypothesis about a parameter is reasonable
 Examples

– Is a new drug effective in curing a certain disease? A


sample of patient is randomly selected. Half of them are
given the new drug where half are given the standard drug .
Then, the improvement in the patients conditions is
measured and compared
Hypothesis Testing Process
Assume the
population
mean age is 50.
( H 0 :   50) Identify the Population

Is X  20 likely if    ?
Take a Sample
No, not likely!

REJECT H0
 X  20 
Steps in hypothesis testing

1) State the statistical hypotheses


 There are two hypotheses:
-Null hypotheses

- Alternative hypotheses
State the null hypotheses

 Null hypothesis – called the hypothesis of no


difference or no association or no effect
 States that ‘’there’s no difference’’ between the
hypothesized value and the population parameter
value
 Is always about a population parameter, not
about a sample
Null Hypothesis:
H0

 The null hypothesis (denoted by H0) is a statement that


the value of a population parameter (such as proportion,
mean, or standard deviation) is equal to some claimed
value.
 Always contains the “=” “≤” or “” sign
 We test the null hypothesis directly.

 Either reject H0 or fail to reject H0.


State the alternative hypotheses
 Alternate to null hypothesis

 Says’’ there’s a difference between the


hypothesized value and the population parameter
value
 It is what we are trying to prove, i.e. the reason for
the research question.
Alternative Hypothesis:
H1 or HA

 The alternative hypothesis (denoted by H1 or HA) is


the statement that the parameter has a value that
somehow differs from the null hypothesis.

 The symbolic form of the alternative hypothesis


must use one of these symbols: , < or >.

 May or may not be accepted


Hypothesis
Example: Consider population mean

H0: μ = μ0
HA: μ  μ0

Two- tailed
Example:
A. Is the mean SBP of the population is different from 120
mmHg?

- H0 : The mean SBP of the population is not different from

120 mmHg (H0: m = 120).

- HA : The mean SBP of the population is different from

120 mmHg (H1: m ≠ 120).


Errors in making Decision
1.Type I Error
– Probability of rejecting true null hypothesis
– Probability of accepting a false alternative hypothesis
– Probability of Type I Error is (Alpha)
• Called level of significance

2.Type II Error
– Probability of failing to reject a false null hypothesis
– Probability of rejecting a true alternative hypothesis
– Probability of Type II Error is (Beta)
Type I & II Errors Have an Inverse
Relationship
If you reduce the probability of one
error, the other one increases so that
everything else is unchanged.

a
Factors Affecting Type II Error
 Significance level

–  Increases when  decreases

 Population standard deviation
 
–  Increases when  increases

 Sample size

–  Increases when n decreases n
Controlling Type I and
Type II Errors
 For any fixed , an increase in the sample
size n will cause a decrease in 
 For any fixed sample size n, a decrease in 
will cause an increase in . Conversely, an
increase in  will cause a decrease in .
 To decrease both  and , increase the
sample size.
Power of a statistical test
 The power of a statistical test is the probability of
rejecting Ho, when Ho is really false. Thus power =
1-β.
 Clearly if the test maximizes power, it minimizes the
probability of Type 2 error β.
Summary:
Elements of a Hypothesis Test
Null Hypothesis (H0)
– A theory about the values of one or more population
parameters. The status quo.
Alternative Hypothesis (Ha)
– A theory that contradicts the null hypothesis. The theory
generally represents that which we will accept only when
sufficient evidence exists to establish its truth.
Test Statistic
– A sample statistic used to decide whether to reject the null
hypothesis. In general,
Estimate-Hypothesized Parameter
test statistic=
Standard Error
Summary:
Elements of a Hypothesis Test
Critical Value
– A value to which the test statistic is compared at some
particular significance level. (usually at  =.01, .05, .10)
Rejection Region
– The numerical values of the test statistic for which the null
hypothesis will be rejected.
– The probability is  that the rejection region will contain the
test statistic when the null hypothesis is true, leading to a
Type I error.  is usually chosen to be small (.01, .05, .10)
and is the level of significance of the test.
Summary of One- and Two-Tail Tests
One-Tail Test Two-Tail Test One-Tail Test
(left tail) (right tail)

H0: μ  μ0 H0: μ ≤ μ0
HA: μ > μ0
HA: μ < μ0
Summary: Rejection Regions
1. Rejection Regions (In Grey)
.5 
 
.5 
Form of Ha: 0 2 2

2-tail hypothesis 

2 2

If |z|>|z/2|
0

Then reject the null hypothesis.

Form of Ha: <0 .5  


1-tail hypothesis
 .5

If z< z
0

Then reject the null hypothesis.

Form of Ha: >o


.5  
1-tail hypothesis

.5
If z> z

Then reject the null hypothesis
Summary :Type I and Type II
Errors
Example: Two-Tail Test

Q. Does an average box of


cereal contain 368 grams of
cereal? A random sample of
25 boxes showed X = 372.5.
The company has specified s
368 gm.
to be 15 grams. Test at the a
= 0.05 level.
Example Solution: Two-Tail Test
H0: m = 368

H1: m ¹ 368 Test Statistic:


s= 15
n = 25 X  372.5  368
Z   1.50
Z –test is appropriate  15
a = 0.05
n 25
Decision: Do not reject
Critical Value: ±1.96 H0 at a = .05
Reject Reject
.025 .025
Conclusion: There is
No evidence that the
-1.96 0 1.96
Z true Mean is not 368
1.50
Example: Two-Tailed Test
Does an average box of cereal
contain 368 grams of cereal?
A random sample of 36 boxes
had a mean of 372.5 and a
standard deviation of 12
368 gm.
grams. Test at the .05 level of
significance.
Solution
Test Statistic:
• H0:  = 368
• HA:   368 X   372.5  368
t*    2.25
•  = 0.05 S 12
• df = 36-1=35 n 36
• Critical Value: ±2.042 0.02 < p-value < 0.05

Decision: Reject Ho since p-


Reject H0 Reject H0 value <  = .05 and t* > t-
.025 .025 critical
Conclusion: There is evidence
-2.042 0 2.042 t population average is not 368
but is >
Example: One Tail Test

Q. Does an average box of


cereal contain more than
368 grams of cereal? A
random sample of 25
boxes showed X = 372.5.
The company has 368 gm.
specified s to be 15 grams.
Test at the a = 0.05 level. H0: m £ 368
H1: m > 368
Solution
H0: m £ 368
H1: m > 368 Test Statistic:
a = 0.05 X 
Z  1.50
n = 25 
Critical Value: 1.645 n
Reject

.05 Do Not reject H0 at a = .05


Decision:
0 1.645
Z No evidence that true
Conclusion:
mean is more than 368
1.50
p -Value Solution

p-Value is P(Z ³ 1.50) = 0.0668


Use the
alternative P-Value =.0668
hypothesis
to find the
direction of
the rejection
region.
0 1.50 Z

Z Value of
Sample Statistic
p -Value Solution
(p-Value = 0.0668) ³ (a = 0.05)
Do Not reject H0 .
p Value = 0.0668

Reject

a = 0.05

0 1.645
Z
1.50
Test Statistic 1.50 is in the non reject region
Example: One-ailed Test

Is the average capacity of the


batteries less than 140 ampere-
hours? A random sample of 20
batteries had a mean of 138.47 and a
standard deviation of 2.66. Assume
a normal distribution. Test at the .05
level of significance.
Solution
Test Statistic:
• H0:  = 140
X   138.47  140
Ha:  < 140 t    2.57
*

S 2.66
•  = 0.05
n 20
• df = 20-1=19
• Critical Value: • 005 < p-value <.01

Reject H0 Decision: Reject Ho since


p-value < a and t* < t-critical
.05
Conclusion: There is an evidence
population average is less than
-1.729 0 t
140
Example

 In a survey of diabetics in a large city, it was found


that 100 out of 400 have diabetic foot. Can we conclude
that 20 percent of diabetics in the sampled population
have diabetic foot. Test at the a =0.05 significance level.
Solution
p 
Ho: π = 0.20 Z 
 (1   )
H1: π  0.20
n
0.25 – 0.20
Z= 0.20 (1- 0.20) = 2.50
400
Critical Value: 1.96
Decision:
Reject Reject We have sufficient evidence to
.025 .025 reject the Ho value of 20%
We conclude that in the population
of diabetic the proportion who have
-1.96 0 +1.96 Z diabetic foot does not equal 0.20
Inference About a Population Variance

 Sometimes we are interested in making


inference about the variability of processes.
– Examples:
• The consistency of a production process
for quality control purposes.
Inference About a Population Variance

 A common goal in business and industry is to improve


the quality of goods or services by reducing variation.
Quality control engineers want to ensure that a product
has an acceptable mean, but they also want to produce
items of consistent quality so that there will be few
defects.
 To draw inference about variability, the parameter
of interest is s2.
Inference About a Population Variance

 The sample variance s2 is an unbiased,


consistent and efficient point estimator for s2.

(n  1) s 2
 The statistic 2
has a distribution
called Chi-squared, if the population is
normally distributed.
(n  1)s 2
d.f. = 1 2  2
d.f .  n  1

d.f. = 5 d.f. = 10
Properties of Chi-Square Distribution

 All values of  2 are non-negative, and the


distribution is not symmetric

 There is a different distribution for each number


of degrees of freedom

 The critical values are found in table using n – 1


degrees of freedom.
Properties of Chi-Square
Distribution - cont
Properties of the Chi-Square Chi-Square Distribution for
Distribution 10 and 20 df

Different distribution for


each number of df.
Measures of association
Chi-Square

 Test two variables (Categorical variables) for


independence
 Consider rxc contingency table:

Variable A Variable B
B1 B2 B3 B4 Totals
A1
A2
A3
Totals Grand total
where:
r = number of rows (number of categories of variable A)
c = number of columns (number of categories of variable B)
Chi-Square
 Hypothesis to be tested:
H0: There is no association between the
row and column variables
HA: There is an association
or
H0: The row and column variables are
independent
HA: The two variables are dependent
 Test Statistic: χ 2 - test with df= (r -1)x(c -1)
Chi-Square( 2) - test

where:
Oij -Observed frequency of i th row and jth column
i th row total×jth column total R i ×C j
E ij = =
grand total n
R i -Marginal total of the i th row
C j -Marginal total of the jth column
n-Grand total
An alternative method to calculate Chi-
Square for 2×2 table
Outcome
Exposure Yes No Total

Yes a b r1
No c d r2
Total c1 c2 n
n ( ad  bc ) 2
2 
r1r2c1c2
 Remember that Chi-Square test should be applied
to counts and not percentages
Characteristics of the Chi-Square
Distribution
1. It is not symmetric.
2.The shape of the chi-square distribution depends upon
the degrees of freedom, just like Student’s t-distribution.

3. As the number of degrees of freedom increases, the


chi-square distribution becomes more symmetric as is
illustrated in the following Figure (see next slide) .
4. The values are non-negative. That is, the values of
are greater than or equal to 0.
The Chi-Square Distribution
Assumption  2 - test
 For the chi-square independence test to be used, the
following must be true
o The observed frequencies must be obtained by using
a random sample
o No expected frequency should be less than 1,
and no more than 20% of the expected
frequencies should be less than 5.
Critical values for chi-square:
.
 Critical values are found in Table by first locating
the row corresponding to the appropriate number of
degrees of freedom (where df = n –1). Next, the
significance level  is used to determine the correct
column.

0  2 (df , ) 2
Steps

 Step 1: Determine the null and alternative


hypothesis

HO: The two variables are independent

HA : The two variables are associated


 Step 2: Select a level of significance α based upon
the seriousness making Type I error. The level of
significance is used to determine the critical value.
All Chi-Square tests for independence are right-
tailed tests, so the critical value is with (r -1)x(c-1)
degrees of freedom. The shaded region at the right
represents the critical region or rejection region.
 Step 3: Calculate the expected frequencies for
Contingency Table Cells and Verify the requirements are
satisfied.
(Su m of r ow r )  (Su m of colu m n c )
E xpect ed fr equ en cy E r ,c 
Sa m ple size

(1) all expected frequencies are greater than or equal to 1


(all Eij > 1)

(2) no more than 20% of the expected frequencies are less than 5.

 If the conditions listed above are satisfied, then…


 Step 4: Compute the test statistic

2
2 (O  E )
χ 
E

where O represents the observed frequencies and


E represents the expected frequencies
 Step 5: Make a decision to reject or fail to
reject the null hypothesis
- Compare the critical value to the test statistic

 Step 6: State the conclusion


Example
 A researcher wishes to determine whether there is a
relationship between the gender of an individual
and the amount of alcohol consumed. A sample of
68 people is selected, and the following data are
obtained. At  = 0.10, can a researcher conclude
that alcohol consumption is related to gender?
Example

 Results of observed frequencies

Alcohol consumption
Low Moderate High Row
Gender total

Male 10 9 8 27
Female 13 16 12 41
Column 23 25 20 68

total
Solution
Step 1 State the hypothesis

H0: The amount of alcohol that a person consumes

is independent of the individual’s gender

HA: The amount of alcohol that a person

consumes is dependent of the individual’s

gender
Step 2 Find the critical value: the critical value is
4.605, since the degrees of freedom are (2-1)(3-1)=2

Step 3 compute the test value: First, compute the


expected frequency.
(41)(23)
(27)(23) E 2,1 =  13.87
E1,1 =  9.13 68
68
(41)(25)
(27)(25) E 2,2 =  15.07
E1,2 =  9.93 68
68
(41)(20)
(27)(20) E 2,3 =  12.06
E1,3 =  7.94 68
68
 The completed table of expected frequencies :

Alcohol consumption
Row
Low Moderate High
Gender total

Male 9.13 9.93 7.94 27


Female 13.87 15.07 12.06 41
Column 23 25 20 68

total
Then, the test value is
(O  E ) 2
2  
all cells E

(10  9.13) 2 (9  9.93) 2 (8  7.94) 2


  
9.13 9.93 7.94

(13  13.87) 2 (16  15.07) 2 (12  12.06) 2


  
13.87 15.07 12.06

 0.283
Step 4 Make the decision: Do not reject the null
hypothesis, since 0.283 < 4.605

Step 5 Conclusion: There is no enough evidence to


support the claim that the amount of alcohol a person
consumes is dependent of the individual’s gender
B. Quantitative variables
1. Correlation
 Measures the relative strength of the linear
relationship between two variables
 The degree of association or correlation between two
variables is measured by the correlation coefficient.
Example
Student Hours studied % Grade
 If a researcher wishes to
A 6 82
see whether there is a
B 2 63
relationship between
number of hours studied C 1 57

and test scores on an D 5 88


exam. Data obtained E 3 68
from a random sample
F 2 75
of students as shown …
Scatter Plot

Scatter Plot
The graph suggests a 100
positive relationship 80
between hours of Grade (%) 60
studies and grades 40
20
0
0 2 4 6 8
Hours Studied
Correlation Coefficient
Measures the strength and direction of a relationship
between two variables. n

 (x -x)(y -y)
i=1
i i

covariance(x,y) n-1
r= =
varx vary n n

 (x -x)  (y -y)
i=1
i
2

i=1
i
2

n-1 n-1
 An alternative formula:
n

 (x -x)(y -y)
i=1
i i
r=
n n

 i  i
(x
i=1
-x) 2
(y -y) 2

i=1
Correlation Coefficient
 Unit-less

 The value of the correlation coefficient is always range


from - 1 to + 1
 The closer to –1 indicates the stronger negative linear
relationship between the variables
 The closer to +1 indicates the stronger positive linear
relationship between the variables
 The closer to 0, the weaker linear relationship. However,
there could be a non- linear relationship between variables.
Strength of relationship

Strong negative No linear Strong positive


linear relationship relationship linear relationship

0
-1 +1
Strength of relationship
 Correlation from 0 to 0.25 (or 0 to –0.25) indicate
little or no relationship
 those from 0.25 to 0.5 (or –0.25 to –0.50)
indicate a fair degree of relationship;
 those from 0.50 to 0.75 (or –0.50 to –0.75) a
moderate to good relationship; and
 those greater than 0.75 (or –0.75 to –1.00) a very
good to excellent relationship.
Scatter Plots of Data with Various
Correlation Coefficients

Y Y Y

X X X
r = -1 r = -.6 r=0
Y
Y Y

X X X
r = +1 r = +.3 r=0
Linear Correlation
Linear relationships Curvilinear relationships

Y Y

X X

Y Y

X X
Linear Correlation
Strong relationships Weak relationships

Y Y

X X

Y Y

X X
Linear Correlation
No relationship

X
Example
The following table shows the Systolic blood pressure and body
weight from a sample of twenty young adults
Subject SBP(mmHg) Weight(kg) Subject SBP(mmHg) Weight(kg)

1 106 60 11 94 48
2 111 42 12 97 46
3 115 53 13 96 39
4 102 49 14 115 66
5 126 67 15 79 39
6 85 47 16 108 51
7 125 62 17 97 57
8 103 48 18 96 37
9 108 49 19 95 49
10 98 55 20 95 37

 Correlation coefficient ?
( x  x ) ( y  y ) ( x  x )( y  y ) ( x x )2 (y-y)2
Subject Weight(x) SBP(y)
1 60 106 9.95 3.45 34.33 99.00 11.90
2 42 111 -8.05 8.45 -68.02 64.80 71.40
3 53 115 2.95 12.45 36.73 8.70 155.00
4 49 102 -1.05 -0.55 0.58 1.10 0.30
5 67 126 16.95 23.45 397.48 287.30 549.90
6 47 85 -3.05 -17.55 53.53 9.30 308.00
7 62 125 11.95 22.45 268.28 142.80 504.00
8 48 103 -205 0.45 -0.92 4.20 0.20
9 49 108 -1.05 5.45 -5.72 1.10 29.70
10 55 98 4.95 -4.55 -22.52 24.50 20.70
Cont…
Subject Weight(x) SBP(y)
( x  x ) ( y  y ) ( x  x )( y  y ) ( x  x )2 (y-y)2
11 48 94 -2.05 -8.55 17.53 4.20 73.10
12 46 97 -4.05 -5.55 22.48 16.40 30.80
13 39 96 -11.05 -6.55 72.38 122.10 42.90
14 66 115 15.95 12.45 198.58 254.40 155.00
15 39 79 -11.05 -23.55 260.23 122.10 554.60
16 51 108 0.95 5.45 5.18 0.90 29.70
17 57 97 6.95 -5.55 -38.57 48.30 30.80
18 37 96 -13.05 -6.55 85.48 170.30 42.90
19 49 95 -1.05 -7.55 7.93 1.10 57.00
20 37 95 -13.05 -7.55 98.53 170.30 57.00

Total 1,001 2,051 1,423.45 1,552.90 2,724.90


n

 (x -x)(y -y)
i i
r= i=1
n n

 (x -x)  (y -y)
i=1
i
2

i=1
i
2

1423.45
r= = 0.69
1552.90×2724.90
2. Regression
 Regression analysis is used to predict the value of one variable
(the dependent variable) on the basis of other variables (the
independent variables).
o Dependent variable: denoted Y
o Independent variables: denoted X1, X2, …, Xk

Prediction
 If you know something about X, this knowledge helps you
predict something about Y.
Linear Regression

 Linear regression model used to describe the linear


relationship between the dependent variable (y) and
independent variable (x)
 The dependent variable (y) and independent variable
(x) are quantitative
Simple Linear Regression

 Deals with one dependent variable (Y) and one


independent variable (X)
 Explanatory and response variables are continuous
 Relationship between Y and X is described by linear
function
The Simple Linear Regression Model
The simple
The simplelinear
linearregression
regressionmodel:
model:
y= aa++bbxx
y= ++  or mmy|x==aa++bbxx
or y|x
Nonrandomor
Nonrandom or Random
Random
Systematic
Systematic Component
Component
Component
Component
Whereyyisisthe
Where thedependent
dependent(response)
(response) variable,
variable,the
thevariable
variablewe
wewish
wishto to
explainor
explain orpredict;
predict;xxisisthe
theindependent
independent(explanatory)
(explanatory)variable,
variable,also
alsocalled
called
thepredictor
the predictorvariable;
variable;andandisisthe
theerror
errorterm,
term,the
theonly
onlyrandom
random
componentin
component inthe
themodel,
model,andandthus,
thus,the
theonly
onlysource
sourceof
ofrandomness
randomnessin in y.y.
mmy|xy|xisis the expectedvalue
theexpected valueofofyyatataagiven
given xx

aaisisthe
theintercept
interceptof
ofthe
thesystematic
systematiccomponent
componentof
ofthe
theregression
regression
relationship.
relationship.
isisthe
theslope
slopeof
ofthe
thesystematic
systematiccomponent.
component.
Picturing the Simple Linear Regression Model

Y Regression Plot The simple


The simple linear linear regression
regression
model posits
model posits an an exact
exact linear
linear
relationship
relationship between
between the
the
expected or
expected or average
average value
value ofof Y,
Y,
thedependent
the dependent variable
variableY, Y,and
andX,X,
my|x=a +  x
the independent
the independent or or predictor
predictor
{
y

Error:  }  = Slope variable:


variable:
mmy|xy|x==aa++bbxx
}

1 Actualobserved
Actual observedvalues valuesofofYY(y)
(y)
differfrom fromthe
theexpected
expectedvalue
value

{
differ
((mmy|xy|x))by
byan
anunexplained
unexplainedor or
a = Intercept
randomerror(
random error(ee):):

yy== mmy|xy|x ++ 
X
0
x
==aa++bbxx++ 
Simple Linear Regression equation…

 Expected value of y at a given level of x=

E ( y / x)     x
 Regression coefficient β
– Measures association between y and x
– Amount by which y changes on average when x
changes by one unit
Assumptions of the Simple Linear Regression Model

 The
Therelationship
relationshipbetween
betweenXX
LINE assumptions of the
andYYisisaastraight-Line
and straight-Line Y
Simple Linear Regression
(linear)relationship.
(linear) relationship.
Model
 Theobservations
The observationsare are
independent
independent my|x=a +  x
 errorsare
Theerrors
The are
uncorrelated(i.e.
uncorrelated (i.e.
Independent)in
Independent) in y
successiveobservations.
successive observations.
The errorsare
Theerrors areNormally
Normally Identical normal
distributedwith
distributed withmean
mean00andand distributions of
variance22(Equal
variance (Equal N(my|x, sy|x2) errors, all
variance).That
variance). is: ~~
Thatis: centered on the
regression line.
N(0,22))
N(0, x X
Regression Picture
ŷi  xi  
yi
C A

B
B
y
A y
C
yi
* Least squares
estimation gave us the
x line (β) that minimized
C2

n n n

(y
i 1
i  y) 2
  ( yˆ
i 1
i  y) 2
  ( yˆ
i 1
i  yi ) 2

A2 B2 C2
n n n


i 1
2
( yi  y )  
i 1
2

( yˆ i  y )  ( yˆ i  y i )
i 1
2

A2 B2 C2
SSresidual
SSreg Variance around the
SStotal regression line
Distance from regression
Total squared distance line to mean of y Additional
variability
of observations Variability due to
not explained
from mean of y x
by
Total variation (regression)
x : what least
squares method
aims
to minimize
The equation of straight line
 The equation of straight line is

y   x

 Where the coefficients ‘α’ is the intercept of the


line on the y-axis and ‘β’ regression the gradient or
slope of the line. The slope β is also referred to as
the regression coefficient. This regression line is
obtained using the method of least square.
 The values of α and β are calculated in such a way
that the sum of the squares all the deviations of each
point to the fitted line is minimum by the method of
least squares.
 The formula for calculating α and β are given by

n
 (xi -x)(yi -y)
ß = i=1 α = y + ßx
n
2
 i (x -x)
i=1
How Good is the Regression?
The coefficient of determination, R2, is a descriptive measure of
the strength of the regression relationship, a measure how well the
regression line fits the data.
R2 : coefficient of ( y  y )  ( y  yˆ )  ( yˆ  y)
Y
determination Total = Unexplained Explained
. Deviation Deviation Deviation

}
Y

{
(Error) (Regression)
Unexplained Deviation
Total Deviation 2
Y  ( y  y )2   ( y  yˆ )2   ( yˆ  y )
Y
Explained Deviation
{ SST = SSE + SSR

Percentage of
Rr22=
SSR 1 SSE total variation
SST SST
explained by the
X regression.
X
Coefficient of Determination
ˆ  y) 2
2 Regression sum of squares  ( y
R = 
Total sum of squares
 ( y  y )2
 R2 is a measure of linear association between x
and y (0 £ R2 £ 1)

 R2 measures the proportion of the total


variation in y which is explained by x
Multiple Linear Regression
 More than one predictor…

E(y)=  + 1*X1 + 2 *X2 + 3 *X3… + k *Xk

-Each regression coefficient is the amount of change


in the outcome variable that would be expected per
one-unit change of the predictor, if all other variables
in the model were held constant.
Checking the assumptions: Residual Analysis

ei  Yi  Yˆi
 The residual for observation i, ei, is the difference between its
observed and predicted value
 Check the assumptions of regression by examining the
residuals
– Examine for linearity assumption
– Examine for constant variance for all levels of X (homoscedasticity)
– Evaluate normal distribution assumption
– Evaluate independence assumption

 Graphical Analysis of Residuals


– Can plot residuals vs. X
Residual Analysis for Linearity

Y Y

x x
residuals

x residuals x

Not Linear
 Linear
Residual Analysis for
Homoscedasticity

Y Y

x x
residuals

x residuals x

Non-constant variance
 Constant variance
Residual Analysis for
Independence

Not Independent
 Independent
residuals

residuals
X
residuals

X
Logistic Regression

 Logistic Regression is used when the outcome


variable is categorical
 The independent variables could be either
categorical or continuous
Multiple Logistic Regression

 Multiple Logistic Regression model can be used to


adjust for the effect of other variables when
assessing the association between E & D variables
o u
k y
a n
T h

You might also like