You are on page 1of 127

DATA SCIENCE

UNIT – 2
PARAMETRIC TESTING
Types Of Hypothesis Testing
Difference
Between
Parametric and
Non-Parametric
Test
Unpaired t -test Paired t -test
Parametric Testing
Z TEST OF HYPOTHESIS FOR THE MEAN
(KNOWN 𝜎) – Two Tail Test
• the Z-test statistic for determining the difference between the sample
mean, 𝑋,ത and the population mean, 𝜇, when the standard deviation,
𝜎, is known.
𝑋ത − 𝜇
𝑍= 𝜎
√𝑛
THE SIX-STEP METHOD OF HYPOTHESIS
TESTING
1. State the null hypothesis, 𝐻0 , and the alternative hypothesis, 𝐻1
2. Choose the level of significance, 𝛼, and the sample size, n. The level of
significance is based on the relative importance of the risks of committing Type
I and Type II errors in the problem.
3. Determine the appropriate test statistic and sampling distribution.
4. Determine the critical values that divide the rejection and nonrejection
regions.
5. Collect the sample data and compute the value of the test statistic.
6. Make the statistical decision and state the managerial conclusion. If the test
statistic falls into the nonrejection region, you do not reject the null hypothesis,
𝐻0 . If the test statistic falls into the rejection region, you reject the null
hypothesis. The managerial conclusion is written in the context of the real-
world problem.
Example
• Assume the following
• 𝜇 = 368, 𝑋ത = 372.5, 𝜎 = 15, 𝑛 = 25
• Level of significance 𝛼 = 0.05
• 𝐻0 : 𝜇 = 368
• 𝐻1 : 𝜇 ≠ 368
• Decision rule
• reject 𝐻0 , if Z > 1.96 or if Z <-1.96
• Otherwise donot reject 𝐻0

𝑋−𝜇
• 𝑍= 𝜎 = 1.50
√𝑛
• Because the test statistic Z = +1.50 is between 1.96 and +1.96, you do not reject 𝐻0
• To take into account the possibility of a Type II error, you state the conclusion as
there is insufficient evidence that the mean fill is different from 368 grams.
Practice Question - 01
Practice Question - 02
Practice Question - 03
• Example 3: A teacher claims that the mean score of students in his
class is greater than 82 with a standard deviation of 20. If a sample of
81 students was selected with a mean score of 90 then check if there
is enough evidence to support this claim at a 0.05 significance level.
Practice Question - 04
Practice Question - 05
You are the manager of a fast-food restaurant. You want to determine
whether the population mean waiting time to place an order has changed
in the past month from its previous population mean value of 4.5
minutes. From past experience, you can assume that the population is
normally distributed with a population standard deviation of 1.2
minutes. You select a sample of 25 orders during a one-hour period. The
sample mean is 5.1 minutes.
Determine whether there is evidence at the 0.05 level of significance
that the population mean waiting time to place an order has changed in
the past month from its previous population mean value of 4.5 minutes
Solution:
• Step 1: 𝐻0 : 𝜇 = 4.5, 𝐻1 : 𝜇 ≠ 4.5
• Step 2: n = 25, Level of significance 𝛼 = 0.05
• Step 3: Because 𝜎is known, you use the normal distribution and the Z test
statistic
• Step 4:
• reject 𝐻0 , if Z > 1.96 or if Z <-1.96
• Otherwise donot reject 𝐻0
• Step 5: Z = (5.1 – 4.5)/(1.2/5) = 2.5
• Because Z = 2.50 > 1.96, you reject the null hypothesis. You conclude that
there is evidence that the population mean waiting time to place an order
has changed from its previous value of 4.5 minutes. The mean waiting time
for customers is longer now than it was last month.
Practice Question - 06
• A company that manufactures chocolate bars is particularly concerned
that the mean weight of a chocolate bar not be greater than 6.03
ounces. Past experience allows you to assume that the standard
deviation is 0.02 ounces. A sample of 50 chocolate bars is selected,
and the sample mean is 6.034 ounces. Using the 𝛼 = 0.01 level of
significance, is there evidence that the population mean weight of the
chocolate bars is greater than 6.03 ounces?
Solution:
• 𝐻0 : 𝜇 ≤ 6.03, 𝐻1 : 𝜇 > 6.03
• n = 50, 𝛼 = 0.01
• Use Z statistics
• Decision rule: Reject 𝐻0 𝑖𝑓 𝑍 > 2.33
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0
• 𝑛 = 50, 𝜇 = 6.03, 𝑋ത = 6.034, 𝜎 = 0.02

𝑋−𝜇
• 𝑍= = 1.414
𝜎/√𝑛
• Because Z = 1.414 < 2.33, you do not reject the null hypothesis. There is
insufficient evidence to conclude that the population mean weight is greater
than 6.03 ounces
The p-Value Approach to Hypothesis Testing
• The p-value is the probability of getting a test statistic equal to or
more extreme than the sample result, given that the null hypothesis,
𝐻0 , is true.
• The p-value, often referred to as the observed level of significance
• The decision rules for rejecting 𝐻0 in the p-value approach are
• If the p-value is greater than or equal to 𝛼, do not reject the null hypothesis.
• If the p-value is less than 𝛼, reject the null hypothesis
THE FIVE-STEP p-VALUE APPROACH TO
HYPOTHESIS TESTING
• State the null hypothesis, 𝐻0 , and the alternative hypothesis, H1.
• Choose the level of significance 𝛼, , and the sample size, n. The level of
significance is based on the relative importance of the risks of committing
Type I and Type II errors in the problem.
• Determine the appropriate test statistic and sampling distribution.
• Collect the sample data, compute the value of the test statistic, and
compute the p-value.
• Make the statistical decision and state the managerial conclusion. If the p-
value is greater than or equal to 𝛼, you do not reject the null hypothesis,
𝐻0 . If the p-value is less than 𝛼, you reject the null hypothesis. Remember
the mantra: If the p-value is low, then 𝐻0 must go. The managerial
conclusion is written in the context of the real-world problem.
Z = 1.43
P = 0.9236
Example Try
• You are the manager of a fast-food restaurant. You want to determine whether the population mean waiting
time to place an order has changed in the past month from its previous population mean value of 4.5
minutes. From past experience, you can assume that the population is normally distributed with a
population standard deviation of 1.2 minutes. You select a sample of 25 orders during a one-hour period.
The sample mean is 5.1 minutes. Use the six-step approach to determine whether there is evidence at the
0.05 level of significance that the population mean waiting time to place an order has changed in the past
month from its previous population mean value of 4.5 minutes
Answer:
• Step 1: 𝐻0 : 𝜇 = 4.5, 𝐻1 : 𝜇 ≠ 4.5
• Step 2: n = 25, Level of significance 𝛼 = 0.05
• Step 3: Because 𝜎is known, you use the normal distribution and the Z test statistic
• Step 4: Z = (5.1 – 4.5)/(1.2/5) = 2.5, The probability of a value below +2.50 is 0.9938. Therefore, the
probability of a value above +2.50 is 1 - 0.9938 = 0.0062. Thus, the p-value for this two-tail test
• is 0.0062 + 0.0062 = 0.0124
• Because the p-value = 0.0124 < 𝛼 = 0.05, you reject the null hypothesis. You conclude that there is evidence
that the population mean waiting time to place an order has changed from its previous population mean
value of 4.5 minutes. The mean waiting time for customers is longer now than it was last month.
ONE-TAIL TESTS
• The Critical Value Approach
• Suppose you wish to determine whether the mean freezing point of milk is
less than -0.545
• n = 25, 𝛼 = 0.05
• 𝐻0 : 𝜇 ≥ −0.545, 𝐻1 : 𝜇 < −0.545
• 𝜎 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛, 𝑢𝑠𝑒 𝑍 𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠
• Decision rule: Reject 𝐻0 𝑖𝑓 𝑍 < −1.645, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0
• 𝜎 = 0.008, 𝑋ത = −0.550

𝑋−𝜇
• 𝑍= = -3.125
𝜎/√𝑛
• Because Z = 3.125 < 1.645, you reject the null hypothesis
ONE-TAIL TESTS
• The p- Value Approach
• Suppose you wish to determine whether the mean freezing point of milk is less than -0.545
• n = 25, 𝛼 = 0.05
• 𝐻0 : 𝜇 ≥ −0.545, 𝐻1 : 𝜇 < −0.545
• 𝜎 𝑖𝑠 𝑘𝑛𝑜𝑤𝑛, 𝑢𝑠𝑒 𝑍 𝑡𝑒𝑠𝑡 𝑠𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐𝑠
• 𝜎 = 0.008, 𝑋ത = −0.550

𝑋−𝜇
• 𝑍= = -3.125
𝜎/√𝑛
• compute the p-value, you need to find the probability that the Z value will be less than the
test statistic of -3.125
• the probability that the Z value will be less than -3.125 is 0.0009
• The p-value of 0.0009 is less than + = 0.05. You reject H0. You conclude that the mean
freezing point of the milk provided is less than -0.545. The company should pursue an
investigation of the milk supplier because the mean freezing point is significantly less than
what is expected to occur by chance.
Example
• A company that manufactures chocolate bars is particularly concerned that the
mean weight of a chocolate bar not be greater than 6.03 ounces. Past experience
allows you to assume that the standard deviation is 0.02 ounces. A sample of 50
chocolate bars is selected, and the sample mean is 6.034 ounces. Using the 𝛼 = 0.01
level of significance, is there evidence that the population mean weight of the
chocolate bars is greater than 6.03 ounces?
Answer:
• 𝐻0 : 𝜇 ≤ 6.03, 𝐻1 : 𝜇 > 6.03
• n = 50, 𝛼 = 0.01
• Use Z statistics
• Decision rule: Reject 𝐻0 𝑖𝑓 𝑍 > 2.33
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑑𝑜 𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0
• 𝑛 = 50, 𝜇 = 6.03, 𝑋ത = 6.034, 𝜎 = 0.02

𝑋−𝜇
• 𝑍= = 1.414
𝜎/√𝑛
• Because Z = 1.414 < 2.33, you do not reject the null hypothesis. There is insufficient
evidence to conclude that the population mean weight is greater than 6.03 ounces
Z TEST OF HYPOTHESIS FOR THE PROPORTION
• ONE SAMPLE Z TEST FOR THE PROPORTION
𝑝−𝜋
𝑍=
𝜋(1 − 𝜋)

𝑛
Where
𝑋 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
𝑝= =
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
𝜋 = hypothesized proportion of successes in the population
𝑝 = sample proportion of successes
Example – Z-Value
• whether the proportion of independent grocery owners who view
Wal-Mart as their biggest competitive threat is 0.50
• n = 151, X = 78
• p = X/ n = 78/151 = 0.5166
• If you select the 𝛼= 0.05 level of significance
• Decision Rule: Reject 𝐻0 , 𝑖𝑓 𝑍 < −1.96 𝑜𝑟 𝑍 >
1.96, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑑𝑜𝑛𝑜𝑡 𝑟𝑒𝑗𝑒𝑐𝑡 𝐻0
• Because 1.96 < Z = 0.4069 < 1.96, you do not reject 𝐻0
ONE SAMPLE Z TEST FOR THE PROPORTION
Steps: One –sample Z test for Proportion
Steps: (Contd…)
Z TEST OF HYPOTHESIS FOR THE PROPORTION

• ONE SAMPLE Z TEST FOR THE PROPORTION


𝑝−𝜋
𝑍=
𝜋(1 − 𝜋)

𝑛
Where
𝑋 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
𝑝= =
𝑛 𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑖𝑧𝑒
𝜋 = hypothesized proportion of successes in the population
𝑝 = sample proportion of successes

Dr. K. Adi Narayana Reddy, Assoc.Prof, DSAI, IFHE


Using the Normal Distribution to Approximate the
Binomial Distribution
Example:
Solution
Contd…(Solution)
Contd…(Solution)
Practice Question
Solution:
Question:
• The average on a statistics test was 78 with s.d 8. If the test scores are
normally distributed, find the probability that a student receives a
test score greater than 85.
Example – p-value approach
• A fast-food chain has developed a new process to ensure that orders at the drive-through are
filled correctly. The previous process filled orders correctly 85% of the time. Based on a sample of
100 orders using the new process, 94 were filled correctly. At the 0.01 level of significance, can
you conclude that the new process has increased the proportion of orders filled correctly?
Answer:
• H0: 𝜋 ≤0.85 (that is, the proportion of orders filled correctly is less than or equal to 0.85)
• H1: 𝜋 > 0.85 (that is, the proportion of orders filled correctly is greater than 0.85)
• p = X/n = 94/100=0.94
𝑝−𝜋
• 𝑍 = 𝜋(1−𝜋) = 2.52
√ 𝑛
• The p-value for Z > 2.52 is 0.0059.
• Using the critical value approach, you reject H0 if Z > 2.33. Using the p-value approach, you reject
H0 if the p-value < 0.01. Because Z = 2.52 > 2.33 or the p-value = 0.0059 < 0.01, you reject H0. You
have evidence that the new process has increased the proportion of correct orders above 0.85.
Paired t Test to compare hA, hB
Paired t Test to compare hA, hB
One sample t-test
Real life example of one sample t-test
Example
Solution
Practice Question:01
Solution:
Practice Question:02
Solution:02
Practice Question:03
Solution:03
Practice Question:
A professor wants to know if her introductory statistics class has a good
grasp of basic math. Six students are chosen at random from the class and
given a math proficiency test. The professor wants the class to be able to
score above 70 on the test. The six students get scores of 62, 92, 75, 68,
83, and 95. Can the professor have 90 percent confidence that the mean
score for the class on the test would be above 70?
ANOVA
• Analysis Of Variance(ANOVA) is also called Fisher analysis of variance
• It is a strong statistical technique that is used to show the difference
between two or more means or components through significance
tests
• The ANOVA test is performed by comparing two types of variation,
the variation between the sample means, as well as the variation
within each of the samples
ANOVA(Analysis Of Variance)
ANOVA-Basic Definitions
• Sum of Squares within group
𝑘 𝑛

𝑆𝑆𝑊 = ෍ ෍(𝑋𝑖𝑗 − 𝑋ഥ𝑖 )2


𝑖=1 𝑗=1

• Sum of squares between the groups


𝑘

𝑆𝑆𝐵 = ෍ 𝑛𝑖 (𝑋ഥ𝑖 − 𝑋)2



𝑖=1
Where n is number of observations in each group and k is number of groups, N is total observations
𝑋ത − 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑚𝑒𝑎𝑛

𝑋𝑖 − 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑖𝑡ℎ 𝑔𝑟𝑜𝑢𝑝
• Mean sum of squares within the groups
𝑆𝑆𝑊
𝑀𝑆𝑊 = 𝑁−𝑘 (degrees of freedom = N-k)
• Mean sum of squares between the groups
𝑆𝑆𝐵
𝑀𝑆𝐵 = 𝑑𝑒𝑔𝑟𝑒𝑒𝑠 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚 = 𝑘 − 1
𝑘−1
• Total sum of squares
𝑆𝑆𝑇 = 𝑆𝑆𝑊 + 𝑆𝑆𝐵
• F-Statistics
𝑀𝑆𝐵
𝐹=
𝑀𝑆𝑊
ANOVA
the number of values in the data set
Overall mean score = (8.14 + 6.71 + 3)/3 = 5.95
Mean score for 0mg = (9 + 8 + 7 + 8+8+9+8)/7 = 8.14

Mean score for 50mg = (7 + 6 + 6 + 7+8+7+6)/7 = 6.71

Mean score for 100mg = (4 + 3 + 2 + 3+4+3+2)/7 = 3


• We then compare the F-statistic to the critical F-value for the
appropriate significance level and degrees of freedom, to determine
whether the difference in test scores between the groups is
significant.
Practice Questions-01
Solution
Practice Questions-02
Practice Questions-03
Solution
ANOVA
ANOVA: Step:01
Group A Group B Group C

80 75 70
85 80 75
90 85 80
95 90 85

Mean score for Group A = (80 + 85 + 90 + 95)/4 = 87.5 Overall mean score = (87.5 + 82.5 + 77.5)/3 = 82.5
Mean score for Group B = (75 + 80 + 85 + 90)/4 = 82.5

Mean score for Group C = (70 + 75 + 80 + 85)/4 = 77.5


Step:03

• Next, we partition the total variation into two parts: variation


between groups and variation within groups.
• The degrees of freedom for the variation between groups is the
number of groups minus 1, which is 2.
• The degrees of freedom for the variation within groups is the total
number of scores minus the number of groups, which is 9.
Step:02
Now we want to test whether there is a significant difference in test
scores between the three groups using ANOVA.
To do this, we first calculate the total variation in the data, which is the
sum of the squared deviations of each score from the overall mean:

Total variation = [(80-82.5)^2 + (85-82.5)^2 + (90-82.5)^2 + (95-82.5)^2


+ (75-82.5)^2 + (80-82.5)^2 + (85-82.5)^2 + (90-82.5)^2 + (70-82.5)^2 +
(75-82.5)^2 + (80-82.5)^2 + (85-82.5)^2] = 1275
Step:
• Variation within groups = [(80-87.5)^2 + (85-87.5)^2 + (90-87.5)^2 +
(95-87.5)^2 + (75-82.5)^2 + (80-82.5)^2 + (85-82.5)^2 + (90-82.5)^2 +
(70-77.5)^2 + (75-77.5)^2 + (80-77.5)^2 + (85-77.5)^2]= 1150
• Variation between group= (total variation) – (Variation within groups )
=1275-1150= 125
• Finally, we calculate the F-statistic by dividing the variation between
groups by the variation within groups, scaled by their respective
degrees of freedom:
F = (125/2) / (1150/9) = 2.61
• We then compare the F-statistic to the critical F-value for the
appropriate significance level and degrees of freedom, to determine
whether the difference in test scores between the groups is
significant.
Practice question:01
Solution:01
Practice Questions-02
Solution:02
Contd…
Contd…

2678-2470

=208
Practice question:03
Solution:03
F- distribution
Step:01- Calculate Degree of freedom
Step:02- Calculate Grand Total(G)
Step:03- Calculate correction factor(C)
Step:04- Calculate Total Sum Square(SS.Total)
Step:05- Calculate Sum of squares between the group
MANOVA
• The Multivariate Analysis of Variance (MANOVA) is an extension of
the ANOVA

• While we only deal with ONE DV in ANOVA, MANOVA accounts for


multiple DVs at once

• It wants to know if there are mean differences across groups on


multiple DVs; it is suitable to test related DVs – e.g., testing
depression, anxiety, and stress across groups at one go
Assumptions
1. Normality (Shapiro Wilk)
2. Univariate Outliers (Boxplots)
3. Multivariate Outliers (Mahalanobis Distances)
4. Multicollinearity (Correlation)
5. Linearity (Scatterplot)
6. Homogeneity of variance-covariance matrices (Box’s M)

You might also like