Professional Documents
Culture Documents
10620A Lecture8 H2023
10620A Lecture8 H2023
Statistics
Lecture 8
Introduction to linear
regression
+
Review of the midterm
Lecture 8: linear regression
OBJECTIVES OF THE LECTURE
01 02 03 04
CONSOLIDATE THE LEARN NEW CONCEPTS:
FURTHER YOUR PARTICIPATE IN
KNOWLEDGE
KNOWLEDGE SYNCHRONOUS
ACQUIRED ON LINEAR WE’RE TAKING A BREAK ACTIVITIES
REGRESSION
2
Consolidation
• Introduction to regression
3
Regression, what is it?
(capsule 1)
n l ine
gre ssio
Re
Linear signifies:
Rate of change
constant
Objectives
Predict the value of future observations
L1
Introduction and planning of a study
L2
Key Sampling variability and estimation
L3
founding
L4 principles Hypothesis testing
L5 Comparison of means (one, two,
L6 Statistical several)
L7 Inference
Analysis Independence tests
L8 techniques
L9 Linear regression
L10
L11 Predictive statistics Linear regression
5
Types of variables
(capsule 2)
¿ 𝑏0 + 𝑏1 𝑋 Simple model
𝑌 ¿ 𝑏0 +𝑏1 𝑋 1+𝑏 2 𝑋 2 +…+𝑏 𝑘 𝑋 𝑘 Multiple model
Quantitative Quantitative
variable (with many OR
possible values) indicator (binary 0-1) variables
6
Exercises
In the two examples on the following pages, quotes were taken from articles in La
Presse. For each of them:
7
Exercise 1 (taken from La Presse)
Source: La Presse
Question: "Why does it take 10 times more elected officials to lead Montreal
than a city in Western Canada?"
Hypothesis: "Some defenders of our system argue that reducing the number of
elected officials would not lead to substantial savings. According to them, the
remaining elected officials would need larger teams to serve the population,
which would command equally large salaries."
8
Exercise 1 (continued)
Study conducted:
"To test this hypothesis, [researchers] looked at how the cost of running
municipal councils in Canada varies with the number of elected officials."
"In other words, they [the researchers] sought to verify whether the addition of
an elected official tends to inflate the budget of a city because of the needs
that this new elected official creates, essentially. "
In other words, the decrease in the number of elected officials is not offset by
an increase in the support staff of other elected officials, or hardly."
9
Exercise 1 (continued)
10
Exercise 2 (taken from La Presse)
Source: La Presse
11
Exercise 2 (continued)
Study conducted:
"In collaboration with data specialists from the accounting firm Raymond
Chabot Grant Thornton, Val-des-Cerfs developed a computer calculation last
year which proved to be of impressive precision. "
"To develop its tool, the accounting firm compiled 300 different factors in the
record of each of the 60,000 students who have studied at the CSVDC since
2002.
These data include academic results as well as statistics relating to financial
aid, absenteeism, disciplinary measures and frequent changes of address."
12
Exercise 2 (continued)
"By examining all students at the end of Grade 6, the model correctly identified
92% of students who would drop out of Secondary III.
At the end of the 2017-18 school year, the model identified approximately 90
students who entered high school this fall and who are at risk of dropping
out."
13
Exercise 2 (continued)
A. Give the targeted objective.
14
Estimation of a regression model
(capsule 3)
price
estimated
by the model
error (residual) of
observed the data point
price
15
Review of the midterm
exam
16
Midterm exam
17
Warning
18
Section 1 – Unbiased estimators
It is known that the sample proportion is an unbiased estimate of when calculated from a
simple random sample with replacement.
19
Section 1 – Link between intervals and tests
Knowing that the confidence interval at the 95% confidence level, for the estimate of the
difference between two proportions is (0.66,0.80), what is the appropriate conclusion for a
hypothesis test, at the 5% significance level?
2) Since the value 0.80 > 0.05, we cannot conclude that there is a difference between the two
proportions.
3) Since the interval does not contain the value 0, we can conclude that there is a difference
between the two proportions
4) Since the center of the interval is positive, i.e., 0.73, we can conclude that there is a
difference between the two proportions.
20
Section1 – Choice of test
As part of a pilot study to assess the effectiveness of three advertising
concepts (let's call them A, B, and C), 90 consumers were asked to view
and successively rate each of them on a Likert scale (*)out of nine
points.
Data preview
21
Section 1: choice of test
22
Section 2 – Context
The Saint-Simon Hot Air Balloon Festival takes place annually on the last weekend (Friday to Sunday) of September.
Each year, the organizers collect data on a random sample of festival attendees who fly in a hot air balloon in the
hope of making improvements the following year.
The table below shows some summary statistics for the sample of 104 festival attendees surveyed at the 2022
edition.
Table 1: Descriptive statistics on waiting time (minutes) for a hot air balloon flight.
23
Section 2 – Variable under study
The average waiting time for all attendees who flew in a hot air balloon flight on
Sunday.
24
Section 2 – Variable under study
The proportion of all attendees who flew in a hot air balloon on Sunday.
25
Section 2 – Interpretation of a CI
1) 95% of the intervals obtained in a similar way will contain the value of the point
estimate, and it is impossible to know if the interval obtained is part of this 95%.
2) 95% of the intervals obtained in a similar way will contain the true value of the
parameter, and it is impossible to know if the interval obtained is part of this 95%.
3) 95% of the values of the parameter are between the bounds of the interval
obtained.
4) Since the value of the point estimate lies within the bounds of the obtained interval,
we know that the obtained interval contains the true value of the parameter.
5) Since the value of the point estimate lies outside the bounds of the obtained
interval, we know that the obtained interval does not contain the true value of the
parameter.
26
Section 3 – Context
A few months ago, a major coffee shop chain launched a prepaid purchasing card that customers can use at any of
the chain's locations. Customers can store anywhere from $5 to $500 on the card and recharge it at any time. Three
types of cards are available, each with different privileges.
Recently, a major advertising campaign was conducted to increase the number of visits from their prepaid card
customers.
A sample of prepaid card customers was selected, and the following information is collected:
27
Section 2 – Choice of a hypothesis test
Among prepaid card customers with an annual income of $60,000 or less, is the
proportion of customers eligible for Golden Star status greater than the proportion who
are not?
28
Section 2 – Point estimates
No Yes
No
Income60
Yes
29
Section 2 – Conditions of validity
30
Reminder: The conditions of validity of a
hypothesis test
Conditions of validity:
31
Section 2 – Reducing the error to which we are
exposed
Which of the following strategies would reduce the type of error to which one is
exposed if we don’t reject H0?
1) None.
2) Decrease the significance level.
3) Increase the significance level.
4) Increase the sample size.
32
Section 2 – Choice of a hypothesis test
Is the Americano prepaid card more popular among customers of the coffee shop chain
with a college degree than among customers without one?
33
Section 2 – Choice of a hypothesis test
Do coffee shop customers with a college degree stock a similar amount on average to
those without a college degree?
34
Section 2 – Choice of a hypothesis test
Can we say that the advertising campaign increased, on average, the number of visits by
a prepaid card customer to the coffee shop chain?
35
Section 4 – Context
A supermarket chain tests three different types of
promotions simultaneously. The objective is to
determine if the three types of promotions have the
same impact, i.e. a similar increase in sales.
36
Section 4 – Conclusion of the pairwise tests
Considering the descriptive statistics and test results, which conclusion would be
most appropriate (using an alpha level of 5%)?
37
Recommendation
38
Section 4 – Type of Study
Based on the findings of your analysis, is it possible to say that the type of
promotion is the cause of differences in sales increase?
1) No. This is an observational study. There could be factors other than the type
of promotion that explain the differences between the three groups.
2) No. Since the types of promotions are not different from each other, there is
no reason to draw such a conclusion.
3) No. Since the validity conditions of the test are not met, it is impossible to say
that the type of promotion causes differences in sales increase.
4) Yes, the experiment carried out allows us to conclude that it is the type of
promotion that makes the difference in the increase of sales.
39
Work to do
40
Work to do
41
Feedback and
recommendation for the
project
42
Feedback and recommendation
The correction of the projects is not yet finished for all the sections.
43
Feedback and recommendation
First, introduce your research question.
Clearly identify the population(s), variable(s), parameter(s) at the beginning of
the presentation.
Present the data collection methodology in a clear but concise manner.
Represent the results of your sample with an appropriate graph (related to the
hypotheses).
Specify the hypothesis test used and its conclusion but do not present a screen
shot of the template.
If the hypothesis test presented in your report is not appropriate,
present the new test.
In the limitations, the names of the biases should be clearly cited and
described where they occur in the study.
44