You are on page 1of 44

10620A

Statistics
Lecture 8
Introduction to linear
regression
+
Review of the midterm
Lecture 8: linear regression
OBJECTIVES OF THE LECTURE

01 02 03 04
CONSOLIDATE THE LEARN NEW CONCEPTS:
FURTHER YOUR PARTICIPATE IN
KNOWLEDGE
KNOWLEDGE SYNCHRONOUS
ACQUIRED ON LINEAR WE’RE TAKING A BREAK ACTIVITIES
REGRESSION

OBJECTIVES TARGETED WE REVIEW THE


BY REGRESSION EXAM QUESTIONS
+ WITH THE
IDENTIFICATION OF POOREST
VARIABLES PERFORMANCE
+
APPLICATION
SPECTRUM

2
Consolidation
• Introduction to regression

3
Regression, what is it?
(capsule 1)

Linear regression is a collection of statistical methods for studying or


exploiting the linear relationship between variables.

n l ine
gre ssio
Re
Linear signifies:
Rate of change
constant

Objectives
 Predict the value of future observations

 Test the presence of a relationship (inference) and understand it


Overall organization of the course

L1
Introduction and planning of a study
L2
Key Sampling variability and estimation
L3
founding
L4 principles Hypothesis testing
L5 Comparison of means (one, two,
L6 Statistical several)
L7 Inference
Analysis Independence tests
L8 techniques
L9 Linear regression
L10
L11 Predictive statistics Linear regression

5
Types of variables
(capsule 2)

¿ 𝑏0 + 𝑏1 𝑋 Simple model
𝑌 ¿ 𝑏0 +𝑏1 𝑋 1+𝑏 2 𝑋 2 +…+𝑏 𝑘 𝑋 𝑘 Multiple model

Dependent or or are called explanatory or


response variable independent variables

This is the variable


whose values we
Used to explain or predict
want to explain or
predict

Quantitative Quantitative
variable (with many OR
possible values) indicator (binary 0-1) variables

6
Exercises

In the two examples on the following pages, quotes were taken from articles in La
Presse. For each of them:

A. Give the targeted objective is:


 predicting future observations
OR
 testing the presence of a relationship?

B. Identify the dependent variables () and explanatory variables ( or s).

C. If linear regression is not appropriate, say why.

7
Exercise 1 (taken from La Presse)
Source: La Presse

Context and motivation:

Question: "Why does it take 10 times more elected officials to lead Montreal
than a city in Western Canada?"

Hypothesis: "Some defenders of our system argue that reducing the number of
elected officials would not lead to substantial savings. According to them, the
remaining elected officials would need larger teams to serve the population,
which would command equally large salaries."

8
Exercise 1 (continued)
Study conducted:

"To test this hypothesis, [researchers] looked at how the cost of running
municipal councils in Canada varies with the number of elected officials."

"In other words, they [the researchers] sought to verify whether the addition of
an elected official tends to inflate the budget of a city because of the needs
that this new elected official creates, essentially. "

One result of the study :

"Thus, a drop in the number of elected officials by 1% in a city leads to an


almost equivalent drop in council spending (0.8 to 0.9%).

In other words, the decrease in the number of elected officials is not offset by
an increase in the support staff of other elected officials, or hardly."

9
Exercise 1 (continued)

A. Give the targeted objective.

B. Identify the dependent variables () and explanatory variables ( or s).

C. If linear regression is not appropriate, say why

10
Exercise 2 (taken from La Presse)

Context and motivation:

"According to the director general of


the CSVDC [Val-des-Cerfs School
Board], many students have dropped
out without the teachers ever
considering them to be at high risk. "

"[The director] acknowledges that it is


rarely easy for teachers to target
students who are at risk of dropping
out of school."

Source: La Presse

11
Exercise 2 (continued)

Study conducted:

"In collaboration with data specialists from the accounting firm Raymond
Chabot Grant Thornton, Val-des-Cerfs developed a computer calculation last
year which proved to be of impressive precision. "

"To develop its tool, the accounting firm compiled 300 different factors in the
record of each of the 60,000 students who have studied at the CSVDC since
2002.
These data include academic results as well as statistics relating to financial
aid, absenteeism, disciplinary measures and frequent changes of address."

12
Exercise 2 (continued)

Results of the study:

"By examining all students at the end of Grade 6, the model correctly identified
92% of students who would drop out of Secondary III.

At the end of the 2017-18 school year, the model identified approximately 90
students who entered high school this fall and who are at risk of dropping
out."

13
Exercise 2 (continued)
A. Give the targeted objective.

B. Identify the dependent variables () and explanatory variables ( or s).

C. If linear regression is not appropriate, say why.

14
Estimation of a regression model
(capsule 3)

 Idea: We minimize the global estimation error


 Specifically, we minimize the sum of squares of the errors at each point:

price
estimated
by the model
error (residual) of
observed the data point
price

15
Review of the midterm
exam

16
Midterm exam

Section Theme Weighting

1 Multiple choice questions 38%

2 Estimation of a parameter 24%

3 Choosing, conducting and concluding a 20%


hypothesis test

4 Welch's test and recommendation 18%

17
Warning

The next few slides are designed to review important


concepts that were tested on the midterm exam.

The questions selected are not necessarily the ones you


answered in your exam nor are they necessarily the ones from
this session's in-class exam.

The questions are linked to your personalized Excel file, and it


is impossible to show all the solutions here.

18
Section 1 – Unbiased estimators

It is known that the sample proportion is an unbiased estimate of when calculated from a
simple random sample with replacement.

Check the false statements.

1) The value of value varies depending on the sample selected.


2) The average value of over all possible samples is .
3) The estimator provides on average the right value.
4) Since it is unbiased, the estimator always gives exactly the desired value, i.e. .
5) The selected sample has no impact on the value of .
6) takes different values depending on the sample selected, but these oscillate around
their average value which is .

19
Section 1 – Link between intervals and tests

Knowing that the confidence interval at the 95% confidence level, for the estimate of the
difference between two proportions is (0.66,0.80), what is the appropriate conclusion for a
hypothesis test, at the 5% significance level?

1) It is impossible to know. It is absolutely necessary to do a hypothesis test to calculate the p-


value to compare with the 5% significance level.

2) Since the value 0.80 > 0.05, we cannot conclude that there is a difference between the two
proportions.

3) Since the interval does not contain the value 0, we can conclude that there is a difference
between the two proportions

4) Since the center of the interval is positive, i.e., 0.73, we can conclude that there is a
difference between the two proportions.

20
Section1 – Choice of test
As part of a pilot study to assess the effectiveness of three advertising
concepts (let's call them A, B, and C), 90 consumers were asked to view
and successively rate each of them on a Likert scale (*)out of nine
points.

Data preview

With which test(s) would it be appropriate to address this question?

Are the scores for different advertising concepts different?

21
Section 1: choice of test

Are the scores for different advertising concepts different?

With which test(s) would it be appropriate to address this question?


1) Welch’s test to compare the mean scores of the 3 concepts:
VS at least one difference

2) Pairwise tests to compare the mean scores two-by-two:


Test 1: Test 2: Test 3:

3) Tests on the correlation coefficient to determine whether there is


a dependency between the scores of each pair of concepts.
4) Tests on one proportion:
Test 1: Test 2: Test 3:

where proportion of people who prefer concept A,B or C.

22
Section 2 – Context
The Saint-Simon Hot Air Balloon Festival takes place annually on the last weekend (Friday to Sunday) of September.
Each year, the organizers collect data on a random sample of festival attendees who fly in a hot air balloon in the
hope of making improvements the following year.
The table below shows some summary statistics for the sample of 104 festival attendees surveyed at the 2022
edition.

Table 1: Descriptive statistics on waiting time (minutes) for a hot air balloon flight.

Friday Saturday Sunday All Three Days

Mean 23.13 minutes 63.59 minutes 47.48 minutes 46.57 minutes


Stdev
5.04 minutes 20.62 minutes 16.40 minutes 23.92 minutes
Minimum
15 minutes 29 minutes 21 minutes 15 minutes
Maximum
32 minutes 98 minutes 76 minutes 98 minutes
Number of
attendees 35 47 22 104
sampled

23
Section 2 – Variable under study

Considering that we want to estimate the following parameter

The average waiting time for all attendees who flew in a hot air balloon flight on
Sunday.

What is the variable under study when estimating this parameter?

1) The day of the week


2) Waiting time
3) The mean waiting time
4) The proportion of attendees who flew in a hot air balloon on this day
5) The number of attendees who flew in a hot air balloon on this day
6) A measure that indicates 1 if the flight took place on this day and indicates 0
otherwise

24
Section 2 – Variable under study

Considering that we want to estimate the following parameter

The proportion of all attendees who flew in a hot air balloon on Sunday.

What is the variable under study when estimating this parameter?

1) The day of the week


2) Waiting time
3) The mean waiting time
4) The proportion of attendees who flew in a hot air balloon on this day
5) The number of attendees who flew in a hot air balloon on this day
6) A measure that indicates 1 if the flight took place on this day and indicates 0
otherwise

25
Section 2 – Interpretation of a CI

Which sentence corresponds to a correct interpretation of the obtained confidence


interval?

1) 95% of the intervals obtained in a similar way will contain the value of the point
estimate, and it is impossible to know if the interval obtained is part of this 95%.
2) 95% of the intervals obtained in a similar way will contain the true value of the
parameter, and it is impossible to know if the interval obtained is part of this 95%.
3) 95% of the values of the parameter are between the bounds of the interval
obtained.
4) Since the value of the point estimate lies within the bounds of the obtained interval,
we know that the obtained interval contains the true value of the parameter.
5) Since the value of the point estimate lies outside the bounds of the obtained
interval, we know that the obtained interval does not contain the true value of the
parameter.

26
Section 3 – Context
A few months ago, a major coffee shop chain launched a prepaid purchasing card that customers can use at any of
the chain's locations. Customers can store anywhere from $5 to $500 on the card and recharge it at any time. Three
types of cards are available, each with different privileges.

Recently, a major advertising campaign was conducted to increase the number of visits from their prepaid card
customers.

A sample of prepaid card customers was selected, and the following information is collected:

Variables Description Values


Amount stored on the customer's prepaid card at the time of
Amount purchase ($).
 
1. Cappuccino
Card Type of prepaid card 2. Americano
3. Double espresso
Visits1 Number of visits made by the customer to one of the chain's
locations in the month preceding the advertising campaign.  
Visits2 Number of visits made by the customer to one of the chain's
locations in the month following the advertising campaign.  
Indicator variable specifying whether the client is eligible for 0. No
GoldenStar "Golden Star" status. 1. Yes
Indicator variable specifying whether the client has a college 0. No
Diploma degree 1. Yes
Indicator variable specifying whether the client's annual income is 0. No
Income60 $60,000 or less 1. Yes

27
Section 2 – Choice of a hypothesis test

Among prepaid card customers with an annual income of $60,000 or less, is the
proportion of customers eligible for Golden Star status greater than the proportion who
are not?

What is the appropriate test to answer the question?

1) Comparison test of a mean / proportion against a predetermined value


2) Test of comparison of two means / proportions - case of independent samples
3) Test of comparison of two means / proportions - case of paired data
4) None of the tests listed are adequate

28
Section 2 – Point estimates

No Yes
No
Income60
Yes

29
Section 2 – Conditions of validity

Are the conditions of validity of the test met?

1) Yes, the sample(s) are large enough.


2) No. There are expected numbers below 5.
3) Yes, if we assume that the variable under study follows a Normal distribution and no
otherwise.
4) No. The sample size(s) are really too small.
5) Yes, the sample(s) are large enough AND the variable under study is definitely
normally distributed. Both conditions must be met for the test to be valid.

30
Reminder: The conditions of validity of a
hypothesis test

Conditions of validity:

1. The sample is large.


OR

2. The variable under study () follows the Normal distribution.

Reminder: To know if condition 2 is met, the entire population would


need to be analyzed.

31
Section 2 – Reducing the error to which we are
exposed

Which of the following strategies would reduce the type of error to which one is
exposed if we don’t reject H0?

1) None.
2) Decrease the significance level.
3) Increase the significance level.
4) Increase the sample size.

32
Section 2 – Choice of a hypothesis test

Is the Americano prepaid card more popular among customers of the coffee shop chain
with a college degree than among customers without one?

What is the appropriate test to answer the question?

1) Comparison test of a mean / proportion against a predetermined value


2) Test of comparison of two means / proportions - case of independent samples
3) Test of comparison of two means / proportions - case of paired data
4) None of the tests listed are adequate

33
Section 2 – Choice of a hypothesis test

Do coffee shop customers with a college degree stock a similar amount on average to
those without a college degree?

What is the appropriate test to answer the question?

1) Comparison test of a mean / proportion against a predetermined value


2) Test of comparison of two means / proportions - case of independent samples
3) Test of comparison of two means / proportions - case of paired data
4) None of the tests listed are adequate

34
Section 2 – Choice of a hypothesis test

Can we say that the advertising campaign increased, on average, the number of visits by
a prepaid card customer to the coffee shop chain?

What is the appropriate test to answer the question?

1) Comparison test of a mean / proportion against a predetermined value


2) Test of comparison of two means / proportions - case of independent samples
3) Test of comparison of two means / proportions - case of paired data
4) None of the tests listed are adequate

35
Section 4 – Context
A supermarket chain tests three different types of
promotions simultaneously. The objective is to
determine if the three types of promotions have the
same impact, i.e. a similar increase in sales.

In concrete terms, a sample of supermarkets is


Promo1 Promo2 Promo3
selected and then randomly divided into three Mean 3.3639 4.8306 2.8861
groups. Finally, for each of the supermarkets, we Std. Dev. 1.1675 3.1606 0.6406
calculate the difference (in thousands of dollars) in
sales (sales after the promotion minus sales before
the promotion).

Consider the following averages: Pair Stat. t DF P-Val. H-B Corr.


P2 P3 3.6177 38 0.0009 0.0026
μ1 = the average difference in sales for promotion 1. P1 P2 -2.6118 44 0.0123 0.0245
μ2 = the average difference in sales for promotion 2. P1 P3 3.1527 54 0.0358 0.0358
μ3 = the average difference in sales for promotion 3.

36
Section 4 – Conclusion of the pairwise tests

Promo1 Promo2 Promo3


Mean 3.3639 4.8306 2.8861
Std. Dev. 1.1675 3.1606 0.6406

Pair Stat. t DF P-Val. H-B Corr.


P2 P3 3.6177 38 0.0009 0.0026
P1 P2 -2.6118 44 0.0123 0.0245
P1 P3 3.1527 54 0.0358 0.0358

Considering the descriptive statistics and test results, which conclusion would be
most appropriate (using an alpha level of 5%)?

37
Recommendation

Pairwise comparisons identify pairs of means that are statistically significantly


different.

Descriptive statistics are used to determine which recommendations to make


based on the objectives.

What should you consider before making your recommendation?

38
Section 4 – Type of Study

Based on the findings of your analysis, is it possible to say that the type of
promotion is the cause of differences in sales increase?
1) No. This is an observational study. There could be factors other than the type
of promotion that explain the differences between the three groups.
2) No. Since the types of promotions are not different from each other, there is
no reason to draw such a conclusion.
3) No. Since the validity conditions of the test are not met, it is impossible to say
that the type of promotion causes differences in sales increase.
4) Yes, the experiment carried out allows us to conclude that it is the type of
promotion that makes the difference in the increase of sales.

39
Work to do

40
Work to do

 Prepare for Lecture 9:


o Five capsules on linear regression (inference component, the
basis);
o Complete the notes;
o Do the exercises;
o Take the quiz.

 Prepare the presentation for project.


Your team should create a video (no longer than FIVE minutes)
presenting your study using whatever medium you choose (e.g. Youtube,
Vimeo, Google Drive link) so that the video can be shared with the class.
It is expected that your analyses and results will be corrected as
needed, based on the feedback provided by the teacher. Refer to the
explanatory document for all the details.

41
Feedback and
recommendation for the
project
42
Feedback and recommendation

 The correction of the projects is not yet finished for all the sections.

 Grades will be published in HEC en ligne as soon as possible.

 Some teams lost points for depth of analysis.

43
Feedback and recommendation
 First, introduce your research question.
 Clearly identify the population(s), variable(s), parameter(s) at the beginning of
the presentation.
 Present the data collection methodology in a clear but concise manner.
 Represent the results of your sample with an appropriate graph (related to the
hypotheses).
 Specify the hypothesis test used and its conclusion but do not present a screen
shot of the template.
 If the hypothesis test presented in your report is not appropriate,
present the new test.
 In the limitations, the names of the biases should be clearly cited and
described where they occur in the study.

44

You might also like