You are on page 1of 6

Elements of Statistics and Probability

STA 201

S M Rajib Hossain

MNS, BRAC University

Lecture-8
Regression Analysis

Regression analysis is a set of statistical methods used for the estimation of


relationships between a dependent variable and one or more independent
variables.

For example,

✓ It can be used to predict the relationship between reckless driving and


the total number of road accidents.
✓ The effect on sales and spending a certain amount of money on
advertising.

Purpose of regression analysis:

✓ Cause effect relationship.


✓ Prediction.

Types of variables in regression analysis:

✓ Dependent variables.
✓ Independent variables.

Dependent variables: The variables where value is influenced or is to be


predicted.

The dependent variable is also known as response, regress or explained


variable.
Independent variables: The variables which influence the values of the
dependent variables or are used for prediction.

The independent variable is also known as explanatory variable, predictor,


covariate or regressor.

Types of regression equation

✓ Simple regression equation: A regression equation containing only one


independent variable is called simple regression equation.
✓ Multiple regression equation: A regression equation containing more
than one independent variable is called multiple regression equation.

Simple linear regression equation

Let Y be the response variable and we want to explain Y by a single


explanatory variable X.

The basic model is 𝑦 = 𝛼 + 𝛽𝑥 + 𝜀

✓ y is the predicted value of the dependent variable for any given value
of the independent variable (x).
✓ 𝛼 is the intercept, the predicted value of y when the x is 0.
✓ 𝛽 is the regression coefficient (slope) – how much we expect y to
change as x increases.
✓ x is the independent variable (the variable we expect is influencing y).

Here, 𝛼 and 𝛽 are parameters that must be estimated. The symbol 𝜀


represents the random error term. This does not mean that a mistake is being
made. It is simply a symbol used to indicate the absence of exact
relationship between x and y.

Estimated/fitted equation is 𝑦̂ = 𝛼̂ + 𝛽̂ 𝑥
𝑛
∑ 𝑥 𝑦 −𝑛𝑥̅ 𝑦̅
Here, 𝛽̂ = ∑1𝑛 𝑖 2𝑖 2
1 𝑥𝑖 −𝑛𝑥̅

𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅

Example 1: Exam Scores and Study Hours


Suppose you want to determine if there is a relationship between the number
of hours a student studies (independent variable) and their exam scores
(dependent variable). You collect data from a sample of students and want to
fit a simple linear regression model to the data.
Study hours (X) 2 3 4 5 6
Exam scores (Y) 60 70 75 80 85

Solution:
𝑥𝑖 𝑦𝑖 𝑥𝑖 𝑦𝑖 𝑥𝑖 2
2 60 120 4
3 70 210 9
4 75 300 16
5 80 400 25
6 85 510 36
∑𝑛1 𝑥𝑖 =20 ∑𝑛1 𝑦𝑖 =370 ∑𝑛1 𝑥𝑖 𝑦𝑖 =1540 ∑𝑛1 𝑥𝑖 2 =90

∑𝑛
1 𝑥𝑖 ∑𝑛
1 𝑦𝑖
Here, 𝑥̅ = 𝑦̅ =
𝑛 𝑛

20 370
= =
5 5

=4 = 74
𝑛
∑ 𝑥 𝑦 −𝑛𝑥̅ 𝑦̅
We know, 𝛽̂ = ∑1𝑛 𝑖 2𝑖 2 𝛼̂ = 𝑦̅ − 𝛽̂ 𝑥̅
1 𝑥𝑖 −𝑛𝑥̅

1540−5∗4∗74
= = 74- 6*4
90−5∗4 2

=6 = 50

Fitted equation is 𝑦̂ = 50 + 6𝑥

Interpretation:

✓ The slope (𝛽̂ = 6) indicates that, on average, for each additional hour
of study, the exam score is expected to increase by approximately 6
points.
✓ The intercept (𝛼̂ = 50) suggests that if a student doesn't study at all (0
hours), their expected exam score is around 50.

For study hour, x= 8 hours (say)

𝑦̂ = 50 + 6 ∗ 8

= 98
So, the predicted exam score is 98.

Coefficient of Determination: The coefficient of determination is the square


of the pearson correlation coefficient (r). The coefficient of determination,
𝑟 2 , is the proportion of variation in the observed values of the dependent
variable explained by the independent variable. The coefficient of
determination, 𝑟 2 , always lies between 0 and 1. A value of 𝑟 2 near 0
suggests that the regression equation is not very useful for making
predictions, whereas a value of 𝑟 2 near 1 suggests that the regression
equation is quite useful for making predictions.
For the above example, r = 0.9480542 (say) and 𝑟 2 =0.899, that means
89.9% variation in the exam score can be explained by the study hours.

Problem: You are analyzing the relationship between the number of


visitors on a website and the time it takes for the website to load:

Number of visitors 100 150 200 250 300


Load time (seconds) 20 30 40 50 60

Problem: You want to predict customer satisfaction based on the response


time of a customer support system:
Response time (minutes) 10 20 30 40 50
Customer satisfaction (point) 80 70 60 50 40

Multiple regression equation


Suppose, we have k independent variables 𝑥1 , 𝑥2 , … , 𝑥𝑘 and we want to
explain Y.

The basic model is 𝑦 = 𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑘 𝑥𝑘 + 𝜀

Where, 𝛽0 is the intercept of the regression equation.

𝛽𝑗 (j= 1, 2, …, k) are called the partial regression coefficients.

## The parameter 𝛽𝑗 represents the expected change in the response y per


unit change in 𝑥𝑗 when all the remaining regressor variables

𝑥𝑖 (𝑖 ≠ 𝑗) are held constant. For this reason the parameters 𝛽𝑗 (j= 1, 2, …, k)


are called the partial regression coefficients.

You might also like