You are on page 1of 38

Regression Analysis

• Though, Carl F. Gauss (1777-1855) this Father of Regression the term


“Regression” was first used in 1877 by Francis Galton.
• Regression is the technique concerned with predicting some variables by knowing
others
• It is the study of the relationship between variables i.e. the process of predicting
variable Y using variable X
• Regression tells you how values in y change as a function of changes in values of x
• Regression Analysis measures the nature and extent of the relationship between two
or more variables, thus enables us to make predictions.
• Regression is the measure of the average relationship between two or more
variables.
Correlation and Regression
• Correlation describes the strength of a linear relationship between two variables
• Linear means “straight line”
• Regression tells us how to draw the straight line described by the correlation
Degree & Nature of Relationship
• Correlation is a measure of degree of relationship between two variables X & Y
• Regression studies the nature of relationship between the variables so that one
may be able to predict the value of one variable on the basis of another.
Cause & Effect Relationship:
• Correlation does not always assume cause and effect relationship between two
variables.
Correlation and Regression
• Regression clearly expresses the cause and effect relationship between two variables.
• The independent variable is the cause and dependent variable is effect.
Prediction
• Correlation doesn’t help in making predictions
• Regression enable us to make predictions using regression line
Symmetric
• Correlation coefficients are symmetrical i.e.
• Regression coefficients are not symmetrical i.e.
Origin & Scale
• Correlation is independent of the change of origin and scale
• Regression coefficient is independent of change of origin but not of scale
Types of Regression Analysis
There are three types of regression analysis such as
1. Simple & Multiple Regression
2. Linear & Non Linear Regression
3. Partial & Total Regression
Multiple Regression
• In multiple regression, several explanatory variables works together
to explain the response.
• Following our principles of data analysis, we first look at the
distribution of each variable to be used in multiple regression to
determine if there are any unusual patterns that may be important in
building our regression analysis and then at relationships among the
variables.
Example of Multiple Regression
• In a study of direct operating cost, Y, for 67 branch offices of consumer finance
charge, four independent variables were considered:
• X1: Average size of loan outstanding during the year,
• X2 : Average number of loans outstanding,
• X3 : Total number of new loan applications processed, and
• X4 : Office salary scale index.
• The model for this example is
Linear Regression
• In linear regression, the model specification is that the dependent
variable, yi is a linear combination of the parameters (but need not
be linear in the independent variables).
• For example, in simple linear regression for modelling n data points
there is one independent variable: xi, and two parameters, β0 and β1:
• General Linear Model Models in which the parameters (β0, β1, . . . ,
βp) all have exponents of one are called linear models.
• y = β 0 + β 1 x1 + ε
• First-Order Model with One Predictor Variable
Total Regression
• A total regression analysis is one which is made to study the effect of all the
important variables on one another.
• For example, when the effect of advertising expenditure, income of the people,
and price of the goods on the volume of sales are measured, it is a case of total
regression analysis.
• In a such cases, the regression equation takes the following forms like that of a
multiple regression analysis.
• S = f (A, I, P) and       
• X= f ( Y, Z, P) etc.
• This type of regression analysis is usually made in the field of business and
economics where values of a variable are effected by multiplicity of causes.
Partial Regression
• A partial regression analysis, on the other hand, is one which is made
to study the effect of one, or two relevant variables (excluding the
irrelevant one) on another variable.
• The equation of such a regression takes the following form like the
total simple regression analysis:
• Y = f (X but not of Z and P)
• S = f ( advertisement but not of price and income of the people)
Simple regression
• In simple linear regression we studied the relationship between one
explanatory variable and one response variable.
• Simple regression analysis is a statistical tool that gives us the ability to
estimate the mathematical relationship between a dependent variable (usually
called y) and an independent variable (usually called x).
• The dependent variable is the variable for which we want to make a prediction.
• General regression model

1.0, and 1 are parameters


2. X is a known constant
3. Deviations  are independent N(o, 2)
Simple Linear Regression
••  In linear regression, the model specification is that the dependent
variable, y is a linear combination of the parameters (but need not be
linear in the independent variables).
• For example, in simple linear regression for modelling n data points
there is one independent variable: x, and two parameters, β0 and β1:
• General Linear Model Models in which the parameters (β0, β1, . . . ,
βp) all have exponents of one are called linear models.
y = β 0 + β 1 x1 + ε
• First-Order Model with One Predictor Variable
Simple Linear Regression
• Simple Linear Regression has 3 parts
1. Regression Lines
2. Regression Equations
3. Regression Coefficients
Regression Lines
• The regression line shows the average relationship between two
variables.
• It is also called Line of Best Fit
• If two variables X & Y are given, we shall have two regression lines:
1. Regression Line of X on Y
2. Regression Line of Y on X
• The regression Line of Y on X gives the most probable values of Y
given the values of X and regression Line of X on Y gives the most
probable values of X given the values of Y.
• Thus there are two regression lines
Regression Lines
• Nature of Regression Lines
• When there is either perfect positive or perfect negative correlation
between i.e. r ±1, then the two regression lines will coincide i.e. we will
have one line.
• The further the two regression lines are from each other, the lesser the
degree of correlations and the nearer the regression lines are to each
other, the greater will be the degree of correlation.
• If the variables are depended, r = 0, then the two regression lines
intersect each other at 90°.
• If regression lines rise from left to right upward, then correlation is
positive.
Regression Line
Regression Line
Regression Equations
• Regression Equations are the algebraic expression of regression lines.
• Since there are two regression lines , there are two regression equations:
• Regression Equation of X on Y is used to describe the variation in the values of X
given the change in the values of Y and
• Regression Equation of Y on X is used to describe the variation in the values of Y
given the change in the values of X and
• Regression Equation of Y on X is expressed as
• Y = a + bX
• Where Y is the dependent variable to be estimated and X is the independent variable
• In this equation a and b are two unknown constant (fixed numerical values) which
determine the position of the line completely
Regression Equations
• The constant are called parameters of the line.
• If the value of either or both of them is changed, another line is
determined.
• The parameter ‘a’ determines either the level of the fitted line (i.e. the
distance of the line directly above or below the origin).
• The parameter ‘b’ determines the slope of the line (i.e. the change in
Y for unit change in x.
• If the value of the constant ‘a’ and ‘b’ are obtained, the line is
completely determined. But the question is how to obtain this values.
• The answer is provided by the method of least squares.
Regression Equations
••  least square method.

• The least square method states that the line should be drawn through
the plotted points in such a manner that the sum of the squares of the
vertical deviations of the actual y values from the estimated y values
is the least, or in other words, in order to obtain a line which fits the
points best, (y- )2 should be the minimum.
• Such line is known as the line of best fit.
• With differential calculus, it can be solved simultaneously.
Regression Equations
••  If solved simultaneously, we yield the values of parameters such that
the least square requirement is fulfilled
+ b ………………………….(i)
………………........(ii)
These equations are usually called normal equations. In the equation,
indicate totals which are computed from observed pairs of values of
two variables X and Y to which the least square estimating line is to
be fitted and N is the total number of observed pairs of values.
Regression Equations
••  Regression Equation of X on Y

• Regression Equation of Y on X is expressed as


• X = a + bY
• To determine the values a and b the following two normal equations
are to be resolved simultaneously
+ b ………………………….(i)
………………........(ii)
Exercise
• Calculate the regression equation of X on Y and Y on X from the
following data
X: 1 2 3 4 5
Y: 2 5 3 8 7
Solution:
X Y XY
1 2 1 2 2
2
2 5
5 4
4 25
25 10
10
3
3 3
3 9
9 9
9 9
9
4
4 8
8 16
16 64
64 32
32
5 7 25 49 35
5 7 25 49 35
= 15 = 25 = 55 = 151 = 88
Regression Equations
••  Regression equation of X on Y is given by

• X = a + bY
The normal equations are
+ b ………………………….(i)
………………........(ii)
Substituting the values, we get
15 = 5a + 25b …………………………….. (i)
88 = 25a + 151b …………………………... (ii)
Regression Equations
Solving (i) and (ii), we get
a = 0.5 and b =0.5
Hence the required regression of X on Y is given by
X = 0.5 + 0.5Y
Regression Equations
••  Regression Equation of Y on X is given by

• Y = a + bX
• The normal equations are
+b

Substituting the values we get


25 = 5a + 15b …………………………. (iii)
88 = 15a + 55b …………………………. (iv)
Regression Equations
Solving (iii) and (iv) we get,
a = 1.10 and b = 1.3
Hence the required regression equation of Y on X is given by
Y = 1.10 + 1.30 X
Deviations taken from actual Means of X and Y
••  The calculations shown by direct method shown above are quite
cumbersome when the values of X and Y are large.
• For simplifications, instead of dealing with the actual values of X
and Y, we take the deviation from X and Y series from their
respective means.
• In such case, the equation Y = a + bX is changed to
• Y - = (Y -)
• The value of can be obtained as follows
• = where, x = (X - ) and y = (Y -)
Deviations taken from actual Means of X and Y
•• The
  two normal equations which had been written earlier when
changed in terms of x and y become
+ b …………………………. (i)
………………........(ii)
• Since = 0 [ deviation being taken from means]
• Equation (i) reduces to
• Na = 0, therefore a = 0
• Equation (ii) reduces to
• So b or =
Deviations taken from actual Means of X and Y
••  After obtaining the value of the regression equation can easily be
written in terms of X and Y by substituting for x, (X - ) and for y, (Y
-)
• Similarly, the regression equation X = a + bY is reduced to
• (X - ) = (Y -) and the value of can be similarly obtained as
• =
Regression Coefficients
• The quantity b in the regression equation is called the “regression coefficient”
or slope coefficient.
• Regression coefficient measures the average change in the value of one
variable for a unit change in the value of another variable. This way it
represents the slope of regression line
• There are two regression coefficients:
• Regression coefficient of Y on X
• Regression coefficient of X on Y
Regression Coefficients
•  Regression coefficient of X on Y is represented by
• It measures the amount of change in X corresponding to a unit change in X.
The regression coefficient of X on Y is given by
• =r
• When deviation are taken from the means of X and Y, the regression coefficient
is obtained by
• =
Regression Coefficients
•  Regression coefficient of Y on X is represented by
• It measures the amount of change in Y corresponding to a unit change in X.
The regression coefficient of Y on X is given by
• =r
• When deviation are taken from the means of X and Y, the regression coefficient
is obtained by
• =
Properties of Regression Coefficients
• Coefficient of correlation is the geometric mean of the regression
coefficients. i.e. r = 𝑏𝑥𝑦 . 𝑏𝑦𝑥
• Both the regression coefficients must have the same algebraic sign.
• Coefficient of correlation must have the same sign as that of the
regression coefficients.
• Both the regression coefficients cannot be greater than unity.
• Arithmetic mean of two regression coefficients is equal to or greater
than the correlation coefficient. i.e. 𝑏𝑥𝑦+𝑏𝑦𝑥 2 ≥ r
• Regression coefficient is independent of change of origin but not of
scale
Residual
• Residual is the difference between the observed value y i and the
corresponding fitted value.

• Residuals are highly useful for studying whether a given regression


model is appropriate for the data at hand.
Usefulness of Regression Analysis
1. It is one of the most commonly used tools for business analysis.
2. It is easy to use and applies to many situations.
3. Regression analysis helps in establishing a functional Relationship
between two or more variables.
4. Since most of the problems of cause and effect relationships, the
regression analysis is a highly valuable tool in economic and business
research.
5. Regression analysis predicts the values of dependent variables from the
values of independent variables.
6. We can calculate coefficient of correlation (r) and coefficient of
determination (R2) with the help of regression.
 Variable Selection Procedures
F Test is used to test whether the addition of x2 to a model involving
x1 (or the deletion of x2 from a model involving x1and x2) is
statistically significant The p-value corresponding to the F statistic is
the criterion used to determine if a variable should be added or deleted
F Test
• The test based on the test statistic which follows F-distribution is
called F- test
• Represents significant difference between variances of two
populations based on small samples drawn from those populations.

• In the F-ratio we always take the larger of the two estimates in the
numerator and smaller in the denominator
• The degree of freedom is (n 1 -1) and (n 2 -1)
F Test
• An inferential statistics is used to determine the significant difference
of three or more variables or multivariate collected from
experimental research.

You might also like