You are on page 1of 14

REGRESSION

Regression is a statistical technique that helps in qualifying the relationship between the
interrelated economic variables. The first step involves estimating the coefficient of the
independent variable and then measuring the reliability of the estimated coefficient. This
requires formulating a hypothesis, and based on the hypothesis, we can create a function.
If a manager wants to determine the relationship between the firm’s advertisement expenditures
and its sales revenue, he will undergo the test of hypothesis. Assuming that higher advertising
expenditures lead to higher sale for a firm. The manager collects data on advertising
expenditure and on sales revenue in a specific period of time. This hypothesis can be translated
into the mathematical function, where it leads to −
Y = A + Bx
Where Y is sales, x is the advertisement expenditure, A and B are constant.
After translating the hypothesis into the function, the basis for this is to find the relationship
between the dependent and independent variables. The value of dependent variable is of most
importance to researchers and depends on the value of other variables. Independent variable is
used to explain the variation in the dependent variable. It can be classified into two types −
 Simple regression − One independent variable
 Multiple regression − Several independent variables

Simple Regression:
Simple linear regression is a statistical method that allows us to summarize and study
relationships between two continuous (quantitative) variables:

One variable, denoted x, is regarded as the predictor, explanatory, or independent variable.

The other variable, denoted y, is regarded as the response, outcome, or dependent variable.

Because the other terms are used less frequently today, we'll use the "predictor" and "response"
terms to refer to the variables encountered in this course. The other terms are mentioned only to
make you aware of them should you encounter them. Simple linear regression gets its adjective
"simple," because it concerns the study of only one predictor variable. In contrast, multiple linear
regression, which we study later in this course, gets its adjective "multiple," because it concerns
the study of two or more predictor variables.

As you may remember, the relationship between degrees Fahrenheit and degrees Celsius is
known to be:

F=95C+32
That is, if you know the temperature in degrees Celsius, you can use this equation to determine
the temperature in degrees Fahrenheit exactly.

Here are some examples of other deterministic relationships that students from previous
semesters have shared:

Circumference = π × diameter

Hooke's Law: Y = α + βX, where Y = amount of stretch in a spring, and X = applied weight.

Ohm's Law: I = V/r, where V = voltage applied, r = resistance, and I = current.

Boyle's Law: For a constant temperature, P = α/V, where P = pressure, α = constant for each gas,
and V = volume of gas.

For each of these deterministic relationships, the equation exactly describes the relationship
between the two variables. This course does not examine deterministic relationships. Instead, we
are interested in statistical relationships, in which the relationship between the variables is not
perfect.

Here is an example of a statistical relationship. The response variable y is the mortality due to
skin cancer (number of deaths per 10 million people) and the predictor variable x is the latitude
(degrees North) at the center of each of 49 states in the U.S. (skincancer.txt) (The data were
compiled in the 1950s, so Alaska and Hawaii were not yet states, and Washington, D.C. is
included in the data set even though it is not technically a state.)

Following are the steps to build up regression analysis −

 Specify the regression model


 Obtain data on variables
 Estimate the quantitative relationships
 Test the statistical significance of the results
 Usage of results in decision-making
Formula for simple regression is −
Y = a + bX + u
Y= dependent variable
X= independent variable
a= intercept
b= slope
u= random factor
Cross sectional data provides information on a group of entities at a given time, whereas time
series data provides information on one entity over time. When we estimate regression equation
it involves the process of finding out the best linear relationship between the dependent and the
independent variables.

Method of Ordinary Least Squares (OLS):


Ordinary least square method is designed to fit a line through a scatter of points is such a way
that the sum of the squared deviations of the points from the line is minimized. It is a statistical
method. Usually Software packages perform OLS estimation.
Y = a + bX

Co-efficient of Determination (R2):


Co-efficient of determination is a measure which indicates the percentage of the variation in the
dependent variable is due to the variations in the independent variables. R 2 is a measure of the
goodness of fit model. Following are the methods −

Total Sum of Squares (TSS):


Sum of the squared deviations of the sample values of Y from the mean of Y.
TSS = SUM ( Yi − Y)2
Yi = dependent variables
Y = mean of dependent variables
i = number of observations

Regression Sum of Squares (RSS):


Sum of the squared deviations of the estimated values of Y from the mean of Y.
RSS = SUM ( Ỷi − uY)2
Ỷi = estimated value of Y
Y = mean of dependent variables
i = number of variations

Error Sum of Squares (ESS):


Sum of the squared deviations of the sample values of Y from the estimated values of Y.
ESS = SUM ( Yi − Ỷi)2
Ỷi = estimated value of Y
Yi = dependent variables
i = number of observations

R2 = 
RSS TSS
 = 1 - 
ESS TSS
R2 measures the proportion of the total deviation of Y from its mean which is explained by the
regression model. The closer the R2 is to unity, the greater the explanatory power of the
regression equation. An R2 close to 0 indicates that the regression equation will have very little
explanatory power.
5 Types of Regression and their
properties
Linear and Logistic regressions are usually the first modeling algorithms
that people learn for Machine Learning and Data Science. Both are great
since they’re easy to use and interpret. However, their inherent simplicity
also comes with a few drawbacks and in many cases, they’re not really
the best choice of regression model. There are in fact several different
types of regressions, each with their own strengths and weaknesses.

In this post, we’re going to look at 7 of the most common types of


regression algorithms and their properties. We’ll soon find that many of
them are biased to working well in certain types of situations and with
certain types of data. In the end, this post will give you a few more tools
in your regression toolbox and give greater insight into regression
models as a whole!

Linear Regression:

Regression is a technique used to model and analyze the relationships


between variables and often times how they contribute and are related to
producing a particular outcome together. A linear regression refers to a
regression model that is completely made up of linear variables.
Beginning with the simple case, Single Variable Linear Regression is a
technique used to model the relationship between a single input
independent variable (feature variable) and an output dependent variable
using a linear model i.e. a line.

The more general case is Multi-Variable Linear Regression where a


model is created for the relationship between multiple independent input
variables (feature variables) and an output dependent variable. The
model remains linear in that the output is a linear combination of the
input variables. We can model a multi-variable linear regression as the
following:

Y = a_1*X_1 + a_2*X_2 + a_3*X_3 ……. a_n*X_n + b

Where a_n are the coefficients, X_n are the variables and b is the bias.


As we can see, this function does not include any non-linearities and so
is only suited for modeling linearly separable data. It is quite easy to
understand as we are simply weighting the importance of each feature
variable X_n using the coefficient weights a_n. We determine these
weights a_n and the bias b using a Stochastic Gradient Descent (SGD).
Linear regression quantifies the relationship between one or more predictor
variable(s) and one outcome variable. ... For example, it can be used to quantify the
relative impacts of age, gender, and diet (the predictor variables) on height (the
outcome variable).
Example of simple linear regression:
The table below shows some data from the early days of the Italian clothing
company Benetton. Each row in the table shows Benetton’s sales for a year and the
amount spent on advertising that year. In this case, our outcome of interest is sales
—it is what we want to predict. If we use advertising as the predictor variable,
linear regression estimates that Sales = 168 + 23 Advertising. That is, if advertising
expenditure is increased by one million Euro, then sales will be expected to
increase by 23 million Euros, and if there was no advertising we would expect
sales of 168 million Euros.
Illustration of how Gradient Descent find the optimal parameters for a Linear
Regression

A few key points about Linear Regression:


 Fast and easy to model and is particularly useful when the
relationship to be modeled is not extremely complex and if
you don’t have a lot of data.

 Very intuitive to understand and interpret.

 Linear Regression is very sensitive to outliers.

Polynomial Regression:

When we want to create a model that is suitable for handling non-linearly


separable data, we will need to use a polynomial regression. In this
regression technique, the best fit line is not a straight line. It is rather a
curve that fits into the data points. For a polynomial regression, the
power of some independent variables is more than 1. For example, we
can have something like:

Y = a_1*X_1 + (a_2)²*X_2 + (a_3)⁴*X_3 ……. a_n*X_n + b

We can have some variables have exponents, others without, and also
select the exact exponent we want for each variable. However, selecting
the exact exponent of each variable naturally requires some knowledge
of how the data relates to the output. See the illustration below for a
visual comparison of linear vs polynomial regression.
Linear vs Polynomial Regression with data that is non-linearly separable

A few key points about Polynomial Regression:

 Able to model non-linearly separable data; linear regression


can’t do this. It is much more flexible in general and can
model some fairly complex relationships.

 Full control over the modeling of feature variables (which


exponent to set).

 Requires careful design. Need some knowledge of the data in


order to select the best exponents.

 Prone to over fitting if exponents are poorly selected.


Ridge Regression:

A standard linear or polynomial regression will fail in the case where


there is high collinearity among the feature variables. Collinearity is the
existence of near-linear relationships among the independent variables.
The presence of high collinearity can be determined in a few different
ways:

 A regression coefficient is not significant even though,


theoretically, that variable should be highly correlated with Y.
For evaluating the regression coefficients, a sample from the population is used rather than the
entire population. It is important to make assumptions about the population based on the sample
and to make a judgment about how good these assumptions are.

Evaluating the Regression Coefficients

Each sample from the population generates its own intercept. To calculate the statistical
difference following methods can be used −
Two tailed test −
Null Hypothesis: H0: b = 0
Alternative Hypothesis: Ha: b ≠ 0
One tailed test −
Null Hypothesis: H0: b > 0 (or b < 0)
Alternative Hypothesis: Ha: b < 0 (or b > 0)
Statistic Test −
t = 
(b - E(b)) SEb
b = estimated coefficient
E (b) = b = 0 (Null hypothesis)
SEb = Standard error of the coefficient
.
Value of t depends on the degree of freedom, one or two failed test, and level of significance.
To determine the critical value of t, t-table can be used. Then comes the comparison of the t-
value with the critical value. One needs to reject the null hypothesis if the absolute value of the
statistic test is greater than or equal to the critical t-value. Do not reject the null hypothesis, I the
absolute value of the statistic test is less than the critical t–value.

Multiple Regression Analysis:


Unlike simple regression in multiple regression analysis, the coefficients indicate the change in
dependent variables assuming the values of the other variables are constant.
The test of statistical significance is called F-test. The F-test is useful as it measures the
statistical significance of the entire regression equation rather than just for an individual. Here
In null hypothesis, there is no relationship between the dependent variable and the independent
variables of the population.
The formula is − H0: b1 = b2 = b3 = …. = bk = 0
No relationship exists between the dependent variable and the k independent variables for the
population.
F-test static −
$$F \: =\: \frac{ \left ( \frac{R^2}{K} \right )}{\frac{(1-R^2)}{(n-k-1)}}$$
Critical value of F depends on the numerator and denominator degree of freedom and level of
significance. F-table can be used to determine the critical F–value. In comparison to F–value
with the critical value (F*) −
If F > F*, we need to reject the null hypothesis.
If F < F*, do not reject the null hypothesis as there is no significant relationship between the
dependent variable and all independent variable
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical
technique that uses several explanatory variables to predict the outcome of a response variable.
The goal of multiple linear regression (MLR) is to model the linear relationship between the
explanatory (independent) variables and response (dependent) variable. In essence, multiple
regression is the extension of ordinary least-squares (OLS) regression that involves more than
one explanatory variable.

You might also like