You are on page 1of 59

Regression analysis

Regression is a statistical method that allows modeling relationships between a


dependent variable and one or more independent variables.
A regression analysis makes it possible to infer or predict another variable
based on one or more variables.
For example, you might be interested in what influences a person's salary. In
order to find it out, you could take level of education, the weekly working hours
and the age of a person.
• Further you could now investigate whether these three variables have
an influence on a person's salary.
• If so, you can predict a person's salary by using the highest
education level, the weekly working hours and the age of a person.
What are dependent and independent
variables?
• The variable to be inferred is called the dependent
variable (criterion). The variables used for prediction are
called independent variables (predictors).
• Thus, in the example above, salary is the dependent variable
and highest educational attainment, weekly hours worked, and
age are the independent variables.
When do I use a regression analysis?
• By performing a regression analysis two goals can be pursued. On the one hand,
the influence of one or more variables on another variable can be measured, and
on the other hand, the regression can be used to predict a variable by one or
more other variables. For example:
1) Measurement of the influence of one or more variables on another variable
• What influences children's ability to concentrate?
• Do the educational level of the parents and the place of residence affect the
future educational attainments of children?
2) Prediction of a variable by one or more other variables
• How long does a patient stay in the hospital?
• What product is a person most likely to buy from an online store?
• The regression analysis thus provides information about how the value of the
dependent variable changes if one of the independent variables is changed.
Types of regression analysis
• Regression analyses are divided into simple linear
regression, multiple linear regression and logistic regression.
• The type of regression analysis that should be used, depends on the
number of independent variables and the scale of measurement of
the dependent variable.
• Independent variable of the regression
• No matter which regression is calculated, the scale level of the independent
variables can take any form (metric, ordinal and nominal). However, if there is an
ordinal or nominal variable with more than two values, so-called dummy variables
must be formed.
• Dummy variables and Reference category
• When an independent variable is categorical, it is encoded as a set of binary dummy
variables before being included in the regression model.
• When dummy variables are created, a variable with several categories is made into several variables with only
2 categories each.

• One of the categories is set as the reference category and a new variable is created for each of the remaining
categories.

• Let's take an example to illustrate this. Suppose you are studying the effect of education level (a categorical
variable with three levels: high school, college, and graduate) on salary. In order to include this categorical
variable in a regression model, it needs to be encoded as dummy variables.

• Let's say we use high school as reference category and we create two dummy
variables: is_college and is_graduate. The variable is_college for example will take a value of 1 if the
individual has a college degree and 0 otherwise.
• If you only want to use one variable for prediction, a simple
regression is used.

• If you use more than one variable, you need to perform a multiple
regression.
• If the dependent variable is nominally scaled, a logistic
regression must be calculated.
• If the dependent variable is metrically scaled, a linear regression is
used.
• Whether a linear or a non-linear regression is used depends on the
relationship itself.
• In order to perform a linear regression, a linear relationship between
the independent variables and the dependent variable is necessary.
Simple Linear Regression
• The goal of a simple linear regression is to predict the value of a
dependent variable based on an independent variable.
• The greater the linear relationship between the independent variable
and the dependent variable, the more accurate is the prediction.
• This goes along with the fact that the greater the proportion of the
dependent variable's variance that can be explained by the independent
variable is, the more accurate is the prediction. Visually, the relationship
between the variables can be shown in a scatter plot.
• The greater the linear relationship between the dependent and
independent variables, the more the data points lie on a straight line.
• The task of simple linear regression is to exactly determine the
straight line which best describes the linear relationship between
the dependent and independent variable.
• In linear regression analysis, a straight line is drawn in the scatter
plot. To determine this straight line, linear regression uses
the method of least squares.
The regression line can be described by the following equation

• Definition of "Regression coefficients":


• a : point of intersection with the y-axis
• b : gradient of the straight line
• ŷ is the respective estimate of the y-value.
• This means that for each x-value the corresponding y-value is
estimated. In our example, this means that the height of people is used
to estimate their weight.
If all points (measured values) were exactly on one straight line, the estimate would be perfect. However, this is
almost never the case and therefore, in most cases a straight line must be found, which is as close as possible
to the individual data points. The attempt is thus made to keep the error in the estimation as small as possible
so that the distance between the estimated value and the true value is as small as possible. This distance or
error is called the "residual", is abbreviated as "e" (error) and can be represented by the greek letter epsilon (ϵ).
When calculating the regression line, an attempt is made to determine the regression coefficients (a and b) so
that the sum of the squared residuals is minimal. (OLS- "Ordinary Least Squares")
• The regression coefficient b can now have different signs, which can be
interpreted as follows
• b > 0: there is a positive correlation between x and y (the greater x, the
greater y)
• b < 0: there is a negative correlation between x and y (the greater x, the
smaller y)
• b = 0: there is no correlation between x and y
• Standardized regression coefficients are usually designated by the letter
"beta". These are values that are comparable with each other.
• Here the unit of measurement of the variable is no longer important. The
standardized regression coefficient (beta) is automatically output by
DATAtab.
Multiple Linear Regression
• Unlike simple linear regression, multiple linear regression allows more than
two independent variables to be considered.
• The goal is to estimate a variable based on several other variables. The
variable to be estimated is called the dependent variable (criterion).
• The variables that are used for the prediction are called independent
variables (predictors).
• Multiple linear regression is frequently used in empirical social research as
well as in market research. In both areas it is of interest to find out what
influence different factors have on a variable.
• For example, what determinants influence a person's health or purchasing
behavior?
• Marketing example:
• For a video streaming service you should predict how many times a
month a person streams videos. For this you get a record of user's
data (age, income, gender, ...).
• Medical example:
• You want to find out which factors have an influence on the
cholesterol level of patients. For this purpose, you analyze a patient
data set with cholesterol level, age, hours of sport per week and so on.
• The equation necessary for the calculation of a multiple regression is
obtained with k dependent variables as:
The coefficients can now be interpreted similarly to the linear regression equation. If all independent variables
are 0, the resulting value is a.

If an independent variable changes by one unit, the associated coefficient indicates by how much the
dependent variable changes.

So if the independent variable xi increases by one unit, the dependent variable y increases by bi.
Coefficient of determination
• In order to find out how well the regression model can predict or
explain the dependent variable, two main measures are used. This is
on the one hand the coefficient of determination R2 and on the other
hand the standard estimation error.
• The coefficient of determination R2, also known as the variance
explanation, indicates how large the portion of the variance is that can
be explained by the independent variables.
• The more variance can be explained, the better the regression model
is. In order to calculate R2, the variance of the estimated value is
related to the variance in the observed values.
Adjusted R2
The coefficient of determination R2 is influenced by the number of independent variables used.

The more independent variables are included in the regression model, the greater the variance resolution R2.
To take this into account, the adjusted R2 is used.
Standard estimation error
• The standard estimation error is the standard deviation of the
estimation error. This gives an impression of how much the prediction
differs from the correct value.
• Graphically interpreted, the standard estimation error is the dispersion
of the observed values around the regression line.
• The coefficient of determination and the standard estimation error are
used for simple and multiple linear regression.
Assumptions of Linear Regression
• In order to interpret the results of the regression analysis
meaningfully, certain conditions must be met.
• Linearity: There must be a linear relationship between the
dependent and independent variables.
• Homoscedasticity: The residuals must have a constant variance.
• Normality: Normally distributed error
• No multicollinearity: No high correlation between the independent
variables
• No auto-correlation: The error component should have no auto
correlation
Linearity
• In linear regression, a straight line is drawn through the data.
This straight line should represent all points as good as
possible. If the points are distributed in a non-linear way, the
straight line cannot fulfill this task.
• In the upper left graph, there is a linear relationship between the
dependent and the independent variable, hence the regression line can be
meaningfully put in. In the right graph you can see that there is a clearly
non-linear relationship between the dependent and the independent
variable.
• Therefore it is not possible to put the regression line through the points in
a meaningful way. For that reason, the coefficients cannot be
meaningfully interpreted by the regression model and there could be
errors in the prediction that are greater than thought.
• Therefore it is important to check beforehand, whether a linear
relationship between the dependent variable and each of the independent
variables exists. This is usually checked graphically.
Homoscedasticity
• Since in practice the regression model never exactly predicts the
dependent variable, there is always an error. This very error must
have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted
on the x-axis and the error on the y-axis.

Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is
present. If this is not the case, heteroskedasticity is present.

In the case of heteroscedasticity, the error has different variances, depending on the value range of the
dependent variable.

Multicollinearity
It means that two or more independent variables are strongly correlated with one another. The problem
with multicollinearity is that the effects of each independent variable cannot be clearly separated from
one another.
Significance test and Regression
• The regression analysis is often carried out in order to make statements about
the population based on a sample.
• Therefore, the regression coefficients are calculated using the data from the sample. To rule out
the possibility that the regression coefficients are not just random and have completely different
values in another sample, the results are statistically tested with significance test. This test takes
place at two levels.
• Significance test for the whole regression model
• Significance test for the regression coefficients
• It should be noted, however, that the assumptions in the previous section must be met.

Significance test for the regression model


• Here it is checked whether the coefficient of determination R2 in the population differs from zero.
The null hypothesis is therefore that the coefficient of determination R2 in the population is zero.
To confirm or reject the null hypothesis, the following F-test is calculated
Significance test for the regression model
• Here it is checked whether the coefficient of determination R2 in
the population differs from zero. The null hypothesis is
therefore that the coefficient of determination R2 in the
population is zero. To confirm or reject the null hypothesis, the
following F-test is calculated.
• The calculated F-value must now be compared with the critical F-
value. If the calculated F-value is greater than the critical F-value, the
null hypothesis is rejected and the R2 deviates from zero in the
population. The critical F-value can be read from the F-distribution
table. The denominator degrees of freedom are k and the numerator
degrees of freedom are n-k-1.
• Significance test for the regression coefficients
• The next step is to check which variables have a significant
contribution to the prediction of the dependent variable. This is done
by checking whether the slopes (regression coefficients) also differ
from zero in the population.
• The following test statistics are calculated in order to analyze it
• where bj is the jth regression coefficient and sb_j is the standard error
of bj. This test statistic is t-distributed with the degrees of freedom n-k-1.
The critical t-value can be read from the t-distribution table.
• As an example of linear regression, a model is set up that predicts the
body weight of a person.
• The dependent variable is thus the body weight, while the height, age
and gender are chosen as independent variables. The following example
data set is available:
Polynomial Regression
• Polynomial regression is a form of Linear regression where only due
to the Non-linear relationship between dependent and independent
variables, we add some polynomial terms to linear regression to
convert it into Polynomial regression.
• In polynomial regression, the relationship between the dependent
variable and the independent variable is modeled as an nth-degree
polynomial function. When the polynomial is of degree 2, it is called
a quadratic model; when the degree of a polynomial is 3, it is called a
cubic model, and so on.
• Suppose we have a dataset where variable X represents the
Independent data and Y is the dependent data. Before feeding data
to a mode in the preprocessing stage, we convert the input
variables into polynomial terms using some degree.
• Consider an example my input value is 35, and the degree of a
polynomial is 2, so I will find 35 power 0, 35 power 1, and 35
power 2 this helps to interpret the non-linear relationship in data.
The equation of polynomials becomes something like this.
• y = a0 + a1x1 + a2x12 + … + anx1n
Logistic Regression
• Logistic regression is a special case of regression analysis and is
used when the dependent variable is nominally scaled. This is the
case, for example, with the variable purchase decision with the two
values buys a product and does not buy a product.

• With logistic regression, it is now possible to explain the dependent


variable or estimate the probability of occurrence of the categories of
the variable.
• Business example:
• For an online retailer, you need to predict which product a particular
customer is most likely to buy. For this, you receive a data set with
past visitors and their purchases from the online retailer.
• Medical example:
• You want to investigate whether a person is susceptible to a certain
disease or not. For this purpose, you receive a data set with
diseased and non-diseased persons as well as other medical
parameters.
• Political example:
• Would a person vote for party A if there were elections next
weekend?
What is a logistic regression?
• In the basic form of logistic regression, dichotomous variables
(0 or 1) can be predicted. For this purpose, the probability of the
occurrence of value 1 (=characteristic present) is estimated.
• In medicine, for example, a frequent application is to find out which
variables have an influence on a disease.
• In this case, 0 could stand for not diseased and 1 for diseased.
Subsequently, the influence of age, gender and smoking status
(smoker or not) on this particular disease could be examined.
• Logistic regression and probabilities
• In linear regression, the independent variables (e.g., age and gender)
are used to estimate the specific value of the dependent variable (e.g.,
body weight).
• In logistic regression, on the other hand, the dependent variable is
dichotomous (0 or 1) and the probability that expression 1 occurs is
estimated. Returning to the example above, this means: How likely is
it that the disease is present if the person under consideration has a
certain age, sex and smoking status.
Calculate logistic regression
• To build a logistic regression model, the linear regression equation is
used as the starting point.
• However, if a linear regression were simply calculated for solving a
logistic regression, the following result would appear graphically:
• As can be seen in the graph, however, values between plus and minus
infinity can now occur. The goal of logistic regression, however, is to estimate
the probability of occurrence and not the value of the variable itself. Therefore,
the this equation must still be transformed.
• To do this, it is necessary to restrict the value range for the prediction to the
range between 0 and 1. To ensure that only values between 0 and 1 are possible,
the logistic function f is used.
• Logistic function
• The logistic model is based on the logical function. The special thing about the
logistic function is that for values between minus and plus infinity, it always
assumes only values between 0 and 1.
• So the logistic function is perfect to describe the probability P(y=1). If
the logistic function is now applied to the upper regression equation
the result is:
This now ensures that no matter in which range the x values are located, only values between 0 and
1 will come out. The new graph now looks like this:
• The probability that for given values of the independent variable
the dichotomous dependent variable y is 0 or 1 is given by:

To calculate the probability of a person being sick or not using the logistic regression for the example above,
the model parameters b1, b2, b3 and a must first be determined. Once these have been determined, the
equation for the example above is:
The Likelihood Function
• To understand the maximum likelihood method, we introduce
the likelihood function L. L is a function of the unknown parameters in
the model, in case of logistic regression these are b1,... bn, a. Therefore
we can also write L(b1,... bn, a) or L(θ) if the parameters are
summarized in θ.

• L(θ) now indicates how probable it is that the observed data occur.
With the change of θ, the probability that the data will occur as
observed changes.
Multinomial logistic regression
• As long as the dependent variable has two characteristics
(e.g. male, female), i.e. is dichotomous, binary logistic regression is
used. However, if the dependent variable has more than two instances,
e.g. which mobility concept describes a person's journey to work
(car, public transport, bicycle), multinomial logistic regression must be
used.
• Each expression of the mobility variable (car, public transport, bicycle) is
transformed into a new variable. The one variable mobility concept
becomes the three new variables:
• car is used
• public transport is used
• bicycle is used
• Each of these new variables then only has the two
expressions yes or no, e.g. the variable car is used only has the two
answer options yes or no (either it is used or not). Thus, for the one
variable "mobility concept" with three values, there are three new
variables with two values each: yes and no (0 and 1).
• Three logistic regression models are now created for these three
variables.
• Chi2 Test and Logistic Regression
• In the case of logistic regression, the Chi-square test tells whether
the model is overall significant or not.

Here two models are compared. In one model all independent variables are used and in the other model the
independent variables are not used.
Now the Chi-square test compares how good the prediction is when the dependent variables are used and how
good it is when the dependent variables are not used.
The Chi-square test now tells us if there is a significant difference between these two results. The null hypothesis
is that both models are the same. If the p-value is less than 0.05, this null hypothesis is rejected.
Example logistic regression
• As an example for the logistic regression, the purchasing behavior in an
online shop is examined. The aim is to determine the influencing factors
that lead a person to buy immediately, at a later time or not at all from the
online shop after visiting the website. The online shop provides the data
collected for this purpose.
• The dependent variable therefore has the following three characteristics:
• Buy now
• Buy later
• Don't buy
• Gender, age, income and time spent in the online shop are available as
independent variables.
LINEAR REGRESSION VS LOGISTIC REGRESSION
LINEAR REGRESSION LOGISTIC REGRESSION
Linear Regression is used to handle regression Logistic regression is used to handle the classification
problems. problems.

Linear regression provides a continuous output. Logistic regression provides discreet output.

The purpose of Linear Regression is to find the best- Logistic regression is one step ahead and fitting the line values
fitted line. to the sigmoid curve.

The method for calculating loss function in linear whereas for logistic regression it is maximum likelihood
regression is the mean squared error. estimation.

Used to predict the continuous dependent variable Used to predict the categorical dependent variable using a
using a given set of independent variables. given set of independent variables

The outputs produced must be a continuous value, The outputs produced must be Categorical values such as 0 or
such as price and age. 1, Yes or No.

You might also like