Professional Documents
Culture Documents
• One of the categories is set as the reference category and a new variable is created for each of the remaining
categories.
• Let's take an example to illustrate this. Suppose you are studying the effect of education level (a categorical
variable with three levels: high school, college, and graduate) on salary. In order to include this categorical
variable in a regression model, it needs to be encoded as dummy variables.
• Let's say we use high school as reference category and we create two dummy
variables: is_college and is_graduate. The variable is_college for example will take a value of 1 if the
individual has a college degree and 0 otherwise.
• If you only want to use one variable for prediction, a simple
regression is used.
• If you use more than one variable, you need to perform a multiple
regression.
• If the dependent variable is nominally scaled, a logistic
regression must be calculated.
• If the dependent variable is metrically scaled, a linear regression is
used.
• Whether a linear or a non-linear regression is used depends on the
relationship itself.
• In order to perform a linear regression, a linear relationship between
the independent variables and the dependent variable is necessary.
Simple Linear Regression
• The goal of a simple linear regression is to predict the value of a
dependent variable based on an independent variable.
• The greater the linear relationship between the independent variable
and the dependent variable, the more accurate is the prediction.
• This goes along with the fact that the greater the proportion of the
dependent variable's variance that can be explained by the independent
variable is, the more accurate is the prediction. Visually, the relationship
between the variables can be shown in a scatter plot.
• The greater the linear relationship between the dependent and
independent variables, the more the data points lie on a straight line.
• The task of simple linear regression is to exactly determine the
straight line which best describes the linear relationship between
the dependent and independent variable.
• In linear regression analysis, a straight line is drawn in the scatter
plot. To determine this straight line, linear regression uses
the method of least squares.
The regression line can be described by the following equation
If an independent variable changes by one unit, the associated coefficient indicates by how much the
dependent variable changes.
So if the independent variable xi increases by one unit, the dependent variable y increases by bi.
Coefficient of determination
• In order to find out how well the regression model can predict or
explain the dependent variable, two main measures are used. This is
on the one hand the coefficient of determination R2 and on the other
hand the standard estimation error.
• The coefficient of determination R2, also known as the variance
explanation, indicates how large the portion of the variance is that can
be explained by the independent variables.
• The more variance can be explained, the better the regression model
is. In order to calculate R2, the variance of the estimated value is
related to the variance in the observed values.
Adjusted R2
The coefficient of determination R2 is influenced by the number of independent variables used.
The more independent variables are included in the regression model, the greater the variance resolution R2.
To take this into account, the adjusted R2 is used.
Standard estimation error
• The standard estimation error is the standard deviation of the
estimation error. This gives an impression of how much the prediction
differs from the correct value.
• Graphically interpreted, the standard estimation error is the dispersion
of the observed values around the regression line.
• The coefficient of determination and the standard estimation error are
used for simple and multiple linear regression.
Assumptions of Linear Regression
• In order to interpret the results of the regression analysis
meaningfully, certain conditions must be met.
• Linearity: There must be a linear relationship between the
dependent and independent variables.
• Homoscedasticity: The residuals must have a constant variance.
• Normality: Normally distributed error
• No multicollinearity: No high correlation between the independent
variables
• No auto-correlation: The error component should have no auto
correlation
Linearity
• In linear regression, a straight line is drawn through the data.
This straight line should represent all points as good as
possible. If the points are distributed in a non-linear way, the
straight line cannot fulfill this task.
• In the upper left graph, there is a linear relationship between the
dependent and the independent variable, hence the regression line can be
meaningfully put in. In the right graph you can see that there is a clearly
non-linear relationship between the dependent and the independent
variable.
• Therefore it is not possible to put the regression line through the points in
a meaningful way. For that reason, the coefficients cannot be
meaningfully interpreted by the regression model and there could be
errors in the prediction that are greater than thought.
• Therefore it is important to check beforehand, whether a linear
relationship between the dependent variable and each of the independent
variables exists. This is usually checked graphically.
Homoscedasticity
• Since in practice the regression model never exactly predicts the
dependent variable, there is always an error. This very error must
have a constant variance over the predicted range.
To test homoscedasticity, i.e. the constant variance of the residuals, the dependent variable is plotted
on the x-axis and the error on the y-axis.
Now the error should scatter evenly over the entire range. If this is the case, homoscedasticity is
present. If this is not the case, heteroskedasticity is present.
In the case of heteroscedasticity, the error has different variances, depending on the value range of the
dependent variable.
Multicollinearity
It means that two or more independent variables are strongly correlated with one another. The problem
with multicollinearity is that the effects of each independent variable cannot be clearly separated from
one another.
Significance test and Regression
• The regression analysis is often carried out in order to make statements about
the population based on a sample.
• Therefore, the regression coefficients are calculated using the data from the sample. To rule out
the possibility that the regression coefficients are not just random and have completely different
values in another sample, the results are statistically tested with significance test. This test takes
place at two levels.
• Significance test for the whole regression model
• Significance test for the regression coefficients
• It should be noted, however, that the assumptions in the previous section must be met.
To calculate the probability of a person being sick or not using the logistic regression for the example above,
the model parameters b1, b2, b3 and a must first be determined. Once these have been determined, the
equation for the example above is:
The Likelihood Function
• To understand the maximum likelihood method, we introduce
the likelihood function L. L is a function of the unknown parameters in
the model, in case of logistic regression these are b1,... bn, a. Therefore
we can also write L(b1,... bn, a) or L(θ) if the parameters are
summarized in θ.
• L(θ) now indicates how probable it is that the observed data occur.
With the change of θ, the probability that the data will occur as
observed changes.
Multinomial logistic regression
• As long as the dependent variable has two characteristics
(e.g. male, female), i.e. is dichotomous, binary logistic regression is
used. However, if the dependent variable has more than two instances,
e.g. which mobility concept describes a person's journey to work
(car, public transport, bicycle), multinomial logistic regression must be
used.
• Each expression of the mobility variable (car, public transport, bicycle) is
transformed into a new variable. The one variable mobility concept
becomes the three new variables:
• car is used
• public transport is used
• bicycle is used
• Each of these new variables then only has the two
expressions yes or no, e.g. the variable car is used only has the two
answer options yes or no (either it is used or not). Thus, for the one
variable "mobility concept" with three values, there are three new
variables with two values each: yes and no (0 and 1).
• Three logistic regression models are now created for these three
variables.
• Chi2 Test and Logistic Regression
• In the case of logistic regression, the Chi-square test tells whether
the model is overall significant or not.
Here two models are compared. In one model all independent variables are used and in the other model the
independent variables are not used.
Now the Chi-square test compares how good the prediction is when the dependent variables are used and how
good it is when the dependent variables are not used.
The Chi-square test now tells us if there is a significant difference between these two results. The null hypothesis
is that both models are the same. If the p-value is less than 0.05, this null hypothesis is rejected.
Example logistic regression
• As an example for the logistic regression, the purchasing behavior in an
online shop is examined. The aim is to determine the influencing factors
that lead a person to buy immediately, at a later time or not at all from the
online shop after visiting the website. The online shop provides the data
collected for this purpose.
• The dependent variable therefore has the following three characteristics:
• Buy now
• Buy later
• Don't buy
• Gender, age, income and time spent in the online shop are available as
independent variables.
LINEAR REGRESSION VS LOGISTIC REGRESSION
LINEAR REGRESSION LOGISTIC REGRESSION
Linear Regression is used to handle regression Logistic regression is used to handle the classification
problems. problems.
Linear regression provides a continuous output. Logistic regression provides discreet output.
The purpose of Linear Regression is to find the best- Logistic regression is one step ahead and fitting the line values
fitted line. to the sigmoid curve.
The method for calculating loss function in linear whereas for logistic regression it is maximum likelihood
regression is the mean squared error. estimation.
Used to predict the continuous dependent variable Used to predict the categorical dependent variable using a
using a given set of independent variables. given set of independent variables
The outputs produced must be a continuous value, The outputs produced must be Categorical values such as 0 or
such as price and age. 1, Yes or No.