You are on page 1of 7

Regression and Probability

1. What is Regression?
Regression analysis is a statistical method that helps us to analyse and understand the relationship between two or more variables of interest.
The process that is adapted to perform regression analysis helps to understand which factors are important, which factors can be ignored and
how they are influencing each other.
Two types of variable for regression analysis are:
1. Dependent Variable : This is the variable that we are trying to understand or forecast.
2. Independent Variable : These are factors that influence the analysis or target variable and provide us with information
regarding the relationship of the variables with the target variable

2. General Uses of Regression Analysis


Regression analysis is used for prediction and forecasting. This has a substantial overlap to the field of machine learning. This statistical
method is used across different industries such as,
 Financial Industry - Understand the trend in the stock prices, forecast the prices, and evaluate risks in the insurance domain.
 Marketing - Understand the effectiveness of market campaigns, forecast pricing and sales of the product.
 Manufacturing - Evaluate the relationship of variables that determine to define a better engine to provide better performance.
 Medicine - Forecast the different combination of medicines to prepare generic medicines for diseases.

3. Terminologies used in Regression Analysis


The common and used Terminologies of Regression Analysis are – Outliers, Multicollinearity, Heteroscedasticity, Underfit and Overfit.
Outliers
Suppose there is an observation in the dataset that has a very high or very low value as compared to the other observations in the data, i.e. it
does not belong to the population, such an observation is called an outlier. In simple words, it is an extreme value. An outlier is a problem
because many times it hampers the results we get.
Multicollinearity
When the independent variables are highly correlated to each other, then the variables are said to be multicollinear. Many types of regression
techniques assume multicollinearity should not be present in the dataset. It is because it causes problems in ranking variables based on its
importance, or it makes the job difficult in selecting the most important independent variable.
Heteroscedasticity
When the variation between the target variable and the independent variable is not constant, it is called heteroscedasticity. Example-As one’s
income increases, the variability of food consumption will increase. A poorer person will spend a rather constant amount by always eating
inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times, eat expensive meals. Those with higher
incomes display a greater variability of food consumption.
Underfit and Overfit
When we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training
set but is unable to perform better on the test sets. It is also known as a problem of high variance.
When our algorithm works so poorly that it is unable to fit even a training set well, then it is said to underfit the data. It is also known as a
problem of high bias.

4. Types of Regression
1. Linear Regression
The simplest of all regression types is Linear Regression where it tries to establish relationships between Independent and Dependent
variables. The Dependent variable considered here is always a continuous variable.

Page - 1 of 7
Regression and Probability

Examples of Independent & Dependent Variables:


 x is Rainfall and y is Crop Yield
 x is Advertising Expense and y is Sales
 x is sales of goods and y is GDP
If the relationship with the dependent variable is in the form of single variables, then it is known as Simple Linear Regression.

If the relationship between Independent and dependent variables are multiple in number, then it is called Multiple Linear Regression.

2. Polynomial Regression
This type of regression technique is used to model nonlinear equations by taking polynomial functions of independent variables.
In the figure given below, we can see the red curve fits the data better than the green curve. Hence in the situations where the relationship
between the dependent and independent variable seems to be non-linear, we can deploy Polynomial Regression Models

Thus a polynomial of degree k in one variable is written as: 𝛾 = 𝛽0 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + ⋯ + 𝛽𝑘 𝑋 𝑘 + 𝜀


Here we can create new features like 𝑋1 = 𝑥, 𝑋2 = 𝑥 , … , 𝑋𝑘 = 𝑥 𝑘 and can fit linear regression in a similar manner. In case of multiple
2

variables say 𝑋1 and 𝑋2 , we can create a third new feature (say 𝑋3 ) which is the product of 𝑋1 and 𝑋2 i.e.
𝑋3 = 𝑋1 ∗ 𝑋2
The main drawback of this type of regression model is if we create unnecessary extra features or fitting polynomials of higher degree
this may lead to overfitting of the model.

3. Logistic Regression
Logistic Regression is also known as Logit, Maximum-Entropy classifier is a supervised learning method for classification. It establishes
a relation between dependent class variables and independent variables using regression.
The dependent variable is categorical i.e. it can take only integral values representing different classes. The probabilities describing the
possible outcomes of a query point are modelled using a logistic function. This model belongs to a family of discriminative classifiers.
They rely on attributes which discriminate the classes well. This model is used when we have 2 classes of dependent variables. When
there are more than 2 classes, then we have another regression method which helps us to predict the target variable better.
There are two broad categories of Logistic Regression algorithms
1. Binary Logistic Regression when the dependent variable is strictly binary
2. Multinomial Logistic Regression when the dependent variable has multiple categories.
There are two types of Multinomial Logistic Regression
1. Ordered Multinomial Logistic Regression (dependent variable has ordered values)
2. Nominal Multinomial Logistic Regression (dependent variable has unordered categories)

5. Simple Linear Regression Model


As the model is used to predict the dependent variable, the relationship between the variables can be written in the below format.

Page - 2 of 7
Regression and Probability

The main factor that is considered as part of Regression analysis is understanding the variance between the variables. For understanding the
variance, we need to understand the measures of variation.

6. Assumptions of Linear Regression


Assumptions of Linear Regression are:
1. Linear Relationship
2. Normality
3. No or Little Multicollinearity
4. No Autocorrelation in errors
5. Homoscedasticity
With these assumptions considered while building the model, we can build the model and do our predictions for the dependent variable. For
any type of machine learning model, we need to understand if the variables considered for the model are correct and have been analysed by
a metric. In the case of Regression analysis, the statistical measure that evaluates the model is called the coefficient of determination which
is represented as 𝑟 2 .
The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent
variable. A higher value of 𝑟 2 better is the model with the independent variables being considered for the model.
𝑆𝑆𝑅
𝑟2 = , where the value of 𝑟 2 is the range of 0≤ 𝑟 2 ≤1
𝑆𝑆𝑇

7. Regression: Linear Relationship


Simple linear regression is a type of regression analysis where the number of independent variables is one and there is a linear relationship
between the independent (𝑥) and dependent (𝑦) variable. The red line in the above graph is referred to as the best fit straight line.

8. Two Simultaneous Equations


A system of two simultaneous equations with two unknowns.

A system of three simultaneous


equations with two unknowns

Page - 3 of 7
Regression and Probability

9. Overdetermined System
A system of equations is considered overdetermined if there are more equations than unknowns. An overdetermined system is almost always
inconsistent (it has no solution) when constructed with random coefficients. However, an overdetermined system will have solutions in some
cases, for example if some equation occurs several times in the system, or if some equations are linear combinations of the others.
With two unknowns and two observations:

Additional observation leads to overdetermined system.

10. How Overdetermined System can be solved? Or How Overdetermined measure?


The Overdetermined System can be solved through a Nosie Model

 Noise model gives mismatch between model and data.


 Gaussian model justified by appeal to central limit theorem.
 Other models also possible (Student-t for heavy tails).
 Maximum likelihood with Gaussian noise leads to least squares.

11. The Gaussian Density


Gaussian distribution (also known as normal distribution) is a bell-shaped curve, and it is assumed that during any measurement values will
follow a normal distribution with an equal number of measurements above and below the mean value.
Mean is usually represented by μ and variance with σ² (σ is the standard deviation). The graph is symmetrix about mean for a gaussian
distribution.

Page - 4 of 7
Regression and Probability

The Gaussian PDF with µ = 1.7 and variance σ² = 0.0225. Mean shown as red line. It could represent the heights of a population of students.

12. Two Important Gaussian Properties


Sum of Gaussians and Scaling a Gaussian are two important properties of Gaussian.

13. A Probabilistic Process


Probabilistic systems are models of systems that involve quantitative information about uncertainty. Probabilities in discrete probabilistic
systems appear as labels on transitions between states. For example, in a Markov chain a transition from one state to another is taken with a
given probability.
Probabilistic models are widely used in text mining and applications range from topic modeling, language modeling, document classification
and clustering to information extraction.

14. How we can find Data Point Likelihood? Or Data Set Likelihood or How likelihood can be determined?
The likelihood (or likelihood function) of a set of parameters given the observed data is defined as the probability of all the observations
under those parameter values.

Page - 5 of 7
Regression and Probability

15. Log Likelihood Function


The log-likelihood is the expression that Minitab maximizes to determine optimal values of the estimated coefficients (β). Log-likelihood
values cannot be used alone as an index of fit because they are a function of sample size but can be used to compare the fit of different
coefficients.
Linear regression is a classical model for predicting a numerical quantity. Coefficients of a linear regression model can be estimated using a
negative log-likelihood function from maximum likelihood estimation. The negative log-likelihood function can be used to derive the least
squares solution to linear regression.

16. Error Function


The error function is the function which you try to minimize. ... It might be a bit confusing when litterature uses the same term when the
minimizing function has been derivated. Just remember that we wish to minimize the error in our network, and the functions which helps us
achieve it is the error function.

17. Sum of Squares Error


The sum of squared errors, or SSE, is a preliminary statistical calculation that leads to other data values. The error is the difference between
the observed value and the predicted value. We usually want to minimize the error. The smaller the error, the better the estimation power of
the regression.

18. Learning is Optimization or How Optimization can solve to find gradient m and Cutoff C?
Machine learning optimization is the process of adjusting the hyperparameters in order to minimize the cost function by using one of the
optimization techniques.

Page - 6 of 7
Regression and Probability

19. What is Cost Function? If cost function creases is it Good or Bad?


Cost function is a function that measures the performance of a Machine Learning model for given data. Cost Function quantifies the error
between predicted values and expected values and presents it in the form of a single real number. Depending on the problem Cost Function
can be formed in many different ways.
In ML, cost functions are used to estimate how badly models are performing. Put simply, a cost function is a measure of how wrong the
model is in terms of its ability to estimate the relationship between X and y. This is typically expressed as a difference or distance between
the predicted value and the actual value.
Increases in the cost function are indications of instability and reduced likelihood that convergence will occur.

Page - 7 of 7

You might also like