You are on page 1of 16

Comparison of Regression and machine learning models

Regression and machine learning models are related concepts but differ in various aspects.
Here's a comparison to delineate the distinctions:

1. Definition:
o Regression: A statistical method used to model the relationship between a
dependent variable and one or more independent variables by fitting a
mathematical equation to observed data.
o Machine Learning Models: A broader category of algorithms designed to
recognize patterns within data, learn from these patterns, and make predictions
or decisions without being explicitly programmed for the task.
2. Methodology:
o Regression: Often employs a linear approach, where a line of best fit is used
to predict output values.
o Machine Learning Models: Can include various methodologies, including
but not limited to linear models, clustering, neural networks, decision trees,
etc.
3. Complexity:
o Regression: Generally simpler, focusing on the relationship between
variables.
o Machine Learning Models: Often more complex, can model nonlinear
relationships, multiple interactions, and high-dimensional data.
4. Flexibility:
o Regression: Usually assumes a specific form for the relationship between
variables (e.g., linear).
o Machine Learning Models: Offers a more flexible approach, allowing for
different kinds of relationships, including nonlinear ones.
5. Application:
o Regression: Primarily used for prediction, forecasting, or determining the
strength and character of the relationship between one dependent variable and
one or more independent variables.
o Machine Learning Models: Can be used for a wide range of applications,
including classification, clustering, recommendation, anomaly detection, and
more.
6. Interpretability:
o Regression: Typically more interpretable, especially in the case of linear
regression, where the relationship between variables can be easily understood.
o Machine Learning Models: Some models, especially complex ones like deep
neural networks, may be considered as "black boxes" where the decision-
making process might not be as transparent or interpretable.
7. Assumptions:
o Regression: Often operates under specific assumptions such as linearity,
independence, homoscedasticity, and normality.
o Machine Learning Models: May not require such strict assumptions,
providing more flexibility in handling various types of data.
8. Scope:
o Regression: More constrained in its application, primarily dealing with
relationships between variables.
o Machine Learning Models: Broad in scope, encompassing various
techniques and approaches for different types of data and tasks.
Linear Regression

In Machine Learning,
 Linear Regression is a supervised machine learning algorithm.
 It tries to find out the best linear relationship that describes the data you have.
 It assumes that there exists a linear relationship between a dependent variable and
independent variable(s).
 The value of the dependent variable of a linear regression model is a continuous value i.e.
real numbers.

Representing Linear Regression Model-

Linear regression model represents the linear relationship between a dependent variable and
independent variable(s) via a sloped straight line.

The sloped straight line representing the linear relationship that fits the given data best is
called as a regression line.
It is also called as best fit line.

Types of Linear Regression-


Based on the number of independent variables, there are two types of linear regression-

1. Simple Linear Regression


2. Multiple Linear Regression

1. Simple Linear Regression-

In simple linear regression, the dependent variable depends only on a single independent
variable.

For simple linear regression, the form of the model is-


Y = β0 + β1X

Here,
 Y is a dependent variable.
 X is an independent variable.
 β0 and β1 are the regression coefficients.
 β0 is the intercept or the bias that fixes the offset to a line.
 β1 is the slope or weight that specifies the factor by which X has an impact on Y.

There are following 3 cases possible-

Case-01: β1 < 0

 It indicates that variable X has negative impact on Y.


 If X increases, Y will decrease and vice-versa.
Case-02: β1 = 0

 It indicates that variable X has no impact on Y.


 If X changes, there will be no change in Y.

Case-03: β1 > 0

 It indicates that variable X has positive impact on Y.


 If X increases, Y will increase and vice-versa.
2. Multiple Linear Regression-

In multiple linear regression, the dependent variable depends on more than one independent
variables.

For multiple linear regression, the form of the model is-


Y = β0 + β1X1 + β2X2 + β3X3 + …… + βnXn

Here,
 Y is a dependent variable.
 X1, X2, …., Xn are independent variables.
 β0, β1,…, βn are the regression coefficients.
 βj (1<=j<=n) is the slope or weight that specifies the factor by which X j has an impact on
Y.

Describe the key assumptions underlying linear regression models and


explain their importance.

Answer:
Linear regression is one of the most widely used statistical methods, allowing researchers to
quantify the relationship between independent and dependent variables. To obtain unbiased
and efficient estimates, several key assumptions must be met:

1. Linearity: The relationship between the independent and dependent variables should
be linear. This means that a change in the predictor leads to a constant change in the
response variable. Checking scatter plots of variables can often help visualise this
relationship.
2. Independence: The residuals (or errors), which are the differences between the
observed and predicted values, should be independent of each other. This assumption
is particularly crucial for time series data where observations might be correlated over
time.
3. Homoscedasticity: The variance of the residuals should remain constant across all
levels of the independent variables. If the spread of residuals varies at different levels
of the independent variables, it's a sign of heteroscedasticity, which can undermine
the efficiency of the linear regression model.
4. Normality of Residuals: For any fixed value of the independent variables, the
dependent variable should be normally distributed. This is crucial for hypothesis
testing. This doesn't mean that the variables themselves need to be normally
distributed, but rather the residuals.
5. No Multicollinearity: Multicollinearity arises when two or more independent
variables in the regression model are highly correlated, making it difficult to isolate
the individual effect of each variable. It doesn't affect the fit of the model but can
make it hard to determine which variable is having an impact. Techniques like the
Variance Inflation Factor (VIF) can be used to detect multicollinearity.

Importance:
Meeting these assumptions ensures that the parameter estimates of the regression model are
unbiased, reliable, and efficient. Violations of these assumptions can lead to misleading
results, reduced prediction accuracy, and incorrect inferences about relationships between
variables. Hence, checking and, if necessary, remedying violations of these assumptions
should be a fundamental step in the linear regression modelling process.
LINEAR REGRESSION PROBLEM

Problem:

You are given the following data representing the advertising budget (in thousands of dollars)
and the corresponding sales (in thousands of units) for a product:

Advertising Budget: X=[2,4,6,8,10]

Sales: Y=[4,7,9,11,15]

Find a linear relationship between the advertising budget and sales.

This simple linear regression example illustrates how you can establish a linear relationship
between two variables and use that relationship for prediction. In real-world applications,
additional steps and considerations may be necessary, including data preprocessing, feature
selection, model validation, and more.
Problem:

A small fitness center wants to understand the relationship between the number of hours
spent in the gym and weight loss (in pounds) for its members over a one-month period. They
collect the following data:

Hours in Gym (X): [5,10,15,20,25]

Weight Loss (Y): [3,6,9,12,15]

The fitness center wants to use this information to predict the weight loss for a member who
spends 18 hours in the gym.
Problem: Given the following data on the number of hours studied and the exam scores
obtained by a group of students:

Hours Studied Exam Score (%)


1 50
2 55
3 60
4 68
5 73
6 80
7 88

Predict the exam score of a student who studies for 8 hours using linear regression.

Solution:

1. State the hypothesis: We assume a linear relationship between hours studied and
exam score:=0+1Y=β0+β1X Where:

 Y is the exam score.


 X is the number of hours studied.
 0 β0 is the y-intercept.
 1 β1 is the slope of the line.

2. Compute the slope (1β1) using the formula: 1=(Σ)−(Σ)(Σ)(Σ2)−(Σ)2β1=n(Σx2)−(Σx)


2n(Σxy)−(Σx)(Σy)

Given the data: Σ=28,Σ=474,Σ=1710,Σ2=140Σx=28,Σy=474,Σxy=1710,Σx2=140 =7n=7

1=7(1710)−(28)(474) 7(140)−(28) 2β1=7(140)−(28) 27(1710)−(28)(474)

1=11970−13272980−784β1=980−78411970−13272 1=−1302196β1=196−1302 1=−6.653

β1=−6.653 (This negative value is likely an error in the data or calculation, but we'll proceed
for the sake of this example.)

3. Compute the y-intercept (ⱽ0β0): ⿽0=Σ−1Σβ0=nΣy−β1 Σx 0=474+6.653(28) 7β0


=7474+6.653(28) 0=474+186.2847β0=7474+186.284 0=94.326β0=94.326
4. Predict the exam score for a student who studies 8 hours:
=94.326−6.653(8)Y=94.326−6.653(8) =94.326−53.224Y=94.326−53.224
=41.102Y=41.102

So, according to our linear regression model, a student who studies for 8 hours is predicted to
score 41.1% on the exam.

PROGRAM

#Import required Libraries


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#read the dataset
#df=pd.read_csv('kc_house_data.csv')
# E:\Machine Learning\Machine Learning LAB
df= pd.read_csv("/content/kc_house_data.csv")

#visualizing the data using heatmap


plt.figure()
sns.heatmap(df.corr(),cmap='coolwarm')
plt.show()
#selecting the required parameters
area=df['sqft_living']
price=df['price']
x=np.array(area).reshape(-1,1)
y=np.array(price)
#import linearregression and split the data into training and testing dataset
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=0)
y_train=y_train.reshape(-1,1)
y_test=y_test.reshape(-1,1)

#fit the model over the training dataset


from sklearn.linear_model import LinearRegression
model =LinearRegression()
model.fit(x_train, y_train)
#calculate intercept and coefficient
print(model.intercept_)
print(model.coef_)
pred=model.predict(x_test)
predictions=pred.reshape(-1,1)
#calculate root mean square error to evaluate model performance from sklearn.matrics
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test,predictions))
print('RMSE:',np.sqrt(mean_squared_error(y_test,predictions)))
OUTPUT:

[-48536.69005829]
[[284.14771038]]
MSE: 62014619472.34492
RMSE: 249027.34683633628
Question:
Describe the Ridge and Lasso regression models, highlighting their similarities,
differences, and use cases.

Answer:
Ridge regression (also known as Tikhonov regularisation) and Lasso regression (Least
Absolute Shrinkage and Selection Operator) are both regularisation techniques used to
prevent overfitting in linear regression models. They both add a penalty to the ordinary least
squares (OLS) objective, which can help in handling multicollinearity and reducing model
complexity.

1. Similarities:
o Both Ridge and Lasso are linear models that use a penalty term to shrink the
coefficients of less important features towards zero.
o They both aim to prevent overfitting by adding a constraint to the magnitude
of coefficients in the model.
o Both regularisation methods require the selection of a hyperparameter, often
denoted as λ, that determines the strength of the regularization. As λ increases,
the penalty on the coefficients also increases.
2. Differences:
o Penalty Term: Ridge regression adds a 2L2 penalty, which is the squared
magnitude of coefficients. The penalty term for Ridge is ∑2λ∑β2. Lasso
regression, on the other hand, uses an 1L1 penalty, which is the absolute value
of the magnitude of coefficients. Its penalty term is ∑∣∣λ∑∣β∣.
o Feature Selection: One of the most distinguishing characteristics of Lasso is
that it can reduce some feature coefficients to exactly zero, effectively
excluding them from the model, thereby performing feature selection. Ridge
regression, in contrast, will shrink coefficients close to zero, but they will
rarely be exactly zero.
o Computation: Ridge has a closed-form solution, whereas Lasso does not.
This makes Ridge computationally more straightforward in some cases, but
Lasso might require iterative methods like coordinate descent.
3. Use Cases:
o Ridge regression is particularly useful when all features have an influence on
the output variable, but the model needs to be regularised to prevent
overfitting.
o Lasso regression is beneficial when we believe many features are irrelevant
or redundant and want the model to automatically perform feature selection.

In summary, Ridge and Lasso regression are powerful techniques to enhance the prediction
accuracy and interpretability of regression models, especially in scenarios where there are
many predictors or multicollinearity is a concern.

You might also like