You are on page 1of 20

11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Image by Mirko Grisendi from Pixabay

A Practical Approach to Linear Regression in


Machine Learning
A Hands-On Beginner’s Guide to Linear Regression

Ashwin Raj · Follow


Published in Towards Data Science
10 min read · May 8, 2020

Listen Share

Imagine having this ability to predict real-world outcomes based on just one feature.
Sounds a bit magical, right? Well, this magic is called Simple Linear Regression, and
that’s a fundamental tool in the world of Machine Learning. But don’t let this term
intimidate you - it’s much simpler than you may think.

https://towardsdatascience.com/linear-regression-5100fe32993a 1/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

In this article, we will start by exploring the mathematical features of Linear


Regression. After that, we’ll dive into some key concepts like hypothesis and cost
function & to top it off, we’ll put our newfound knowledge into action by creating a
simple regression model. So let’s get ready for an exciting journey!

If you find this blog helpful, do consider giving it some claps so that Medium knows,
that you’re enjoying, what you’re reading. For more exciting Machine Learning
Recipes, make sure to follow me. Let’s delve straight into our topic!

1. Understanding Simple Linear Regression


Linear Regression is a supervised learning algorithm which is generally used when
the value to be predicted is of discrete or quantitative nature. It tries to establish a
relationship between the dependent variable ‘y’, and one or more related
independent variables ‘x’ using what is referred to as the best-fit line

One of the most common example wherein a linear regression model is used is
when predicting the price of a house by analyzing sales data of that region.

Image from CDOT Wiki

Linear Regression can be classified into two main categories - Simple Linear
Regression and Multiple Linear Regression. The former centers around just one
https://towardsdatascience.com/linear-regression-5100fe32993a 2/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

single independent variable while the latter extends its reach to multiple
independent variables, creating a multidimensional landscape for predictive
modeling. In this article, we will be discussing the Simple Linear Regression.

1.1 Mathematics Behind Simple Linear Regression


The crux of Simple Linear Regression is to derive a best-fit line that captures the
relationship between any single independent variable and the dependent variable.
The best-fit line emphasizes minimizing the overall distance of data points from the
regression line, ensuring a precise alignment with the trend

Statistically, the mathematical equation that approximately models a linear


relationship between a dependent variable x, & an independent variable y is

wherein, ‘m’ is the slope of the line, and ‘c’ is the intercept, and together they are
referred to as the ‘Model’s Coefficients’. This equation is the basis for any Linear
Regression model and is often referred to as the Hypothesis Function

The goal of most machine learning algorithms is to construct a model i.e. the
hypothesis (H) to estimate the dependent variable based on our independent
variables such that it minimizes the Loss Function (Residual Sum of Squares)

Aptly termed the Least Squares Method, this approach seeks to find the most
optimal values for the model parameters that minimize the loss function/RSS

1.2 Exploring the Cost Functions


While coefficients provides insights into the relationship between variables, we
need metrics to quantify how accurately these linear models captures the
underlying data patterns. The most common metrics are RMSE and R2 Score

Image by Rohan Vij from Medium

https://towardsdatascience.com/linear-regression-5100fe32993a 3/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

One of the primary metric used to assess the model’s performance is MSE, or Mean
Squared Error. This metric quantifies the squared difference, between the actual
observed values and the predicted values from the regression line. A lower MSE
indicates that the model’s predictions are closer to actual values

The R-squared (R2) Score, also known as the coefficient of determination is


another vital metric that provides a measure of how well the model explains the
variability, in the dependent variable. It ranges from 0 to 1, with a higher value
indicating a better fit without explicitly accounting for overfitting data

Beyond these, you may also explore other metrices such as: Residual Sum of
Squares, and Residual Analysis that offer a more comprehensive evaluation of
model performance. Read more about regression evaluation metrics here

2. Assumptions in Simple Linear Regression


Behind its core functionalities, Simple Linear Regression establishes a set of
foundational assumptions that underpin its reliability & performance. These
assumptions serve as the bedrock upon which the LR model's predictions are built,
ensuring that their insights accurately reflect the real-world dynamics.

1. Linear Relationship: Simple Linear Regression algorithms expects that the


relationship between the independent variable and the dependent variable is linear,
& shall be adequately approximated by a simple straight-line equation

2. Homoscedasticity: Referred to as the assumption of constant variance, the


assumption underscores that the variance of the model error should remain
consistent across all levels of the independent variable i.e. predictor variable

3. Normality of Residuals: This assumption signifies that all residuals should follow
a normal distribution pattern, with a mean of zero. If the confidence intervals
become too wide or narrow, some statistical tests may not hold true

https://towardsdatascience.com/linear-regression-5100fe32993a 4/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Ashwin Raj

Machine Learning Recipes

View list 12 stories

Another important assumption is that there should not be multi-collinearity


Multicollinearity emerges when the independent variables are correlated. In Simple
Linear Regression, this is not a concern as there is only one predictor

3. Building a Simple Linear Regression Model


Having explored the fundamental concepts of Simple Linear Regression, it is time to
put theory into practice by constructing your very own Simple Linear Regression
model. Lets jump straight right into the practical implementation

In this tutorial, we’re going to create a diamond price prediction model using Simple
Linear Regression. We’ll train this model using the Diamonds dataset found on
Kaggle. You can grab the dataset for yourself from this GitHub repo

Here, I will be using Google Colab to build this model. You can also use other IDE’s
like PyCharm, Selenium, VS Code, etc. to follow along with this tutorial

3.1 Importing Libraries and Loading the Dataset


Let’s begin by importing the Python packages that are crucial to building the model.
You don’t have to gather all the tools at once; we’ll start with essential libraries like
Pandas, Numpy, Seaborn, and Matplotlib. Instead of using their entire name
everytime, we create short aliases using the ‘as’ Python keyword

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

https://towardsdatascience.com/linear-regression-5100fe32993a 5/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

If you’re working on a Jupyter Notebook, consider including the %matplotlib inline


magic command. It’s a nifty tool that lets you view plots right within a notebook,
eliminating the need to use plt.show each time you generate a plot

Screenshot by Author from Kaggle

Once we’ve loaded these packages, the next task is to fetch the dataset & load the
data. For this, we’ll employ the read_csv method from the pandas library. While
specifying the dataset’s location, if the file is located in some different directory, you
will need to provide the relative or the absolute path to the file

data = pd.read_csv("datasets/diamonds.csv")

Open in app Sign up Sign In

Although usually CSV files are used for machine learning tasks, JSON files & Excel
Search
spreadsheets can also be used as datasets. The only distinction will be to use the
read_json() & read_excel() functions when working with such files

3.2 Data Exploration and Data Cleaning


When dealing with a new dataset, it’s always a good idea to take a quick peek at the
first few rows of the data. You can do this using the head() function. By default it
displays the first 5 rows but can be tuned to see more or fewer rows

print(data.head(3))

https://towardsdatascience.com/linear-regression-5100fe32993a 6/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

When dealing with unseen datasets, ensuring data quality is paramount. The real-
world datasets often come with few missing values, duplicate records, & outliers.
Inaccurate or incomplete information can lead to skewed results, & analysis. So
before we put our model to the test, we need to clean our dataset

data.fillna(data.mean(numeric_only=True), inplace=True)
data.drop_duplicates(inplace=True)

One of the most common challenges, when working with real-world datasets is
dealing with missing data. To handle such missing values, a good approach is to
impute the missing values with the mean of that column. This approach is
particularly useful when the data is random, and unlikely to introduce bias

Another challenge that we might come across when working with real-world data is
managing the duplicate entries. To tackle this issue, a straightforward method is to
eliminate these duplicates, using the drop_duplicates() function

Q1 = data.quantile(0.25, numeric_only=True)
Q3 = data.quantile(0.75, numeric_only=True)

IQR = Q3 - Q1

data = data[
~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)
]

Outliers are data points that deviate significantly from the rest of the dataset These
may indicate errors, anomalies, or unique occurrences. One approach to managing
outliers involves filtering out data points that lie beyond the IQR range (i.e below Q1
– 1.5 * IQR or above Q3 + 1.5 * IQR wherein IQR = Q3 - Q1)

3.3 Data Visualization and Feature Engineering


Numbers and statistics alone may not always provide the complete narrative.
Visualizing your data brings it to life, revealing patterns, & relationships that might
otherwise remain hidden. In this section, we explore few visualization techniques
that can help us gain deeper insights into the intricacies our data.

https://towardsdatascience.com/linear-regression-5100fe32993a 7/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Image by Author

Matplotlib & Seashore are excellent libraries that can be used for visualizing data.
Some of the commonly used visualization techniques are as described:

1. Pair Plot: Pair plots paint a matrix of scatter plots, revealing how variables
correlate. These are often used to spot trends, clusters, or even outliers & are often
used to study the relationship between the predictor and the predictant

sns.pairplot(
data,x_vars=['carat'],y_vars=['price'],height=12,kind='scatter'
)

plt.xlabel('Carat')
plt.ylabel('Price')
plt.title('Diamond Price Prediction - Carat vs Price')

plt.show()

2. Heatmap: Heatmaps offer a color-coded matrix representation of the data. These


are particularly handy when dealing with multiple predictor variables helping you
identify potential multicollinearity using certain color gradients.

plt.figure(figsize=(12, 8))

sns.heatmap(

https://towardsdatascience.com/linear-regression-5100fe32993a 8/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

data.corr(numeric_only=True), annot=True, cmap="coolwarm", linewidths=0.5


)
plt.title("Feature Correlation Heatmap")

plt.show()

3. Violin Plot: These are used to study the distribution of data across multiple levels,
much like a violin’s shape. These are often used to spot the density of values,
especially when exploring the spread of the given dependent variable

plt.figure(figsize=(10, 6))

sns.violinplot(x="carat", y="price", data=data)

plt.title("Violin Plot of Price by Carat")


plt.show()

Other important visualizations include box plot for distribution and outliers, the
regression plot for model alignment and bar plot for categorical insights

3.4 Model Development and Evaluation


Now that we’ve performed the basic data exploration and cleansing tasks, it’s time to
construct & evaluate our simple linear regression model. In this blog, we will be
training a model to predict diamond prices based on carat weight.

https://towardsdatascience.com/linear-regression-5100fe32993a 9/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Image from Machine Learning Recipes WebApp

To start off, divide the data into two parts: one for training the model and the other
for testing its performance. A common practice is to allocate over 70% to 80% of the
data for training & reserve the remaining portion for validation

from sklearn.preprocessing import StandardScaler


from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_stat

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

To ensure fair treatment, we will use StandardScaler to standardize our data’s


distribution making sure no feature disproportionately influences the model

Note that only the feature variables needs to be scaled. Also, in simple linear
regression, as there is only a single feature involved, scaling may be skipped.

from sklearn.linear_model import LinearRegression


from sklearn.metrics import r2_score,mean_squared_error

linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train_scaled,y_train)

y_pred = linear_regression_model.predict(X_test_scaled)

print("Mean Squared Error: ", mean_squared_error(y_test,y_pred))


print("R2 Score: ", r2_score(y_test,y_pred))

Next, we need to identify the best-fit line that represents the relationship b/w
predictor, and the predictant. Our aim is to minimize the difference between the
predicted & actual values by adjusting the slope, and intercept of the line

https://towardsdatascience.com/linear-regression-5100fe32993a 10/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Image by Rohan Vij from Medium

Finally, we have successfully built the model & now it’s time to put it through its
paces & see how it holds up. For this, we calculate the MSE & the R2 Score of the
model. Our aim here is to minimize the MSE & maximize the R2 Score

4. Limitations and Extensions


While Simple Linear Regression is a powerful tool, it’s crucial to recognize its
limitations. One such limitation is its reliance on linear relationship between
variables. In real-world scenarios, data may often exhibit complex behaviors that
the simple linear regression algorithms, might fail to capture accurately

Additionally, this method is sensitive to outliers. Outliers might influence the


coefficient & intercepts of the regression line, potentially generating skewed
predictions. It’s crucial to preprocess the data diligently to mitigate this issue

However, simple linear regression is not the only linear algorithm. There are
extensions that tackle the limitations of such models. Polynomial regression for
instance, introduces curves & bends, accommodating nonlinear patterns.

Similarly, Multiple Linear Regression takes on multiple predictors, making it


suitable for analyzing multifaceted relationships. Time Series Regression on the
other hand can be used when dealing with dynamic time-dependent data

5. Conclusion and Final Thoughts


From predicting sale based on advertisement spending to understanding the
relation between study time & exam scores, the applications of simple linear
regression are diverse & span multiple industries from finance to healthcare

https://towardsdatascience.com/linear-regression-5100fe32993a 11/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Though useful, it is just a stepping stone to understanding more complicated


models and algorithm. From handling nonlinear relationships to embracing
multiple predictors, the regression journey continues that expands horizons

With that, we have reached the end of this article. You have chosen the right career
at exactly the right time. If you have any questions or if you believe I have made a
mistake, feel free to connect with me. Get in touch with me over LinkedIn. Read
more such Machine Learning Recipes here. Happy Learning!

Machine Learning Linear Regression Python Supervised Learning

House Price Prediction

Follow

Written by Ashwin Raj


229 Followers · Writer for Towards Data Science

Do cool things, that matter!

More from Ashwin Raj and Towards Data Science

https://towardsdatascience.com/linear-regression-5100fe32993a 12/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Ashwin Raj in Towards Data Science

An Exhaustive Guide to Decision Tree Classification in Python 3.x


An End-to-End Tutorial for Classification using Decision Trees

12 min read · Oct 27, 2021

859

Adrian H. Raudaschl in Towards Data Science

Forget RAG, the Future is RAG-Fusion

https://towardsdatascience.com/linear-regression-5100fe32993a 13/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion
and Generated Queries

· 10 min read · Oct 6

2.1K 24

Natassha Selvaraj in Towards Data Science

Coding was Hard Until I Learned These 2 Things!


Here’s what helped me go from “aspiring programmer” to actually landing a job in the field.

· 7 min read · Oct 2

2.4K 30

https://towardsdatascience.com/linear-regression-5100fe32993a 14/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Ashwin Raj in Towards Data Science

The Perfect Recipe for Classification Using Logistic Regression


Solving Classification Problems using Logistic Regression

9 min read · Nov 7, 2020

1K

See all from Ashwin Raj

See all from Towards Data Science

Recommended from Medium

https://towardsdatascience.com/linear-regression-5100fe32993a 15/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Tahera Firdose in GoPenAI

Understanding K-Nearest Neighbors (KNN)


In the vast field of machine learning, there are numerous algorithms designed to solve various
types of problems. One such algorithm that…

6 min read · Aug 24

14

Mohamadhasan Sarvandani

https://towardsdatascience.com/linear-regression-5100fe32993a 16/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Top 11 algorithms of non-linear regression in machine learning +


Proposed Python library and Python…
Non-linear regression algorithms are machine learning techniques used to model and predict
non-linear relationships between input variables…

5 min read · Jun 9

Lists

Predictive Modeling w/ Python


20 stories · 555 saves

Practical Guides to Machine Learning


10 stories · 632 saves

Coding & Development


11 stories · 253 saves

Natural Language Processing


781 stories · 363 saves

AshirbadPradhan

Simple Linear Regression.

https://towardsdatascience.com/linear-regression-5100fe32993a 17/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

In this blog, we learn the basics of Simple Linear Regression (SLR), building a linear model with
python libraries.

7 min read · May 19

31

Yennhi95zz

# 3. Understanding the Cost Function in Linear Regression for Machine


Learning Beginners
When working with linear regression, we aim to find the best line that fits the training data. The
cost function measures the difference…

5 min read · May 30

159 2

https://towardsdatascience.com/linear-regression-5100fe32993a 18/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Frauke Albrecht in Towards Data Science

Decision Trees for Classification — Complete Example


A detailed example how to construct a Decision Tree for classification

8 min read · Jan 1

220 3

Simon Benavides Pinjosovsky, PhD

Normalize data before or after split of training and testing data?

https://towardsdatascience.com/linear-regression-5100fe32993a 19/20
11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

When working with machine learning models, it is important to preprocess the data before
training the model. One common preprocessing…

3 min read · Jul 5

57

See more recommendations

https://towardsdatascience.com/linear-regression-5100fe32993a 20/20

You might also like