A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science

11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science
Image by Mirko Grisendi from Pixabay
A Practical Approach to Linear Regression in

Machine Learning
A Hands-On Beginner’s Guide to Linear Regression
Ashwin Raj · Follow

Published in Towards Data Science
10 min read · May 8, 2020
Listen Share
Imagine having this ability to predict real-world outcomes based on just one feature.
Sounds a bit magical, right? Well, this magic is called Simple Linear Regression, and
that’s a fundamental tool in the world of Machine Learning. But don’t let this term
intimidate you - it’s much simpler than you may think.
https://towardsdatascience.com/linear-regression-5100fe32993a 1/20
In this article, we will start by exploring the mathematical features of Linear

Regression. After that, we’ll dive into some key concepts like hypothesis and cost
function & to top it off, we’ll put our newfound knowledge into action by creating a
simple regression model. So let’s get ready for an exciting journey!
If you find this blog helpful, do consider giving it some claps so that Medium knows,
that you’re enjoying, what you’re reading. For more exciting Machine Learning
Recipes, make sure to follow me. Let’s delve straight into our topic!
1. Understanding Simple Linear Regression

Linear Regression is a supervised learning algorithm which is generally used when
the value to be predicted is of discrete or quantitative nature. It tries to establish a
relationship between the dependent variable ‘y’, and one or more related
independent variables ‘x’ using what is referred to as the best-fit line
One of the most common example wherein a linear regression model is used is
when predicting the price of a house by analyzing sales data of that region.
Image from CDOT Wiki
Linear Regression can be classified into two main categories - Simple Linear
Regression and Multiple Linear Regression. The former centers around just one
single independent variable while the latter extends its reach to multiple
independent variables, creating a multidimensional landscape for predictive
modeling. In this article, we will be discussing the Simple Linear Regression.
1.1 Mathematics Behind Simple Linear Regression

The crux of Simple Linear Regression is to derive a best-fit line that captures the
relationship between any single independent variable and the dependent variable.
The best-fit line emphasizes minimizing the overall distance of data points from the
regression line, ensuring a precise alignment with the trend
Statistically, the mathematical equation that approximately models a linear

relationship between a dependent variable x, & an independent variable y is
wherein, ‘m’ is the slope of the line, and ‘c’ is the intercept, and together they are
referred to as the ‘Model’s Coefficients’. This equation is the basis for any Linear
Regression model and is often referred to as the Hypothesis Function
The goal of most machine learning algorithms is to construct a model i.e. the
hypothesis (H) to estimate the dependent variable based on our independent
variables such that it minimizes the Loss Function (Residual Sum of Squares)
Aptly termed the Least Squares Method, this approach seeks to find the most
optimal values for the model parameters that minimize the loss function/RSS
1.2 Exploring the Cost Functions

While coefficients provides insights into the relationship between variables, we
need metrics to quantify how accurately these linear models captures the
underlying data patterns. The most common metrics are RMSE and R2 Score
Image by Rohan Vij from Medium
One of the primary metric used to assess the model’s performance is MSE, or Mean
Squared Error. This metric quantifies the squared difference, between the actual
observed values and the predicted values from the regression line. A lower MSE
indicates that the model’s predictions are closer to actual values
The R-squared (R2) Score, also known as the coefficient of determination is

another vital metric that provides a measure of how well the model explains the
variability, in the dependent variable. It ranges from 0 to 1, with a higher value
indicating a better fit without explicitly accounting for overfitting data
Beyond these, you may also explore other metrices such as: Residual Sum of
Squares, and Residual Analysis that offer a more comprehensive evaluation of
model performance. Read more about regression evaluation metrics here
2. Assumptions in Simple Linear Regression

Behind its core functionalities, Simple Linear Regression establishes a set of
foundational assumptions that underpin its reliability & performance. These
assumptions serve as the bedrock upon which the LR model's predictions are built,
ensuring that their insights accurately reflect the real-world dynamics.
1. Linear Relationship: Simple Linear Regression algorithms expects that the

relationship between the independent variable and the dependent variable is linear,
& shall be adequately approximated by a simple straight-line equation
2. Homoscedasticity: Referred to as the assumption of constant variance, the

assumption underscores that the variance of the model error should remain
consistent across all levels of the independent variable i.e. predictor variable
3. Normality of Residuals: This assumption signifies that all residuals should follow
a normal distribution pattern, with a mean of zero. If the confidence intervals
become too wide or narrow, some statistical tests may not hold true
Ashwin Raj
Machine Learning Recipes
View list 12 stories
Another important assumption is that there should not be multi-collinearity

Multicollinearity emerges when the independent variables are correlated. In Simple
Linear Regression, this is not a concern as there is only one predictor
3. Building a Simple Linear Regression Model

Having explored the fundamental concepts of Simple Linear Regression, it is time to
put theory into practice by constructing your very own Simple Linear Regression
model. Lets jump straight right into the practical implementation
In this tutorial, we’re going to create a diamond price prediction model using Simple
Linear Regression. We’ll train this model using the Diamonds dataset found on
Kaggle. You can grab the dataset for yourself from this GitHub repo
Here, I will be using Google Colab to build this model. You can also use other IDE’s
like PyCharm, Selenium, VS Code, etc. to follow along with this tutorial
3.1 Importing Libraries and Loading the Dataset

Let’s begin by importing the Python packages that are crucial to building the model.
You don’t have to gather all the tools at once; we’ll start with essential libraries like
Pandas, Numpy, Seaborn, and Matplotlib. Instead of using their entire name
everytime, we create short aliases using the ‘as’ Python keyword
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
If you’re working on a Jupyter Notebook, consider including the %matplotlib inline

magic command. It’s a nifty tool that lets you view plots right within a notebook,
eliminating the need to use plt.show each time you generate a plot
Screenshot by Author from Kaggle
Once we’ve loaded these packages, the next task is to fetch the dataset & load the
data. For this, we’ll employ the read_csv method from the pandas library. While
specifying the dataset’s location, if the file is located in some different directory, you
will need to provide the relative or the absolute path to the file
data = pd.read_csv("datasets/diamonds.csv")
Open in app Sign up Sign In
Although usually CSV files are used for machine learning tasks, JSON files & Excel
Search
spreadsheets can also be used as datasets. The only distinction will be to use the
read_json() & read_excel() functions when working with such files
3.2 Data Exploration and Data Cleaning

When dealing with a new dataset, it’s always a good idea to take a quick peek at the
first few rows of the data. You can do this using the head() function. By default it
displays the first 5 rows but can be tuned to see more or fewer rows
print(data.head(3))
When dealing with unseen datasets, ensuring data quality is paramount. The real-
world datasets often come with few missing values, duplicate records, & outliers.
Inaccurate or incomplete information can lead to skewed results, & analysis. So
before we put our model to the test, we need to clean our dataset
data.fillna(data.mean(numeric_only=True), inplace=True)
data.drop_duplicates(inplace=True)
One of the most common challenges, when working with real-world datasets is
dealing with missing data. To handle such missing values, a good approach is to
impute the missing values with the mean of that column. This approach is
particularly useful when the data is random, and unlikely to introduce bias
Another challenge that we might come across when working with real-world data is
managing the duplicate entries. To tackle this issue, a straightforward method is to
eliminate these duplicates, using the drop_duplicates() function
Q1 = data.quantile(0.25, numeric_only=True)
Q3 = data.quantile(0.75, numeric_only=True)
IQR = Q3 - Q1
data = data[
~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)
]
Outliers are data points that deviate significantly from the rest of the dataset These
may indicate errors, anomalies, or unique occurrences. One approach to managing
outliers involves filtering out data points that lie beyond the IQR range (i.e below Q1
– 1.5 * IQR or above Q3 + 1.5 * IQR wherein IQR = Q3 - Q1)
3.3 Data Visualization and Feature Engineering

Numbers and statistics alone may not always provide the complete narrative.
Visualizing your data brings it to life, revealing patterns, & relationships that might
otherwise remain hidden. In this section, we explore few visualization techniques
that can help us gain deeper insights into the intricacies our data.
Image by Author
Matplotlib & Seashore are excellent libraries that can be used for visualizing data.
Some of the commonly used visualization techniques are as described:
1. Pair Plot: Pair plots paint a matrix of scatter plots, revealing how variables
correlate. These are often used to spot trends, clusters, or even outliers & are often
used to study the relationship between the predictor and the predictant
sns.pairplot(
data,x_vars=['carat'],y_vars=['price'],height=12,kind='scatter'
)
plt.xlabel('Carat')
plt.ylabel('Price')
plt.title('Diamond Price Prediction - Carat vs Price')
plt.show()
2. Heatmap: Heatmaps offer a color-coded matrix representation of the data. These

are particularly handy when dealing with multiple predictor variables helping you
identify potential multicollinearity using certain color gradients.
plt.figure(figsize=(12, 8))
sns.heatmap(
data.corr(numeric_only=True), annot=True, cmap="coolwarm", linewidths=0.5

)
plt.title("Feature Correlation Heatmap")
plt.show()
3. Violin Plot: These are used to study the distribution of data across multiple levels,
much like a violin’s shape. These are often used to spot the density of values,
especially when exploring the spread of the given dependent variable
plt.figure(figsize=(10, 6))
sns.violinplot(x="carat", y="price", data=data)
plt.title("Violin Plot of Price by Carat")

plt.show()
Other important visualizations include box plot for distribution and outliers, the
regression plot for model alignment and bar plot for categorical insights
3.4 Model Development and Evaluation

Now that we’ve performed the basic data exploration and cleansing tasks, it’s time to
construct & evaluate our simple linear regression model. In this blog, we will be
training a model to predict diamond prices based on carat weight.
Image from Machine Learning Recipes WebApp
To start off, divide the data into two parts: one for training the model and the other
for testing its performance. A common practice is to allocate over 70% to 80% of the
data for training & reserve the remaining portion for validation
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.7,random_stat
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
To ensure fair treatment, we will use StandardScaler to standardize our data’s

distribution making sure no feature disproportionately influences the model
Note that only the feature variables needs to be scaled. Also, in simple linear
regression, as there is only a single feature involved, scaling may be skipped.
from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score,mean_squared_error
linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train_scaled,y_train)
y_pred = linear_regression_model.predict(X_test_scaled)
print("Mean Squared Error: ", mean_squared_error(y_test,y_pred))

print("R2 Score: ", r2_score(y_test,y_pred))
Next, we need to identify the best-fit line that represents the relationship b/w
predictor, and the predictant. Our aim is to minimize the difference between the
predicted & actual values by adjusting the slope, and intercept of the line
Image by Rohan Vij from Medium
Finally, we have successfully built the model & now it’s time to put it through its
paces & see how it holds up. For this, we calculate the MSE & the R2 Score of the
model. Our aim here is to minimize the MSE & maximize the R2 Score
4. Limitations and Extensions

While Simple Linear Regression is a powerful tool, it’s crucial to recognize its
limitations. One such limitation is its reliance on linear relationship between
variables. In real-world scenarios, data may often exhibit complex behaviors that
the simple linear regression algorithms, might fail to capture accurately
Additionally, this method is sensitive to outliers. Outliers might influence the

coefficient & intercepts of the regression line, potentially generating skewed
predictions. It’s crucial to preprocess the data diligently to mitigate this issue
However, simple linear regression is not the only linear algorithm. There are
extensions that tackle the limitations of such models. Polynomial regression for
instance, introduces curves & bends, accommodating nonlinear patterns.
Similarly, Multiple Linear Regression takes on multiple predictors, making it

suitable for analyzing multifaceted relationships. Time Series Regression on the
other hand can be used when dealing with dynamic time-dependent data
5. Conclusion and Final Thoughts

From predicting sale based on advertisement spending to understanding the
relation between study time & exam scores, the applications of simple linear
regression are diverse & span multiple industries from finance to healthcare
Though useful, it is just a stepping stone to understanding more complicated

models and algorithm. From handling nonlinear relationships to embracing
multiple predictors, the regression journey continues that expands horizons
With that, we have reached the end of this article. You have chosen the right career
at exactly the right time. If you have any questions or if you believe I have made a
mistake, feel free to connect with me. Get in touch with me over LinkedIn. Read
more such Machine Learning Recipes here. Happy Learning!
Machine Learning Linear Regression Python Supervised Learning
House Price Prediction
Follow
Written by Ashwin Raj

229 Followers · Writer for Towards Data Science
Do cool things, that matter!
More from Ashwin Raj and Towards Data Science
Ashwin Raj in Towards Data Science
An Exhaustive Guide to Decision Tree Classification in Python 3.x

An End-to-End Tutorial for Classification using Decision Trees
12 min read · Oct 27, 2021
859
Adrian H. Raudaschl in Towards Data Science
Forget RAG, the Future is RAG-Fusion
The Next Frontier of Search: Retrieval Augmented Generation meets Reciprocal Rank Fusion
and Generated Queries
· 10 min read · Oct 6
2.1K 24
Natassha Selvaraj in Towards Data Science
Coding was Hard Until I Learned These 2 Things!

Here’s what helped me go from “aspiring programmer” to actually landing a job in the field.
· 7 min read · Oct 2
2.4K 30
Ashwin Raj in Towards Data Science
The Perfect Recipe for Classification Using Logistic Regression

Solving Classification Problems using Logistic Regression
9 min read · Nov 7, 2020
1K
See all from Ashwin Raj
See all from Towards Data Science
Recommended from Medium
Tahera Firdose in GoPenAI
Understanding K-Nearest Neighbors (KNN)

In the vast field of machine learning, there are numerous algorithms designed to solve various
types of problems. One such algorithm that…
6 min read · Aug 24
14
Mohamadhasan Sarvandani
Top 11 algorithms of non-linear regression in machine learning +

Proposed Python library and Python…
Non-linear regression algorithms are machine learning techniques used to model and predict
non-linear relationships between input variables…
5 min read · Jun 9
Lists
Predictive Modeling w/ Python

20 stories · 555 saves
Practical Guides to Machine Learning

Coding & Development

Natural Language Processing

AshirbadPradhan
Simple Linear Regression.
In this blog, we learn the basics of Simple Linear Regression (SLR), building a linear model with
python libraries.
7 min read · May 19
31
Yennhi95zz
# 3. Understanding the Cost Function in Linear Regression for Machine

Learning Beginners
When working with linear regression, we aim to find the best line that fits the training data. The
cost function measures the difference…
5 min read · May 30
159 2
Frauke Albrecht in Towards Data Science
Decision Trees for Classification — Complete Example

A detailed example how to construct a Decision Tree for classification
8 min read · Jan 1
220 3
Simon Benavides Pinjosovsky, PhD
Normalize data before or after split of training and testing data?
When working with machine learning models, it is important to preprocess the data before
training the model. One common preprocessing…
3 min read · Jul 5
57
See more recommendations

A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Practical Approach To Linear Regression in Machine Learning - by Ashwin Raj - Towards Data Science

Uploaded by

Copyright:

Available Formats

11/4/23, 11:10 AM A Practical Approach to Linear Regression in Machine Learning | by Ashwin Raj | Towards Data Science

Image by Mirko Grisendi from Pixabay

A Practical Approach to Linear Regression in

Ashwin Raj · Follow

In this article, we will start by exploring the mathematical features of Linear

1. Understanding Simple Linear Regression

Image from CDOT Wiki

1.1 Mathematics Behind Simple Linear Regression

Statistically, the mathematical equation that approximately models a linear

1.2 Exploring the Cost Functions

Image by Rohan Vij from Medium

The R-squared (R2) Score, also known as the coefficient of determination is

2. Assumptions in Simple Linear Regression

1. Linear Relationship: Simple Linear Regression algorithms expects that the

2. Homoscedasticity: Referred to as the assumption of constant variance, the

Machine Learning Recipes

View list 12 stories

Another important assumption is that there should not be multi-collinearity

3. Building a Simple Linear Regression Model

3.1 Importing Libraries and Loading the Dataset

If you’re working on a Jupyter Notebook, consider including the %matplotlib inline

Screenshot by Author from Kaggle

Open in app Sign up Sign In

3.2 Data Exploration and Data Cleaning

3.3 Data Visualization and Feature Engineering

2. Heatmap: Heatmaps offer a color-coded matrix representation of the data. These

data.corr(numeric_only=True), annot=True, cmap="coolwarm", linewidths=0.5

sns.violinplot(x="carat", y="price", data=data)

plt.title("Violin Plot of Price by Carat")

3.4 Model Development and Evaluation

Image from Machine Learning Recipes WebApp

from sklearn.preprocessing import StandardScaler

To ensure fair treatment, we will use StandardScaler to standardize our data’s

from sklearn.linear_model import LinearRegression

print("Mean Squared Error: ", mean_squared_error(y_test,y_pred))

Image by Rohan Vij from Medium

4. Limitations and Extensions

Additionally, this method is sensitive to outliers. Outliers might influence the

Similarly, Multiple Linear Regression takes on multiple predictors, making it

5. Conclusion and Final Thoughts

Though useful, it is just a stepping stone to understanding more complicated

Machine Learning Linear Regression Python Supervised Learning

House Price Prediction

Written by Ashwin Raj

Do cool things, that matter!

More from Ashwin Raj and Towards Data Science

Ashwin Raj in Towards Data Science

An Exhaustive Guide to Decision Tree Classification in Python 3.x

12 min read · Oct 27, 2021

Adrian H. Raudaschl in Towards Data Science

Forget RAG, the Future is RAG-Fusion

· 10 min read · Oct 6

Natassha Selvaraj in Towards Data Science

Coding was Hard Until I Learned These 2 Things!

· 7 min read · Oct 2

Ashwin Raj in Towards Data Science

The Perfect Recipe for Classification Using Logistic Regression

9 min read · Nov 7, 2020

See all from Ashwin Raj

See all from Towards Data Science

Recommended from Medium

Tahera Firdose in GoPenAI

Understanding K-Nearest Neighbors (KNN)