Professional Documents
Culture Documents
datagy
In this tutorial, you’ll learn how to learn the fundamentals of linear regression in Scikit-Learn.
Throughout this tutorial, you’ll use an insurance dataset to predict the insurance charges that a client will
accumulate, based on a number of different factors. You’ll learn how to model linear relationships between
a single independent and dependent variable and multiple independent variables and a single dependent
variable.
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 1/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Table of Contents
You may recall from high-school math that the equation for a linear relationship is: y = m(x) + b . In
machine learning, m is often referred to as the weight of a relationship and b is referred to as the bias.
This relationship is referred to as a univariate linear regression because there is only a single independent
variable. In many cases, our models won’t actually be able to be predicted by a single independent
variable. In these cases, there will be multiple independent variables influencing the dependent variable.
This can often be modeled as shown below:
Where the weight and bias of each independent variable influence the resulting dependent variable.
In the image below, you can see the line of best fit being applied to some data. The more linear a
relationship, the more accurately the line of best fit will describe a relationship.
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 2/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Let’s get started with learning how to implement linear regression in Python using Scikit-Learn!
To explore the data, let’s load the dataset as a Pandas DataFrame and print out the first five rows using
the .head() method.
import pandas as pd
df =
pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/insurance.csv')
print(df.head())
# Returns:
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 3/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
By printing out the first five rows of the dataset, you can see that the dataset has seven columns:
For this tutorial, you’ll be exploring the relationship between the first six variables and the charges
variable. Specifically, you’ll learn how to explore how the numeric variables from the features impact
the charges made by a client.
You’ll notice I specified numeric variables here. This is because regression can only be completed on
numeric variables. While there are ways to convert categorical data to work with numeric variables, that’s
outside the scope of this tutorial.
Before going any further, let’s dive into the dataset a little further. Let’s confirm that the numeric features
are in fact stored as numeric data types and whether or not any missing data exists in the dataset. This
can be done by applying the .info() method:
# Returns:
# <class 'pandas.core.frame.DataFrame'>
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 4/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
From this, you can see that the age , bmi , and children features are numeric, and that the
charges target variable is also numeric.
Pandas makes it very easy to calculate the coefficient of correlation between all numeric variables in a
dataset using the .corr() method. Let’s apply the method to the DataFrame and see what it returns:
print(df.corr())
# Returns:
# age bmi children charges
# age 1.000000 0.109272 0.042469 0.299008
# bmi 0.109272 1.000000 0.012759 0.198341
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 5/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
From this, you can see that the strongest relationship exists between the age and charges variable.
It’s still a fairly weak relationship. Let’s see what other insights we can get from the data.
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 6/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
It looks like the data is fairly all over the place and those linear relationships may be harder to identify.
However, if you look closely, you can see some level of stratification. For example, the pairplots
for charges and age as well as charges and BMI show separate clusters of data.
Let’s see how we can apply some of the other categorical data to see if we can identify any nuances in the
data. Because the smoker variable is a binary variable (either “yes” or “no”), let’s split the data by that
variable. This can be done by passing in the hue= parameter.
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 7/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Adding hue to our Seaborn pairplot allows us to see trends in data for linear regression
From this, you can see that there are clear differences in the charges of clients that smoke or don’t smoke.
Let’s take a closer look at the relationship between the age and charges variables. This can be done
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 8/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
This is great! Aside from a few outliers, there’s a clear, linear-looking, trend between the age and charges
for non-smokers. Remember, when you first calculated the correlation between age and charges was
the strongest, but it was still a weak relationship. Now that you know that smoking is a strong determinant
in charges, let’s filter the DataFrame to only non-smokers and see if this makes a difference in correlation.
# Returns:
# age bmi children charges
# age 1.000000 0.122638 0.033395 0.627947
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machine… 9/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
The correlation between age and charges increased from 0.28 to 0.62 when filtering to only non-
smokers. Let’s focus on non-smokers for the rest of the tutorial, since we’re more likely to be able to find
strong, linear relationships for them.
But how do we know what the line looks like? This is where linear regression comes into play! Using linear
regression, you can find the line of best fit, i.e., the line that best represents the data.
What linear regression does is minimize the error of the line from the actual data points using a process
of ordinary least squares. In this process, the line that produces the minimum distance from the true data
points is the line of best fit.
Let’s begin by importing the LinearRegression class from Scikit-Learn’s linear_model . You can
then instantiate a new LinearRegression object. In this case, it’s been called model .
This object also has a number of methods. One of these is the fit() method, which is used to fit data to
a linear model. Let’s see how can learn a little bit about this method, by calling the help() function on it:
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 10/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
# Returns:
# Help on method fit in module sklearn.linear_model._base:
# Parameters
# ----------
# X : {array-like, sparse matrix} of shape (n_samples, n_features)
# Training data.
# .. versionadded:: 0.17
# parameter *sample_weight* support to LinearRegression.
# Returns
# -------
# self : object
# Fitted Estimator.
From the help documentation, you can see that the method expects two arrays: X and y . X is
expected to be a two-dimensional array (as denoted by the capital X), while y is expected to be one-
dimensional.
As with other machine-learning models, X will be the features of the dataset, while y will be the
target of the dataset. In this case, we’ll start off by only looking at a single feature: age . Let’s convert
age to a DataFrame and parse out charges into a Series.
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 11/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
y = non_smokers['charges']
In the code above, you used double square brackets to return a DataFrame for the variable X . We can
confirm the types by using the type() function:
# Returns:
# The type of X is <class 'pandas.core.frame.DataFrame'>
# The type of y is <class 'pandas.core.series.Series'>
Now that we know that X is two-dimensional and y is one-dimensional, we can create our training and
testing datasets.
Now that our datasets are split, we can use the .fit() method to fit our data. Let’s pass these variables
in to create a fitted model. We’ll use the training datasets to create our fitted model.
Now that our model has been fitted, we can use our testing data to see how accurate the data is. Because
in the case of regression, our labels are continuous, we can use a number of different metrics. The table
below breaks down a few of these:
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 12/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
0-1 (larger is The proportion of the variance in the predicted variable ( y ) that can
R squared
better) be explained by the features ( X )
Root mean squared 0+ (lower is A representation of the average distance between the observed data
error (RMSE) better) values and the predicted data values
Scikit-learn comes with all of these evaluation metrics built-in. We can import them from the metrics
module. Let’s load them, predict our values based on the testing variables, and evaluate the effectiveness
of our model.
# Returns:
# The r2 is: 0.37371317540937177
# The rmse is: 4706.59088022
These results aren’t ideal. The r2 value is less than 0.4, meaning that our line of best fit doesn’t really do a
good job of predicting the charges. However, based on what we saw in the data, there are a number of
outliers in the dataset. Because the r2 value is affected by outliers, this could cause some of the errors to
occur.
Let’s see if we can improve our model by including more variables into the mix.
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 13/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
In this section, you’ll learn how to conduct linear regression using multiple variables. In this case, rather
than plotting a line, you’re plotting a plane in multiple dimensions. However, the phenomenon is still
referred to as linear since the data grows at a linear rate.
Scikit-Learn makes it very easy to create these models. Remember, when you first fitted your model, you
passed in a two-dimensional array X_train . That array only had one column. However, you can simply
pass in an array of multiple columns to fit your data to multiple variables. Let’s see how this is done:
# Returns:
# The r2 is: 0.3711113278927346
# The rmse is: 4867.23495571
It looks like our results have actually become worse! Knowing that smoking has a large influence on the
data, we can convert the smoker column into a numerical column. Since this is a binary question, we can
convert the value of 'yes' to 1 and 'no' to 0 . Following that, we can simply pass in the data and
evaluate our model:
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 14/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
# Returns:
# The r2 is: 0.7455266762675961
# The rmse is: 6177.07010254
In this case, while our r2 value increases, as did our errors. Logically, this makes sense. We are now fitting
the line on a dataset of a much larger spread. Because of this, the line may fit better, while the overall
variance of the data varies too.
Tip: if you wanted to show the root mean squared error, you could pass the squared=False argument to
the mean_squared_error() function. By default, the squared= parameter will be set to True ,
meaning that the mean squared error is returned.
If you’re satisfied with the data, you can actually turn the linear model into a function. With this function,
you can then pass in new data points to make predictions about what a person’s charges may be. Let’s
see how you can do this.
The number of coefficients will match the number of features being passed in. Let’s see what they look like:
# Returns:
# [ 238.47905621 370.5876659 23627.93402865]
# -12061.849365383008
We can easily turn this into a predictive function to return the predicted charges a person will incur
based on their age, BMI, and whether or not they smoke. Let’s create this function now:
Now, say we have a person who is 33, has a BMI of 22, and doesn’t smoke, we could simply pass in the
following arguments:
# Predicting charges
print(calculate_charges(33, 22, 0))
# Returns: 3960.8881392049407
In the case above, the person would likely have just under $4,000 of charges!
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 16/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Exercises
It’s time to check your learning. Try and complete the exercises below. If you need a hint or want to check
your solution, simply toggle the question.
Question 1
Question 2
How would you include the sex variable in the regression analysis?
You could convert the values to 0 and 1, as they are represented by binary values.
Linear regression involves fitting a line to data that best represents the relationship between a
dependent and independent variable
Similarly, multivariate linear regression can model the linear relationship between multiple independent
variables and a dependent variable
The Scikit-Learn library provides a LinearRegression class to fit and predict data
Additional Resources
To learn more about related topics, check out the tutorials below:
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 17/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
PREVIOUS NEXT
Tinotenda Fred
REPLY
April 7, 2022 at 2:53 pm
Nik
REPLY
April 8, 2022 at 6:08 am
Jayron
REPLY
May 26, 2022 at 7:30 am
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 18/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Nik
REPLY
May 26, 2022 at 5:14 pm
Mary
REPLY
June 26, 2022 at 11:51 pm
Thanks for the tutorial! I found one edit. The last time you reference rmse you need to append
squared=False. Otherwise you end up with a crazy big number (the mse). Thanks again — this
helped me learn.
Nik
REPLY
July 1, 2022 at 8:30 am
Thanks so much, Mary! I’ll make note of that in the tutorial :).
Luise
REPLY
November 11, 2022 at 8:22 am
Thank you so much for this tutorial! This was exactly what I was looking for, a step-by-step guide
through the code, always explaining what you’re doing and why.
Nik
REPLY
November 12, 2022 at 6:18 am
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 19/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Leave a Reply
Your email address will not be published. Required fields are marked *
Name *
Email *
Website
Comment *
Save my name, email, and website in this browser for the next time I comment.
Post Comment
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 20/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
First Name
Learn Python
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 21/22
4/27/23, 8:09 AM Linear Regression in Scikit-Learn (sklearn): An Introduction • datagy
Recent Posts
Content
Learn Python in 30 Days
Blog
Python Tutorials
Quick links
About us
Privacy policy
Contact us
https://datagy.io/python-sklearn-linear-regression/?utm_source=newsletter&utm_medium=email&utm_campaign=name_lets_build_another_machin… 22/22