CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory

Data Exploration
Data exploration is a crucial step in machine learning, which involves analyzing and understanding
the data before building a predictive model.
It is the process of understanding the characteristics of the data, identifying patterns, relationships,
and outliers in the data, and gaining insights into the data.
The importance of data exploration lies in the fact that it helps in building an accurate and robust
predictive model. By exploring the data, we can identify the underlying patterns and relationships
that exist within the data, which can be leveraged to build a more accurate model. Additionally, data
exploration can help us identify and deal with outliers and missing data, which can significantly
affect the performance of the model.
Furthermore, data exploration can help us make informed decisions about which features to include
in our model, and how to preprocess the data. For example, we can identify which features are most
important in predicting the target variable, and we can use this information to perform feature
selection or feature engineering.
Overall, data exploration is a critical step in the machine learning process, as it enables us to
understand the data better and make informed decisions that can lead to more accurate and robust
predictive models.
Univariate Analysis
Univariate analysis is a method of analyzing data that looks at only one variable at a time. It is used
to describe the characteristics of a single variable, such as its frequency distribution, mean, median,
mode, variance, and standard deviation.
Univariate analysis is useful for gaining insights into a single variable and for identifying any
unusual observations that might be present in the data. It is also a useful preliminary step for more
complex multivariate analysis, which involves analyzing multiple variables simultaneously.
Some common techniques used in univariate analysis include frequency tables, histograms, box
plots, and summary statistics. These techniques can help to identify patterns, outliers, and other
key characteristics of the variable under investigation.
In this exercise, we are going to perform univariate analysis on the house price dataset
(House_Price.csv)
First, we are going to load our dataset (House_Price.csv).
1 # import necessary libraries / modules
2 import pandas as pd
3 import numpy as np
4 import matplotlib.pyplot
5 import seaborn as sns
1 # load the dataset
2 df = pd.read_csv('House_Price.csv')
3
4 # displapy the top five rows
5 df.head()
price crime_rate resid_area air_qual room_num age dist1 dist2 dist3 di
0 24.0 0.00632 32.31 0.538 6.575 65.2 4.35 3.81 4.18 4
1 21.6 0.02731 37.07 0.469 6.421 78.9 4.99 4.70 5.12 5
2 34.7 0.02729 37.07 0.469 7.185 61.1 5.03 4.86 5.01 4
3 33.4 0.03237 32.18 0.458 6.998 45.8 6.21 5.93 6.16 5
4 36.2 0.06905 32.18 0.458 7.147 54.2 6.16 5.86 6.37 5
The read_csv() function of the pandas library. We passed in the name of the data set as the
argument. The funtion will return a data frame (data in tabular format) and will be stored in the
variable called df.
The head() funcation of the pandas library to display the first 5 records (samples or observations).
1 # display information about the data set
2 df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 506 non-null float64
1 crime_rate 506 non-null float64
2 resid_area 506 non-null float64
3 air_qual 506 non-null float64
4 room_num 506 non-null float64
5 age 506 non-null float64
6 dist1 506 non-null float64
10 teachers 506 non-null float64
11 poor_prop 506 non-null float64
12 airport 506 non-null object
13 n_hos_beds 498 non-null float64
14 n_hot_rooms 506 non-null float64
15 waterbody 506 non-null object
16 rainfall 506 non-null int64
17 bus_ter 506 non-null object
18 parks 506 non-null float64
dtypes: float64(15), int64(1), object(3)
memory usage: 75.2+ KB
You can also use the info() function of the pandas library to display all the variables and their
corresponding data types.
Second, we are going perform univariate analysis using Summary

Statisctics.
1 # display summary statistics
2 df.describe()
price crime_rate resid_area air_qual room_num age dist1
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 22.528854 3.613524 41.136779 0.554695 6.284634 68.574901 3.971996
std 9.182176 8.601545 6.860353 0.115878 0.702617 28.148861 2.108532
min 5.000000 0.006320 30.460000 0.385000 3.561000 2.900000 1.130000
25% 17.025000 0.082045 35.190000 0.449000 5.885500 45.025000 2.270000
50% 21.200000 0.256510 39.690000 0.538000 6.208500 77.500000 3.385000
75% 25.000000 3.677083 48.100000 0.624000 6.623500 94.075000 5.367500
max 50.000000 88.976200 57.740000 0.871000 8.780000 100.000000 12.320000
The describe() function of the pandas library will display the summary statistics.
Taking a look at our data, we can observe that all the variables consist of 506 samples except the
variable n_hos_beds with 498 samples. This means that for the variable n_hos_beds there are 506
minus 498 missing values.
For the crime_rate variable, we observed that there is no significant difference between the min
value (0.006) and the 25% IQR (0.082). However, there is a significant difference between the max
value (88.976) and 75% IQR (3.677), meaning the distribution is right-skewed.
Thesame is through with the n_hot_rooms variable. There is a significant difference between the
75% IQR (14.170) and the max value (101.120).
Next, we perform visualisation with the data using scatter plot and
distribution plot.
We can easily create this visualisations using the jointplot() function of the seaborn library. We can
also do this with matplotlib but will require more code not to mention that it is a bit tricky.
1 # generate a histogram or distribution plot
2 sns.histplot(data=df, x='crime_rate')
<AxesSubplot:xlabel='crime_rate', ylabel='Count'>
The figure above is called a histogram or distribution plot. Histograms are commonly used in data
analysis to visualize the shape, center, and spread of the data distribution. They can reveal patterns,
such as skewness, symmetry, and multimodality, that may not be apparent from summary statistics
alone.
To generate a distribution plot or histogram, we use the histplot() function of the seaborn library. We
passed in the data (in this case df) and the x valude in this case (crime_rate variable). We can see
from the histogram that distribution is skewed-right.
1 # generate a scatter plot and histogram
2 sns.jointplot(data=df, x='n_hot_rooms', y='price')
<seaborn.axisgrid.JointGrid at 0x7f82e8826d30>
The figure above is a jointplot (combination of distribution plot and scatter plot).
A scatter plot is a type of data visualization that displays the relationship between two continuous
variables. Each data point in a scatter plot is represented by a point on the graph, with one variable
plotted along the x-axis and the other variable plotted along the y-axis.
Scatter plots are useful for exploring the relationship between two variables and can help identify
trends, patterns, and potential outliers.
To generate a scatter plot and histogram, we use the jointplot() function of the seaborn library and
passed in the data frame (df), the x value (n_hot_rooms variable) and the y value (price variable).
We can see from the scatter plot most of the data points lie between 0 and 20 of the x axis
(n_hot_rooms) while two data points lie somewhere between 80 and 100. These two points are
outliers (extreme values).
1 # generate a distribution plot for categorical variable
2 sns.countplot(data=df, x='airport')
<AxesSubplot:xlabel='airport', ylabel='count'>
The figure above is a countplot of the categorical variable airport.
Count plots are useful for exploring the distribution of categorical data and for identifying the most
common categories in the data. They can also be used to compare the frequency of different
categories within a single variable or between multiple variables.
To generate count plot, we use the countplot() function of the seaborn library, and passed in the
data frame and the x value as arguments.
We can see from the count plot that there are more occurences of YES as compared to NO.
Data Preprocessing
Data preprocessing is the process of transforming raw data into a more useful and understandable
format. It involves cleaning, organizing, and transforming the data to make it suitable for analysis.
The following are some common techniques used in data preprocessing:

Data cleaning: This involves handling missing data, dealing with outliers, and removing
duplicates from the dataset.
Data transformation: This involves scaling, normalization, and encoding of the data. Scaling is
used to rescale data values to a specific range. Normalization is used to transform data
values to a standard scale. Encoding is used to convert categorical data into numerical data.
Data reduction: This involves reducing the number of features or variables in the dataset to
improve the efficiency of the analysis.
Data integration: This involves combining data from multiple sources to create a unified
dataset.
Data discretization: This involves converting continuous data into discrete categories to
simplify analysis.
Data aggregation: This involves combining data from multiple records or rows into a single
record or row to simplify analysis.
Overall, data preprocessing is an essential step in data analysis that helps to ensure that the data is
accurate, complete, and suitable for analysis.
Check for missing values
1 df.isnull().sum()
price 0
crime_rate 0
resid_area 0
air_qual 0
room_num 0
age 0
dist1 0
dist2 0
dist3 0
dist4 0
teachers 0
poor_prop 0
airport 0
n_hos_beds 8
n_hot_rooms 0
waterbody 0
rainfall 0
bus_ter 0
parks 0
dtype: int64
We use the isnull() function of pandas to check missing values for each column and the sum()
function to sum up the number of missing values in each colum.
Based from the output, only the n_hos_beds variable has missing values - there are eight (8)
missing values.
There are different techniques for treating missing values in a dataset. One of these techniques is
to remove the missing values. Another technique is to impute values.
If there are lots of missing values, ideally you are going to remove. Otherwise, you are going to
impute values. Since there are only 8 missing values we are going to impute values. You can use the
mean, median, or mode of the variable. In our example we are going to use the median value.
1 df['n_hos_beds'].fillna(df['n_hos_beds'].median(), inplace=True)
2
3 # verify if there are no more missing values
4 df.isnull().sum()
price 0
crime_rate 0
resid_area 0
air_qual 0
room_num 0
age 0
dist1 0
dist2 0
dist3 0
dist4 0
teachers 0
poor_prop 0
airport 0
n_hos_beds 0
n_hot_rooms 0
waterbody 0
rainfall 0
bus_ter 0
parks 0
dtype: int64
We use the fillna() function of to fill in missing values and pass a value as the argument, in our case
we use the median of the variable n_hos_beds. We verify if there are no more missing values. We
can see from the result that there are no more missing values indicated by the zero (0) values
Transform categrical values in to a format that is suitable for machine learning

using one-hot encoding.
One hot encoding is a technique used to convert categorical variables into a format that can be
used by machine learning algorithms to improve the accuracy of the analysis. In this technique,
each categorical variable is transformed into a set of binary variables (i.e., 0 or 1) representing the
presence or absence of the variable in a particular observation.
For example, suppose we have a dataset of fruits that includes a categorical variable called "color"
with values "red," "green," and "blue." One hot encoding would create three new binary variables
called "color_red," "color_green," and "color_blue." If an observation has a "red" fruit, the "color_red"
variable would be 1, while the other two variables would be 0.
One hot encoding is necessary because many machine learning algorithms cannot handle
categorical data directly. By transforming categorical variables into binary variables, we can convert
them into a format that can be used by these algorithms.
There are various libraries in Python, such as Scikit-learn and Pandas, that provide one hot encoding
functionality. The process of one hot encoding involves identifying the categorical variable and
using the appropriate function to perform the encoding.
Based from the output we have three (3) categorical variables (variables with object type) namely,
airpotr, waterbody, bus_ter
1 ## one hot encode the 3 variables and then store it to a new data fram
2 encoded_vars = pd.get_dummies(data=df, columns=['airport','waterbody',
3
4 ## display the first five rows
5 encoded_vars.head()
price crime_rate resid_area air_qual room_num age dist1 dist2 dist3 dist4 t
0 24.0 0.00632 32.31 0.538 6.575 65.2 4.35 3.81 4.18 4.01
1 21.6 0.02731 37.07 0.469 6.421 78.9 4.99 4.70 5.12 5.06
2 34.7 0.02729 37.07 0.469 7.185 61.1 5.03 4.86 5.01 4.97
3 33.4 0.03237 32.18 0.458 6.998 45.8 6.21 5.93 6.16 5.96
Rescaling variables
Rescaling is an important preprocessing step in many machine learning algorithms. Rescaling or

normalization refers to the process of transforming the features of a dataset to a common scale so
that they have similar ranges or distributions.
The most common rescaling methods are MinMaxScaler and StandardScaler. The MinMaxScaler
scales the values of the features to a specified range (usually 0 to 1), while the StandardScaler
scales the values to have zero mean and unit variance.
Rescaling is necessary because many machine learning algorithms use distance measures
between the features to calculate similarities or differences between observations. Features that
are on different scales can lead to biased estimates and poor performance of the algorithm.
Rescaling the features to a common scale can prevent this issue and improve the accuracy of the
algorithm.
Rescaling should be done after splitting the dataset into training and testing sets. The fit method of
the scaler should be applied to the training set only, and the transform method should be applied to
both the training and testing sets. This ensures that the rescaling is based only on the training set
and is not influenced by the testing set.
1 ## extract all numeric input variables
2 numeric_vars = encoded_vars.iloc[:,1:16]
3 numeric_vars
crime_rate resid_area air_qual room_num age dist1 dist2 dist3 dist4 teache
0 0.00632 32.31 0.538 6.575 65.2 4.35 3.81 4.18 4.01 24
1 0.02731 37.07 0.469 6.421 78.9 4.99 4.70 5.12 5.06 22
2 0.02729 37.07 0.469 7.185 61.1 5.03 4.86 5.01 4.97 22
3 0.03237 32.18 0.458 6.998 45.8 6.21 5.93 6.16 5.96 2
4 0.06905 32.18 0.458 7.147 54.2 6.16 5.86 6.37 5.86 2
... ... ... ... ... ... ... ... ... ...
501 0.06263 41.93 0.573 6.593 69.1 2.64 2.45 2.76 2.06 19
502 0.04527 41.93 0.573 6.120 76.7 2.44 2.11 2.46 2.14 19
503 0.06076 41.93 0.573 6.976 91.0 2.34 2.06 2.29 1.98 19
504 0.10959 41.93 0.573 6.794 89.3 2.54 2.31 2.40 2.31 19
505 0.04741 41.93 0.573 6.030 80.8 2.72 2.24 2.64 2.42 19
506 rows × 15 columns
1 from sklearn.preprocessing import MinMaxScaler
2
3 ## create a MinMaxScaler object
4 scaler = MinMaxScaler()
5
6 ## fit and transform the DataFrame using the scaler
7 scaled_data = scaler.fit_transform(numeric_vars)
8
9 # convert the scaled data back to a DataFrame
10 scaled_df = pd.DataFrame(scaled_data, columns=numeric_vars.columns)
11
12
13 ## display the first 5 rows
14 scaled_df.head()
crime_rate resid_area air_qual room_num age dist1 dist2 dist3 d
0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29
1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38
2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37
3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46
4 0.000705 0.063050 0.150206 0.687105 0.528321 0.449508 0.448683 0.467323 0.45
This code uses the MinMaxScaler class from the sklearn.preprocessing module to rescale the
numeric variables in a DataFrame called numeric_vars.
Then, the fit_transform method is used to fit the scaler to the data and transform the numeric
variables in numeric_vars using the scaling formula:
scaled_value = (original_value - min_value) / (max_value - min_value) The resulting scaled values

are stored in a NumPy array called scaled_data.
Finally, the scaled_data array is converted back to a DataFrame called scaled_df using the
pd.DataFrame() constructor, and the column names are set to the original column names of
numeric_vars.
We remove all the numeric input variables from the encoded_vars dataset and then
concatenate it with the scaled_df data frame
1 ## drop all numeric input variables from the encoded_vars dataframe an
2 temp_df = encoded_vars.drop(numeric_vars.columns, axis=True)
3
4 ## combine temp_df and scaled_df
5 prep_df = pd.concat([scaled_df, temp_df], axis=1)
6
7
8 ## display the first 5 rows
9 prep_df.head()
10
0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29
1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38
2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37
3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46
4 0.000705 0.063050 0.150206 0.687105 0.528321 0.449508 0.448683 0.467323 0.45
Move the price column to the last colum
1 ## remove the column from its current position and store it in a varia
2 col = prep_df.pop('price')
3
4 ## add the column back to the DataFrame in the last position
5 prep_df = prep_df.assign(price=col)
6
7 ## display
8 prep_df.head()
0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29
1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38
2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37
3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46
4 0.000705 0.063050 0.150206 0.687105 0.528321 0.449508 0.448683 0.467323 0.45
This code moves the 'price' column to the last column position in a DataFrame called prep_df.
The pop() method is used to remove the 'price' column from its current position in prep_df and store
it in a variable called col. The assign() method is then used to add the col Series back to the
DataFrame with the original column name, which moves it to the last column position.
The resulting DataFrame prep_df will have the 'price' column as the last column. The head() method
is then used to display the top five rows of the modified prep_df.
This will show the first five rows of prep_df with the 'price' column moved to the last position.
Split the data set
Separate the input variables (features) from the output variable (lables)
1 ## assign all input varibales to variable X
2 X = prep_df.drop('price', axis=1)
3
4 ## display the top five rows
5 X.head()
0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29
1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38
2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37
3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46
This code assigns all the input variables in a DataFrame called prep_df to a variable called X, by
dropping the 'price' column along the axis=1 (columns) using the drop method.
The resulting DataFrame X will have all the columns of prep_df except for the 'price' column.
1 ## asign the price to the variable y
2 y = prep_df['price']
3
4 ## display the first 5 values
5 y.head()
0 24.0
1 21.6
2 34.7
3 33.4
4 36.2
Name: price, dtype: float64
Split the data into train and test using 80:20 ratio.
1 from sklearn.model_selection import train_test_split
2
3 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0
4
5 ## display the shape
6 print('The shape of X_train: ', X_train.shape)
7 print('The shape of X_test: ', X_test.shape)
8 print('The shape of y_train: ', y_train.shape)
9 print('The shape of y_test: ', y_test.shape)
The shape of X_train: (404, 19)

The shape of X_test: (102, 19)
The shape of y_train: (404,)
The shape of y_test: (102,)
This code uses the train_test_split function from the sklearn.model_selection module to split the
input variables (X) and target variable (y) into training and testing sets.
The train_test_split function takes four arguments:
X: the input variables as a DataFrame or NumPy array y: the target variable as a DataFrame or
NumPy array test_size: the proportion of the data to use for testing (e.g. 0.20 means 20% of the
data will be used for testing) random_state: a random seed for reproducibility The function returns
four variables:
X_train: the training input variables X_test: the testing input variables y_train: the training target
variable y_test: the testing target variable
In this example, the input variables (X) and target variable (y) are split into 80% training data and
20% testing data, with a random seed of 42 for reproducibility. The resulting variables X_train,
X_test, y_train, and y_test are then used for model training and evaluation.
Build and Train a linear regression model

1 from sklearn.linear_model import LinearRegression
2
3 lr_model = LinearRegression()
4 lr_model.fit(X_train, y_train)
▾ LinearRegression
LinearRegression()
This code creates a new LinearRegression object called lr_model using the LinearRegression class
from the sklearn.linear_model module. The fit() method is then called on the lr_model object,
passing in the training data X_train and y_train.
The fit() method trains the linear regression model on the training data, using the input features
X_train and target variable y_train. The model learns the coefficients of the linear equation that best
fits the data, and these coefficients are stored internally in the lr_model object.
After the model is trained, it can be used to make predictions on new data by calling the predict()
method on the lr_model object. For example, if you have a new set of input features X_test, you
could make predictions using the trained model like this:
Evaluate the model using Mean Squared Error (MSE) and the
Goodness of Fit (R_Squared)
1 from sklearn.metrics import mean_squared_error, r2_score
2
3 y_pred = lr_model.predict(X_test)
4
5 # calculate the MSE
6 mse = mean_squared_error(y_test, y_pred)
7
8 ## calculate the r2_score
9 r2 = r2_score(y_test, y_pred)
10
11 ## display the mse and r2 score
12 print('The mean squared error is: ', mse)
13 print('The r2 score is; ', r2)
The mean squared error is: 26.039825947061992

The r2 score is; 0.6468901356802383
This code calculates two evaluation metrics for a linear regression model: the mean squared error
(MSE) and the R-squared (R2) score.
First, the predict() method is called on the lr_model object, passing in the test data X_test to
generate predicted target values y_pred.
Then, the mean_squared_error() function and the r2_score() function from the sklearn.metrics
module are used to calculate the MSE and R2 score, respectively, by comparing the predicted target
values y_pred with the true target values y_test.
Finally, the calculated MSE and R2 score are displayed to the console using the print() function.
The MSE represents the average squared difference between the predicted target values and the
true target values. It is a measure of how well the model is able to predict the target variable. The
lower the MSE, the better the model is performing.
The R2 score represents the proportion of the variance in the target variable that is explained by the
model. It is a measure of how well the model fits the data. The R2 score ranges from 0 to 1, with
higher values indicating a better fit. A score of 1 indicates that the model perfectly predicts the
target variable, while a score of 0 indicates that the model does not explain any of the variance in
the target variable.
1 ## display the weights/coefficients of x
2 print(lr_model.coef_)
[ -9.02559248 -0.65139272 -8.49483278 23.73643291 -1.22930362

0.68884694 8.57556687 -15.32264485 -8.95360836 8.42431684
-19.89676348 1.74230734 2.21051122 1.65579831 1.87080362
0.9805717 -1.26025099 -0.50491955 -0.94267398]
This code displays the weights or coefficients of the input features (independent variables) in the
linear regression model. The coef_ attribute of the lr_model object contains an array of coefficients,
where each element corresponds to a different input feature.
By calling print(lr_model.coef_), we can output the array of coefficients to the console. The
coefficients represent the change in the target variable (dependent variable) for a one-unit change
in the corresponding input feature, while holding all other input features constant.
For example, if the input feature at index 0 has a coefficient of 2.5, it means that a one-unit increase
in that feature results in a 2.5-unit increase in the target variable, while holding all other input
features constant. The sign of the coefficient indicates the direction of the relationship between the
input feature and the target variable: a positive coefficient means that the feature has a positive
effect on the target variable, while a negative coefficient means that the feature has a negative
effect.
1 ## display the intercept
2 print(lr_model.intercept_)
18.60881220400154
This code displays the intercept of the linear regression model. The intercept, also known as the
bias term, is the value of the target variable when all input features are equal to zero.
The intercept can be accessed using the intercept_ attribute of the lr_model object. By calling
print(lr_model.intercept_), we can output the intercept value to the console.
For example, if the intercept is 10, it means that when all input features are zero, the predicted value
of the target variable is 10.
In linear regression, the bias term, also known as the intercept, is a constant term that represents
the value of the target variable when all input features (independent variables) are equal to zero. It
is called the "bias" because it allows the linear regression line to shift up or down, thereby
introducing a bias into the predictions.
The bias term is important because without it, the linear regression line would have to pass through
the origin (0,0) of the plot, which is not always the case in real-world data. For example, if we were
modeling the price of a house as a function of its size and age, it would not make sense for the
price to be zero when the house is brand new and has zero square footage.
The bias term is learned along with the coefficients of the input features during the training of the
linear regression model. It is included in the linear equation that describes the relationship between
the input features and the target variable, and is used to shift the line up or down as needed to
better fit the data.
1
Colab paid products - Cancel contracts here

CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CSA105-LinearRegression-HousePrice-Prediction - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

Data Exploration

price crime_rate resid_area air_qual room_num age dist1 dist2 dist3 di

0 24.0 0.00632 32.31 0.538 6.575 65.2 4.35 3.81 4.18 4

1 21.6 0.02731 37.07 0.469 6.421 78.9 4.99 4.70 5.12 5

2 34.7 0.02729 37.07 0.469 7.185 61.1 5.03 4.86 5.01 4

3 33.4 0.03237 32.18 0.458 6.998 45.8 6.21 5.93 6.16 5

4 36.2 0.06905 32.18 0.458 7.147 54.2 6.16 5.86 6.37 5

Second, we are going perform univariate analysis using Summary

price crime_rate resid_area air_qual room_num age dist1

count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000

mean 22.528854 3.613524 41.136779 0.554695 6.284634 68.574901 3.971996

std 9.182176 8.601545 6.860353 0.115878 0.702617 28.148861 2.108532

min 5.000000 0.006320 30.460000 0.385000 3.561000 2.900000 1.130000

25% 17.025000 0.082045 35.190000 0.449000 5.885500 45.025000 2.270000

50% 21.200000 0.256510 39.690000 0.538000 6.208500 77.500000 3.385000

75% 25.000000 3.677083 48.100000 0.624000 6.623500 94.075000 5.367500

max 50.000000 88.976200 57.740000 0.871000 8.780000 100.000000 12.320000

The figure above is a countplot of the categorical variable airport.

The following are some common techniques used in data preprocessing:

Check for missing values

Transform categrical values in to a format that is suitable for machine learning

Rescaling is an important preprocessing step in many machine learning algorithms. Rescaling or

0 0.00632 32.31 0.538 6.575 65.2 4.35 3.81 4.18 4.01 24

1 0.02731 37.07 0.469 6.421 78.9 4.99 4.70 5.12 5.06 22

2 0.02729 37.07 0.469 7.185 61.1 5.03 4.86 5.01 4.97 22

3 0.03237 32.18 0.458 6.998 45.8 6.21 5.93 6.16 5.96 2

4 0.06905 32.18 0.458 7.147 54.2 6.16 5.86 6.37 5.86 2

506 rows × 15 columns

crime_rate resid_area air_qual room_num age dist1 dist2 dist3 d

0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29

1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38

2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37

3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46

4 0.000705 0.063050 0.150206 0.687105 0.528321 0.449508 0.448683 0.467323 0.45

scaled_value = (original_value - min_value) / (max_value - min_value) The resulting scaled values

crime_rate resid_area air_qual room_num age dist1 dist2 dist3 d

0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29

1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38

2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37

3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46

4 0.000705 0.063050 0.150206 0.687105 0.528321 0.449508 0.448683 0.467323 0.45

Move the price column to the last colum

crime_rate resid_area air_qual room_num age dist1 dist2 dist3 d

0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29

1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38

2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37

3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46

4 0.000705 0.063050 0.150206 0.687105 0.528321 0.449508 0.448683 0.467323 0.45

Split the data set

crime_rate resid_area air_qual room_num age dist1 dist2 dist3 d

0 0.000000 0.067815 0.314815 0.577505 0.641607 0.287757 0.262489 0.271262 0.29

1 0.000236 0.242302 0.172840 0.547998 0.782698 0.344951 0.343324 0.355416 0.38

2 0.000236 0.242302 0.172840 0.694386 0.599382 0.348525 0.357856 0.345568 0.37

3 0.000293 0.063050 0.150206 0.658555 0.441813 0.453977 0.455041 0.448523 0.46

The shape of X_train: (404, 19)

The train_test_split function takes four arguments:

Build and Train a linear regression model

The mean squared error is: 26.039825947061992

[ -9.02559248 -0.65139272 -8.49483278 23.73643291 -1.22930362

You might also like