Professional Documents
Culture Documents
y = b0+b1*x1+b2*x2+..........+bn*xn
In the equation, y is the single dependent variable value of which depends on more than one
independent variables(i.e. x1,x2,...,xn).
For example, you can predict the performance of students in an exam based on their
revision time, class attendance, previous results, test anxiety, and gender. Here the
dependent variable(Exam performance) can be calculated by using more than one
independent variables. So, this the kind of task where you can use a Multiple Linear
Regression model.
From this dataset, we are required to build a model that would predict the Profits earned by
a startup and their various expenditures like R & D Spend, Administration Spend, and
WWW.LTBPTECH.IN
CONTACT@LTBPTECH.IN LTBPTECH@GMAIL.COM
MOB:8318234647 MOB:7398721672
LTBP SOFTWARE SOLUTIONS AND SERVICES PVT. LTD.
Marketing Spend. Clearly, we can understand that it is a multiple linear regression problem,
as the independent variables are more than one.
Let's take Profit as a dependent variable and put it in the equation as y and put other
attributes as the independent variables-
From this equation, hope you can understand the regression process a bit clearer.
Now, let's jump to build the model, first the data preprocessing step. Here we will take Profit
as in the dependent variable vector y, and other independent variables in feature matrix X.
import numpy as np
import pandas as pd
dataset = pd.read_csv('Startups_Ltbp.csv')
X=dataset.iloc[:,:-1].values
#or
X=dataset.iloc[:,:4].values
#or
X = dataset.iloc[:, [0,1,2,3]].values
y=dataset.iloc[:,-1].values
#or
y = dataset.iloc[:, 4].values
The dataset contains one categorical variable. So we need to encode or make dummy
variables for that.
WWW.LTBPTECH.IN
CONTACT@LTBPTECH.IN LTBPTECH@GMAIL.COM
MOB:8318234647 MOB:7398721672
LTBP SOFTWARE SOLUTIONS AND SERVICES PVT. LTD.
ct=ColumnTransformer(transformers=[("encoder",OneHotEncoder(),[-
1])],remainder="passthrough")
X=ct.fit_transform(X)
Dummy Variable Trap: Above code will make two dummy variables(as the categorical
variable has two variations). And obviously, our linear equation will use both dummy
variables. But this will make a problem. Here both dummy variables are correlated to some
extent(that means ones value can be predicted by the other) which causes multicollinearity,
a phenomenon where an independent variable can be predicted from one or more than one
independent variables. When multicollinearity exists, the model cannot distinguish the
variables properly, therefore predicts improper outcomes. This problem is identified as the
Dummy Variable Trap.
To solve this problem, you should always take all dummy variables except one form the
dummy variable set.
X = X[:, 1:]
# Splitting the dataset into the Training set and Test set
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Let's evaluate our model how it predicts the outcome according to the test data.
WWW.LTBPTECH.IN
CONTACT@LTBPTECH.IN LTBPTECH@GMAIL.COM
MOB:8318234647 MOB:7398721672
LTBP SOFTWARE SOLUTIONS AND SERVICES PVT. LTD.
y_pred = regressor.predict(X_test)
Here you can see our model has made some close predictions and some bad predictions
also. But you can improve the quality of the prediction by choosing other Multiple Linear
Regression techniques such as Backward Elimination, Forward Selection etc.
WWW.LTBPTECH.IN
CONTACT@LTBPTECH.IN LTBPTECH@GMAIL.COM
MOB:8318234647 MOB:7398721672