Professional Documents
Culture Documents
o m
Import Library il.c
m a
@ g
import pandas as pd
2 0
import numpy as np
80
import matplotlib.pyplot as ply
1 9
import seaborn as sns
s y
g )
.o r
Import Dataset
ion
a t
n d
ou
df = pd.read_csv('https://github.com/ybifoundation/Dataset/raw/main/MPG.csv')
b i f
.y
Explore Data
ww
( w
o n
df.head()
ati
n d
o u
I F
Y B
mpg cylinders displacement horsepower weight acceleration model_year origin nam
chevrol
df.info()
o m
.c
<class 'pandas.core.frame.DataFrame'>
il
buic
RangeIndex: 398 entries, 0 to 397
1 15.0 8 350.0
Data columns (total 9 columns):
@
--- ------ -------------- -----
plymou
0 mpg 398 non-null float64
2 0
0
1 cylinders 398 non-null int64
2
3
displacement 398 non-null
horsepower 392 non-null
float64
float64
9 8
4 weight 398 non-null int64
y1
5 acceleration 398 non-null float64
) s
6 model_year 398 non-null int64
r g
.o
7 origin 398 non-null object
n
8 name 398 non-null object
io
dtypes: float64(4), int64(3), object(2)
a t
n d
df.dropna(inplace=True)
o u
b i f
.y
w
df['model_year'] = pd.to_datetime(df['model_year'], format='%y')
w
( w
df.describe(include='all')
o n
a ti
n d
o u
I F
Y B
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: FutureWarning: Treating dateti
"""Entry point for launching an IPython kernel.
o m
unique NaN NaN NaN NaN NaN NaN 13
il.c
1973-01-01
m a
g
top NaN NaN NaN NaN NaN NaN
00:00:00
9
first NaN NaN NaN NaN NaN NaN
00:00:00
y 1
)s
1982-01-01
last NaN NaN NaN NaN NaN NaN
g
00:00:00
u n
weight acceleration
25%
mpg
17.000000
1.000000
4.000000
-0.777618
105.000000
-0.805127
if o
75.000000 2225.250000
-0.778427 -0.832244
13.775000
0.423329
NaN
.yb
50% 22.750000 4.000000 151.000000 93.500000 2803.500000 15.500000 NaN
cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 -0.504683
ww
75% 29.000000 8.000000 275.750000 126.000000 3614.750000 17.025000 NaN
displacement -0.805127 0.950823 1.000000 0.897257 0.932994 -0.543800
acceleration 0.423329
d a
-0.504683 -0.543800 -0.689196 -0.416839 1.000000
u n
F o
I
df['origin'].value_counts()
usa
japan
YB 245
79
europe 68
Visualize Data o m
il.c
m a
sns.pairplot(df, x_vars= ['displacement', 'horsepower', 'weight', 'acceleration', 'mpg'], y_vars=
@ g
2 0
0
<seaborn.axisgrid.PairGrid at 0x7f8745b9a390>
9 8
y1
) s
rg
n .o
t io
d a
u n
i f o
Define y and X
.y b
w w
( w
df.columns
o n
a ti
Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
n d
'acceleration', 'model_year', 'origin', 'name'],
dtype='object')
o u
I F
y = df['mpg']
Y B
X = df[['horsepower', 'weight']]
# For each X, calculate VIF and save in dataframe
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
o m
vif["features"] = X.columns
il.c
vif.round(1)
m a
@ g
0
VIF Factor features
0 32.2 horsepower
0 2
9 8
1
1 32.2 weight
s y
g )
Train Test Split Data .o r
ion
a t
n d
from sklearn.model_selection import train_test_split
u o
i f
yb
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, random_state = 2529)
.
w w
w
X_train.shape, X_test.shape, y_train.shape, y_test.shape
(
o n
ti
((274, 2), (118, 2), (274,), (118,))
d a
Scaling Data u n
F o
B I
Y
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
X_train = ss.fit_transform(X_train)
X_test = ss.fit_transform(X_test)
o m
il.c
m a
Linear Regression Model
@ g
2 0
80
from sklearn.linear_model import LinearRegression
1 9
s y
lr = LinearRegression()
g )
.o r
lr.fit(X_train, y_train)
ion
a t
LinearRegression()
n d
ou
lr.intercept_
b i f
.y
23.577737226277375
ww
( w
lr.coef_
o n
a ti
d
array([-1.83276106, -4.89794393])
n
o u
I F
Predict Test Data
YB
y_pred = lr.predict(X_test)
Model Accuracy
o m
.c
from sklearn.metrics import mean_absolute_percentage_error, r2_score
a il
mean_absolute_percentage_error(y_test, y_pred)
gm
0 @
2
0.15224796342732086
80
r2_score(y_test, y_pred)
1 9
s y
0.7032406165122396
g )
.o r
Significant Variables io n
a t
n d
ou
import·statsmodels.api·as·sm
b i f
X·=·sm.add_constant(X)
.y
model·=·sm.OLS(y,·X)
w w
results·=·model.fit()
( w
print(results.summary())
o n
a ti
OLS Regression Results
n d
==============================================================================
Dep. Variable:
o u mpg R-squared: 0.706
F
Model: OLS Adj. R-squared: 0.705
Method:
Date:
B I Least Squares
Fri, 22 Jul 2022
F-statistic:
Prob (F-statistic):
467.9
3.06e-104
Time:
Y
No. Observations:
02:33:16
392
Log-Likelihood:
AIC:
-1121.0
2248.
Df Model: 2
==============================================================================
------------------------------------------------------------------------------
o m
.c
weight -0.0058 0.001 -11.535 0.000 -0.007 -0.005
==============================================================================
a il
Prob(Omnibus): 0.000 Jarque-Bera (JB): 45.973
g m
Skew:
Kurtosis:
0.683
3.974
Prob(JB):
Cond. No.
1.04e-10
1.15e+04
0 @
==============================================================================
0 2
Warnings:
9 8
y1
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
)
[2] The condition number is large, 1.15e+04. This might indicate that there are
s
strong multicollinearity or other numerical problems.
r g
.o
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:117: FutureWarning: In a future version of pandas all arguments of con
x = pd.concat(x[::order], 1)
n
t io
d a
Polynomial Regression u n
i f o
.y b
w
from sklearn.preprocessing import PolynomialFeatures
w
( w
o n
poly = PolynomialFeatures(degree=2)
a ti
n d
u
X_train2 = poly.fit_transform(X_train)
o
I F
YB
X_train2 = pd.DataFrame(X_train2, columns=['bias', 'horsepower', 'weight', 'square of horsepower',
X_train2
m
1 1.0 -0.903062 -1.150355 0.815521 1.038842 1.323317
.c o
il
2 1.0 -0.687147 0.704394 0.472171 -0.484022 0.496171
) s
271 1.0 -0.363274 -0.306116 0.131968 0.111204
r g 0.093707
d a
n
274 rows × 6 columns
o u
X_test2 = poly.fit_transform(X_test)
b i f
.y
w w
X_test2 = pd.DataFrame(X_test2, columns=['bias', 'horsepower', 'weight', 'square of horsepower', 'h
( w
o n
lr.fit(X_train2, y_train)
a ti
LinearRegression()
n d
o u
I F
lr.intercept_
Y B
array([22.09857341])
lr coef
array([[ 0. , -3.81240238, -4.16928337, -0.04228002, 2.21938086,
-0.41596167]])
y_pred_poly = lr.predict(X_test2)
o m
il.c
m a
Model Accuracy
@ g
2 0
from sklearn.metrics import mean_absolute_percentage_error, r2_score
8 0
1 9
) sy
mean_absolute_percentage_error(y_test, y_pred_poly)
r g
0.13053963468368052
n .o
t io
r2_score(y_test, y_pred_poly)
d a
u n
0.7427946235072436
i f o
.yb
w w
( w
o n
a ti
n d
o u
I F
YB
o m
check 0s completed at 8:12 AM
il.c
m a
@ g
2 0
80
1 9
s y
g )
.o r
io n
a t
n d
o u
b i f
.y
ww
( w
o n
ati
n d
o u
I F
Y B