You are on page 1of 10

Tugas Akhir

PRAKTIKUM
PEMROGRAMAN BIG DATA

OLEH:
NAMA : SASKYA LIDAYANI
NIM : F1A220099
KELOMPOK : I (SATU)

PROGRAM STUDI S1 STATISTIKA


FAKULTAS MATEMATIKA DAN ILMU PENGETAHUAN ALAM
UNIVERSITAS HALU OLEO
KENDARI
2023
Soal:
1. Berdasarkan data kc_house_data buatlah visualisasi data dan analisis
menggunakan klasifikasi SVM dan Naive Bayes pada Python!
Jawab:
1. Klasifikasi SVM
Program: Pemanggilan Packages yang Digunakan
# importing the libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

Program: Import Dataset


# importing the dataset
df = pd.read_csv(kc_house_data1.csv')
df.shape
df.head()

Output:

Program: Periksa Variabel


Df.columns

Output:
Index(['id', 'bedrooms', 'sqft_living', 'sqft_lot',
'waterfront', 'view', 'condition', 'grade', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
'sqft_living15', 'sqft_lot15'], dtype='object')

Program: Periksa Missing Value


#PERIKSA MISSING VALUE
df.isnull().sum()
Output:
id 0
bedrooms 0
sqft_living 0
sqft_lot 0
waterfront 0
view 0
condition 0
grade 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
zipcode 0
sqft_living15 0
sqft_lot15 0
dtype: int64

Program: Statistika Deskriptif


# MELIHAT STATISTIKA DESKRIPTIF DATASET
round(df.describe(),2)

Output:

Program: Visualisasi Histogram


# PLOT HISTOGRAM UNTUK PERIKSA SEBARAN DATA
plt.figure(figsize=(24,20))
plt.subplot(4, 2, 1)
fig = df['id'].hist(bins=20)
fig.set_xlabel('id')
fig.set_ylabel('waterfront')
plt.subplot(4, 2, 2)
fig = df['sqft_above'].hist(bins=20)
fig.set_xlabel('sqft_above')
fig.set_ylabel('waterfront')
plt.subplot(4, 2, 3)
fig = df['sqft_basement'].hist(bins=20)
fig.set_xlabel('sqft_basement')
fig.set_ylabel('waterfront')
plt.subplot(4, 2, 4)
fig = df['yr_built'].hist(bins=20)
fig.set_xlabel('yr_built')
fig.set_ylabel('waterfront')
plt.subplot(4, 2, 5)
fig = df['sqft_living15'].hist(bins=20)
fig.set_xlabel('sqft_living15')
fig.set_ylabel('waterfront')

Output:

Program:
X = df[['lat', 'long', 'sqft_living15', 'sqft_lot15']]
y = df['price']

Program: Menentukan Train dan Test


# MEMBAGI X DAN Y MENJADI SET TRAIN DAN TEST
X_train, X_test, y_train, y_test = train_test_split(X, y, test
_size = 0.2, random_state = 0)
X_train.shape, X_test.shape
print('X_train',X_train.shape)
print('X_test',X_test.shape)
print('y_train',y_train.shape)
print('y_test',y_test.shape)

Output:
X_train (11223, 14)
X_test (2806, 14)
y_train (11223,)
y_test (2806,)
Program: Mengimplementasikan testing data
#mengimplementasikan testing data dan hasil prediksi dalam
confusion matrix
cm = confusion_matrix(y_test, y_predict)

#membuat plotting confusion matrix


%matplotlib inline
plt.figure (figsize=(10,7))
sn.heatmap(cm, annot=True)
plt.xlabel('Predicted')
plt.ylabel('Truth')

Output:

Program:
#menggunakan SVM library untuk membuat SVM classifier
classifier = svm.SVC(kernel = 'linear')

#memasukkan training data kedalam classifier


classifier.fit(X_train, y_train)

#memasukkan testing data ke variabel y_predict


y_predict = classifier.predict(X_test)

#menampilkan classification report


print(classification_report(y_test, y_predict))

Output:
precision recall f1-score support

0 1.00 1.00 1.00 2790


1 0.75 0.38 0.50 16

accuracy 1.00 2806


macro avg 0.87 0.69 0.75 2806
weighted avg 1.00 1.00 1.00 2806
Interpretasi:
Karena kita menggunakan kolom ‘Price’’ sebagai dependent variable,
maka kategorinya ada 2 yaitu 0 dan 1. Sumbu X (Truth) dari tabel diatas
merupakan data sebenarnya dalam dataset yang terdiri dari 0 dan 1. Sumbu Y
(Predicted) dari tabel diatas merupakan prediksi yang diberikan oleh model
yang terdiri dari 0 dan 1.
Di koordinat (0,0) dan (1,0) terdapat nilai 2788 dan 2 yang artinya ketika
data sebenarnya adalah ‘0’, classification model berhasil memprediksinya
sebagai ‘0’ sebanyak 2788 kali dan memprediksinya sebagai ‘1’ sebanyak 2
kali. Kemudian di koordinat (1,0) dan (1,1) terdapat nilai 10 dan 6 yang artinya
ketika data sebenarnya adalah ‘0’, classification model berhasil
memprediksinya sebagai ‘0’ sebanyak 10 kali dan memprediksinya sebagai ‘1’
sebanyak 6 kali.
2. Klasifikasi Navie Bayes
Program: Pemanggilan Packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.
read_csv)
import matplotlib.pyplot as plt # for data visualization purp
oses
import seaborn as sns # for statistical data visualization
%matplotlib inline

Program: Import Dataset


df = pd.read_csv(' kc_house_data1.csv')
df.shape
df.head()

Output:

Program: Menentukan Variabel Numerik


numerical = [var for var in data_niluh.columns if data_niluh[
var].dtype!='O']
print('There are {} numerical variables\n'.format(len(numeric
al)))
print('The numerical variables are :', numerical)
df[numerical].head()

Output:

Program:
X = df[['lat', 'long', 'sqft_living15', 'sqft_lot15']]
y = df['price']

Program:
# Menentukan X Train dan X Test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, tes


t_size = 0.3, random_state = 0)
X_train.shape, X_test.shape
X_train.dtypes

Output:
id int64
bedrooms int64
sqft_living int64
sqft_lot int64
view int64
condition int64
grade int64
sqft_above int64
sqft_basement int64
yr_built int64
yr_renovated int64
zipcode int64
sqft_living15 int64
sqft_lot15 int64
dtype: object

Program:
cols = X_train.columns
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_train = pd.DataFrame(X_train, columns=[cols])
X_test = pd.DataFrame(X_test, columns=[cols])
X_train.head()

Output:
Program:
# train a Gaussian Naive Bayes classifier on the training set
from sklearn.naive_bayes import GaussianNB

# instantiate the model


gnb = GaussianNB()

# fit the model


gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

y_pred

Output:
array([0, 0, 0, ..., 0, 0, 0])

Program:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score


(y_test, y_pred)))
y_pred_train = gnb.predict(X_train)
print('Training-
set accuracy score: {0:0.4f}'. format(accuracy_score(y_train,
y_pred_train)))
print('Training set score: {:.4f}'.format(gnb.score(X_train,
y_train)))

print('Test set score: {:.4f}'.format(gnb.score(X_test, y_tes


t)))
y_test.value_counts()

Output:
Training-set accuracy score: 0.9665
Training set score: 0.9665
Test set score: 0.9641
0 4183
1 26
Name: waterfront, dtype: int64

Program:
null_accuracy = (7407/(7407+2362))

print('Null accuracy score: {0:0.4f}'. format(null_accuracy))

from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)
print('Confusion matrix\n\n', cm)
print('\nTrue Positives(TP) = ', cm[0,0])
print('\nTrue Negatives(TN) = ', cm[1,1])
print('\nFalse Positives(FP) = ', cm[0,1])
print('\nFalse Negatives(FN) = ', cm[1,0])
Output:
Confusion matrix

[[4035 148]
[ 3 23]]

True Positives(TP) = 4035


True Negatives(TN) = 23
False Positives(FP) = 148
False Negatives(FN) = 3

Program: Visualisasi Confusion Matrix


cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1
', 'Actual Negative:0'],
index=['Predict Positive:1',
'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

Output:

Program:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Output:
precision recall f1-score support

0 1.00 0.96 0.98 4183


1 0.13 0.88 0.23 26

accuracy 0.96 4209


macro avg 0.57 0.92 0.61 4209
weighted avg 0.99 0.96 0.98 4209

Interpretasi:

You might also like