You are on page 1of 19

UAS Data Intelligent

Muhammad Bariklana (18611079)

Problem A.1
In [10]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

In [11]: from sklearn.cluster import KMeans


from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler

In [15]: # Membaca data


student = pd.read_csv('student-mat.csv',sep=";")
student.head()

Out[15]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel free

0 GP F 18 U GT3 A 4 4 at_home teacher ... 4

1 GP F 17 U GT3 T 1 1 at_home other ... 5

2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

4 GP F 16 U GT3 T 3 3 other other ... 4

5 rows × 33 columns
In [13]: student.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 sex 395 non-null object
2 age 395 non-null int64
3 address 395 non-null object
4 famsize 395 non-null object
5 Pstatus 395 non-null object
6 Medu 395 non-null int64
7 Fedu 395 non-null int64
8 Mjob 395 non-null object
9 Fjob 395 non-null object
10 reason 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null int64
13 studytime 395 non-null int64
14 failures 395 non-null int64
15 schoolsup 395 non-null object
16 famsup 395 non-null object
17 paid 395 non-null object
18 activities 395 non-null object
19 nursery 395 non-null object
20 higher 395 non-null object
21 internet 395 non-null object
22 romantic 395 non-null object
23 famrel 395 non-null int64
24 freetime 395 non-null int64
25 goout 395 non-null int64
26 Dalc 395 non-null int64
27 Walc 395 non-null int64
28 health 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB

What is the data size?

In [16]: student.shape

Out[16]: (395, 33)

What is the percentage ratio of female and male students?


In [20]: student[student["sex"].str.contains(r'F')]

Out[20]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel f

0 GP F 18 U GT3 A 4 4 at_home teacher ... 4

1 GP F 17 U GT3 T 1 1 at_home other ... 5

2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

4 GP F 16 U GT3 T 3 3 other other ... 4

... ... ... ... ... ... ... ... ... ... ... ... ...

385 MS F 18 R GT3 T 2 2 at_home other ... 5

386 MS F 18 R GT3 T 4 4 teacher at_home ... 4

387 MS F 19 R GT3 T 2 3 services other ... 5

388 MS F 18 U LE3 T 3 1 teacher services ... 4

389 MS F 18 U GT3 T 1 1 other other ... 1

208 rows × 33 columns

What is the percentage ratio of female and male students?

In [21]: student[student["sex"].str.contains(r'M')]

Out[21]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel f

5 GP M 16 U LE3 T 4 3 services other ... 5

6 GP M 16 U LE3 T 2 2 other other ... 4

8 GP M 15 U LE3 A 3 2 services other ... 4

9 GP M 15 U GT3 T 3 4 other other ... 5

12 GP M 15 U LE3 T 4 4 health services ... 4

... ... ... ... ... ... ... ... ... ... ... ... ...

390 MS M 20 U LE3 A 2 2 services services ... 5

391 MS M 17 U LE3 T 3 1 services services ... 2

392 MS M 21 R GT3 T 1 1 other other ... 5

393 MS M 18 R LE3 T 3 2 services other ... 4

394 MS M 19 U LE3 T 1 1 other at_home ... 3

187 rows × 33 columns

Utilize the information on family size and parent's cohabitation status: what is the
probability that student is not the only child in their family?

In [61]: databaru = student[(student["famsize"] == "GT3") & (student["Pstatus"] == "T"


)]
databaru.shape

Out[61]: (260, 33)

Below is the information about students' parents.

Read the metadata to find variable definitions.


In [22]: #The number of mothers who work in health care is more than the number of fath
ers who work in the same field. (TRUE)
student[student["Mjob"].str.contains(r'health')]

Out[22]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel free

3 GP F 15 U GT3 T 4 2 health services ... 3

12 GP M 15 U LE3 T 4 4 health services ... 4

15 GP F 16 U GT3 T 4 4 health other ... 4

19 GP M 16 U LE3 T 4 3 health other ... 3

21 GP M 15 U GT3 T 4 4 health health ... 5

27 GP M 15 U GT3 T 4 2 health services ... 2

30 GP M 15 U GT3 T 4 4 health services ... 5

47 GP M 16 U GT3 T 4 3 health services ... 4

51 GP F 15 U LE3 T 4 2 health other ... 4

52 GP M 15 U LE3 A 4 2 health health ... 5

60 GP F 16 R GT3 T 4 4 health teacher ... 2

68 GP F 15 R LE3 T 2 2 health services ... 4

109 GP F 16 U LE3 T 4 4 health health ... 5

114 GP M 15 R GT3 T 2 1 health services ... 5

123 GP M 16 U GT3 T 4 4 health other ... 3

146 GP F 15 U GT3 T 3 2 health services ... 3

167 GP F 16 U GT3 T 4 2 health services ... 4

169 GP F 16 U GT3 T 4 4 health health ... 4

188 GP F 17 U GT3 A 3 3 health other ... 3

200 GP F 16 U GT3 T 4 3 health other ... 4

230 GP F 17 U LE3 T 4 3 health other ... 3

233 GP M 16 U GT3 T 4 4 health other ... 4

240 GP M 17 U LE3 T 4 3 health other ... 2

255 GP M 17 U LE3 T 1 1 health other ... 4

268 GP M 18 U GT3 T 4 2 health other ... 5

278 GP F 18 U GT3 T 4 4 health health ... 2

291 GP F 17 U GT3 T 4 3 health services ... 4

295 GP M 17 U GT3 T 3 3 health other ... 4

296 GP F 19 U GT3 T 4 4 health other ... 2

300 GP F 18 U LE3 A 4 4 health other ... 4

303 GP F 17 U GT3 T 3 2 health health ... 5

348 GP F 17 U GT3 T 4 3 health other ... 4

351 MS M 17 U GT3 T 3 3 health other ... 4

376 MS F 20 U GT3 T 4 2 health other ... 5

34 rows × 33 columns
In [23]: student[student["Fjob"].str.contains(r'health')]

Out[23]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel fre

10 GP F 15 U GT3 T 4 4 teacher health ... 3

21 GP M 15 U GT3 T 4 4 health health ... 5

24 GP F 15 R GT3 T 2 4 services health ... 4

38 GP F 15 R GT3 T 3 4 services health ... 4

52 GP M 15 U LE3 A 4 2 health health ... 5

57 GP M 15 U GT3 T 4 4 teacher health ... 3

63 GP F 16 U GT3 T 4 3 teacher health ... 3

89 GP M 16 U LE3 A 4 4 teacher health ... 4

94 GP M 15 U LE3 T 2 2 services health ... 4

105 GP F 15 U GT3 A 3 3 other health ... 4

109 GP F 16 U LE3 T 4 4 health health ... 5

122 GP F 16 U LE3 T 2 4 other health ... 4

169 GP F 16 U GT3 T 4 4 health health ... 4

217 GP M 18 U LE3 T 3 3 services health ... 3

274 GP F 17 U GT3 T 2 4 at_home health ... 4

278 GP F 18 U GT3 T 4 4 health health ... 2

303 GP F 17 U GT3 T 3 2 health health ... 5

314 GP F 19 U GT3 T 1 1 at_home health ... 4

18 rows × 33 columns

In [30]: #Fewer than 10% of mother have primary education or below.


primary = student[(student["Medu"] <= 1)]
primary.shape

Out[30]: (62, 33)

In [102]: 62/395*100

Out[102]: 15.69620253164557

In [31]: #Most of the father have higher education.


nol = student[(student["Fedu"] == 0)]
nol.shape

Out[31]: (2, 33)

In [32]: satu = student[(student["Fedu"] == 1)]


satu.shape

Out[32]: (82, 33)

In [33]: dua = student[(student["Fedu"] == 2)]


dua.shape

Out[33]: (115, 33)

In [34]: tiga = student[(student["Fedu"] == 3)]


tiga.shape

Out[34]: (100, 33)

In [35]: empat = student[(student["Fedu"] == 4)]


empat.shape

Out[35]: (96, 33)


What can be inferred from the two tails paired t-test for first and second grade?

In [38]: import numpy as np


from scipy import stats
import matplotlib.pyplot as plt

In [41]: # Uji perbedaan rata-rata 2 populasi berpasangan

np.random.seed(456)
G1 = stats.norm.rvs(loc=8,scale=10,size=50)
G2 = stats.norm.rvs(loc=8,scale=10,size=50)

a, b = stats.ttest_rel(G1,G2)

In [42]: a, b

Out[42]: (-0.22111035345211244, 0.8259253278353127)

What can be inferred from the Kendall Tau correlation test for study time and
frequency of going out with friends?

In [49]: # Example of the Pearson's Correlation test

data1 = student["studytime"]
data2 = student["goout"]
stat, p = stats.pearsonr(data1, data2)

stat, p

Out[49]: (-0.06390367501441135, 0.2050369462596782)

What can be inferred from the normality test for the final grade?

In [55]: # Example of the Shapiro-Wilk test

G3 = student["G3"]
stat, p = stats.shapiro(G3)

# Optional: hanya digunakan untuk keperluan menampilkan hasil

print('stat=%.3f, p=%.3f' % (stat, p))


if p > 0.05:
print('Probably Normal')
else:
print('Probably not Normal')

stat=0.929, p=0.000
Probably not Normal

What can be inferred from one way ANOVA test for the grades?

In [56]: G1 = student["G1"]
G2 = student["G2"]
G3 = student["G3"]

stats.f_oneway(G1, G2, G3)

Out[56]: F_onewayResult(statistic=1.5873114383378757, pvalue=0.20491015955616096)


Find the error in the following list comprehension to create a new list of the
reason for choosing the school in which the term "course" substitute with
"curriculum".

In [65]: list_baru = [baris if "course" not in baris else "curriculum" for baris in stu
dent['reason']]
print(list_baru)

['curriculum', 'curriculum', 'other', 'home', 'home', 'reputation', 'home',


'home', 'home', 'home', 'reputation', 'reputation', 'curriculum', 'curriculu
m', 'home', 'home', 'reputation', 'reputation', 'curriculum', 'home', 'reputa
tion', 'other', 'curriculum', 'reputation', 'curriculum', 'home', 'home', 'ot
her', 'home', 'home', 'home', 'reputation', 'curriculum', 'curriculum', 'hom
e', 'other', 'home', 'reputation', 'curriculum', 'reputation', 'home', 'hom
e', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'home', 'reputati
on', 'home', 'other', 'curriculum', 'other', 'other', 'curriculum', 'other',
'other', 'reputation', 'reputation', 'home', 'curriculum', 'other', 'curricul
um', 'reputation', 'home', 'reputation', 'curriculum', 'reputation', 'curricu
lum', 'reputation', 'reputation', 'reputation', 'curriculum', 'reputation',
'reputation', 'home', 'home', 'curriculum', 'reputation', 'home', 'curriculu
m', 'curriculum', 'home', 'reputation', 'home', 'home', 'reputation', 'curric
ulum', 'reputation', 'reputation', 'reputation', 'home', 'reputation', 'hom
e', 'home', 'reputation', 'home', 'reputation', 'curriculum', 'reputation',
'curriculum', 'other', 'other', 'curriculum', 'home', 'curriculum', 'reputati
on', 'curriculum', 'home', 'home', 'other', 'curriculum', 'reputation', 'hom
e', 'curriculum', 'reputation', 'curriculum', 'reputation', 'home', 'curricul
um', 'reputation', 'curriculum', 'home', 'curriculum', 'curriculum', 'home',
'home', 'home', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'curr
iculum', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curriculu
m', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'reputation', 'cu
rriculum', 'curriculum', 'home', 'curriculum', 'home', 'curriculum', 'curricu
lum', 'curriculum', 'curriculum', 'curriculum', 'reputation', 'home', 'curric
ulum', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'curriculum',
'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curric
ulum', 'curriculum', 'home', 'home', 'reputation', 'curriculum', 'reputatio
n', 'reputation', 'home', 'reputation', 'curriculum', 'reputation', 'reputati
on', 'other', 'curriculum', 'home', 'home', 'reputation', 'reputation', 'repu
tation', 'other', 'other', 'curriculum', 'reputation', 'home', 'curriculum',
'curriculum', 'other', 'reputation', 'home', 'curriculum', 'home', 'home', 'h
ome', 'reputation', 'home', 'reputation', 'curriculum', 'reputation', 'reputa
tion', 'home', 'curriculum', 'other', 'home', 'reputation', 'reputation', 'ho
me', 'reputation', 'home', 'other', 'reputation', 'reputation', 'home', 'hom
e', 'curriculum', 'reputation', 'reputation', 'other', 'home', 'home', 'reput
ation', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'reputation',
'curriculum', 'reputation', 'reputation', 'home', 'reputation', 'home', 'hom
e', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'curriculum', 'cu
rriculum', 'curriculum', 'curriculum', 'curriculum', 'other', 'curriculum',
'other', 'curriculum', 'reputation', 'other', 'curriculum', 'curriculum', 'cu
rriculum', 'reputation', 'reputation', 'home', 'curriculum', 'home', 'curricu
lum', 'curriculum', 'home', 'home', 'reputation', 'other', 'reputation', 'rep
utation', 'reputation', 'home', 'reputation', 'home', 'home', 'reputation',
'curriculum', 'home', 'home', 'reputation', 'curriculum', 'home', 'home', 're
putation', 'home', 'curriculum', 'reputation', 'other', 'reputation', 'reputa
tion', 'reputation', 'home', 'reputation', 'reputation', 'reputation', 'reput
ation', 'home', 'reputation', 'home', 'reputation', 'home', 'home', 'home',
'reputation', 'reputation', 'home', 'reputation', 'curriculum', 'reputation',
'reputation', 'reputation', 'home', 'other', 'curriculum', 'reputation', 'hom
e', 'reputation', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'cu
rriculum', 'curriculum', 'curriculum', 'curriculum', 'home', 'curriculum', 'r
eputation', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curricul
um', 'home', 'home', 'curriculum', 'curriculum', 'home', 'home', 'home', 'hom
e', 'home', 'home', 'home', 'home', 'curriculum', 'other', 'curriculum', 'cur
riculum', 'reputation', 'curriculum', 'home', 'curriculum', 'curriculum', 'ho
me', 'home', 'curriculum', 'other', 'reputation', 'home', 'curriculum', 'curr
iculum', 'other', 'other', 'curriculum', 'curriculum', 'curriculum', 'other',
'reputation', 'curriculum', 'other', 'home', 'other', 'home', 'curriculum',
'reputation', 'home', 'curriculum', 'curriculum', 'home', 'reputation', 'hom
e', 'other', 'home', 'other', 'home', 'other', 'reputation', 'curriculum', 'c
urriculum', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curricul
um', 'curriculum']
Problem A.2
In [68]: covid = pd.read_excel('data covid INA.xlsx')
covid.head()

Out[68]:
No Provinsi Case RR CFR PR TR

0 1 Aceh 8776 0.817912 0.041 0.0437 22.9

1 2 Bali 18263 0.907354 0.029 0.0588 17.0

2 3 Banten 19161 0.552320 0.023 0.0358 27.9

3 4 Bangka Belitung 2638 0.750569 0.016 0.0956 10.5

4 5 Bengkulu 3779 0.790156 0.031 0.2049 4.9

In [76]: covid.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 No 34 non-null int64
1 Provinsi 34 non-null object
2 Case 34 non-null int64
3 RR 34 non-null float64
4 CFR 34 non-null float64
5 PR 34 non-null float64
6 TR 34 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 2.0+ KB

In [79]: # Menentukan variabel yang akan di klusterkan


covid_x = covid.iloc[:, 3:6]
covid_x.head()

Out[79]:
RR CFR PR

0 0.817912 0.041 0.0437

1 0.907354 0.029 0.0588

2 0.552320 0.023 0.0358

3 0.750569 0.016 0.0956

4 0.790156 0.031 0.2049


In [80]: # Mengubah variabel data frame menjadi array
x_array = np.array(covid_x)
x_array

Out[80]: array([[0.81791249, 0.041 , 0.0437 ],


[0.90735367, 0.029 , 0.0588 ],
[0.55231982, 0.023 , 0.0358 ],
[0.75056861, 0.016 , 0.0956 ],
[0.79015613, 0.031 , 0.2049 ],
[0.66731634, 0.022 , 0.0305 ],
[0.90322915, 0.017 , 0.0673 ],
[0.77294398, 0.017 , 0.0551 ],
[0.8493994 , 0.013 , 1. ],
[0.68126408, 0.044 , 0.0645 ],
[0.8604622 , 0.07 , 0.1114 ],
[0.90310442, 0.009 , 0.2769 ],
[0.83986882, 0.027 , 0.1758 ],
[0.79884485, 0.027 , 0.2754 ],
[0.9056507 , 0.038 , 0.1901 ],
[0.6059322 , 0.014 , 0.0876 ],
[0.88373722, 0.025 , 0.203 ],
[0.80221843, 0.048 , 0.0859 ],
[0.8169665 , 0.051 , 0.0868 ],
[0.87099891, 0.022 , 0.2515 ],
[0.73097007, 0.032 , 1. ],
[0.85203917, 0.037 , 0.1732 ],
[0.8808642 , 0.019 , 0.0268 ],
[0.86858625, 0.018 , 0.139 ],
[0.55492813, 0.03 , 0.1261 ],
[0.70549582, 0.044 , 0.0328 ],
[0.92903024, 0.024 , 1. ],
[0.83853797, 0.032 , 0.1484 ],
[0.79972518, 0.015 , 0.182 ],
[0.92090954, 0.017 , 1. ],
[0.57974589, 0.011 , 0.1702 ],
[0.78159204, 0.019 , 0.0225 ],
[0.56445993, 0.024 , 0.1005 ],
[0.88425926, 0.027 , 0.1298 ]])

In [81]: # Menstandarkan ukuran variabel


scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x_array)
In [92]: x_scaled

Out[92]: array([[0.70503139, 0.52459016, 0.02168798],


[0.94245827, 0.32786885, 0.03713555],
[0. , 0.2295082 , 0.01360614],
[0.52626311, 0.1147541 , 0.07478261],
[0.63135049, 0.36065574, 0.18659847],
[0.30526505, 0.21311475, 0.00818414],
[0.9315095 , 0.13114754, 0.0458312 ],
[0.58565984, 0.13114754, 0.03335038],
[0.78861527, 0.06557377, 1. ],
[0.34229015, 0.57377049, 0.04296675],
[0.81798211, 1. , 0.09094629],
[0.9311784 , 0. , 0.26025575],
[0.76331577, 0.29508197, 0.15682864],
[0.65441523, 0.29508197, 0.25872123],
[0.93793763, 0.47540984, 0.1714578 ],
[0.14231724, 0.08196721, 0.06659847],
[0.87976701, 0.26229508, 0.18465473],
[0.66337059, 0.63934426, 0.06485934],
[0.7025202 , 0.68852459, 0.06578005],
[0.84595242, 0.21311475, 0.2342711 ],
[0.47423763, 0.37704918, 1. ],
[0.79562268, 0.45901639, 0.1541688 ],
[0.87214041, 0.16393443, 0.00439898],
[0.83954787, 0.14754098, 0.11918159],
[0.00692393, 0.3442623 , 0.10598465],
[0.40661472, 0.57377049, 0.01053708],
[1. , 0.24590164, 1. ],
[0.75978295, 0.37704918, 0.12879795],
[0.65675211, 0.09836066, 0.16317136],
[0.97844313, 0.13114754, 1. ],
[0.07280413, 0.03278689, 0.15109974],
[0.60861662, 0.16393443, 0. ],
[0.03222665, 0.24590164, 0.0797954 ],
[0.8811528 , 0.29508197, 0.10976982]])

Analyze the RR, CFR, and PR of each province by employing k-means clustering.
Group the province into 6 clusters by setting 12345 as the random state. What
function do we need to process this analysis?

In [83]: # Menentukan dan mengkonfigurasi fungsi kmeans


kmeans = KMeans(n_clusters = 6, random_state=12345)

In [84]: # Menentukan kluster dari data


kmeans.fit(x_scaled)

Out[84]: KMeans(n_clusters=6, random_state=12345)

In [85]: # Menampilkan pusat cluster


print(kmeans.cluster_centers_)

[[0.59432292 0.12704918 0.06782609]


[0.81032401 0.20491803 1. ]
[0.09325617 0.19125683 0.07087809]
[0.8332951 0.27166276 0.14657654]
[0.37445243 0.57377049 0.02675192]
[0.72222607 0.71311475 0.06081841]]

In [86]: # Menampilkan hasil kluster


print(kmeans.labels_)

[5 3 2 0 3 2 3 0 1 4 5 3 3 3 3 2 3 5 5 3 1 3 3 3 2 4 1 3 0 1 2 0 2 3]
In [87]: # Menambahkan kolom "kluster" dalam data frame ritel
covid["kluster"] = kmeans.labels_
covid.head()

Out[87]:
No Provinsi Case RR CFR PR TR kluster

0 1 Aceh 8776 0.817912 0.041 0.0437 22.9 5

1 2 Bali 18263 0.907354 0.029 0.0588 17.0 3

2 3 Banten 19161 0.552320 0.023 0.0358 27.9 2

3 4 Bangka Belitung 2638 0.750569 0.016 0.0956 10.5 0

4 5 Bengkulu 3779 0.790156 0.031 0.2049 4.9 3

In [91]: # Memvisualkan hasil kluster

fig, ax = plt.subplots()
sct = ax.scatter(x_scaled[:,1], x_scaled[:,0], s = 100,
c = covid.kluster, marker = "o", alpha = 0.5)

centers = kmeans.cluster_centers_
ax.scatter(centers[:,1], centers[:,0], c='blue', s=200, alpha=0.5);

plt.title("Hasil Klustering K-Means")


plt.xlabel("Scaled X")
plt.ylabel("Scaled Y")

plt.show()

In [89]: aa = pd.DataFrame(x_scaled)
aa.head()

Out[89]:
0 1 2

0 0.705031 0.524590 0.021688

1 0.942458 0.327869 0.037136

2 0.000000 0.229508 0.013606

3 0.526263 0.114754 0.074783

4 0.631350 0.360656 0.186598


In [90]: plt.figure(figsize=[8,8])

sns.scatterplot(0, 1, hue = covid.kluster,


palette="Set2", s = 100, alpha = 0.7, data = aa)

sns.scatterplot(centers[:,1], centers[:,0],
color = "k", s = 200, alpha = 0.5);

Which cluster has the highest number of members?

In [96]: covid["kluster"].value_counts()

Out[96]: 3 14
2 6
5 4
1 4
0 4
4 2
Name: kluster, dtype: int64

In the near future, the government will implement the large-scale restriction
(PSBB/Pembatasan Sosial Berskala Besar) for Jawa and Bali. Check all correct
answers based on the result of clustering about Jawa and Bali province.
In [103]: covid

Out[103]:
No Provinsi Case RR CFR PR TR kluster

0 1 Aceh 8776 0.817912 0.041 0.0437 22.9 5

1 2 Bali 18263 0.907354 0.029 0.0588 17.0 3

2 3 Banten 19161 0.552320 0.023 0.0358 27.9 2

3 4 Bangka Belitung 2638 0.750569 0.016 0.0956 10.5 0

4 5 Bengkulu 3779 0.790156 0.031 0.2049 4.9 3

5 6 DI Yogyakarta 13340 0.667316 0.022 0.0305 32.8 2

6 7 DKI Jakarta 192899 0.903229 0.017 0.0673 14.9 3

7 8 Jambi 3356 0.772944 0.017 0.0551 18.2 0

8 9 Jawa Barat 89661 0.849399 0.013 1.0000 1.0 1

9 10 Jawa Tengah 86545 0.681264 0.044 0.0645 15.5 4

10 11 Jawa Timur 87797 0.860462 0.070 0.1114 9.0 5

11 12 Kalimantan Barat 3189 0.903104 0.009 0.2769 3.6 3

12 13 Kalimantan Timur 28358 0.839869 0.027 0.1758 5.7 3

13 14 Kalimantan Tengah 10042 0.798845 0.027 0.2754 3.6 3

14 15 Kalimantan Selatan 15591 0.905651 0.038 0.1901 5.3 3

15 16 Kalimantan Utara 4248 0.605932 0.014 0.0876 11.4 2

16 17 Kepulauan Riau 7139 0.883737 0.025 0.2030 4.9 3

17 18 Nusa Tenggara Barat 5860 0.802218 0.048 0.0859 11.6 5

18 19 Sumatera Selatan 12118 0.816966 0.051 0.0868 11.5 5

19 20 Sumatera Barat 23806 0.870999 0.022 0.2515 4.0 3

20 21 Sulawesi Utara 9958 0.730970 0.032 1.0000 1.0 1

21 22 Sumatera Utara 18586 0.852039 0.037 0.1732 5.8 3

22 23 Sulawesi Tenggara 8100 0.880864 0.019 0.0268 37.3 3

23 24 Sulawesi Selatan 33931 0.868586 0.018 0.1390 7.2 3

24 25 Sulawesi Tengah 3896 0.554928 0.030 0.1261 7.9 2

25 26 Lampung 6696 0.705496 0.044 0.0328 30.5 4

26 27 Riau 25532 0.929030 0.024 1.0000 1.0 1

27 28 Maluku Utara 2818 0.838538 0.032 0.1484 6.7 3

28 29 Maluku 5822 0.799725 0.015 0.1820 5.5 0

29 30 Papua Barat 6069 0.920910 0.017 1.0000 1.0 1

30 31 Papua 13380 0.579746 0.011 0.1702 5.9 2

31 32 Sulawesi Barat 2010 0.781592 0.019 0.0225 44.4 0

32 33 Nusa Tenggara Timur 2296 0.564460 0.024 0.1005 9.9 2

33 34 Gorontalo 3888 0.884259 0.027 0.1298 7.7 3

Based on the mean of variables of every cluster.


In [99]: covid[['RR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_valu
es(by='RR')

Out[99]:
kluster RR

2 2 0.587450

4 4 0.693380

0 0 0.776207

5 5 0.824390

1 1 0.857577

3 3 0.866231

In [98]: covid[['CFR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_val


ues(by='CFR')

Out[98]:
kluster CFR

0 0 0.016750

2 2 0.020667

1 1 0.021500

3 3 0.025571

4 4 0.044000

5 5 0.052500

In [101]: covid[['RR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_valu


es(by='RR')

Out[101]:
kluster RR

2 2 0.587450

4 4 0.693380

0 0 0.776207

5 5 0.824390

1 1 0.857577

3 3 0.866231

In [100]: covid[['PR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_valu


es(by='PR')

Out[100]:
kluster PR

4 4 0.048650

5 5 0.081950

0 0 0.088800

2 2 0.091783

3 3 0.165779

1 1 1.000000

Visualisasi

In [109]: Provinsi = covid["Provinsi"]


Kluster = covid["kluster"]
In [115]: # Gunakan plt.bar(x,y,color=...) untuk membuat grafik batang
plt.bar(Kluster,Provinsi , color="c")

plt.title("EPS Perusahaan")
plt.xlabel("Kode Finansial Perusahaan")
plt.ylabel("Earning Per Share (EPS)")

plt.show()

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-115-119a9b178618> in <module>
1 # Gunakan plt.bar(x,y,color=...) untuk membuat grafik batang
----> 2 plt.bar(Kluster , color="c")
3
4 plt.title("EPS Perusahaan")
5 plt.xlabel("Kode Finansial Perusahaan")

TypeError: bar() missing 1 required positional argument: 'height'

Latihan
1. Tambahkan analisis untuk menentukan jumlah kluster terbaik dengan metode WSS, silhouette atau gap
statistics
2. Lakukan visualisasi mapping pada data hasil kluster

In [116]: # function returns WSS score for k values from 1 to kmax


def calculate_WSS(points, kmax):
sse = []
for k in range(1, kmax+1):
kmeans = KMeans(n_clusters = k).fit(points)
centroids = kmeans.cluster_centers_
pred_clusters = kmeans.predict(points)
curr_sse = 0

# calculate square of Euclidean distance of each point from its cluster ce


nter and add to current WSS
for i in range(len(points)):
curr_center = centroids[pred_clusters[i]]
curr_sse += (points[i, 0] - curr_center[0]) ** 2 + (points[i, 1] - curr_
center[1]) ** 2

sse.append(curr_sse)
return sse

In [117]: yy = calculate_WSS(x_scaled, 15)


yy

Out[117]: [4.551804861427149,
4.364361585947376,
2.2004193180017224,
1.3044658871052162,
1.02667947037841,
0.8365685591334261,
0.6970081493729405,
0.5068498726364605,
0.38836713771764475,
0.33791062338638983,
0.26167833918931815,
0.2339398836613525,
0.15760260406598947,
0.14533961930401096,
0.10248637498594232]

In [118]: xx = np.arange(1, 16, 1)


xx

Out[118]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])


In [119]: plt.figure(figsize=[7,7])
plt.plot(xx, yy, "b-o")
plt.show()

Berdasarkan plot WSS, ketika digunakan banyak cluster 7 atau 8, sudah cukup melandai. Sehingga, bisa
digunakan 7 cluster.

In [22]: sil = []
kmax = 10

# dissimilarity would not be defined for a single cluster, thus, minimum numbe
r of clusters should be 2
for k in range(2, kmax+1):
kmeans = KMeans(n_clusters = k, random_state = 123).fit(x_scaled)
labels = kmeans.labels_
sil.append(silhouette_score(x_scaled, labels, metric = 'euclidean'))

In [23]: yy2 = sil


xx2 = np.arange(2,11,1)

In [24]: plt.plot(xx2, yy2)


plt.plot
plt.xlabel("Number of cluster")
plt.ylabel("silhouette_score")

plt.show()
Berdasarkan plot silhouette, paling optimal ketika digunakanbanyak kluster 2. Nilai optimal kedua dicapai ketika
k = 8, dan optimal ketiga ketika k = 7. Sehingga dapat digunakan banyak kluster 7 atau 8.

In [25]: BBox = ((ritel.Longitude.min(), ritel.Longitude.max(),


ritel.Latitude.min(), ritel.Latitude.max()))

In [26]: BBox

Out[26]: (110.03157900000001, 110.91825700000001, -7.340652, -6.891375)

In [27]: semarang = plt.imread("semarang.png")

---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-27-1e4f7780defc> in <module>
----> 1 semarang = plt.imread("semarang.png")

~\anaconda3\lib\site-packages\matplotlib\pyplot.py in imread(fname, format)


2059 @docstring.copy(matplotlib.image.imread)
2060 def imread(fname, format=None):
-> 2061 return matplotlib.image.imread(fname, format)
2062
2063

~\anaconda3\lib\site-packages\matplotlib\image.py in imread(fname, format)


1472 fd = BytesIO(request.urlopen(fname).read())
1473 return _png.read_png(fd)
-> 1474 with cbook.open_file_cm(fname, "rb") as file:
1475 return _png.read_png(file)
1476

~\anaconda3\lib\contextlib.py in __enter__(self)
111 del self.args, self.kwds, self.func
112 try:
--> 113 return next(self.gen)
114 except StopIteration:
115 raise RuntimeError("generator didn't yield") from None

~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in open_file_cm(pa
th_or_file, mode, encoding)
416 def open_file_cm(path_or_file, mode="r", encoding=None):
417 r"""Pass through file objects and context-manage `.PathLike`
\s."""
--> 418 fh, opened = to_filehandle(path_or_file, mode, True, encoding)
419 if opened:
420 with fh:

~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in to_filehandle(f
name, flag, return_opened, encoding)
401 fh = bz2.BZ2File(fname, flag)
402 else:
--> 403 fh = open(fname, flag, encoding=encoding)
404 opened = True
405 elif hasattr(fname, 'seek'):

FileNotFoundError: [Errno 2] No such file or directory: 'semarang.png'


In [28]: fig, ax = plt.subplots(figsize = (15,15))
ax.scatter(ritel.Longitude, ritel.Latitude, zorder=1, alpha= 1, c=ritel.kluste
r, s=20)
ax.set_title('Plotting Spatial Data on Semarang Map')
ax.set_xlim(110.0300, 110.919)
ax.set_ylim(-7.339, -6.890)
ax.imshow(semarang, zorder=0, extent = BBox, aspect= 'equal', alpha = 0.3)

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-833c51c5e732> in <module>
4 ax.set_xlim(110.0300, 110.919)
5 ax.set_ylim(-7.339, -6.890)
----> 6 ax.imshow(semarang, zorder=0, extent = BBox, aspect= 'equal', alpha =
0.3)

NameError: name 'semarang' is not defined

In [29]: plt.scatter(ritel.Longitude, ritel.Latitude, zorder=1, alpha= 0.8, c='blue', s


=15)

Out[29]: <matplotlib.collections.PathCollection at 0x13688f19fd0>


In [ ]:

In [ ]:

You might also like