Kmeans Sklearn

UAS Data Intelligent
Muhammad Bariklana (18611079)
Problem A.1
In [10]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
In [11]: from sklearn.cluster import KMeans

from sklearn.metrics import silhouette_score
from sklearn.preprocessing import MinMaxScaler
In [15]: # Membaca data

student = pd.read_csv('student-mat.csv',sep=";")
student.head()
Out[15]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel free
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4
1 GP F 17 U GT3 T 1 1 at_home other ... 5
2 GP F 15 U LE3 T 1 1 at_home other ... 4
3 GP F 15 U GT3 T 4 2 health services ... 3
4 GP F 16 U GT3 T 3 3 other other ... 4
5 rows × 33 columns
In [13]: student.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 sex 395 non-null object
2 age 395 non-null int64
3 address 395 non-null object
4 famsize 395 non-null object
5 Pstatus 395 non-null object
6 Medu 395 non-null int64
7 Fedu 395 non-null int64
8 Mjob 395 non-null object
9 Fjob 395 non-null object
10 reason 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null int64
13 studytime 395 non-null int64
14 failures 395 non-null int64
15 schoolsup 395 non-null object
16 famsup 395 non-null object
17 paid 395 non-null object
18 activities 395 non-null object
19 nursery 395 non-null object
20 higher 395 non-null object
21 internet 395 non-null object
22 romantic 395 non-null object
23 famrel 395 non-null int64
24 freetime 395 non-null int64
25 goout 395 non-null int64
26 Dalc 395 non-null int64
27 Walc 395 non-null int64
28 health 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB
What is the data size?
In [16]: student.shape
Out[16]: (395, 33)
What is the percentage ratio of female and male students?

In [20]: student[student["sex"].str.contains(r'F')]
Out[20]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel f
0 GP F 18 U GT3 A 4 4 at_home teacher ... 4
1 GP F 17 U GT3 T 1 1 at_home other ... 5
2 GP F 15 U LE3 T 1 1 at_home other ... 4
4 GP F 16 U GT3 T 3 3 other other ... 4
... ... ... ... ... ... ... ... ... ... ... ... ...
385 MS F 18 R GT3 T 2 2 at_home other ... 5
386 MS F 18 R GT3 T 4 4 teacher at_home ... 4
387 MS F 19 R GT3 T 2 3 services other ... 5
388 MS F 18 U LE3 T 3 1 teacher services ... 4
389 MS F 18 U GT3 T 1 1 other other ... 1
What is the percentage ratio of female and male students?
In [21]: student[student["sex"].str.contains(r'M')]
Out[21]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel f
5 GP M 16 U LE3 T 4 3 services other ... 5
6 GP M 16 U LE3 T 2 2 other other ... 4
8 GP M 15 U LE3 A 3 2 services other ... 4
9 GP M 15 U GT3 T 3 4 other other ... 5
12 GP M 15 U LE3 T 4 4 health services ... 4
... ... ... ... ... ... ... ... ... ... ... ... ...
390 MS M 20 U LE3 A 2 2 services services ... 5
391 MS M 17 U LE3 T 3 1 services services ... 2
392 MS M 21 R GT3 T 1 1 other other ... 5
393 MS M 18 R LE3 T 3 2 services other ... 4
394 MS M 19 U LE3 T 1 1 other at_home ... 3
Utilize the information on family size and parent's cohabitation status: what is the
probability that student is not the only child in their family?
In [61]: databaru = student[(student["famsize"] == "GT3") & (student["Pstatus"] == "T"

)]
databaru.shape
Out[61]: (260, 33)
Below is the information about students' parents.
Read the metadata to find variable definitions.

In [22]: #The number of mothers who work in health care is more than the number of fath
ers who work in the same field. (TRUE)
student[student["Mjob"].str.contains(r'health')]
Out[22]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel free
12 GP M 15 U LE3 T 4 4 health services ... 4
15 GP F 16 U GT3 T 4 4 health other ... 4
19 GP M 16 U LE3 T 4 3 health other ... 3
21 GP M 15 U GT3 T 4 4 health health ... 5
27 GP M 15 U GT3 T 4 2 health services ... 2
51 GP F 15 U LE3 T 4 2 health other ... 4
52 GP M 15 U LE3 A 4 2 health health ... 5
60 GP F 16 R GT3 T 4 4 health teacher ... 2
68 GP F 15 R LE3 T 2 2 health services ... 4
109 GP F 16 U LE3 T 4 4 health health ... 5
114 GP M 15 R GT3 T 2 1 health services ... 5
123 GP M 16 U GT3 T 4 4 health other ... 3
169 GP F 16 U GT3 T 4 4 health health ... 4
188 GP F 17 U GT3 A 3 3 health other ... 3
230 GP F 17 U LE3 T 4 3 health other ... 3
300 GP F 18 U LE3 A 4 4 health other ... 4
351 MS M 17 U GT3 T 3 3 health other ... 4
376 MS F 20 U GT3 T 4 2 health other ... 5
In [23]: student[student["Fjob"].str.contains(r'health')]
Out[23]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel fre
10 GP F 15 U GT3 T 4 4 teacher health ... 3
21 GP M 15 U GT3 T 4 4 health health ... 5
24 GP F 15 R GT3 T 2 4 services health ... 4
38 GP F 15 R GT3 T 3 4 services health ... 4
52 GP M 15 U LE3 A 4 2 health health ... 5
57 GP M 15 U GT3 T 4 4 teacher health ... 3
63 GP F 16 U GT3 T 4 3 teacher health ... 3
89 GP M 16 U LE3 A 4 4 teacher health ... 4
94 GP M 15 U LE3 T 2 2 services health ... 4
105 GP F 15 U GT3 A 3 3 other health ... 4
109 GP F 16 U LE3 T 4 4 health health ... 5
122 GP F 16 U LE3 T 2 4 other health ... 4
217 GP M 18 U LE3 T 3 3 services health ... 3
274 GP F 17 U GT3 T 2 4 at_home health ... 4
314 GP F 19 U GT3 T 1 1 at_home health ... 4
In [30]: #Fewer than 10% of mother have primary education or below.

primary = student[(student["Medu"] <= 1)]
primary.shape
Out[30]: (62, 33)
In [102]: 62/395*100
Out[102]: 15.69620253164557
In [31]: #Most of the father have higher education.

nol = student[(student["Fedu"] == 0)]
nol.shape
Out[31]: (2, 33)
In [32]: satu = student[(student["Fedu"] == 1)]

satu.shape
Out[32]: (82, 33)
In [33]: dua = student[(student["Fedu"] == 2)]

dua.shape
Out[33]: (115, 33)
In [34]: tiga = student[(student["Fedu"] == 3)]

tiga.shape
Out[34]: (100, 33)
In [35]: empat = student[(student["Fedu"] == 4)]

empat.shape
Out[35]: (96, 33)

What can be inferred from the two tails paired t-test for first and second grade?
In [38]: import numpy as np

from scipy import stats
import matplotlib.pyplot as plt
In [41]: # Uji perbedaan rata-rata 2 populasi berpasangan
np.random.seed(456)
G1 = stats.norm.rvs(loc=8,scale=10,size=50)
G2 = stats.norm.rvs(loc=8,scale=10,size=50)
a, b = stats.ttest_rel(G1,G2)
In [42]: a, b
Out[42]: (-0.22111035345211244, 0.8259253278353127)
What can be inferred from the Kendall Tau correlation test for study time and
frequency of going out with friends?
In [49]: # Example of the Pearson's Correlation test
data1 = student["studytime"]
data2 = student["goout"]
stat, p = stats.pearsonr(data1, data2)
stat, p
Out[49]: (-0.06390367501441135, 0.2050369462596782)
What can be inferred from the normality test for the final grade?
In [55]: # Example of the Shapiro-Wilk test
G3 = student["G3"]
stat, p = stats.shapiro(G3)
# Optional: hanya digunakan untuk keperluan menampilkan hasil
print('stat=%.3f, p=%.3f' % (stat, p))

if p > 0.05:
print('Probably Normal')
else:
print('Probably not Normal')
stat=0.929, p=0.000
Probably not Normal
What can be inferred from one way ANOVA test for the grades?
In [56]: G1 = student["G1"]
G2 = student["G2"]
G3 = student["G3"]
stats.f_oneway(G1, G2, G3)
Out[56]: F_onewayResult(statistic=1.5873114383378757, pvalue=0.20491015955616096)

Find the error in the following list comprehension to create a new list of the
reason for choosing the school in which the term "course" substitute with
"curriculum".
In [65]: list_baru = [baris if "course" not in baris else "curriculum" for baris in stu
dent['reason']]
print(list_baru)
['curriculum', 'curriculum', 'other', 'home', 'home', 'reputation', 'home',

'home', 'home', 'home', 'reputation', 'reputation', 'curriculum', 'curriculu
m', 'home', 'home', 'reputation', 'reputation', 'curriculum', 'home', 'reputa
tion', 'other', 'curriculum', 'reputation', 'curriculum', 'home', 'home', 'ot
her', 'home', 'home', 'home', 'reputation', 'curriculum', 'curriculum', 'hom
e', 'other', 'home', 'reputation', 'curriculum', 'reputation', 'home', 'hom
e', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'home', 'reputati
on', 'home', 'other', 'curriculum', 'other', 'other', 'curriculum', 'other',
'other', 'reputation', 'reputation', 'home', 'curriculum', 'other', 'curricul
um', 'reputation', 'home', 'reputation', 'curriculum', 'reputation', 'curricu
lum', 'reputation', 'reputation', 'reputation', 'curriculum', 'reputation',
'reputation', 'home', 'home', 'curriculum', 'reputation', 'home', 'curriculu
m', 'curriculum', 'home', 'reputation', 'home', 'home', 'reputation', 'curric
ulum', 'reputation', 'reputation', 'reputation', 'home', 'reputation', 'hom
e', 'home', 'reputation', 'home', 'reputation', 'curriculum', 'reputation',
'curriculum', 'other', 'other', 'curriculum', 'home', 'curriculum', 'reputati
on', 'curriculum', 'home', 'home', 'other', 'curriculum', 'reputation', 'hom
e', 'curriculum', 'reputation', 'curriculum', 'reputation', 'home', 'curricul
um', 'reputation', 'curriculum', 'home', 'curriculum', 'curriculum', 'home',
'home', 'home', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'curr
iculum', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curriculu
m', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'reputation', 'cu
rriculum', 'curriculum', 'home', 'curriculum', 'home', 'curriculum', 'curricu
lum', 'curriculum', 'curriculum', 'curriculum', 'reputation', 'home', 'curric
ulum', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'curriculum',
'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curric
ulum', 'curriculum', 'home', 'home', 'reputation', 'curriculum', 'reputatio
n', 'reputation', 'home', 'reputation', 'curriculum', 'reputation', 'reputati
on', 'other', 'curriculum', 'home', 'home', 'reputation', 'reputation', 'repu
tation', 'other', 'other', 'curriculum', 'reputation', 'home', 'curriculum',
'curriculum', 'other', 'reputation', 'home', 'curriculum', 'home', 'home', 'h
ome', 'reputation', 'home', 'reputation', 'curriculum', 'reputation', 'reputa
tion', 'home', 'curriculum', 'other', 'home', 'reputation', 'reputation', 'ho
me', 'reputation', 'home', 'other', 'reputation', 'reputation', 'home', 'hom
e', 'curriculum', 'reputation', 'reputation', 'other', 'home', 'home', 'reput
ation', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'reputation',
'curriculum', 'reputation', 'reputation', 'home', 'reputation', 'home', 'hom
e', 'curriculum', 'reputation', 'curriculum', 'curriculum', 'curriculum', 'cu
rriculum', 'curriculum', 'curriculum', 'curriculum', 'other', 'curriculum',
'other', 'curriculum', 'reputation', 'other', 'curriculum', 'curriculum', 'cu
rriculum', 'reputation', 'reputation', 'home', 'curriculum', 'home', 'curricu
lum', 'curriculum', 'home', 'home', 'reputation', 'other', 'reputation', 'rep
utation', 'reputation', 'home', 'reputation', 'home', 'home', 'reputation',
'curriculum', 'home', 'home', 'reputation', 'curriculum', 'home', 'home', 're
putation', 'home', 'curriculum', 'reputation', 'other', 'reputation', 'reputa
tion', 'reputation', 'home', 'reputation', 'reputation', 'reputation', 'reput
ation', 'home', 'reputation', 'home', 'reputation', 'home', 'home', 'home',
'reputation', 'reputation', 'home', 'reputation', 'curriculum', 'reputation',
'reputation', 'reputation', 'home', 'other', 'curriculum', 'reputation', 'hom
e', 'reputation', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'cu
rriculum', 'curriculum', 'curriculum', 'curriculum', 'home', 'curriculum', 'r
eputation', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curricul
um', 'home', 'home', 'curriculum', 'curriculum', 'home', 'home', 'home', 'hom
e', 'home', 'home', 'home', 'home', 'curriculum', 'other', 'curriculum', 'cur
riculum', 'reputation', 'curriculum', 'home', 'curriculum', 'curriculum', 'ho
me', 'home', 'curriculum', 'other', 'reputation', 'home', 'curriculum', 'curr
iculum', 'other', 'other', 'curriculum', 'curriculum', 'curriculum', 'other',
'reputation', 'curriculum', 'other', 'home', 'other', 'home', 'curriculum',
'reputation', 'home', 'curriculum', 'curriculum', 'home', 'reputation', 'hom
e', 'other', 'home', 'other', 'home', 'other', 'reputation', 'curriculum', 'c
urriculum', 'curriculum', 'curriculum', 'curriculum', 'curriculum', 'curricul
um', 'curriculum']
Problem A.2
In [68]: covid = pd.read_excel('data covid INA.xlsx')
covid.head()
Out[68]:
No Provinsi Case RR CFR PR TR
0 1 Aceh 8776 0.817912 0.041 0.0437 22.9
1 2 Bali 18263 0.907354 0.029 0.0588 17.0
2 3 Banten 19161 0.552320 0.023 0.0358 27.9
3 4 Bangka Belitung 2638 0.750569 0.016 0.0956 10.5
4 5 Bengkulu 3779 0.790156 0.031 0.2049 4.9
In [76]: covid.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 No 34 non-null int64
1 Provinsi 34 non-null object
2 Case 34 non-null int64
3 RR 34 non-null float64
4 CFR 34 non-null float64
5 PR 34 non-null float64
6 TR 34 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 2.0+ KB
In [79]: # Menentukan variabel yang akan di klusterkan

covid_x = covid.iloc[:, 3:6]
covid_x.head()
Out[79]:
RR CFR PR
0 0.817912 0.041 0.0437
1 0.907354 0.029 0.0588
2 0.552320 0.023 0.0358
3 0.750569 0.016 0.0956
4 0.790156 0.031 0.2049

In [80]: # Mengubah variabel data frame menjadi array
x_array = np.array(covid_x)
x_array
Out[80]: array([[0.81791249, 0.041 , 0.0437 ],

[0.90735367, 0.029 , 0.0588 ],
[0.55231982, 0.023 , 0.0358 ],
[0.75056861, 0.016 , 0.0956 ],
[0.79015613, 0.031 , 0.2049 ],
[0.66731634, 0.022 , 0.0305 ],
[0.90322915, 0.017 , 0.0673 ],
[0.77294398, 0.017 , 0.0551 ],
[0.8493994 , 0.013 , 1. ],
[0.68126408, 0.044 , 0.0645 ],
[0.8604622 , 0.07 , 0.1114 ],
[0.90310442, 0.009 , 0.2769 ],
[0.83986882, 0.027 , 0.1758 ],
[0.79884485, 0.027 , 0.2754 ],
[0.9056507 , 0.038 , 0.1901 ],
[0.6059322 , 0.014 , 0.0876 ],
[0.88373722, 0.025 , 0.203 ],
[0.80221843, 0.048 , 0.0859 ],
[0.8169665 , 0.051 , 0.0868 ],
[0.87099891, 0.022 , 0.2515 ],
[0.73097007, 0.032 , 1. ],
[0.85203917, 0.037 , 0.1732 ],
[0.8808642 , 0.019 , 0.0268 ],
[0.86858625, 0.018 , 0.139 ],
[0.55492813, 0.03 , 0.1261 ],
[0.70549582, 0.044 , 0.0328 ],
[0.92903024, 0.024 , 1. ],
[0.83853797, 0.032 , 0.1484 ],
[0.79972518, 0.015 , 0.182 ],
[0.92090954, 0.017 , 1. ],
[0.57974589, 0.011 , 0.1702 ],
[0.78159204, 0.019 , 0.0225 ],
[0.56445993, 0.024 , 0.1005 ],
[0.88425926, 0.027 , 0.1298 ]])
In [81]: # Menstandarkan ukuran variabel

scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x_array)
In [92]: x_scaled
Out[92]: array([[0.70503139, 0.52459016, 0.02168798],

[0.94245827, 0.32786885, 0.03713555],
[0. , 0.2295082 , 0.01360614],
[0.52626311, 0.1147541 , 0.07478261],
[0.63135049, 0.36065574, 0.18659847],
[0.30526505, 0.21311475, 0.00818414],
[0.9315095 , 0.13114754, 0.0458312 ],
[0.58565984, 0.13114754, 0.03335038],
[0.78861527, 0.06557377, 1. ],
[0.34229015, 0.57377049, 0.04296675],
[0.81798211, 1. , 0.09094629],
[0.9311784 , 0. , 0.26025575],
[0.76331577, 0.29508197, 0.15682864],
[0.65441523, 0.29508197, 0.25872123],
[0.93793763, 0.47540984, 0.1714578 ],
[0.14231724, 0.08196721, 0.06659847],
[0.87976701, 0.26229508, 0.18465473],
[0.66337059, 0.63934426, 0.06485934],
[0.7025202 , 0.68852459, 0.06578005],
[0.84595242, 0.21311475, 0.2342711 ],
[0.47423763, 0.37704918, 1. ],
[0.79562268, 0.45901639, 0.1541688 ],
[0.87214041, 0.16393443, 0.00439898],
[0.83954787, 0.14754098, 0.11918159],
[0.00692393, 0.3442623 , 0.10598465],
[0.40661472, 0.57377049, 0.01053708],
[1. , 0.24590164, 1. ],
[0.75978295, 0.37704918, 0.12879795],
[0.65675211, 0.09836066, 0.16317136],
[0.97844313, 0.13114754, 1. ],
[0.07280413, 0.03278689, 0.15109974],
[0.60861662, 0.16393443, 0. ],
[0.03222665, 0.24590164, 0.0797954 ],
[0.8811528 , 0.29508197, 0.10976982]])
Analyze the RR, CFR, and PR of each province by employing k-means clustering.
Group the province into 6 clusters by setting 12345 as the random state. What
function do we need to process this analysis?
In [83]: # Menentukan dan mengkonfigurasi fungsi kmeans

kmeans = KMeans(n_clusters = 6, random_state=12345)
In [84]: # Menentukan kluster dari data

kmeans.fit(x_scaled)
Out[84]: KMeans(n_clusters=6, random_state=12345)
In [85]: # Menampilkan pusat cluster

print(kmeans.cluster_centers_)
[[0.59432292 0.12704918 0.06782609]

[0.81032401 0.20491803 1. ]
[0.09325617 0.19125683 0.07087809]
[0.8332951 0.27166276 0.14657654]
[0.37445243 0.57377049 0.02675192]
[0.72222607 0.71311475 0.06081841]]
In [86]: # Menampilkan hasil kluster

print(kmeans.labels_)
[5 3 2 0 3 2 3 0 1 4 5 3 3 3 3 2 3 5 5 3 1 3 3 3 2 4 1 3 0 1 2 0 2 3]
In [87]: # Menambahkan kolom "kluster" dalam data frame ritel
covid["kluster"] = kmeans.labels_
covid.head()
Out[87]:
No Provinsi Case RR CFR PR TR kluster
0 1 Aceh 8776 0.817912 0.041 0.0437 22.9 5
1 2 Bali 18263 0.907354 0.029 0.0588 17.0 3
2 3 Banten 19161 0.552320 0.023 0.0358 27.9 2
3 4 Bangka Belitung 2638 0.750569 0.016 0.0956 10.5 0
4 5 Bengkulu 3779 0.790156 0.031 0.2049 4.9 3
In [91]: # Memvisualkan hasil kluster
fig, ax = plt.subplots()
sct = ax.scatter(x_scaled[:,1], x_scaled[:,0], s = 100,
c = covid.kluster, marker = "o", alpha = 0.5)
centers = kmeans.cluster_centers_
ax.scatter(centers[:,1], centers[:,0], c='blue', s=200, alpha=0.5);
plt.title("Hasil Klustering K-Means")

plt.xlabel("Scaled X")
plt.ylabel("Scaled Y")
plt.show()
In [89]: aa = pd.DataFrame(x_scaled)
aa.head()
Out[89]:
0 1 2
0 0.705031 0.524590 0.021688
1 0.942458 0.327869 0.037136
2 0.000000 0.229508 0.013606
3 0.526263 0.114754 0.074783
4 0.631350 0.360656 0.186598

In [90]: plt.figure(figsize=[8,8])
sns.scatterplot(0, 1, hue = covid.kluster,

palette="Set2", s = 100, alpha = 0.7, data = aa)
sns.scatterplot(centers[:,1], centers[:,0],
color = "k", s = 200, alpha = 0.5);
Which cluster has the highest number of members?
In [96]: covid["kluster"].value_counts()
Out[96]: 3 14
2 6
5 4
1 4
0 4
4 2
Name: kluster, dtype: int64
In the near future, the government will implement the large-scale restriction
(PSBB/Pembatasan Sosial Berskala Besar) for Jawa and Bali. Check all correct
answers based on the result of clustering about Jawa and Bali province.
In [103]: covid
Out[103]:
No Provinsi Case RR CFR PR TR kluster
0 1 Aceh 8776 0.817912 0.041 0.0437 22.9 5
1 2 Bali 18263 0.907354 0.029 0.0588 17.0 3
2 3 Banten 19161 0.552320 0.023 0.0358 27.9 2
3 4 Bangka Belitung 2638 0.750569 0.016 0.0956 10.5 0
4 5 Bengkulu 3779 0.790156 0.031 0.2049 4.9 3
5 6 DI Yogyakarta 13340 0.667316 0.022 0.0305 32.8 2
6 7 DKI Jakarta 192899 0.903229 0.017 0.0673 14.9 3
7 8 Jambi 3356 0.772944 0.017 0.0551 18.2 0
8 9 Jawa Barat 89661 0.849399 0.013 1.0000 1.0 1
9 10 Jawa Tengah 86545 0.681264 0.044 0.0645 15.5 4
10 11 Jawa Timur 87797 0.860462 0.070 0.1114 9.0 5
11 12 Kalimantan Barat 3189 0.903104 0.009 0.2769 3.6 3
12 13 Kalimantan Timur 28358 0.839869 0.027 0.1758 5.7 3
13 14 Kalimantan Tengah 10042 0.798845 0.027 0.2754 3.6 3
14 15 Kalimantan Selatan 15591 0.905651 0.038 0.1901 5.3 3
15 16 Kalimantan Utara 4248 0.605932 0.014 0.0876 11.4 2
16 17 Kepulauan Riau 7139 0.883737 0.025 0.2030 4.9 3
17 18 Nusa Tenggara Barat 5860 0.802218 0.048 0.0859 11.6 5
18 19 Sumatera Selatan 12118 0.816966 0.051 0.0868 11.5 5
19 20 Sumatera Barat 23806 0.870999 0.022 0.2515 4.0 3
20 21 Sulawesi Utara 9958 0.730970 0.032 1.0000 1.0 1
21 22 Sumatera Utara 18586 0.852039 0.037 0.1732 5.8 3
22 23 Sulawesi Tenggara 8100 0.880864 0.019 0.0268 37.3 3
23 24 Sulawesi Selatan 33931 0.868586 0.018 0.1390 7.2 3
24 25 Sulawesi Tengah 3896 0.554928 0.030 0.1261 7.9 2
25 26 Lampung 6696 0.705496 0.044 0.0328 30.5 4
26 27 Riau 25532 0.929030 0.024 1.0000 1.0 1
27 28 Maluku Utara 2818 0.838538 0.032 0.1484 6.7 3
28 29 Maluku 5822 0.799725 0.015 0.1820 5.5 0
29 30 Papua Barat 6069 0.920910 0.017 1.0000 1.0 1
30 31 Papua 13380 0.579746 0.011 0.1702 5.9 2
31 32 Sulawesi Barat 2010 0.781592 0.019 0.0225 44.4 0
32 33 Nusa Tenggara Timur 2296 0.564460 0.024 0.1005 9.9 2
33 34 Gorontalo 3888 0.884259 0.027 0.1298 7.7 3
Based on the mean of variables of every cluster.

In [99]: covid[['RR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_valu
es(by='RR')
Out[99]:
kluster RR
2 2 0.587450
4 4 0.693380
0 0 0.776207
5 5 0.824390
1 1 0.857577
3 3 0.866231
In [98]: covid[['CFR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_val

ues(by='CFR')
Out[98]:
kluster CFR
0 0 0.016750
2 2 0.020667
1 1 0.021500
3 3 0.025571
4 4 0.044000
5 5 0.052500
In [101]: covid[['RR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_valu

es(by='RR')
Out[101]:
kluster RR
2 2 0.587450
4 4 0.693380
0 0 0.776207
5 5 0.824390
1 1 0.857577
3 3 0.866231
In [100]: covid[['PR', 'kluster']].groupby(['kluster'], as_index=False).mean().sort_valu

es(by='PR')
Out[100]:
kluster PR
4 4 0.048650
5 5 0.081950
0 0 0.088800
2 2 0.091783
3 3 0.165779
1 1 1.000000
Visualisasi
In [109]: Provinsi = covid["Provinsi"]

Kluster = covid["kluster"]
In [115]: # Gunakan plt.bar(x,y,color=...) untuk membuat grafik batang
plt.bar(Kluster,Provinsi , color="c")
plt.title("EPS Perusahaan")
plt.xlabel("Kode Finansial Perusahaan")
plt.ylabel("Earning Per Share (EPS)")
plt.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-115-119a9b178618> in <module>
1 # Gunakan plt.bar(x,y,color=...) untuk membuat grafik batang
----> 2 plt.bar(Kluster , color="c")
3
4 plt.title("EPS Perusahaan")
5 plt.xlabel("Kode Finansial Perusahaan")
TypeError: bar() missing 1 required positional argument: 'height'
Latihan
1. Tambahkan analisis untuk menentukan jumlah kluster terbaik dengan metode WSS, silhouette atau gap
statistics
2. Lakukan visualisasi mapping pada data hasil kluster
In [116]: # function returns WSS score for k values from 1 to kmax

def calculate_WSS(points, kmax):
sse = []
for k in range(1, kmax+1):
kmeans = KMeans(n_clusters = k).fit(points)
centroids = kmeans.cluster_centers_
pred_clusters = kmeans.predict(points)
curr_sse = 0
# calculate square of Euclidean distance of each point from its cluster ce

nter and add to current WSS
for i in range(len(points)):
curr_center = centroids[pred_clusters[i]]
curr_sse += (points[i, 0] - curr_center[0]) ** 2 + (points[i, 1] - curr_
center[1]) ** 2
sse.append(curr_sse)
return sse
In [117]: yy = calculate_WSS(x_scaled, 15)

yy
Out[117]: [4.551804861427149,
4.364361585947376,
2.2004193180017224,
1.3044658871052162,
1.02667947037841,
0.8365685591334261,
0.6970081493729405,
0.5068498726364605,
0.38836713771764475,
0.33791062338638983,
0.26167833918931815,
0.2339398836613525,
0.15760260406598947,
0.14533961930401096,
0.10248637498594232]
In [118]: xx = np.arange(1, 16, 1)

xx
Out[118]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])

In [119]: plt.figure(figsize=[7,7])
plt.plot(xx, yy, "b-o")
plt.show()
Berdasarkan plot WSS, ketika digunakan banyak cluster 7 atau 8, sudah cukup melandai. Sehingga, bisa
digunakan 7 cluster.
In [22]: sil = []
kmax = 10
# dissimilarity would not be defined for a single cluster, thus, minimum numbe
r of clusters should be 2
for k in range(2, kmax+1):
kmeans = KMeans(n_clusters = k, random_state = 123).fit(x_scaled)
labels = kmeans.labels_
sil.append(silhouette_score(x_scaled, labels, metric = 'euclidean'))
In [23]: yy2 = sil

xx2 = np.arange(2,11,1)
In [24]: plt.plot(xx2, yy2)

plt.plot
plt.xlabel("Number of cluster")
plt.ylabel("silhouette_score")
plt.show()
Berdasarkan plot silhouette, paling optimal ketika digunakanbanyak kluster 2. Nilai optimal kedua dicapai ketika
k = 8, dan optimal ketiga ketika k = 7. Sehingga dapat digunakan banyak kluster 7 atau 8.
In [25]: BBox = ((ritel.Longitude.min(), ritel.Longitude.max(),

ritel.Latitude.min(), ritel.Latitude.max()))
In [26]: BBox
Out[26]: (110.03157900000001, 110.91825700000001, -7.340652, -6.891375)
In [27]: semarang = plt.imread("semarang.png")
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-27-1e4f7780defc> in <module>
----> 1 semarang = plt.imread("semarang.png")
~\anaconda3\lib\site-packages\matplotlib\pyplot.py in imread(fname, format)

2059 @docstring.copy(matplotlib.image.imread)
2060 def imread(fname, format=None):
-> 2061 return matplotlib.image.imread(fname, format)
2062
2063
~\anaconda3\lib\site-packages\matplotlib\image.py in imread(fname, format)

1472 fd = BytesIO(request.urlopen(fname).read())
1473 return _png.read_png(fd)
-> 1474 with cbook.open_file_cm(fname, "rb") as file:
1475 return _png.read_png(file)
1476
~\anaconda3\lib\contextlib.py in __enter__(self)
111 del self.args, self.kwds, self.func
112 try:
--> 113 return next(self.gen)
114 except StopIteration:
115 raise RuntimeError("generator didn't yield") from None
~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in open_file_cm(pa
th_or_file, mode, encoding)
416 def open_file_cm(path_or_file, mode="r", encoding=None):
417 r"""Pass through file objects and context-manage `.PathLike`
\s."""
--> 418 fh, opened = to_filehandle(path_or_file, mode, True, encoding)
419 if opened:
420 with fh:
~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in to_filehandle(f
name, flag, return_opened, encoding)
401 fh = bz2.BZ2File(fname, flag)
402 else:
--> 403 fh = open(fname, flag, encoding=encoding)
404 opened = True
405 elif hasattr(fname, 'seek'):
FileNotFoundError: [Errno 2] No such file or directory: 'semarang.png'

In [28]: fig, ax = plt.subplots(figsize = (15,15))
ax.scatter(ritel.Longitude, ritel.Latitude, zorder=1, alpha= 1, c=ritel.kluste
r, s=20)
ax.set_title('Plotting Spatial Data on Semarang Map')
ax.set_xlim(110.0300, 110.919)
ax.set_ylim(-7.339, -6.890)
ax.imshow(semarang, zorder=0, extent = BBox, aspect= 'equal', alpha = 0.3)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-833c51c5e732> in <module>
4 ax.set_xlim(110.0300, 110.919)
5 ax.set_ylim(-7.339, -6.890)
----> 6 ax.imshow(semarang, zorder=0, extent = BBox, aspect= 'equal', alpha =
0.3)
NameError: name 'semarang' is not defined
In [29]: plt.scatter(ritel.Longitude, ritel.Latitude, zorder=1, alpha= 0.8, c='blue', s

=15)
Out[29]: <matplotlib.collections.PathCollection at 0x13688f19fd0>

In [ ]:
In [ ]:

Kmeans Sklearn

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kmeans Sklearn

Uploaded by

Copyright:

Available Formats

UAS Data Intelligent

Muhammad Bariklana (18611079)

In [11]: from sklearn.cluster import KMeans

In [15]: # Membaca data

0 GP F 18 U GT3 A 4 4 at_home teacher ... 4

1 GP F 17 U GT3 T 1 1 at_home other ... 5

2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

4 GP F 16 U GT3 T 3 3 other other ... 4

What is the data size?

Out[16]: (395, 33)

What is the percentage ratio of female and male students?

0 GP F 18 U GT3 A 4 4 at_home teacher ... 4

1 GP F 17 U GT3 T 1 1 at_home other ... 5

2 GP F 15 U LE3 T 1 1 at_home other ... 4

3 GP F 15 U GT3 T 4 2 health services ... 3

4 GP F 16 U GT3 T 3 3 other other ... 4

385 MS F 18 R GT3 T 2 2 at_home other ... 5

386 MS F 18 R GT3 T 4 4 teacher at_home ... 4

387 MS F 19 R GT3 T 2 3 services other ... 5

388 MS F 18 U LE3 T 3 1 teacher services ... 4

389 MS F 18 U GT3 T 1 1 other other ... 1

208 rows × 33 columns

What is the percentage ratio of female and male students?

5 GP M 16 U LE3 T 4 3 services other ... 5

6 GP M 16 U LE3 T 2 2 other other ... 4

8 GP M 15 U LE3 A 3 2 services other ... 4

9 GP M 15 U GT3 T 3 4 other other ... 5

12 GP M 15 U LE3 T 4 4 health services ... 4

390 MS M 20 U LE3 A 2 2 services services ... 5

391 MS M 17 U LE3 T 3 1 services services ... 2

392 MS M 21 R GT3 T 1 1 other other ... 5

393 MS M 18 R LE3 T 3 2 services other ... 4

394 MS M 19 U LE3 T 1 1 other at_home ... 3

187 rows × 33 columns

In [61]: databaru = student[(student["famsize"] == "GT3") & (student["Pstatus"] == "T"

Out[61]: (260, 33)

Below is the information about students' parents.

Read the metadata to find variable definitions.

3 GP F 15 U GT3 T 4 2 health services ... 3

12 GP M 15 U LE3 T 4 4 health services ... 4

15 GP F 16 U GT3 T 4 4 health other ... 4

19 GP M 16 U LE3 T 4 3 health other ... 3

21 GP M 15 U GT3 T 4 4 health health ... 5

27 GP M 15 U GT3 T 4 2 health services ... 2

30 GP M 15 U GT3 T 4 4 health services ... 5

47 GP M 16 U GT3 T 4 3 health services ... 4

51 GP F 15 U LE3 T 4 2 health other ... 4

52 GP M 15 U LE3 A 4 2 health health ... 5

60 GP F 16 R GT3 T 4 4 health teacher ... 2

68 GP F 15 R LE3 T 2 2 health services ... 4

109 GP F 16 U LE3 T 4 4 health health ... 5

114 GP M 15 R GT3 T 2 1 health services ... 5

123 GP M 16 U GT3 T 4 4 health other ... 3

146 GP F 15 U GT3 T 3 2 health services ... 3

167 GP F 16 U GT3 T 4 2 health services ... 4

169 GP F 16 U GT3 T 4 4 health health ... 4

188 GP F 17 U GT3 A 3 3 health other ... 3

200 GP F 16 U GT3 T 4 3 health other ... 4

230 GP F 17 U LE3 T 4 3 health other ... 3

233 GP M 16 U GT3 T 4 4 health other ... 4

240 GP M 17 U LE3 T 4 3 health other ... 2

255 GP M 17 U LE3 T 1 1 health other ... 4