Professional Documents
Culture Documents
Kmeans Sklearn
Kmeans Sklearn
Problem A.1
In [10]: import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
Out[15]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel free
5 rows × 33 columns
In [13]: student.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 school 395 non-null object
1 sex 395 non-null object
2 age 395 non-null int64
3 address 395 non-null object
4 famsize 395 non-null object
5 Pstatus 395 non-null object
6 Medu 395 non-null int64
7 Fedu 395 non-null int64
8 Mjob 395 non-null object
9 Fjob 395 non-null object
10 reason 395 non-null object
11 guardian 395 non-null object
12 traveltime 395 non-null int64
13 studytime 395 non-null int64
14 failures 395 non-null int64
15 schoolsup 395 non-null object
16 famsup 395 non-null object
17 paid 395 non-null object
18 activities 395 non-null object
19 nursery 395 non-null object
20 higher 395 non-null object
21 internet 395 non-null object
22 romantic 395 non-null object
23 famrel 395 non-null int64
24 freetime 395 non-null int64
25 goout 395 non-null int64
26 Dalc 395 non-null int64
27 Walc 395 non-null int64
28 health 395 non-null int64
29 absences 395 non-null int64
30 G1 395 non-null int64
31 G2 395 non-null int64
32 G3 395 non-null int64
dtypes: int64(16), object(17)
memory usage: 102.0+ KB
In [16]: student.shape
Out[20]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel f
... ... ... ... ... ... ... ... ... ... ... ... ...
In [21]: student[student["sex"].str.contains(r'M')]
Out[21]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel f
... ... ... ... ... ... ... ... ... ... ... ... ...
Utilize the information on family size and parent's cohabitation status: what is the
probability that student is not the only child in their family?
Out[22]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel free
34 rows × 33 columns
In [23]: student[student["Fjob"].str.contains(r'health')]
Out[23]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob ... famrel fre
18 rows × 33 columns
In [102]: 62/395*100
Out[102]: 15.69620253164557
np.random.seed(456)
G1 = stats.norm.rvs(loc=8,scale=10,size=50)
G2 = stats.norm.rvs(loc=8,scale=10,size=50)
a, b = stats.ttest_rel(G1,G2)
In [42]: a, b
What can be inferred from the Kendall Tau correlation test for study time and
frequency of going out with friends?
data1 = student["studytime"]
data2 = student["goout"]
stat, p = stats.pearsonr(data1, data2)
stat, p
What can be inferred from the normality test for the final grade?
G3 = student["G3"]
stat, p = stats.shapiro(G3)
stat=0.929, p=0.000
Probably not Normal
What can be inferred from one way ANOVA test for the grades?
In [56]: G1 = student["G1"]
G2 = student["G2"]
G3 = student["G3"]
In [65]: list_baru = [baris if "course" not in baris else "curriculum" for baris in stu
dent['reason']]
print(list_baru)
Out[68]:
No Provinsi Case RR CFR PR TR
In [76]: covid.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 No 34 non-null int64
1 Provinsi 34 non-null object
2 Case 34 non-null int64
3 RR 34 non-null float64
4 CFR 34 non-null float64
5 PR 34 non-null float64
6 TR 34 non-null float64
dtypes: float64(4), int64(2), object(1)
memory usage: 2.0+ KB
Out[79]:
RR CFR PR
Analyze the RR, CFR, and PR of each province by employing k-means clustering.
Group the province into 6 clusters by setting 12345 as the random state. What
function do we need to process this analysis?
[5 3 2 0 3 2 3 0 1 4 5 3 3 3 3 2 3 5 5 3 1 3 3 3 2 4 1 3 0 1 2 0 2 3]
In [87]: # Menambahkan kolom "kluster" dalam data frame ritel
covid["kluster"] = kmeans.labels_
covid.head()
Out[87]:
No Provinsi Case RR CFR PR TR kluster
fig, ax = plt.subplots()
sct = ax.scatter(x_scaled[:,1], x_scaled[:,0], s = 100,
c = covid.kluster, marker = "o", alpha = 0.5)
centers = kmeans.cluster_centers_
ax.scatter(centers[:,1], centers[:,0], c='blue', s=200, alpha=0.5);
plt.show()
In [89]: aa = pd.DataFrame(x_scaled)
aa.head()
Out[89]:
0 1 2
sns.scatterplot(centers[:,1], centers[:,0],
color = "k", s = 200, alpha = 0.5);
In [96]: covid["kluster"].value_counts()
Out[96]: 3 14
2 6
5 4
1 4
0 4
4 2
Name: kluster, dtype: int64
In the near future, the government will implement the large-scale restriction
(PSBB/Pembatasan Sosial Berskala Besar) for Jawa and Bali. Check all correct
answers based on the result of clustering about Jawa and Bali province.
In [103]: covid
Out[103]:
No Provinsi Case RR CFR PR TR kluster
Out[99]:
kluster RR
2 2 0.587450
4 4 0.693380
0 0 0.776207
5 5 0.824390
1 1 0.857577
3 3 0.866231
Out[98]:
kluster CFR
0 0 0.016750
2 2 0.020667
1 1 0.021500
3 3 0.025571
4 4 0.044000
5 5 0.052500
Out[101]:
kluster RR
2 2 0.587450
4 4 0.693380
0 0 0.776207
5 5 0.824390
1 1 0.857577
3 3 0.866231
Out[100]:
kluster PR
4 4 0.048650
5 5 0.081950
0 0 0.088800
2 2 0.091783
3 3 0.165779
1 1 1.000000
Visualisasi
plt.title("EPS Perusahaan")
plt.xlabel("Kode Finansial Perusahaan")
plt.ylabel("Earning Per Share (EPS)")
plt.show()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-115-119a9b178618> in <module>
1 # Gunakan plt.bar(x,y,color=...) untuk membuat grafik batang
----> 2 plt.bar(Kluster , color="c")
3
4 plt.title("EPS Perusahaan")
5 plt.xlabel("Kode Finansial Perusahaan")
Latihan
1. Tambahkan analisis untuk menentukan jumlah kluster terbaik dengan metode WSS, silhouette atau gap
statistics
2. Lakukan visualisasi mapping pada data hasil kluster
sse.append(curr_sse)
return sse
Out[117]: [4.551804861427149,
4.364361585947376,
2.2004193180017224,
1.3044658871052162,
1.02667947037841,
0.8365685591334261,
0.6970081493729405,
0.5068498726364605,
0.38836713771764475,
0.33791062338638983,
0.26167833918931815,
0.2339398836613525,
0.15760260406598947,
0.14533961930401096,
0.10248637498594232]
Berdasarkan plot WSS, ketika digunakan banyak cluster 7 atau 8, sudah cukup melandai. Sehingga, bisa
digunakan 7 cluster.
In [22]: sil = []
kmax = 10
# dissimilarity would not be defined for a single cluster, thus, minimum numbe
r of clusters should be 2
for k in range(2, kmax+1):
kmeans = KMeans(n_clusters = k, random_state = 123).fit(x_scaled)
labels = kmeans.labels_
sil.append(silhouette_score(x_scaled, labels, metric = 'euclidean'))
plt.show()
Berdasarkan plot silhouette, paling optimal ketika digunakanbanyak kluster 2. Nilai optimal kedua dicapai ketika
k = 8, dan optimal ketiga ketika k = 7. Sehingga dapat digunakan banyak kluster 7 atau 8.
In [26]: BBox
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
<ipython-input-27-1e4f7780defc> in <module>
----> 1 semarang = plt.imread("semarang.png")
~\anaconda3\lib\contextlib.py in __enter__(self)
111 del self.args, self.kwds, self.func
112 try:
--> 113 return next(self.gen)
114 except StopIteration:
115 raise RuntimeError("generator didn't yield") from None
~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in open_file_cm(pa
th_or_file, mode, encoding)
416 def open_file_cm(path_or_file, mode="r", encoding=None):
417 r"""Pass through file objects and context-manage `.PathLike`
\s."""
--> 418 fh, opened = to_filehandle(path_or_file, mode, True, encoding)
419 if opened:
420 with fh:
~\anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in to_filehandle(f
name, flag, return_opened, encoding)
401 fh = bz2.BZ2File(fname, flag)
402 else:
--> 403 fh = open(fname, flag, encoding=encoding)
404 opened = True
405 elif hasattr(fname, 'seek'):
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-833c51c5e732> in <module>
4 ax.set_xlim(110.0300, 110.919)
5 ax.set_ylim(-7.339, -6.890)
----> 6 ax.imshow(semarang, zorder=0, extent = BBox, aspect= 'equal', alpha =
0.3)
In [ ]: