Professional Documents
Culture Documents
Listen Share
https://github.com/nt27web/WebMining-
Clustering
Data Exploration
Let’s look at the data at hand. The format of
data is .csv
Data Attributes:
First, I extracted the CSV files into a
python dataframe-
data = pd.read_csv('HR-Employee-
Attrition.csv')
display(data.shape)
display(data.isnull().sum())
display(data['Age'].describe())
display(data['DailyRate'].descri
be())
display(data['EducationField'].u
nique())
display(data['YearsAtCompany'].d
escribe())
display(data['YearsInCurrentRole
'].describe())
display(data['YearsSinceLastProm
otion'].describe())
display(data['YearsWithCurrManag
er'].describe())
Result:
display(len(data))
f_data = pd.DataFrame(data,
columns=['Attrition',
'DailyRate', 'EducationField',
'YearsAtCompany'
, 'YearsInCurrentRole',
'YearsSinceLastPromotion',
'YearsWithCurrManager'
])
display(f_data.head())
m_data =
f_data[f_data['Attrition'] ==
'Yes']
f_data = m_data.drop(
['Attrition'], axis=1)
display(len(f_data))
X = f_data
y = f_data['EducationField']
le = LabelEncoder()
X['EducationField'] =
le.fit_transform(X['EducationFie
ld'])
y = le.transform(y)
Result:
s = X.columns
ms = MinMaxScaler()
X = ms.fit_transform(X)
X = pd.DataFrame(X, columns=
[cols])
Result:
k_means = KMeans(n_clusters=2,
random_state=0)
y_k_means =
k_means.fit_predict(x)
labels = k_means.labels_
# check how many of the samples
were correctly labeled
correct_labels = sum(y ==
labels)
print("Result: %d out of %d
samples were correctly labeled."
% (correct_labels, y.size))
print('Accuracy score: {0:0.2f}
%'.format((correct_labels *100)/
float(y.size)))
When K=3
plt.scatter(x['YearsAtCompany'],
x['YearsWithCurrManager'],
c=y_k_means, cmap='rainbow')
plt.show()
When K=4
When K=5
X= DailyRate,
Y=NumberofYearsinCompany
X= EducationField,
Y=NumberofYearsinCompany
X= NumberofYearsinCompany,
Y=Yearswithcurrentmanager