You are on page 1of 19

Machine Learning

Lab File

Name:- Mohit Kumar Choudhary


Branch:- CSE
Entry number:- 18BCS042

Course Coordinator:- Dr. Sakshi Arora


Lab Coordinator:- Miss Vippon Preet Kour

1
INDEX

S.No. Exercise
1 ML_Exercise_1

2 ML_Exercise_2

3 ML_Exercise_3

4 ML_Exercise_4

5 ML_Exercise_5

2
Exercise 1
1. Using head () function print the raw values of the data, ie, top n rows.
2. Using tail () function print the values of the data i.e., last n rows.
3. Check the dimensionality of the data by using the shape () function.
4. Get each attributes data type by using dtypes property.
5. Find the statistical summary of data with the help of describe () method.

In [4]:
import pandas as pd

In [5]:
data = pd.read_csv('/Users/mohitchoudhary/Desktop/train.csv')

In [6]:

data.head(5)
Out[6]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Braund, Mr. Owen


0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Harris

Cumings, Mrs. John


1 2 1 1 Bradle y (Florence female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Heath (Lily May Peel)

Allen, Mr. William


4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Henry

In [7]:
data.tail(5)

Out[7]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S

Graham, Miss. Margaret


887 888 1 1 female 19.0 0 0 112053 30.00 B42 S
Edith

Johnston, Miss. W./C.


888 889 0 3 female NaN 1 2 23.45 NaN S
Catherine Helen "Carrie" 6607

889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C

890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

In [8]:
data.shape
Out[8]:

(891, 12)
In [9]:

data.dtypes
Out[9]:

PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object

In [10]:
data.describe()
Out[10]:

PassengerId Survived Pclass Age SibSp Parch Fare

count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000

mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208

std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429

min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000

25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400

50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200

75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000

max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

In [ ]:

In [ ]:
Exercise 2
1. Plot histogram for datasets using hist() function.
2. Plot density plots for a dataset for understanding the attribute distribution.
3. Plot box and whisker plots for a dataset for understanding the attribute distribution.
4. Plot multivariate plots(corelation matrix plot and scattered matrix plot) for a dataset.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv('/Users/mohitchoudhary/Desktop/train.csv')

In [3]:
data.head()
Out[3]:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

Braund, Mr. Owen


0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN S
Harris

Cumings, Mrs. John


1 2 1 1 Bradle y (Florence female 38.0 1 0 PC 17599 71.2833 C85 C
Briggs Th...

STON/O2.
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 7.9250 NaN S
3101282

Futrelle, Mrs. Jacques


3 4 1 1 female 35.0 1 0 113803 53.1000 C123 S
Heath (Lily May Peel)

Allen, Mr. William


4 5 0 3 male 35.0 0 0 373450 8.0500 NaN S
Henry

In [4]:
data.plot(kind = 'hist', subplots = True, layout = (3,3))
plt.show()

/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
In [5]:
data.plot(kind = 'density', subplots = True, layout = (3,3))
plt.show()

/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:307: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
layout[ax.rowNum, ax.colNum] = ax.get_visible()
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The rowNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().rowspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/pandas/plotting/_matplotlib/tools. py:313: MatplotlibDeprecationWarning:
The colNum attribute was deprecated in Matplotlib 3.2 and will be removed two minor relea
ses later. Use ax.get_subplotspec().colspan.start instead.
if not layout[ax.rowNum + 1, ax.colNum]:

In [6]:

data.boxplot(figsize = (10,10))
plt.show()
In [7]:
corr = data.corr()
plt.figure(figsize=(12,8))
sns.heatmap(corr, cmap="Greens",annot=True)

Out[7]:

<AxesSubplot:>

In [8]:

sns.pairplot(data=data)
plt.show()
Ridge Regression
In a multiple LR, there are many variables at play. This sometimes poses a problem of choosing the wrong variables for the ML, which gives undesirable output as a result. Ridge regression is used in order to
overcome this. This method is a regularisation technique in which an extra variable (tuning parameter) is added and optimised to offset the effect of multiple variables in LR (in the statistical context, it is referred
to as „noise‟).

Ridge regression essentially is an instance of LR with regularisation. Mathematically, the model with ridge regression is given by

Y = XB + e

where Y is the dependent variable(label), X is the independent variable (features), B represents all the regression coefficients and e represents the residuals (the extra variables‟ effect). Based on this, the
variables are now standardised by subtracting the respective means and dividing by their standard deviations.

The tuning parameter is now included in the ridge regression model as part of regularisation. It is denoted by the symbol ƛ. Higher the value of ƛ, the residual sum of squares tend to be zero. Lower the ƛ, the
solutions conform to least square method. In simpler words, this parameter decides the effect of coefficients. ƛ is found out using a technique called cross-validation. (More mathematical details on ridge
regression can be found here).

Lasso Regression
Least absolute shrinkage and selection operator, abbreviated as LASSO or lasso, is an LR technique which also performs regularisation on variables in consideration. In fact, it almost shares a similar statistical
analysis evident in ridge regression, except it differs in the regularisation values. This means, it considers the absolute values of the sum of the regression coefficients (hence the term was coined on this
„shrinkage‟ feature). It even sets the coefficients to zero thus reducing the errors completely. In the ridge equation mentioned earlier, the „e‟ component has absolute values instead of squared values.

This method was proposed by Professor Robert Tibshirani from the University of Toronto, Canada. He said, “The Lasso minimises the residual sum of squares to the sum of the absolute value of the coefficients
being less than a constant. Because of the nature of this constraint, it tends to produce some coefficients that are exactly 0 and hence gives interpretable models”.

In his journal article titled Regression Shrinkage and Selection via the Lasso, Tibshirani gives an account of this technique with respect to various other statistical models such as subset selection and ridge
regression. He goes on to say that lasso can even be extended to generalised regression models and tree-based models. In fact, this technique provides possibilities of even conducting statistical estimations.
Ridge Regression
In [1]: from sklearn.datasets import load_boston
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

In [2]: boston = load_boston()


x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

In [3]: alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1]

In [4]: for a in alphas:


model = Ridge(alpha=a, normalize=True).fit(x,y)
score = model.score(x, y)
pred_y = model.predict(x)
mse = mean_squared_error(y, pred_y)
print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}"
.format(a, score, mse, np.sqrt(mse)))

Alpha:0.000001, R2:0.741, MSE:21.89, RMSE:4.68


Alpha:0.000010, R2:0.741, MSE:21.89, RMSE:4.68
Alpha:0.000100, R2:0.741, MSE:21.89, RMSE:4.68
Alpha:0.001000, R2:0.741, MSE:21.90, RMSE:4.68
Alpha:0.010000, R2:0.740, MSE:21.92, RMSE:4.68
Alpha:0.100000, R2:0.732, MSE:22.66, RMSE:4.76
Alpha:0.500000, R2:0.686, MSE:26.48, RMSE:5.15
Alpha:1.000000, R2:0.635, MSE:30.81, RMSE:5.55

In [5]: ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain)


ypred = ridge_mod.predict(xtest)
score = model.score(xtest,ytest)
mse = mean_squared_error(ytest,ypred)
print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}"
.format(score, mse,np.sqrt(mse)))

R2:0.601, MSE:24.49, RMSE:4.95

In [6]: x_ax = range(len(xtest))


plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In [ ]:
Lasso Regression
In [1]: from sklearn.datasets import load_boston
from sklearn.linear_model import Lasso, LassoCV
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

In [2]: boston = load_boston()


x, y = boston.data, boston.target
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)

In [3]: model=Lasso().fit(x, y)
print(model)
Lasso(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)

Lasso()
Out[3]: Lasso()

In [4]: score = model.score(x, y)


ypred = model.predict(xtest)
mse = mean_squared_error(ytest, ypred)
print("Alpha:{0:.2f}, R2:{1:.2f}, MSE:{2:.2f}, RMSE:{3:.2f}"
.format(model.alpha, score, mse, np.sqrt(mse)))

Alpha:1.00, R2:0.68, MSE:27.89, RMSE:5.28

In [5]: x_ax = range(len(xtest))


plt.scatter(x_ax, ytest, s=5, color="blue", label="original")
plt.plot(x_ax, ypred,lw=0.8, color="red", label="predicted")
plt.legend()
plt.show()

In [ ]:
Exercise 4
1. Perform Scaling on the dataset using MinMaxScaler class/library.
2. Perform Normalization of the data by using Normalizer class/library
A. L1 Normalization
B. L2 Normalization
3. Perform binarization on the dataset using Binarize class/library.
4. Perform standardization on the data using StandardScaler class/library.

In [1]:
# Scaling the data using MinMaxScaler
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
scaler = MinMaxScaler()
scaled = scaler.fit_transform(data)
print(scaled)

[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[1. 0. ]
[0.04166667 0.49494949]
[0.47916667 0.04040404]
[0.875 0.6969697 ]
[0. 1. ]]

In [4]:
# L1 normalization
from sklearn.preprocessing import Normalizer
data = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
transformer = Normalizer(norm='l1').fit(data)
l1_normalized = transformer.transform(data)
print(l1_normalized)

[[0.44444444 0.11111111 0.22222222 0.22222222]


[0.0625 0.1875 0.5625 0.1875 ]
[0.27777778 0.38888889 0.27777778 0.05555556]]

In [5]:
# L2 normalization
from sklearn.preprocessing import Normalizer
data = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
transformer = Normalizer(norm='l2').fit(data)
l1_normalized = transformer.transform(data)
print(l1_normalized)

[[0.8 0.2 0.4 0.4]


[0.1 0.3 0.9 0.3]
[0.5 0.7 0.5 0.1]]
In [6]:

# Binarization
from sklearn.preprocessing import binarize
data = [[ 1., -1., 2.],
[ 2., 0., 0.],
[ 0., 1., -1.]]
binarized_data = binarize(data)
print(binarized_data)

[[1. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]]

In [3]:
# Standardizing the data using StandardScaler
from numpy import asarray
from sklearn.preprocessing import StandardScaler
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
# define standard scaler
scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)

[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[ 1.26398112 -1.16389967]
[-1.06174414 0.12639634]
[ 0. -1.05856939]
[ 0.96062565 0.65304778]
[-1.16286263 1.44302493]]
Exercise 5
1. Select features from a dataset using Univariate Selection.
2. Select features from a dataset using Recursive Feature Elimination.
3. Select features from a dataset using Principal Component Analysis.
4. Select features from a dataset using Feature Importance.

In [1]:

# Feature Selection using Univariate Selection


from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
filename = '/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
print(features[0:5,:])

[ 39.67 213.162 3.257 4.304 13.281 71.772 23.871 46.141]


[[ 6. 148. 33.6 50. ]
[ 1. 85. 26.6 31. ]
[ 8. 183. 23.3 32. ]
[ 1. 89. 28.1 21. ]
[ 0. 137. 43.1 33. ]]

In [3]:

# Feature Extraction using RFE


from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
model = LogisticRegression(solver='lbfgs')
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

Num Features: 3
Selected Features: [ True False False False False True True False]
Feature Ranking: [1 2 4 5 6 1 1 3]

/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/sklearn/utils/validation.py:67: Fu tureWarning: Pass n_features_to_select=3 as
keyword args. From version 0.25 passing these as positional arguments will result in an
error
warnings.warn("Pass {} as keyword args. From version 0.25 "
/Users/mohitchoudhary/anaconda3/lib/python3.8/site-
packages/sklearn/linear_model/_logistic.py: 762: ConvergenceWarning: lbfgs failed to
converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
n_iter_i = _check_optimize_result(

In [4]:
# Feature Extraction using PCA
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
pca = PCA(n_components=3)
fit = pca.fit(X)
print("Explained Variance: %s" % fit.explained_variance_ratio_)
print(fit.components_)

Explained Variance: [0.889 0.062 0.026]


[[-2.022e-03 9.781e-02 1.609e-02 6.076e-02 9.931e-01 1.401e-02
5.372e-04 -3.565e-03]
[-2.265e-02 -9.722e-01 -1.419e-01 5.786e-02 9.463e-02 -4.697e-02
-8.168e-04 -1.402e-01]
[-2.246e-02 1.434e-01 -9.225e-01 -3.070e-01 2.098e-02 -1.324e-01
-6.400e-04 -1.255e-01]]

In [5]:
# Feature Extraction using Feature Importance
from pandas import read_csv
from sklearn.ensemble import ExtraTreesClassifier
filename = "/Users/mohitchoudhary/Downloads/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
model = ExtraTreesClassifier(n_estimators=10)
model.fit(X, Y)
print(model.feature_importances_)

[0.107 0.241 0.096 0.074 0.079 0.133 0.124 0.146]

In [ ]:
Exercise 6
1. Using K-means clustering algorithm perform clustering on any two datasets
2. Using mean shift clustering algorithm perform clustering on any two datasets
3. Using gaussian mixture model perform clustering on any two datasets
4. Provide a comparison between the various clustering algorithms on different datasets

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, MeanShift
from sklearn.mixture import GaussianMixture

In [2]:
# K-Means on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')

Out[2]:

<matplotlib.collections.PathCollection at 0x7f832c67e250>

In [3]:
# K-Means on second dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
kmeans = KMeans(n_clusters=3)
y = kmeans.fit_predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[3]:

<matplotlib.collections.PathCollection at 0x7f832c47df70>
In [4]:
# Mean Shift on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
meanshift = MeanShift()
meanshift.fit(x)
y = meanshift.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[4]:

<matplotlib.collections.PathCollection at 0x7f832c865ee0>

In [5]:
# Mean Shift on second dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
meanshift = MeanShift()
meanshift.fit(x)
y = meanshift.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[5]:

<matplotlib.collections.PathCollection at 0x7f832c6e2940>

In [6]:
# Gaussian Mixture on first dataset
df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Iris.csv')
x = df.iloc[:, [0,1,2,3]].values
gmm = GaussianMixture(n_components=3)
gmm.fit(x)
y = gmm.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')

Out[6]:

<matplotlib.collections.PathCollection at 0x7f83271526a0>

In [7]:

# Gaussian Mixture on second dataset


df =
pd.read_csv('/Users/mohitchoudhary/Downloads/Mall_Customers.csv')
df = df.drop(['Genre'], axis = 1)
x = df.iloc[:, [0,1,2]].values
gmm = GaussianMixture(n_components=3)
gmm.fit(x)
y = gmm.predict(x)
plt.scatter(x[:, 0], x[:, 1], c=y, cmap='rainbow')
Out[7]:

<matplotlib.collections.PathCollection at 0x7f832c0884c0>

K-Means
Kmeans algorithm is an iterative algorithm that tries to partition the dataset into Kpre-defined distinct non-
overlapping subgroups (clusters) where each data point belongs to only one group. It tries to make the intra -
cluster data points as similar as possible while also keeping the clusters as different (far) as possible. It assigns
data points to a cluster such that the sum of the squared distance between the data points and the cluster‟s
centroid (arithmetic mean of all the data points that belong to that cluster) is at the minimum. The less variation
we have within clusters, the more homogeneous (similar) the data points are within the same cluster. The way
kmeans algorithm works is as follows:

1. Specify number of clusters K.


2. Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids
without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data points to clusters isn‟t
changing.
Compute the sum of the squared distance between data points and all centroids.
Assign each data point to the closest cluster (centroid).
Compute the centroids for the clusters by taking the average of the all data points that belong to each
cluster.

Mean Shift
Mean Shift is a hierarchical clustering algorithm. In contrast to supervised machine learning algorithms,
clustering attempts to group data without having first been train on labeled data. Clustering is used in a wide
variety of applications such as search engines, academic rankings and medicine. As opposed to K-Means, when
using Mean Shift, you don‟t need to know the number of categories (clusters) beforehand. The downside to
Mean Shift is that it is computationally expensive — O(n²).

How it works
1. Define a window (bandwidth of the kernel) and place the window on a data point.
2. Calculate the mean for all the points in the window.
3. Move the center of the window to the location of the mean.
4. Repeat steps 2 and 3 until there is convergence.

Guassian Mixture Model


As the name implies, a Gaussian mixture model involves the mixture (i.e. superposition) of multiple Gaussian
distributions. For the sake of explanation, suppose we had three distributions made up of samples from three
distinct classes. The blue Gaussian represents the level of education of people that make up the lower class.
The red Gaussian represents the level of education of people that make up the middle class, and the green
Gaussian represents the level of education of people that make up the upper class. Not knowing what samples
came from which class, our goal will be to use Gaussian Mixture Models to assign the data points to the
appropriate cluster. After training the model, we‟d ideally end up with three distributions on the same axis. Then,
depending on the level of education of a given sample (where it is located on the axis), we‟d place it in one of the
three categories. Every distribution is multiplied by a weight π to account for the fact that we do not have an
equal number of samples from each category. In other words, we might only have included 1000 people from the
upper class and 100,000 people from the middle class. Since, we‟re dealing with probabilities, the weights
should add to 1, when summed. If we decided to add another dimension such as the number of children, then, it
might look something like this.

You might also like