You are on page 1of 14

Using Python to Evaluate Machine Learning Algorithms

Tutorial
Manu Madhavan
Research Scholar, NIT Calicut

1 Introduction
This tutorial discusses some of the fundamental machine learning problems and their implemen-
tation using Python. We will start with a quick introduction to Python and related packages re-
quired for setting mahchine learning environment.The problems we are discussion include: Lin-
ear regression, Polynomial regerssion, Support vector machine, K-means clustering and Princi-
ple component analysis. Among these five, first 3 problems follow supervised learning strategy
whereas the remaiing two uses unsupervised methods.

2 Acknowledgement
This tutorial uses examples from various web resourses. I deeply acknowledge the contributors.

3 Python for Machine Learning


Jean Francois Puget, from IBM’s machine learning department, expressed his opinion that Python
is the most popular language for AI and ML and based it on a trend search results on indeed.com.
Introducing Python programming is beyond the scope this tutorial. For basics of Python pro-
gramming, you may refer Python Tutorial. Basic idea of Python syntax and datastructures (list,
set, dictionary,string and files) is sufficient to run this tutorial.
Machine Learning applications exclusively use following python packages. * Numpy: NumPy
is the fundamental package for scientific computing, which contains useful linear algebra, Fourier
transform, and random number capabilities

• Pandas: offers data structures and operations for manipulating numerical tables and time
series.

• Scikitlearn: Scikit-learn is a free software machine learning library for the Python program-
ming language. It features various classification, regression and clustering algorithms in-
cluding support vector machines, etc..

• Matplotlib: Is the python package for visualizing the data.

1
TRISCCON 2020 4 LINEAR REGRESSION

3.1 Setting the Environment


We are using ANACONDA package manager for setting the environment. The open source ver-
sion of Anaconda is a high performance distribution of Python and R and includes over 100 of the
most popular Python, R and Scala packages for data science. Ad- ditionally, you’ll have access
to over 720 packages that can easily be installed with conda, our renowned package, dependency
and environment manager, that is included in Anaconda.
You can also use other environments like Google Colab.

4 Linear Regression
Regression is a supervised method of learning used to predict a scaler value from the data. Lin-
ear Regression is a method used to define a relationship between a dependent variable (Y) and
independent variable (X). Which is simply written as

b = mx + b
Where y is the dependent variable, m is the scale factor or coefficient, b being the bias coef-
ficient and X being the independent variable. The goal is to draw the line of best fit between X
and Y which estimates the relationship between X and Y. We can use ordinary least square method to
estimate these coefficients.
Lets start working on an example to understand how it is working.
We are going to be using a dataset containing head size and brain weight of different people.
Lets first load the dataset into
In [2]: import pandas as pd
data = pd.read_csv('regression.csv')
print ("Size of dataset",data.shape)
print ("Column names:",list(data))
print ("Sample data")
data.head()

('Size of dataset', (237, 4))


('Column names:', ['Gender', 'Age Range', 'Head Size(cmˆ3)', 'Brain Weight(grams)'])
Sample data

Out[2]: Gender Age Range Head Size(cmˆ3) Brain Weight(grams)


0 1 1 4512 1530
1 1 1 3738 1297
2 1 1 4261 1335
3 1 1 3777 1282
4 1 1 4177 1590

Let’s visualize the data, by plotting the relationship between the Head Size and Brain weights.
In [6]: from matplotlib import pyplot as plt
%matplotlib inline

X = data['Head Size(cm^3)']
Y = data['Brain Weight(grams)']
plt.scatter(X,Y)
plt.xlabel('Head size')
plt.ylabel('Brain weight')

2
TRISCCON 2020 4 LINEAR REGRESSION

Out[6]: <matplotlib.text.Text at 0x7f6a9a3afa90>

The task is to find a line which fits best in above scatter plot so that we can predict the response
for any new feature values. This line is called regression line. This line is the best fit that passes
through most of the scatter points and also reduces error which is the distance from the point to
the line.
And the total error of the linear model is the sum of the error of each point.
n
X
(ri )2 (1)
i=1

where ri is the distance of ith point from the line.


To minimize this distance, we can compute the coefficients of line y = mx + b as:

(x − x̄)(yi − ¯(y)
Pn
m = i=1 Pn i 2
(2)
i=1 (xi − x̄)
b = ȳ − mx̄ (3)
In [8]: import numpy as np
# mean of our inputs and outputs
x_mean = np.mean(X)
y_mean = np.mean(Y)

#total number of values


n = len(X)

In [9]: #now we are calculating m and b


numerator =0
denominator=0

3
TRISCCON 2020 4 LINEAR REGRESSION

for i in range(n):
numerator += (X[i] - x_mean) * (Y[i] - y_mean)
denominator += (X[i] - x_mean) ** 2
m=numerator/denominator
b= y_mean - m*x_mean

print ("Coefficients (m,b):", (m,b))

('Coefficients (m,b):', (0.26342933948939945, 325.57342104944223))

So, our estimation is: Brainweights = 325.57342104944223 + 0.26342933948939945 ∗ Headsize


In [13]: X = data['Head Size(cm^3)']
Y = data['Brain Weight(grams)']
plt.scatter(X,Y,label="data points")
plt.xlabel('Head size')
plt.ylabel('Brain weight')

#take evenly space points from X


x = np.linspace(np.min(X), np.max(X), 1000)
y = b + m * x
plt.plot(x,y,color='red',label="regressionline")
plt.legend()

Out[13]: <matplotlib.legend.Legend at 0x7f6a98775e50>

We can calculate the accuracy of the model by calculating the root mean square error. Let y is
the actual value and y‘ is the predicted value by the regression coefficients, then
v
u n
u1 X
t (yi − yi ‘)2 (4)
n
i=1

4
TRISCCON 2020 5 POLYNOMIAL REGRESSION

In [16]: rmse = 0
for i in range(n):
y_pred= b + m* X[i]
rmse += (Y[i] - y_pred) ** 2

rmse = np.sqrt(rmse/n)
print (rmse)

72.1206213783709

4.1 Linear regression using scikitlearn


You can easily compute the linear regression using python package scikitlearn(sklearn) as follows
In [35]: data = pd.read_csv('regression.csv')
X1 = pd.DataFrame(data['Head Size(cm^3)'])
Y1 = pd.DataFrame(data['Brain Weight(grams)'])

from sklearn.linear_model import LinearRegression


lin = LinearRegression()
lin.fit(X1, Y1)

print(lin.intercept_, lin.coef_)

((237, 1), (237, 1))


(array([325.57342105]), array([[0.26342934]]))

5 Polynomial Regression
Polynomial Regression is a form of linear regression in which the relationship between the inde-
pendent variable x and dependent variable y is modeled as an nth degree polynomial. This type
of regression is useful to define or describe non-linear phenomenon.
The basic goal of regression analysis is to model the expected value of a dependent variable
y in terms of the value of an independent variable x. In simple regression, we used following
equation y = b + ax + e where b is the intercept, a is the slop and e is the error.
For multiple variables, we can write as y = b + a1 x1 + a2 x2 + ... + an xn + e
Lets do the same example using polynomial regression.
(First step is load data into our model)
In [63]: data = pd.read_csv('regression.csv')
X = pd.DataFrame(data['Head Size(cm^3)'])
Y = pd.DataFrame(data['Brain Weight(grams)'])
print (X.shape, Y.shape)

((237, 1), (237, 1))

To convert the original features into their higher order terms we will use the PolynomialFea-
tures class provided by scikit-learn.
In [67]: from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression

poly = PolynomialFeatures(degree = 2)
X_poly = poly.fit_transform(X)

5
TRISCCON 2020 6 SUPPORT VECTOR MACHINE CLASSIFIER

poly.fit(X_poly, Y)

lin2 = LinearRegression()
lin2.fit(X_poly, Y)

Out[67]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=Fals


In [68]: plt.scatter(X, Y, color = 'blue')
plt.plot(X, lin2.predict(poly.fit_transform(X)),color='red')
plt.show()

In [69]: rmse = 0
for xi,yi in zip(X['Head Size(cm^3)'],Y['Brain Weight(grams)']):
y_pred= lin2.predict(poly.fit_transform(xi))
rmse += (yi - y_pred) ** 2

rmse = np.sqrt(rmse/n)
print (rmse)

[[71.88842975]]

6 Support Vector Machine Classifier


Support Vector Machine (SVM) is a supervised machine learning algorithm capable of perform-
ing classification, regression and even outlier detection. The linear SVM classifier works by
drawing a straight line between two classes. All the data points that fall on one side of the
line will be labeled as one class and all the points that fall on the other side will be labeled as

6
TRISCCON 2020 6 SUPPORT VECTOR MACHINE CLASSIFIER

the second. SVM chooses the decision boundary that maximizes the distance from the near-
est data points of all the classes. The nearest points from the decision boundary that max-
imize the distance between the decision boundary and the points are called support vectors.

To understand the complex maths behind SVM, you are envouraged to refer this blog.
In this tutorial, we will discuss how we can do the SVM classification using Python. We are
using Pima Indians Diabetes Database. The objective of the dataset is to diagnostically predict
whether or not a patient has dia- betes, based on certain diagnostic measurements included in the
dataset.
First load the dataset.
In [70]: import pandas as pd
data = pd. read_csv('diabetes.csv')
print(data.shape)
print ("sampe data")
data.head()

(768, 9)
sampe data

Out[70]: Pregnancies Glucose BloodPressure SkinThickness Insulin BMI \


0 6 148 72 35 0 33.6
1 1 85 66 29 0 26.6
2 8 183 64 0 0 23.3
3 1 89 66 23 94 28.1
4 0 137 40 35 168 43.1

DiabetesPedigreeFunction Age Outcome


0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
Now we are going to seperate the features and label into variables X(matrix) and y (vector)

7
TRISCCON 2020 6 SUPPORT VECTOR MACHINE CLASSIFIER

In [72]: labels=data['Outcome']
features=data.drop(['Outcome'],axis=1)
print (list(features))
X=features.values
y=labels.values

['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI',


'DiabetesPedigreeFunction', 'Age']

Next step is split the dataset into test and train sets.
In [75]: #from sklearn.model_selection import train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=10)

Since we are going to perform a classification task, we will use the support vector classifier
class, which is written as SVC in the Scikit-Learn’s svm library.
In [76]: from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')

Next step is training.The fit method of SVC class is called to train the algorithm on the training
data, which is passed as a parameter to the fit method. Execute the following code to train the
algorithm.
In [77]: svclassifier.fit(X_train, y_train)

Out[77]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,


decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

Now predict the classes on test dataset.


In [78]: y_pred = svclassifier.predict(X_test)

The classification performance can be evaluated as follows.


In [79]: from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[146 16]
[ 47 45]]
precision recall f1-score support

0 0.76 0.90 0.82 162


1 0.74 0.49 0.59 92

avg / total 0.75 0.75 0.74 254

8
TRISCCON 2020 7 CLUSTERING USING K-MEANS

In [82]: labels = ['0', '1']


fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(confusion_matrix(y_test,y_pred))
plt.title('Confusion matrix of the classifier')
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

7 Clustering using K-Means


In simple words, clustering is the process of segregating groups with similar traits and assign
them into clusters. It can be defined as the task of identifying subgroups in the data such that
data points in the same subgroup (cluster) are very similar while data points in different clusters
are very different. How the similarity between data points is measured depends on the clustering
algorithm.
K-means algorithm is an iterative algorithm that tries to partition the dataset into Kpre-
defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one
group.It assigns data points to a cluster such that the sum of the squared distance between the
data points and the cluster’s centroid (arithmetic mean of all the data points that belong to that
cluster) is at the minimum.
## K-Means Algorithm Let X = x1 , x2 , x3 , . . . . . . .., xn be the set of data points and
V = v1 , v2 , . . . . . . ., vk be the set of centers. 1. Randomly select ‘k’ cluster centers. 1. Calcu-
late the distance between each data point and cluster centers. 1. Assign the data point to the

9
TRISCCON 2020 7 CLUSTERING USING K-MEANS

cluster center whose distance from the cluster center is minimum of all the cluster centers 1.
Recalculate the new cluster center using:

i c
1 X
vi = ( xi ), (5)
ci
j=1

where c_i is the number of data points in cluster vi . 1. Recalculate the distance between each
data point and new obtained cluster centers. 1. If no data point was reassigned then stop, other-
wise repeat from step 3).
Let us try an example Let’s implement k-means clustering using a famous dataset: the Iris
dataset. This dataset contains 3 classes of 50 instances each and each class refers to a type of iris
plant. The dataset has four features: sepal length, sepal width, petal length, and petal width. The
fifth column is for species, which holds the value for these types of plants.
First load dataset.
In [83]: import pandas as pd
data = pd.read_csv('iris.csv')
print(data.shape)
print (data.head())

(150, 5)
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa

Now lets visualize the data.


In [84]: from sklearn.datasets import load_iris
iris = load_iris()

from matplotlib import pyplot as plt

# The indices of the features that we are plotting


x_index = 0
y_index = 1

# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])

plt.figure(figsize=(5, 4))
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])

plt.tight_layout()
plt.show()

10
TRISCCON 2020 7 CLUSTERING USING K-MEANS

Now we select all four features (sepal length, sepal width, petal length, and petal width) of the
dataset in a variable called x so that we can train our model with these features.
In [107]: x = data.iloc[:, [0,1,2,3]].values
x[:10]

Out[107]: array([[5.1, 3.5, 1.4, 0.2],


[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])

Now apply k-means algorithm from scikitlearn.


In [108]: from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(x)
clusters = kmeans.predict(x)
print (clusters)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2

11
TRISCCON 2020 8 DIMENSIONALITY REDUCTION SUING PCA

2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
2 0]

In [114]: #Print the cluster centers


centers =kmeans.cluster_centers_
print centers

[[5.9016129 2.7483871 4.39354839 1.43387097]


[5.006 3.428 1.462 0.246 ]
[6.85 3.07368421 5.74210526 2.07105263]]

In [115]: import matplotlib.pyplot as plt


plt.scatter(x[:,0],x[:,1], c=clusters,s=50, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.5);

8 Dimensionality Reduction suing PCA


Principal component analysis(PCA) is a fast and flexible unsupervised method for dimensionality
re- duction in data.
In order to project the high dimensional data into a new reduced dimensional space, first find
the principal components. Eigen values serve as principal component. Let input is m × n data
(ie in n dim space). Find the eigen values of the input data (matrix). Sort the eigen values in
decreasing order. Select first k eigne values (k < n). Project x into eigen vectors of each k eigne
values. The new vectors are selected with largest covarience.
Lets do a toy example.
We are creating a 200 × 2 matrix with random numbers and visualize.

12
TRISCCON 2020 8 DIMENSIONALITY REDUCTION SUING PCA

In [119]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt

rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
X.shape

Out[119]: (200, 2)
In [117]: plt.scatter(X[:, 0], X[:, 1])

Out[117]: <matplotlib.collections.PathCollection at 0x7f6a7ecccb50>

The X is currently 2-D. Our objective is to project this data points to a 1D space. We apply
sklearn PCA implementation.
In [120]: from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:", X.shape)
print("transformed shape:", X_pca.shape)

('original shape:', (200, 2))


('transformed shape:', (200, 1))

In [122]: print("PCA Componets:", pca.components_)


print("PCA explained variance:",pca.explained_variance_)

('PCA Componets:', array([[0.94446029, 0.32862557]]))


('PCA explained variance:', array([0.75871884]))

13
TRISCCON 2020 9 REFERENCES

In [121]: X_new = pca.inverse_transform(X_pca)


plt.scatter(X[:, 0], X[:, 1], alpha=0.2)
plt.scatter(X_new[:, 0], X_new[:, 1], alpha=0.8)

Out[121]: <matplotlib.collections.PathCollection at 0x7f6a7ec0e990>

9 References
1. Tom Mitchel, Machine Learning

2. Abdul Hafeez Abdul Raheem, Linear Regression from scratch, Blog

3. Animesh Agarwal, Polynomial Regression, Blog

4. Machine Learnig with Python, Online Course


In [ ]:

14

You might also like