Professional Documents
Culture Documents
Tutorial
Manu Madhavan
Research Scholar, NIT Calicut
1 Introduction
This tutorial discusses some of the fundamental machine learning problems and their implemen-
tation using Python. We will start with a quick introduction to Python and related packages re-
quired for setting mahchine learning environment.The problems we are discussion include: Lin-
ear regression, Polynomial regerssion, Support vector machine, K-means clustering and Princi-
ple component analysis. Among these five, first 3 problems follow supervised learning strategy
whereas the remaiing two uses unsupervised methods.
2 Acknowledgement
This tutorial uses examples from various web resourses. I deeply acknowledge the contributors.
• Pandas: offers data structures and operations for manipulating numerical tables and time
series.
• Scikitlearn: Scikit-learn is a free software machine learning library for the Python program-
ming language. It features various classification, regression and clustering algorithms in-
cluding support vector machines, etc..
1
TRISCCON 2020 4 LINEAR REGRESSION
4 Linear Regression
Regression is a supervised method of learning used to predict a scaler value from the data. Lin-
ear Regression is a method used to define a relationship between a dependent variable (Y) and
independent variable (X). Which is simply written as
b = mx + b
Where y is the dependent variable, m is the scale factor or coefficient, b being the bias coef-
ficient and X being the independent variable. The goal is to draw the line of best fit between X
and Y which estimates the relationship between X and Y. We can use ordinary least square method to
estimate these coefficients.
Lets start working on an example to understand how it is working.
We are going to be using a dataset containing head size and brain weight of different people.
Lets first load the dataset into
In [2]: import pandas as pd
data = pd.read_csv('regression.csv')
print ("Size of dataset",data.shape)
print ("Column names:",list(data))
print ("Sample data")
data.head()
Let’s visualize the data, by plotting the relationship between the Head Size and Brain weights.
In [6]: from matplotlib import pyplot as plt
%matplotlib inline
X = data['Head Size(cm^3)']
Y = data['Brain Weight(grams)']
plt.scatter(X,Y)
plt.xlabel('Head size')
plt.ylabel('Brain weight')
2
TRISCCON 2020 4 LINEAR REGRESSION
The task is to find a line which fits best in above scatter plot so that we can predict the response
for any new feature values. This line is called regression line. This line is the best fit that passes
through most of the scatter points and also reduces error which is the distance from the point to
the line.
And the total error of the linear model is the sum of the error of each point.
n
X
(ri )2 (1)
i=1
(x − x̄)(yi − ¯(y)
Pn
m = i=1 Pn i 2
(2)
i=1 (xi − x̄)
b = ȳ − mx̄ (3)
In [8]: import numpy as np
# mean of our inputs and outputs
x_mean = np.mean(X)
y_mean = np.mean(Y)
3
TRISCCON 2020 4 LINEAR REGRESSION
for i in range(n):
numerator += (X[i] - x_mean) * (Y[i] - y_mean)
denominator += (X[i] - x_mean) ** 2
m=numerator/denominator
b= y_mean - m*x_mean
We can calculate the accuracy of the model by calculating the root mean square error. Let y is
the actual value and y‘ is the predicted value by the regression coefficients, then
v
u n
u1 X
t (yi − yi ‘)2 (4)
n
i=1
4
TRISCCON 2020 5 POLYNOMIAL REGRESSION
In [16]: rmse = 0
for i in range(n):
y_pred= b + m* X[i]
rmse += (Y[i] - y_pred) ** 2
rmse = np.sqrt(rmse/n)
print (rmse)
72.1206213783709
print(lin.intercept_, lin.coef_)
5 Polynomial Regression
Polynomial Regression is a form of linear regression in which the relationship between the inde-
pendent variable x and dependent variable y is modeled as an nth degree polynomial. This type
of regression is useful to define or describe non-linear phenomenon.
The basic goal of regression analysis is to model the expected value of a dependent variable
y in terms of the value of an independent variable x. In simple regression, we used following
equation y = b + ax + e where b is the intercept, a is the slop and e is the error.
For multiple variables, we can write as y = b + a1 x1 + a2 x2 + ... + an xn + e
Lets do the same example using polynomial regression.
(First step is load data into our model)
In [63]: data = pd.read_csv('regression.csv')
X = pd.DataFrame(data['Head Size(cm^3)'])
Y = pd.DataFrame(data['Brain Weight(grams)'])
print (X.shape, Y.shape)
To convert the original features into their higher order terms we will use the PolynomialFea-
tures class provided by scikit-learn.
In [67]: from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
poly = PolynomialFeatures(degree = 2)
X_poly = poly.fit_transform(X)
5
TRISCCON 2020 6 SUPPORT VECTOR MACHINE CLASSIFIER
poly.fit(X_poly, Y)
lin2 = LinearRegression()
lin2.fit(X_poly, Y)
In [69]: rmse = 0
for xi,yi in zip(X['Head Size(cm^3)'],Y['Brain Weight(grams)']):
y_pred= lin2.predict(poly.fit_transform(xi))
rmse += (yi - y_pred) ** 2
rmse = np.sqrt(rmse/n)
print (rmse)
[[71.88842975]]
6
TRISCCON 2020 6 SUPPORT VECTOR MACHINE CLASSIFIER
the second. SVM chooses the decision boundary that maximizes the distance from the near-
est data points of all the classes. The nearest points from the decision boundary that max-
imize the distance between the decision boundary and the points are called support vectors.
To understand the complex maths behind SVM, you are envouraged to refer this blog.
In this tutorial, we will discuss how we can do the SVM classification using Python. We are
using Pima Indians Diabetes Database. The objective of the dataset is to diagnostically predict
whether or not a patient has dia- betes, based on certain diagnostic measurements included in the
dataset.
First load the dataset.
In [70]: import pandas as pd
data = pd. read_csv('diabetes.csv')
print(data.shape)
print ("sampe data")
data.head()
(768, 9)
sampe data
7
TRISCCON 2020 6 SUPPORT VECTOR MACHINE CLASSIFIER
In [72]: labels=data['Outcome']
features=data.drop(['Outcome'],axis=1)
print (list(features))
X=features.values
y=labels.values
Next step is split the dataset into test and train sets.
In [75]: #from sklearn.model_selection import train_test_split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=10)
Since we are going to perform a classification task, we will use the support vector classifier
class, which is written as SVC in the Scikit-Learn’s svm library.
In [76]: from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
Next step is training.The fit method of SVC class is called to train the algorithm on the training
data, which is passed as a parameter to the fit method. Execute the following code to train the
algorithm.
In [77]: svclassifier.fit(X_train, y_train)
[[146 16]
[ 47 45]]
precision recall f1-score support
8
TRISCCON 2020 7 CLUSTERING USING K-MEANS
9
TRISCCON 2020 7 CLUSTERING USING K-MEANS
cluster center whose distance from the cluster center is minimum of all the cluster centers 1.
Recalculate the new cluster center using:
i c
1 X
vi = ( xi ), (5)
ci
j=1
where c_i is the number of data points in cluster vi . 1. Recalculate the distance between each
data point and new obtained cluster centers. 1. If no data point was reassigned then stop, other-
wise repeat from step 3).
Let us try an example Let’s implement k-means clustering using a famous dataset: the Iris
dataset. This dataset contains 3 classes of 50 instances each and each class refers to a type of iris
plant. The dataset has four features: sepal length, sepal width, petal length, and petal width. The
fifth column is for species, which holds the value for these types of plants.
First load dataset.
In [83]: import pandas as pd
data = pd.read_csv('iris.csv')
print(data.shape)
print (data.head())
(150, 5)
sepal.length sepal.width petal.length petal.width variety
0 5.1 3.5 1.4 0.2 Setosa
1 4.9 3.0 1.4 0.2 Setosa
2 4.7 3.2 1.3 0.2 Setosa
3 4.6 3.1 1.5 0.2 Setosa
4 5.0 3.6 1.4 0.2 Setosa
# this formatter will label the colorbar with the correct target names
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.figure(figsize=(5, 4))
plt.scatter(iris.data[:, x_index], iris.data[:, y_index], c=iris.target)
plt.colorbar(ticks=[0, 1, 2], format=formatter)
plt.xlabel(iris.feature_names[x_index])
plt.ylabel(iris.feature_names[y_index])
plt.tight_layout()
plt.show()
10
TRISCCON 2020 7 CLUSTERING USING K-MEANS
Now we select all four features (sepal length, sepal width, petal length, and petal width) of the
dataset in a variable called x so that we can train our model with these features.
In [107]: x = data.iloc[:, [0,1,2,3]].values
x[:10]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 2 2 2 2 0 2 2 2 2
11
TRISCCON 2020 8 DIMENSIONALITY REDUCTION SUING PCA
2 2 0 0 2 2 2 2 0 2 0 2 0 2 2 0 0 2 2 2 2 2 0 2 2 2 2 0 2 2 2 0 2 2 2 0 2
2 0]
12
TRISCCON 2020 8 DIMENSIONALITY REDUCTION SUING PCA
rng = np.random.RandomState(1)
X = np.dot(rng.rand(2, 2), rng.randn(2, 200)).T
X.shape
Out[119]: (200, 2)
In [117]: plt.scatter(X[:, 0], X[:, 1])
The X is currently 2-D. Our objective is to project this data points to a 1D space. We apply
sklearn PCA implementation.
In [120]: from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.transform(X)
print("original shape:", X.shape)
print("transformed shape:", X_pca.shape)
13
TRISCCON 2020 9 REFERENCES
9 References
1. Tom Mitchel, Machine Learning
14