You are on page 1of 45

Journey of Machine Learning

There are so many algorithms available and it can feel overwhelming when algorithm names are thrown
around and you are expected to just know what they are and where they fit.

In this post I want to give you two ways to think about and categorize the algorithms you may come
across in the field.

 The first is a grouping of algorithms by the learning style.


 The second is a grouping of algorithms by similarity in form or function (like grouping similar animals
together).
Both approaches are useful, but we will focus in on the grouping of algorithms by similarity and go on a
tour of a variety of different algorithm types.

After reading this post, you will have a much better understanding of the most popular machine learning
algorithms for supervised learning and how they are related.
A cool example of an ensemble of lines of best fit. Weak members are grey, the combined prediction is
red.
Plot from Wikipedia, licensed under public domain.

Algorithms Grouped by Learning Style

There are different ways an algorithm can model a problem based on its interaction with the experience
or environment or whatever we want to call the input data.

It is popular in machine learning and artificial intelligence textbooks to first consider the learning styles
that an algorithm can adopt.

There are only a few main learning styles or learning models that an algorithm can have and we’ll go
through them here with a few examples of algorithms and problem types that they suit.

This taxonomy or way of organizing machine learning algorithms is useful because it forces you to think
about the the roles of the input data and the model preparation process and select one that is the most
appropriate for your problem in order to get the best result.

Let’s take a look at four different learning styles in machine learning algorithms:

Supervised Learning

Input data is called training data and has a known label or result
such as spam/not-spam or a stock price at a time.

A model is prepared through a training process where it is required to make predictions and is corrected
when those predictions are wrong. The training process continues until the model achieves a desired
level of accuracy on the training data.
Example problems are classification and regression.

Example algorithms include Logistic Regression and the Back Propagation Neural Network.

Unsupervised Learning

Input data is not labelled and does not have a known result.

A model is prepared by deducing structures present in the input data. This may be to extract general
rules. It may through a mathematical process to systematically reduce redundancy, or it may be
to organize data by similarity.

Example problems are clustering, dimensionality reduction and association rule learning.

Example algorithms include: the Apriori algorithm and k-Means.

Semi-Supervised Learning
Input data is a mixture of labelled and unlabelled examples.
There is a desired prediction problem but the model must learn the structures to organize the data as well
as make predictions.

Example problems are classification and regression.

Example algorithms are extensions to other flexible methods that make assumptions about how to model
the unlabelled data.

Overview

When crunching data to model business decisions, you are most typically using supervised and
unsupervised learning methods.

A hot topic at the moment is semi-supervised learning methods in areas such as image classification
where there are large datasets with very few labelled examples.

Algorithms Grouped By Similarity

Algorithms are often grouped by similarity in terms of their function (how they work). For example, tree-
based methods, and neural network inspired methods.

I think this is the most useful way to group algorithms and it is the approach we will use here.

This is a useful grouping method, but it is not perfect. There are still algorithms that could just as easily fit
into multiple categories like Learning Vector Quantization that is both a neural network inspired method
and an instance-based method. There are also categories that have the same name that describes the
problem and the class of algorithm such as Regression and Clustering.
We could handle these cases by listing algorithms twice or by selecting the group that subjectively is the
“best” fit. I like this latter approach of not duplicating algorithms to keep things simple.

In this section I list many of the popular machine leaning algorithms grouped the way I think is the most
intuitive. It is not exhaustive in either the groups or the algorithms, but I think it is representative and will
be useful to you to get an idea of the lay of the land.

Please Note: There is a strong bias towards algorithms used for classification and regression, the two
most prevalent supervised machine learning problems you will encounter.
If you know of an algorithm or a group of algorithms not listed, put it in the comments and share it with us.
Let’s dive in.

Regression Algorithms

Regression is concerned with modelling the relationship between


variables that is iteratively refined using a measure of error in the predictions made by the model.

Regression methods are a workhorse of statistics and have been cooped into statistical machine learning.
This may be confusing because we can use regression to refer to the class of problem and the class of
algorithm. Really, regression is a process.

The most popular regression algorithms are:

 Ordinary Least Squares Regression (OLSR)


 Linear Regression
 Logistic Regression
 Stepwise Regression
 Multivariate Adaptive Regression Splines (MARS)
 Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based Algorithms
Instance based learning model a decision problem with instances or
examples of training data that are deemed important or required to the model.

Such methods typically build up a database of example data and compare new data to the database
using a similarity measure in order to find the best match and make a prediction. For this reason,
instance-based methods are also called winner-take-all methods and memory-based learning. Focus is
put on representation of the stored instances and similarity measures used between instances.

The most popular instance-based algorithms are:

 k-Nearest Neighbour (kNN)


 Learning Vector Quantization (LVQ)
 Self-Organizing Map (SOM)
 Locally Weighted Learning (LWL)

Regularization Algorithms
An extension made to another method (typically regression methods)
that penalizes models based on their complexity, favoring simpler models that are also better at
generalizing.

I have listed regularization algorithms separately here because they are popular, powerful and generally
simple modifications made to other methods.

The most popular regularization algorithms are:

 Ridge Regression
 Least Absolute Shrinkage and Selection Operator (LASSO)
 Elastic Net
 Least-Angle Regression (LARS)
Decision Tree Algorithms

Decision tree methods construct a model of decisions made based


on actual values of attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record. Decision trees are
trained on data for classification and regression problems. Decision trees are often fast and accurate and
a big favorite in machine learning.

The most popular decision tree algorithms are:

 Classification and Regression Tree (CART)


 Iterative Dichotomiser 3 (ID3)
 C4.5 and C5.0 (different versions of a powerful approach)
 Chi-squared Automatic Interaction Detection (CHAID)
 Decision Stump
 M5
 Conditional Decision Trees
Bayesian Algorithms

Bayesian methods are those that are explicitly apply Bayes’


Theorem for problems such as classification and regression.

The most popular Bayesian algorithms are:

 Naive Bayes
 Gaussian Naive Bayes
 Multinomial Naive Bayes
 Averaged One-Dependence Estimators (AODE)
 Bayesian Belief Network (BBN)
 Bayesian Network (BN)
Clustering Algorithms

Clustering, like regression describes the class of problem and the


class of methods.

Clustering methods are typically organized by the modelling approaches such as centroid-based and
hierarchal. All methods are concerned with using the inherent structures in the data to best organize the
data into groups of maximum commonality.

The most popular clustering algorithms are:

 k-Means
 k-Medians
 Expectation Maximisation (EM)
 Hierarchical Clustering

Association Rule Learning Algorithms

Association rule learning are methods that extract rules that best
explain observed relationships between variables in data.
These rules can discover important and commercially useful associations in large multidimensional
datasets that can be exploited by an organisation.

The most popular association rule learning algorithms are:

 Apriori algorithm
 Eclat algorithm

Artificial Neural Network Algorithms

Artificial Neural Networks are models that are inspired by the structure
and/or function of biological neural networks.

They are a class of pattern matching that are commonly used for regression and classification problems
but are really an enormous subfield comprised of hundreds of algorithms and variations for all manner of
problem types.

Note that I have separated out Deep Learning from neural networks because of the massive growth and
popularity in the field. Here we are concerned with the more classical methods.

The most popular artificial neural network algorithms are:

 Perceptron
 Back-Propagation
 Hopfield Network
 Radial Basis Function Network (RBFN)
Deep Learning Algorithms

Deep Learning methods are a modern update to Artificial Neural


Networks that exploit abundant cheap computation.

They are concerned with building much larger and more complex neural networks, and as commented
above, many methods are concerned with semi-supervised learning problems where large datasets
contain very little labelled data.

The most popular deep learning algorithms are:

 Deep Boltzmann Machine (DBM)


 Deep Belief Networks (DBN)
 Convolutional Neural Network (CNN)
 Stacked Auto-Encoders
Dimensionality Reduction Algorithms

Like clustering methods, dimensionality reduction seek and exploit


the inherent structure in the data, but in this case in an unsupervised manner or order to summarise or
describe data using less information.

This can be useful to visualize dimensional data or to simplify data which can then be used in a
supervized learning method. Many of these methods can be adapted for use in classification and
regression.

 Principal Component Analysis (PCA)


 Principal Component Regression (PCR)
 Partial Least Squares Regression (PLSR)
 Sammon Mapping
 Multidimensional Scaling (MDS)
 Projection Pursuit
 Linear Discriminant Analysis (LDA)
 Mixture Discriminant Analysis (MDA)
 Quadratic Discriminant Analysis (QDA)
 Flexible Discriminant Analysis (FDA)
Ensemble Algorithms

Ensemble methods are models composed of multiple weaker models


that are independently trained and whose predictions are combined in some way to make the overall
prediction.

Much effort is put into what types of weak learners to combine and the ways in which to combine them.
This is a very powerful class of techniques and as such is very popular.

 Boosting
 Bootstrapped Aggregation (Bagging)
 AdaBoost
 Stacked Generalization (blending)
 Gradient Boosting Machines (GBM)
 Gradient Boosted Regression Trees (GBRT)
 Random Forest
Machine Learning with R

R is the most popular platform for applied machine learning. When you want to get serious with applied
machine learning you will find your way into R.
It is very powerful because so many machine learning algorithms are provided. A problem is that the
algorithms are all provided by third parties, which makes their usage very inconsistent. This slows you
down, a lot, because you have to learn how to model data and how to make predicts with each algorithm
in each package, again and again.

In this post you will discover how you can overcome this difficulty with machine learning algorithms in R,
with pre-prepared recipes that follow a consistent structure.
The R ecosystem is enormous. Open source third party packages provide this power, allowing academics
and professionals to get the most powerful algorithms available into the hands of us practitioners.

A problem that I experienced when starting out with R was that the usage to each algorithm differs from
package to package. This inconsistency also extends to the documentation, with some providing worked
example for classification but ignoring regression and others not providing examples at all.

All this means that if you want to try a few different algorithms from different packages, you must spend
time figuring out how to fit and make predictions with each method in turn. This takes a lot of time,
especially with the spotty examples and vignettes.

I summarize these difficulties as follows:

 Inconsistent: Algorithm implementations vary in the way a model is fit to data and the way a model is
used to generate predictions. This means that you have to study each package and each algorithm
implementation just to put together a working example, let alone adapt it to your problem.
 Decentralized: Algorithm are implemented across different packages and it can be hard to locate which
packages provide an implementation of the algorithm you need, let alone which package provides the
most popular implementation. Additionally, the documentation for one package may be spread across
multiple help files, website and vignettes. This means you have to do a lot of searching just to locate an
algorithm, let alone compile a list of algorithms from which you can choose.
 Incomplete: Algorithm documentation is almost always partially complete. An example usage may or
may not be provided, if it is, it may or may not be demonstrated on a canonical problem. This means you
have no obvious way to quickly understand how to use an implementation.
 Complexity: Algorithms vary in their complexity of implementation and description. This can take it’s toll
on you as you jump from package to package. You want to focus on how to get the most from the
algorithm and its parameters, and not burn energy on parsing reams of PDFs just to get a hello world.

Linear Regression in R

You can copy and paste the recipes in this post to make a jump-start on your own problem or to learn and
practice with linear regression in R.
Ordinary Least Squares Regression

Ordinary Least Squares Regression

Ordinary Least Squares (OLS) regression is a linear model that seeks to find a set of coefficients for a
line/hyper-plane that minimise the sum of the squared errors.

# load data
data(longley)
# fit model
fit<- lm(Employed~., longley)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)
Stepwize Linear Regression

Stepwise Linear Regression is a method that makes use of linear regression to discover which subset of
attributes in the dataset result in the best performing model. It is step-wise because each iteration of the
method makes a change to the set of attributes and creates a model to evaluate the performance of the
set.

# load data
data(longley)
# fit model
base<- lm(Employed~., longley)
# summarize the fit
summary(base)
# perform step-wise feature selection
fit<- step(base)
# summarize the selected model
summary(fit)
# make predictions
predictions<- predict(fit, longley)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Principal Component Regression

Principal Component Regression (PCR) creates a linear regression model using the outputs of a Principal
Component Analysis (PCA) to estimate the coefficients of the model. PCR is useful when the data has
highly-correlated predictors.

# load the package


library(pls)
# load data
data(longley)
# fit model
fit<- pcr(Employed~., data=longley, validation="CV")
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley, ncomp=6)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Partial Least Squares Regression

Partial Least Squares (PLS) Regression creates a linear model of the data in a transformed projection of
problem space. Like PCR, PLS is appropriate for data with highly-correlated predictors.

# load the package


library(pls)
# load data
data(longley)
# fit model
fit<- plsr(Employed~., data=longley, validation="CV")
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley, ncomp=6)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Multivariate Adaptive Regression Splines

Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression method that models
multiple nonlinearities in data using hinge functions (functions with a kink in them).
# load the package
library(earth)
# load data
data(longley)
# fit model
fit<- earth(Employed~., longley)
# summarize the fit
summary(fit)
# summarize the importance of input variables
evimp(fit)
# make predictions
predictions<- predict(fit, longley)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Support Vector Machine

Support Vector Machines (SVM) are a class of methods, developed originally for classification, that find
support points that best separate classes. SVM for regression is called Support Vector Regression
(SVM).

# load the package


library(kernlab)
# load data
data(longley)
# fit model
fit<- ksvm(Employed~., longley)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

k-Nearest Neighbor

The k-Nearest Neighbor (kNN) does not create a model, instead it creates predictions from close data on-
demand when a prediction is required. A similarity measure (such as Euclidean distance) is used to
locate close data in order to make predictions.

# load the package


library(caret)
# load data
data(longley)
# fit model
fit<- knnreg(longley[,1:6], longley[,7], k=3)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Neural Network

A Neural Network (NN) is a graph of computational units that recieve inputs and transfer the result into an
output that is passed on. The units are ordered into layers to connect the features of an input vector to the
features of an output vector. With training, such as the Back-Propagation algorithm, neural networks can
be designed and trained to model the underlying relationship in data.

# load the package


library(nnet)
# load data
data(longley)
x <- longley[,1:6]
y <- longley[,7]
# fit model
fit<- nnet(Employed~., longley, size=12, maxit=500, linout=T, decay=0.01)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, x, type="raw")
# summarize accuracy
rmse<- mean((y - predictions)^2)
print(rmse)
Classification and Regression Trees

Classification and Regression Trees (CART) split attributes based on values that minimize a loss function,
such as sum of squared errors.

The following recipe demonstrates the recursive partitioning decision tree method on the longley dataset.

# load the package


library(rpart)
# load data
data(longley)
# fit model
fit<- rpart(Employed~., data=longley, control=rpart.control(minsplit=5))
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Conditional Decision Trees

Condition Decision Trees are created using statistical tests to select split points on attributes rather than a
loss function.

The following recipe demonstrates the condition inference trees method on the longley dataset.

# load the package


library(party)
# load data
data(longley)
# fit model
fit<- ctree(Employed~., data=longley, controls=ctree_control(minsplit=2,minbucket=2,testtype="Univariate"))
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Model Trees

Model Trees create a decision tree and use a linear model at each node to make a prediction rather than
using an average value.

The following recipe demonstrates the M5P Model Tree method on the longley dataset.

Rule System

Rule Systems can be created by extracting and simplifying the rules from a decision tree.

The following recipe demonstrates the M5Rules Rule System on the longley dataset.

# load the package


library(RWeka)
# load data
data(longley)
# fit model
fit<- M5Rules(Employed~., data=longley)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Bagging CART

Bootstrapped Aggregation (Bagging) is an ensemble method that creates multiple models of the same
type from different sub-samples of the same dataset. The predictions from each separate model are
combined together to provide a superior result. This approach has shown participially effective for high-
variance methods such as decision trees.

The following recipe demonstrates bagging applied to the recursive partitioning decision tree.

# load the package


library(ipred)
# load data
data(longley)
# fit model
fit<- bagging(Employed~., data=longley, control=rpart.control(minsplit=5))
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)
Random Forest

Random Forest is variation on Bagging of decision trees by reducing the attributes available to making a
tree at each decision point to a random sub-sample. This further increases the variance of the trees and
more trees are required.

# load the package


library(randomForest)
# load data
data(longley)
# fit model
fit<- randomForest(Employed~., data=longley)
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Gradient Boosted Machine


Boosting is an ensemble method developed for classification for reducing bias where models are added
to learn the misclassification errors in existing models. It has been generalized and adapted in the form
of Gradient Boosted Machines (GBM) for use with CART decision trees for classification and regression

# load the package


library(gbm)
# load data
data(longley)
# fit model
fit<- gbm(Employed~., data=longley, distribution="gaussian")
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley)
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Cubist

Cubist decision trees are another ensemble method. They are constructed like model trees but involve a
boosting-like procedure called committees that re rule-like models.

# load the package


library(Cubist)
# load data
data(longley)
# fit model
fit<- cubist(longley[,1:6], longley[,7])
# summarize the fit
summary(fit)
# make predictions
predictions<- predict(fit, longley[,1:6])
# summarize accuracy
rmse<- mean((longley$Employed - predictions)^2)
print(rmse)

Machine Learning Using Python


In the picture, we used the following definitions for the notations:

 a(j)iai(j) : "activation" of unit ii in layer jj


 Θ(j)Θ(j) : matrix of weights controlling function mapping from layer jj to layer j+1j+1

Here are the computations represented by the NN picture above:

a(2)0=g(Θ(1)00x0+Θ(1)01x1+Θ(1)02x2)=g(ΘT0x)=g(z(2)0)a0(2)=g(Θ00(1)x0+Θ01(1)x1+Θ02(1)x2)=g(Θ0Tx)=g(z0(
2))

a(2)1=g(Θ(1)10x0+Θ(1)11x1+Θ(1)12x2)=g(ΘT1x)=g(z(2)1)a1(2)=g(Θ10(1)x0+Θ11(1)x1+Θ12(1)x2)=g(Θ1Tx)=g(z1(
2))

a(2)2=g(Θ(1)20x0+Θ(1)21x1+Θ(1)22x2)=g(ΘT2x)=g(z(2)2)a2(2)=g(Θ20(1)x0+Θ21(1)x1+Θ22(1)x2)=g(Θ2Tx)=g(z2(
2))

hΘ(x)=a(3)1=g(Θ(2)10a(2)0+Θ(2)11a(2)1+Θ(2)12a(2)2)hΘ(x)=a1(3)=g(Θ10(2)a0(2)+Θ11(2)a1(2)+Θ12(2)a2(2))

In the equations, the gg is sigmoid function that refers to the special case of the logisticfunction and
defined by the formula:
g(z)=11+e−zg(z)=11+e−z

Sigmoid functions

One of the reasons to use the sigmoid function (also called the logistic function) is it was the first one to
be used. Its derivative has a very good property. In a lot of weight update algorithms, we need to know a
derivative (sometimes even higher order derivatives). These can all be expressed as products
of ff and 1−f1−f. In fact, it's the only class of functions that satisfies f′(t)=f(t)(1−f(t))f′(t)=f(t)(1−f(t)).

However, usually the weights are much more important than the particular function chosen. These
sigmoid functions are very similar, and the output differences are small. Here's a plot from Wikipedia-
Sigmoid function. Note that all functions are normalized in such a way that their slope at the origin is 1.

Forward Propagation

If we use matrix notation, the equations of the previous section become:

x=⎡⎣⎢x0x1x2⎤⎦⎥z(2)=⎡⎣⎢⎢⎢z(2)0z(2)1z(2)2⎤⎦⎥⎥⎥x=[x0x1x2]z(2)=[z0(2)z1(2)z2(2)]
z(2)=Θ(1)x=Θ(1)a(1)z(2)=Θ(1)x=Θ(1)a(1)

a(2)=g(z(2))a(2)=g(z(2))

a(2)0=1.0a0(2)=1.0

z(3)=Θ(2)a(2)z(3)=Θ(2)a(2)

hΘ(x)=a(3)=g(z(3))hΘ(x)=a(3)=g(z(3))

Back Propagation (Gradient computation)

The backpropagation learning algorithm can be divided into two phases: propagation and weight update.
- from wiki - Backpropagatio.

1. Phase 1: Propagation
Each propagation involves the following steps:
o Forward propagation of a training pattern's input through the neural network in order to
generate the propagation's output activations.
o Backward propagation of the propagation's output activations through the neural network
using the training pattern target in order to generate the deltas of all output and hidden
neurons.
2. Phase 2: Weight update
For each weight-synapse follow the following steps:
o Multiply its output delta and input activation to get the gradient of the weight.
o Subtract a ratio (percentage) of the gradient from the weight.

This ratio (percentage) influences the speed and quality of learning; it is called the learning rate.
The greater the ratio, the faster the neuron trains; the lower the ratio, the more accurate the
training is. The sign of the gradient of a weight indicates where the error is increasing, this is why
the weight must be updated in the opposite direction.

Repeat phase 1 and 2 until the performance of the network is satisfactory.

If we denote an error of node jj in layer ll as δ(l)jδj(l), for our output unit(L=3) becomes activation -actual
value:
δ(3)j=a(3)j−yj=hΘ(x)−yjδj(3)=aj(3)−yj=hΘ(x)−yj

If we use a vector form, it is:

δ(3)=a(3)−yδ(3)=a(3)−y

δ(2)=(Θ(2))Tδ(3)⋅g′(z(2))δ(2)=(Θ(2))Tδ(3)⋅g′(z(2))

where

g′(z(2))=a(2)⋅(1−a(2))g′(z(2))=a(2)⋅(1−a(2))
Note that we do not have δ(1)δ(1) term because that's the input layer and the values are the ones that we
observed and they are being used as a training set. So, there is no errors associate with the input.

Also, the derivative of cost function can be written like this:

∂∂Θ(l)ijJ(Θ)=a(l)jδ(l+1)i∂∂Θij(l)J(Θ)=aj(l)δi(l+1)

We use this value to update weights and we can multiply learning rate before we adjust the weight.

self.weights[i] += learning_rate * layer.T.dot(delta)

where the layer in the code is actually a(l)a(l).

Code

Source code is here.

importnumpy as np

def sigmoid(x):
return 1.0/(1.0 + np.exp(-x))

defsigmoid_prime(x):
return sigmoid(x)*(1.0-sigmoid(x))

deftanh(x):
returnnp.tanh(x)

deftanh_prime(x):
return 1.0 - x**2

classNeuralNetwork:
def __init__(self, layers, activation='tanh'):
if activation == 'sigmoid':
self.activation = sigmoid
self.activation_prime = sigmoid_prime
elif activation == 'tanh':
self.activation = tanh
self.activation_prime = tanh_prime

# Set weights
self.weights = []
# layers = [2,2,1]
# range of weight values (-1,1)
# input and hidden layers - random((2+1, 2+1)) : 3 x 3
fori in range(1, len(layers) - 1):
r = 2*np.random.random((layers[i-1] + 1, layers[i] + 1)) -1
self.weights.append(r)
# output layer - random((2+1, 1)) : 3 x 1
r = 2*np.random.random( (layers[i] + 1, layers[i+1])) - 1
self.weights.append(r)

def fit(self, X, y, learning_rate=0.2, epochs=100000):


# Add column of ones to X
# This is to add the bias unit to the input layer
ones = np.atleast_2d(np.ones(X.shape[0]))
X = np.concatenate((ones.T, X), axis=1)

for k in range(epochs):
i = np.random.randint(X.shape[0])
a = [X[i]]

for l in range(len(self.weights)):
dot_value = np.dot(a[l], self.weights[l])
activation = self.activation(dot_value)
a.append(activation)
# output layer
error = y[i] - a[-1]
deltas = [error * self.activation_prime(a[-1])]

# we need to begin at the second to last layer


# (a layer before the output layer)
for l in range(len(a) - 2, 0, -1):
deltas.append(deltas[-1].dot(self.weights[l].T)*self.activation_prime(a[l]))

# reverse
# [level3(output)->level2(hidden)] => [level2(hidden)-
>level3(output)]
deltas.reverse()

# backpropagation
# 1. Multiply its output delta and input activation
# to get the gradient of the weight.
# 2. Subtract a ratio (percentage) of the gradient from the
weight.
fori in range(len(self.weights)):
layer = np.atleast_2d(a[i])
delta = np.atleast_2d(deltas[i])
self.weights[i] += learning_rate * layer.T.dot(delta)

if k % 10000 == 0: print 'epochs:', k

def predict(self, x):


a = np.concatenate((np.ones(1).T, np.array(x)), axis=1)
for l in range(0, len(self.weights)):
a = self.activation(np.dot(a, self.weights[l]))
return a

if __name__ == '__main__':

nn = NeuralNetwork([2,2,1])
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
y = np.array([0, 1, 1, 0])
nn.fit(X, y)
for e in X:
print(e,nn.predict(e))

Output:

epochs: 0
epochs: 10000
epochs: 20000
epochs: 30000
epochs: 40000
epochs: 50000
epochs: 60000
epochs: 70000
epochs: 80000
epochs: 90000
(array([0, 0]), array([ 9.14891326e-05]))
(array([0, 1]), array([ 0.99557796]))
(array([1, 0]), array([ 0.99707463]))
(array([1, 1]), array([ 0.00090973]))

k-Means Clustering

The k-Means Clustering finds centers of clusters and groups input samples around the clusters.

k-Means Clustering is a partitioning method which partitions data into k mutually exclusive clusters, and
returns the index of the cluster to which it has assigned each observation. Unlike hierarchical clustering,
k-means clustering operates on actual observations (rather than the larger set of dissimilarity measures),
and creates a single level of clusters. The distinctions mean that k-means clustering is often more
suitable than hierarchical clustering for large amounts of data.
The following description for the steps is from wiki - K-means_clustering.

Step 1
k initial "means" (in this case k=3) are randomly generated within the data domain.

Step 2
k clusters are created by associating every observation with the nearest mean. The partitions here
represent the Voronoi diagram generated by the means.

Step 3
The centroid of each of the k clusters becomes the new mean.
Step 4
Steps 2 and 3 are repeated until convergence has been reached.

K-Means Clustering in OpenCV

In OpenCV, we use cv.KMeans2(() and it's defined like this:

cv2.kmeans(data, K, criteria, attempts, flags[, bestLabels[, centers]])

The parameters are:

 data : Data for clustering.


 nclusters(K) K : Number of clusters to split the set by.
 criteria : The algorithm termination criteria, that is, the maximum number of iterations and/or the
desired accuracy. The accuracy is specified as criteria.epsilon. As soon as each of the cluster
centers moves by less than criteria.epsilon on some iteration, the algorithm stops.
 attempts : Flag to specify the number of times the algorithm is executed using different initial
labellings. The algorithm returns the labels that yield the best compactness (see the last function
parameter).
 flags :
o KMEANS_RANDOM_CENTERS - selects random initial centers in each attempt.
o KMEANS_PP_CENTERS - uses kmeans++ center initialization by Arthur and
Vassilvitskii.
o KMEANS_USE_INITIAL_LABELS - during the first (and possibly the only) attempt, use
the user-supplied labels instead of computing them from the initial centers. For the
second and further attempts, use the random or semi-random centers. Use one of
KMEANS_*_CENTERS flag to specify the exact method.
 centers : Output matrix of the cluster centers, one row per each cluster center.

1-D data (one feature)


In this section, we'll play with a data set which has only one feature. The data is one-dimensional, and
clustering can be decided by one parameter.

We generated two groups of random numbers in two separated ranges, each group with 25 numbers as
shown in the picture below:

The code:

importnumpy as np
import cv2
frommatplotlib import pyplot as plt

x = np.random.randint(25,100,25) # 25 randoms in (25,100)


y = np.random.randint(175,250,25) # 25 randoms in (175,250)
z = np.hstack((x,y)) # z.shape : (50,)
z = z.reshape((50,1)) # reshape to a column vector
z = np.float32(z)
plt.hist(z,256,[0,256])
plt.show()

Now, we're going to apply cv2.kmeans() function to the data with the following parameters:

# Define criteria = ( type, max_iter = 10 , epsilon = 1.0 )


criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
# Set flags
flags = cv2.KMEANS_RANDOM_CENTERS

# Apply KMeans
compactness,labels,centers = cv2.kmeans(z,2,None,criteria,10,flags)

In the code, the criteria means whenever 10 iterations of algorithm is ran, or an accuracy of epsilon = 1.0
is reached, stop the algorithm and return the answer.

Here are the meanings of the return values from cv2.kmeans() function:

 compactness : It is the sum of squared distance from each point to their corresponding centers.
 labels : This is the label array where each element marked '0','1',.....
 centers : This is array of centers of clusters.

For the current case, it gives us centers as:

centers: [[ 72.08000183] [ 217.03999329]]

Here is the final code for 1-D data:

importnumpy as np
import cv2
frommatplotlib import pyplot as plt

x = np.random.randint(25,100,25) # 25 randoms in (25,100)


y = np.random.randint(175,250,25) # 25 randoms in (175,250)
z = np.hstack((x,y)) # z.shape : (50,)
z = z.reshape((50,1)) # reshape to a column vector
z = np.float32(z)

# Define criteria = ( type, max_iter = 10 , epsilon = 1.0 )


criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
# Set flags
flags = cv2.KMEANS_RANDOM_CENTERS

# Apply KMeans
compactness,labels,centers = cv2.kmeans(z,2,None,criteria,10,flags)

print('centers: %s' %centers)

A = z[labels==0]
B = z[labels==1]

# initial plot
plt.subplot(211)
plt.hist(z,256,[0,256])

# Now plot 'A' in red, 'B' in blue, 'centers' in yellow


plt.subplot(212)
plt.hist(A,256,[0,256],color = 'r')
plt.hist(B,256,[0,256],color = 'b')
plt.hist(centers,32,[0,256],color = 'y')

plt.show()
We can see the data has been clustered: A's centroid is 72 and B's centroid is 217 which was one of the
outputs from the cv2.kmeans() function.

Signal Processing with NumPy arrays in iPython

iPython

IPython currently provides the following features (wiki-iPython):

 Powerful interactive shells (terminal and Qt-based).


 A browser-based notebook with support for code, text, mathematical expressions, inline plots and
other rich media.
 Support for interactive data visualization and use of GUI toolkits.
 Flexible, embeddable interpreters to load into ones own projects.
 Easy to use, high performance tools for parallel computing.
 Basic Signals - boxcar
 We'll make a simple boxcar with np.zeros() and np.ones().
 We start with a simple command to get python environment using ipython --pylab:
 $ ipython --pylab
 Python 2.7.5+ (default, Feb 27 2014, 19:37:08)
 Type "copyright", "credits" or "license" for more information.

 IPython 0.13.2 -- An enhanced Interactive Python.
 ? -> Introduction and overview of IPython's features.
 %quickref -> Quick reference.
 help -> Python's own help system.
 object? -> Details about 'object', use 'object??' for extra details.

 Welcome to pylab, a matplotlib-based Python environment [backend:
TkAgg].
 For more information, type 'help(pylab)'.

 In [1]:
 Let's play with arrays first:
 In [1]: zeros
 Out[1]:

 In [2]: zeros(100)
 Out[2]:
 array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0.])
 In [3]: ones(10)
 Out[3]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
 Now, we want to concatenate these and make a boxcar:
 Also, we we know more about the command, we can add '?' after the command. Then, it will
show details about the command as well as some usage samples:
 In [4]: concatenate?
 The following information will be displayed in another window (I mean if we can close the help
window (by typing 'q'), our command shell will reappear:
 Type: builtin_function_or_method
 String Form:<built-in function concatenate>
 Docstring:
 concatenate((a1, a2, ...), axis=0)

 Join a sequence of arrays together.

 Parameters
 ----------
 a1, a2, ... : sequence of array_like
 The arrays must have the same shape, except in the dimension
 corresponding to `axis` (the first, by default).
 axis : int, optional
 The axis along which the arrays will be joined. Default is 0.
 Now, concatenation:
 In [5]: concatenate ( (zeros(100), ones(30), zeros(100)) )
 Out[5]:
 array([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1.,
 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
 0., 0., 0., 0., 0., 0., 0., 0., 0.])
 Actually, we want to assign the array to a variable:
 In [6]: data = concatenate ( (zeros(100), ones(30), zeros(100)) )

 In [7]: plot(data)
 Out[7]: [<matplotlib.lines.Line2D at 0x3024450>]
 The we get the picture below:

 Note the values in the array is for y-values, for x-axis, the sample numbers are used: we have
100+30+100=230 samples.

 Hamming filter
 In [8]: hamming(40)
 Out[8]:
 array([ 0.08 , 0.08595688, 0.10367324, 0.13269023, 0.17225633,
 0.2213468 , 0.27869022, 0.34280142, 0.41201997, 0.48455313,
 0.55852233, 0.63201182, 0.70311825, 0.77 , 0.83092487,
 0.88431494, 0.92878744, 0.96319054, 0.98663324, 0.99850836,
 0.99850836, 0.98663324, 0.96319054, 0.92878744, 0.88431494,
 0.83092487, 0.77 , 0.70311825, 0.63201182, 0.55852233,
 0.48455313, 0.41201997, 0.34280142, 0.27869022, 0.2213468 ,
 0.17225633, 0.13269023, 0.10367324, 0.08595688, 0.08
])

 In [9]: plot(hamming(40))
 Out[9]: []

 Then, we get the following picture:


 We overwrote a new to the old one because we did not close the plot window before we made a
new draw.
 So, let's close this window, and plot the Hamming again:
 In [10]: plot(hamming(40))
 Out[10]: []

 Now, we have a plot with only the Hamming filter:


 However, if we want to apply the filter to the other signal, we need to normalize the filter.

 Normalizing the Hamming filter


 We'll normalize the filter by diving it by the sum of the filter values:
 In [11]: sum(hamming(40))
 Out[11]: 21.139999999999997

 In [12]: norm = hamming(40) / sum(hamming(40))

 In [13]: plot(norm)
 Out[13]: []

 The we get normalized Hamming filter:


 Note that the peak value is much lower than the previous Hamming filter.

 Convolve
 Now we do convolve the filter with the boxcar data:
 In [14]: convolve(data, norm)
 Out[14]:
 array([ 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0.0037843 , 0.00785037, 0.0127545 , 0.01903124, 0.0271796 ,
 0.03765012, 0.05083319, 0.06704896, 0.08653903, 0.10946018,
 0.13588035, 0.16577684, 0.19903693, 0.23546077, 0.27476658,
 0.31659794, 0.36053301, 0.40609548, 0.45276687, 0.5 ,
 0.54723313, 0.59390452, 0.63946699, 0.68340206, 0.72523342,
 0.76453923, 0.80096307, 0.83422316, 0.86411965, 0.89053982,
 0.90967668, 0.92510066, 0.93641231, 0.94331865, 0.94564081,
 0.94331865, 0.93641231, 0.92510066, 0.90967668, 0.89053982,
 0.86411965, 0.83422316, 0.80096307, 0.76453923, 0.72523342,
 0.68340206, 0.63946699, 0.59390452, 0.54723313, 0.5 ,
 0.45276687, 0.40609548, 0.36053301, 0.31659794, 0.27476658,
 0.23546077, 0.19903693, 0.16577684, 0.13588035, 0.10946018,
 0.08653903, 0.06704896, 0.05083319, 0.03765012, 0.0271796 ,
 0.01903124, 0.0127545 , 0.00785037, 0.0037843 , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. , 0. ,
 0. , 0. , 0. , 0. ])

 In [15]:

 If we plot, it looks like this:


 Random Signal and FFT


 We can also generate random signals.
 In [16]: randn(20)
 Out[16]:
 array([ 1.04943265, -1.10655086, 2.65115344, 0.37216869, -1.81576411,
 -0.72413097, -0.74942176, -0.38411864, 0.04484577, 0.31172929,
 -0.78896659, -0.47319888, -0.61743916, -0.77453274, -2.34677579,
 -0.61554551, 0.93594151, -1.06451913, -1.13374368,
1.16034994])
 If we do FFT with 100 random numbers:
 In [17]: random_data = randn(100)

 In [18]: plot(fft.fft(random_data))
 /usr/lib/python2.7/dist-packages/numpy/core/numeric.py:320:
ComplexWarning: Casting complex values to real discards the imaginary
part
 return array(a, dtype, copy=False, order=order)
 Out[18]: []

 In [19]: plot(random_data)
 Out[19]: []

 In [20]:

 blue: FFT, green = random data


 Video Capture
 The following code is to capture video which could be either live stream or file stream. To capture
a video, we need to create a VideoCapture object. Its argument can be either the device index or
the name of a video file. Device index is just the number to specify which camera. Normally one
camera will be connected (index = 0) or the 2nd camera (index = 1).
 # capture.py
 import numpy as np
 import cv2

 # Capture video from camera
 # cap = cv2.VideoCapture(0)

 # Capture video from file
 cap = cv2.VideoCapture('BlueUmbrella.webm')

 while(cap.isOpened()):
 # Capture frame-by-frame
 ret, frame = cap.read()

 # Our operations on the frame come here
 gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

 # Display the resulting frame
 cv2.imshow('frame',gray)
 if cv2.waitKey(1) & 0xFF == ord('q'):
 break

 # When everything done, release the capture
 cap.release()

 cv2.destroyAllWindows()

Tracking Blue Object

In this section, we want to track blue umbrella from the Pixar movie clip. How?

After converting BGR image to HSV, we extract a blue colored object. The reason for convderting to HSV
colorspace is because it is more easier to represent a color than RGB colorspace. Here are the steps to
take:

1. Take each frame of the video.


2. Convert from BGR to HSV color-space.
3. We threshold the HSV image for a range of blue color.
4. Now extract the blue object alone, we can do whatever on that image we want.

Code - Tracking Blue Object


# cspace.py
import cv2
importnumpy as np

cap = cv2.VideoCapture('BlueUmbrella.webm')

while(1):

# Take each frame


_, frame = cap.read()

# Convert BGR to HSV


hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV)

# define range of blue color in HSV


lower_blue = np.array([110,50,50])
upper_blue = np.array([130,255,255])

# Threshold the HSV image to get only blue colors


mask = cv2.inRange(hsv, lower_blue, upper_blue)

# Bitwise-AND mask and original image


res = cv2.bitwise_and(frame,frame, mask= mask)

cv2.imshow('frame',frame)
cv2.imshow('mask',mask)
cv2.imshow('res',res)

k = cv2.waitKey(5) & 0xFF


if k == 27:
break

cv2.destroyAllWindows()

Here is the output from the code:

1. Original image
2. Threshold the HSV image to get only blue colors - mask
3. Bitwise-AND mask and original image - res

Original files in other formats:

 cspace_blue.mp4
 cspace_blue.webm
 cspace_blue.ogv