You are on page 1of 99

Machine learning: Crash Course

Kshitij Sharma
Department of Computer Science
Faculty of Information Technology and Electrical Engineering
What?
Why?
Machine Learning is the future

• Why Machine Learning Is The Future Of Business Culture?


How?
Supervised vs Unsupervised?
Feature vs variables?
Feature extraction
Feature space reduction
Feature space reduction
• Space required to store the data is reduced as the number of features
comes down

• Less features lead to less computation/training time

• Some algorithms do not perform well when we have a large number


of features. So reducing these features needs to happen for the
algorithm to be useful.
Feature space reduction
• It takes care of multicollinearity by removing redundant features.
Such features are highly correlated to each other. There is no point in
• It helps in visualizing data. Tt is very difficult to visualize data in bigger
feature spaces so reducing the space to 2D or 3D may allow for
plotting and observing patterns more clearly.
Feature space reduction
• Missing Values Ratio
• Data columns with too many missing values are unlikely to carry much useful
information
• data columns with number of missing values greater than a given threshold can
be removed
• Low Variance Filter
• data columns with little changes in the data carry little information
• variance is range dependent; therefore normalization is required before
applying this technique
Feature space reduction
• High Correlation Filter
• Data columns with very
similar trends are also
likely to carry very similar
information
• only one of them will
suffice to feed the machine
learning model
library(corrgram)
corrgram(<dataFrame>,
order=TRUE,
lower.panel=panel.shade,
upper.panel=panel.pie,
text.panel=panel.txt)
Feature space reduction
• Forward Selection • Backward Elimination
Feature 1 Feature 1 Feature 2
Feature 1 Feature 3 Feature 3
Feature 2 Feature 2 Feature 5
Feature 1 Feature 2
Feature 3 Feature 3
Feature 1
Feature 2 Feature 4
Feature 3 Feature 1
Feature 2 Feature 3 Feature 2
Feature 1
Feature 3 Feature 2 Feature 5
Feature 2 Feature 1
Feature 3 Feature 4 Feature 3 Feature 5
Feature 1
Feature 4 Feature 5
Feature 4 Feature 3
Feature 3 Feature 5
Feature 3 Feature 2 Feature 1
Feature 5 Feature 5 Feature 5 Feature 2 Feature 2
Feature 4 Feature 3 Feature 1
Feature 5 Feature 5 Feature 2
Prediction/classification
Classification
• Centroid based
Classification
• Centroid based Compute centroids
Classification
• Centroid based
Classification
• Centroid based
• Compute distances
Classification
• Nearest neighbour
Classification
• Nearest neighbour
Classification
• K - Nearest neighbour
# load data
df <- data(iris)
summary(iris)

# normalise the data between 0 and 1


normalisedData <- as.data.frame(
lapply(iris[,c(1,2,3,4)],
function(x) {
(x -min(x))/(max(x)-min(x))
}))

summary(normalisedData)
#randomely generate 80% indices from the original data
trainRandom <- sample(1:nrow(iris), 0.8 * nrow(iris))

#divide the original data into training and testing set


trainData <- normalisedData[trainRandom,]
testData <- normalisedData[-trainRandom,]
targetClass <- iris[trainRandom,5]
testClass <- iris[-trainRandom,5]

#load the package class and run the knn classification


library(class)
set.seed(1234)
prediction <- knn(trainData,testData,cl=targetClass,k=13)

#create confusion matrix and compute accuracy


tab <- table(prediction,testClass)
accuracy <- 100* sum(diag(tab)/(sum(rowSums(tab))))
#try with different values of ”k”
acc<-numeric()
for (i in 3:15)
{
set.seed(1234)
prediction <- knn(trainData,testData,cl=targetClass,k=i)
tab <- table(prediction,testClass)
acc[i-2]<-accuracy(tab)
}
d1<-data.frame(neighbours=3:15, accuracy=acc)

# accuracy function
accuracy <- function(x)
{
sum(diag(x)/(sum(rowSums(x)))) * 100
}
# plot
ggplot(d1,aes(x=neighbours,y=accuracy)) + geom_line() + geom_point() + theme_bw()
Classification
• Decision
tree
#read data
set.seed(6789)
titanic <-read.csv('https://raw.githubusercontent.com/guru99-edu/R-Programming/master/titanic_data.csv')
head(titanic)
# clean data
# Drop useless information
# "pclass" and "survived" columns get labels instead of numbers (easy to understand)
# remove NAs
cleanData <- titanic
%>% select(-c(home.dest, cabin, name, x, ticket))
%>% mutate(pclass = factor(
pclass, levels = c(1, 2, 3),
labels = c('Upper', 'Middle', 'Lower')),
survived = factor(
survived, levels = c(0, 1),
labels = c('No', 'Yes')))
%>% na.omit()
glimpse(cleanData)
#Split into training and testing
set.seed(1234)
trainRandom <- sample(1:nrow(cleanData), 0.8 * nrow(cleanData))
trainData <- cleanData[trainRandom,]
testData <- cleanData[-trainRandom,]

# train the model and plot


treeModel <- rpart(survived~., data = trainData, method = 'class')
rpart.plot(treeModel, extra = 106)
rpart.plot(treeModel, extra = 106)
# test the model
predict <-predict(treeModel, testData, type = 'class')
tab <- table(testData$survived, predict)
accuracy(tab)
Acc=79.38

# tune the parameters


control <- rpart.control(
minsplit = 4,
minbucket = round(5 / 3),
maxdepth = 3,
cp = 0)

# run the model with tuning


tuneModel <- rpart(survived~., data = trainData, method = 'class', control = control)
predict <- predict(treeModel, testData, type = 'class')
tab <- table(testData$survived, predict)
tab
testAccuracy <- sum(diag(tab)) / sum(tab)
testAccuracy
Acc=80.53
Prediction/classification
• Regression
• Y = 14.81X + 88.33

Sales

Money on ads
Prediction/classification

Problem with
Linear regression
Prediction/classification
• Linear Regression

Sales

Money on ads
Prediction/classification
• Polynomial Regression
• Y = 14.81X + 2.89X2 + 88.33

Sales

Money on ads
Prediction/classification
Quality
Prediction/classification Quality

Error
Prediction/classification Quality
Error = original - predicted
• Mean Absolute Error (MAE)
• Root Mean Squared Error (RMSE)
Prediction/classification Quality

Error
Prediction/classification Quality
Prediction/classification Quality
• R-squared
Prediction/classification Quality
• Precision, Recall, Accuracy
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒
𝑅𝑒𝑎𝑐𝑙𝑙 =
𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒


𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙
One package to rule them all - caret
library(caret)
names(getModelInfo())

More than 200 prediction/classification


algorithms are supported
Caret
#training-grid

fitControl<-trainControl(method="repeatedcv",number=10, repeats = 10)


# this is just for the cross-validation

# the following are for the individual algorithms you can try other models by running -->

modelLookup("<name of the model>")

# for a list of models supported by caret package and to find the names for "modelLookup" run -->

names(getModelInfo())
Caret
train(
formula,
data = …,
method=" … ",
trControl = …,
preProcess = …,
tuneGrid = …,
metric = …
)

predict(<trained model>,<test Data>)

confusionMatrix(<predicted values/labels>, <original values/labels>)


Naïve bayes
Naïve bayes
Class prior
• Naïve bayes likelihood
probability

posterior probability prior probability


of predictor
Naïve bayes

=

Naïve bayes Outlook Temp Humidity Windy Play
RAINY HOT HIGH FALSE NO
RAINY HOT HIGH TRUE NO
OVERCAST HOT HIGH FALSE YES
SUNNY MILD HIGH FALSE YES
SUNNY COOL NORMAL FALSE YES
SUNNY COOL NORMAL TRUE NO
OVERCAST COOL NORMAL TRUE YES
RAINY MILD HIGH FALSE NO
RAINY COOL NORMAL FALSE YES
SUNNY MILD NORMAL FALSE YES
RAINY MILD NORMAL TRUE YES
OVERCAST MILD HIGH TRUE YES
OVERCAST HOT NORMAL FALSE YES
SUNNY MILD HIGH TRUE NO
Naïve bayes
Outlook Temp Humidity Windy Play
Frequency table
RAINY • Naïve
HOT HIGH
bayes FALSE NO
RAINY HOT HIGH TRUE NO YES NO
OVERCAST HOT HIGH FALSE YES SUNNY 3 2
SUNNY MILD HIGH FALSE YES OVERCAST 4 0
SUNNY COOL NORMAL FALSE YES RAINY 2 3
SUNNY COOL NORMAL TRUE NO
OVERCAST COOL NORMAL TRUE YES
Likelihood table
RAINY MILD HIGH FALSE NO
YES NO
RAINY COOL NORMAL FALSE YES
SUNNY 3/9 2/5
SUNNY MILD NORMAL FALSE YES
OVERCAST 4/9 0/5
RAINY MILD NORMAL TRUE YES
RAINY 2/9 3/5
OVERCAST MILD HIGH TRUE YES
OVERCAST HOT NORMAL FALSE YES
SUNNY MILD HIGH TRUE NO
Naïve bayes

YES NO
SUNNY 3/9 2/5 5/14
P(SUNNY|YES) = 3/9 = 0.33
OVERCAST 4/9 0/5 4/14 P(YES) = 9/14 = 0.64
RAINY 2/9 3/5 5/14
9/14 5/14
P(SUNNY) = 5/14 = 0.36
Naïve bayes

= 0.60

= 0.36
Naïve bayes

=

Naïve bayes
Outlook Temp Humidity Windy Play YES NO
RAINY • Naïve
HOT HIGH
bayes FALSE NO HOT 2/9 2/5
RAINY HOT HIGH TRUE NO MILD 4/9 2/5
OVERCAST HOT HIGH FALSE YES COOL 3/9 1/5
SUNNY MILD HIGH FALSE YES
SUNNY COOL NORMAL FALSE YES YES NO
SUNNY COOL NORMAL TRUE NO HIGH 3/9 4/5
OVERCAST COOL NORMAL TRUE YES NORMAL 6/9 1/5
RAINY MILD HIGH FALSE NO
RAINY COOL NORMAL FALSE YES YES NO
SUNNY MILD NORMAL FALSE YES FALSE 6/9 2/5
RAINY MILD NORMAL TRUE YES TRUE 3/9 3/5
OVERCAST MILD HIGH TRUE YES
OVERCAST HOT NORMAL FALSE YES
SUNNY MILD HIGH TRUE NO
Naïve bayes
YES NO YES NO
HOT 2/9 2/5 SUNNY 3/9 2/5
MILD 4/9 2/5 OVERCAST 4/9 0/5
COOL 3/9 1/5 RAINY 2/9 3/5
YES NO YES NO
HIGH 3/9 4/5 FALSE 6/9 2/5
NORMAL 6/9 1/5 TRUE 3/9 3/5

Outlook Temp Humidity Windy Play


RAINY COOL HIGH TRUE ?
Naïve bayes
Outlook Temp Humidity Windy Play
RAINY COOL HIGH TRUE ?

=

Naïve bayes
Outlook Temp Humidity Windy Play
RAINY COOL HIGH TRUE ?

𝑃 𝑌𝐸𝑆 𝑋 =
𝑃 𝑅𝑎𝑖𝑛𝑦 𝑌𝑒𝑠 × 𝑃 𝐶𝑂𝑂𝐿 𝑌𝑒𝑠 × 𝑃 𝐻𝐼𝐺𝐻 𝑌𝑒𝑠 × 𝑃 𝑇𝑅𝑈𝐸 𝑌𝑒𝑠 ×
𝑃(𝑌𝐸𝑆)

𝑃 𝑁𝑂 𝑋 =
𝑃 𝑅𝑎𝑖𝑛𝑦 𝑁𝑂 × 𝑃 𝐶𝑂𝑂𝐿 𝑁𝑂 × 𝑃 𝐻𝐼𝐺𝐻 𝑁𝑂 × 𝑃 𝑇𝑅𝑈𝐸 𝑁𝑂 ×
𝑃(𝑁𝑂)
Naïve bayes
Outlook Temp Humidity Windy Play
RAINY COOL HIGH TRUE ?

= 𝑃 𝑌𝐸𝑆 𝑋
=

𝑃 𝑁𝑂 𝑋
=
=
Naïve bayes - Implementation
library(dplyr)
library(ggplot2)
library(caret)

# load data and remove Nas


data("PimaIndiansDiabetes2", package = "mlbench")
PimaIndiansDiabetes2 <- na.omit(PimaIndiansDiabetes2)

# split data
set.seed(1234)
trainingSamples <- PimaIndiansDiabetes2$diabetes %>% createDataPartition(p = 0.8, list = FALSE)
trainData <- PimaIndiansDiabetes2[trainingSamples, ]
testData <- PimaIndiansDiabetes2[-trainingSamples, ]
Naïve bayes - Implementation
#train
model <- train(
diabetes ~.,
data = trainData,
method = "nb",
trControl =
trainControl("cv", number = 10))

#test
predictions <-predict(model,testData)

#evaluate
confusionMatrix(predictions,testData$diabetes)
Naïve bayes - Implementation
#parameter tuning

modelLookup("nb")

gridControl <- expand.grid(


usekernel = c(TRUE, FALSE),
fL = 0:5,
adjust = seq(0, 5, by = 1))

modelTune <- train( diabetes ~.,


data = trainData,
method = "nb",
trControl = trainControl("cv", number = 10), tuneGrid = gridControl)
Naïve bayes – When to use?
plot(modelTune)
#training performance
confusionMatrix(modelTune)

#test with the best model


predictions <-predict(modelTune,testData)

#evaluate the best model


confusionMatrix(predictions,testData$diabetes)
Support vector machines
Support vector machines
Support vector machines
Support vector machines
Support vector machines – kernel trick
Support vector machines – kernel trick
Support vector machines – kernel trick
Support vector machines
Support vector machines – kernel trick
SVM - implementation

trainSVMLinear <- train(


score~.,
data=trainData[,featureSet],
method="svmLinear” OR , "svmPoly” OR ”svmRadial”
trControl = fitControl,
preProcess = c("center","scale"),
tuneGrid = svmLinearGrid, OR svmPolyGrid OR svmRadialGrid
metric="rmse")
SVM - implementation
SVM - implementation

svmLinearGrid <- expand.grid(


C = seq(0, 1, length = 10))

svmPolyGrid <- expand.grid(


C = seq(0, 1, length = 10),
degree=c(1:10),
scale = seq(0, 1, length = 10))

svmRadialGrid <- expand.grid(


C = seq(0, 1, length = 10),
sigma=seq(0, 1, length = 25))
SVM – When to use?
Neural networks
Basic Architectures
Neural networks
Basic Architecture
• Human Brain
• Thalmocortical System
• 3 Million Neurons
• 476 Million Synapses
• Full Brain
• 10 Billion Neurons
• 1,000 Trillion Synapses
• Artificial Neural network
• ResNet 152 layers
• 60 Million Synapses
Neural networks
Basic Architecture
Biological NN Artificial NN
10 million times more synapses
Topology is asynchronous and not constructed in
layer
For our biological networks we don't know The learning algorithm for artificial neural networks
is backpropagation
Power consumption human brains are much more
efficient
Learning in the biological neural networks Often there's a training stage, there's a distinct
you really never stop learning training stage and there's a distinct testing stage
when you release the thing in the wild
Neural networks
Basic Architecture

Perceptron without Bias Perceptron with Bias


Neural networks
Single computational layer: Perceptron
Classification

Regression

The weights wi are calculated using the gradient descent techniques


Neural networks
Multi-layer Neural Networks

Multi-layer NN without Bias Multi-layer NN with Bias


Neural networks
Multi-layer Neural Networks

Multi-layer NN scalar notation Multi-layer NN vector notation


Basics of neural networks
Activation functions
Neural networks - caret
Deep Neural networks
Deep Neural networks
Neural networks – When to use?
Resources
• Machine Learning Yearning – Andrew Ng
• The hundred page machine learning book – Andriy Burkov
• Machine Learning by Tom M Mitchell
Thank you!!
Questions??

You might also like