You are on page 1of 20

UNIT – IV: Machine learning for information security

What is Machine Learning?


Machine Learning is a subset of artificial intelligence which focuses mainly on machine
learning from their experience and making predictions based on its experience.

What does it do?


It enables the computers or the machines to make data-driven decisions rather than being
explicitly programmed for carrying out a certain task. These programs or algorithms are designed
in a way that they learn and improve over time when are exposed to new data.

Types of Machine Learning


Machine learning is sub-categorized to three types:
Supervised Learning – Train Me!
Unsupervised Learning – I am self-sufficient in learning
Reinforcement Learning – My life My rules! (Hit & Trial)

What is Supervised Learning?


Supervised Learning is the one, where you can consider the learning is guided by a teacher.
We have a dataset which acts as a teacher and its role is to train the model or the machine. Once
the model gets trained it can start making a prediction or decision when new data is given to it.
What is Unsupervised Learning?
The model learns through observation and finds structures in the data. Once the model is
given a dataset, it automatically finds patterns and relationships in the dataset by creating clusters
in it. What it cannot do is add labels to the cluster, like it cannot say this a group of apples or
mangoes, but it will separate all the apples from mangoes.
Suppose we presented images of apples, bananas and mangoes to the model, so what it
does, based on some patterns and relationships it creates clusters and divides the dataset into those
clusters. Now if a new data is fed to the model, it adds it to one of the created clusters.
What is Reinforcement Learning?
It is the ability of an agent to interact with the environment and find out what is the best
outcome. It follows the concept of hit and trial method. The agent is rewarded or penalized with a
point for a correct or a wrong answer, and on the basis of the positive reward points gained the
model trains itself. And again, once trained it gets ready to predict the new data presented to it.
How does Machine Learning Work?
Machine Learning algorithm is trained using a training data set to create a model. When
new input data is introduced to the ML algorithm, it makes a prediction on the basis of the model.
The prediction is evaluated for accuracy and if the accuracy is acceptable, the Machine
Learning algorithm is deployed. If the accuracy is not acceptable, the Machine Learning algorithm
is trained again and again with an augmented training data set.

List of Common Machine Learning Algorithms


Here is the list of commonly used machine learning algorithms. These algorithms can be
applied to almost any data problem:
1. Linear Regression
It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on
continuous variable(s). Here, we establish relationship between independent and dependent
variables by fitting a best line. This best fit line is known as regression line and represented by a
linear equation Y= a *X + b.
The best way to understand linear regression is to relive this experience of childhood. Let
us say, you ask a child in fifth grade to arrange people in his class by increasing order of weight,
without asking them their weights! What do you think the child will do? He / she would likely
look (visually analyze) at the height and build of people and arrange them using a combination of
these visible parameters. This is linear regression in real life! The child has actually figured out
that height and build would be correlated to the weight by a relationship, which looks like the
equation above.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
These coefficients a and b are derived based on minimizing the sum of squared difference
of distance between data points and regression line.

Look at the below example. Here we have identified the best fit line having linear equation
y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.
Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear
Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple
Linear Regression (as the name suggests) is characterized by multiple (more than 1) independent
variables. While finding the best fit line, you can fit a polynomial or curvilinear regression. And
these are known as polynomial or curvilinear regression.
2. Logistic Regression
Don’t get confused by its name! It is a classification not a regression algorithm. It is used
to estimate discrete values (Binary values like 0/1, yes/no, true/false) based on given set of
independent variables(s). In simple words, it predicts the probability of occurrence of an event by
fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the
probability, its output values lie between 0 and 1 (as expected).
Again, let us try and understand this through a simple example.
Let’s say your friend gives you a puzzle to solve. There are only 2 outcome scenarios – either you
solve it or you don’t. Now imagine, that you are being given wide range of puzzles / quizzes in an
attempt to understand which subjects you are good at. The outcome to this study would be
something like this – if you are given a trigonometry based tenth grade problem, you are 70%
likely to solve it. On the other hand, if it is grade fifth history question, the probability of getting
an answer is only 30%. This is what Logistic Regression provides you.
Coming to the math, the log odds of the outcome is modeled as a linear combination of the
predictor variables.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p))
logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3.... + bkXk

Above, p is the probability of presence of the characteristic of interest. It chooses


parameters that maximize the likelihood of observing the sample values rather than that minimize
the sum of squared errors (like in ordinary regression).
Now, you may ask, why take a log? For the sake of simplicity, let’s just say that this is one
of the best mathematical way to replicate a step function. I can go in more details, but that will
beat the purpose of this article.
3. Decision Tree
This is one of my favorite algorithms and I use it quite frequently. It is a type of supervised
learning algorithm that is mostly used for classification problems. Surprisingly, it works for both
categorical and continuous dependent variables. In this algorithm, we split the population into two
or more homogeneous sets. This is done based on most significant attributes/ independent variables
to make as distinct groups as possible. For more details, you can read: Decision Tree Simplified.
In the image above, you can see that population is classified into four different groups
based on multiple attributes to identify ‘if they will play or not’. To split the population into
different heterogeneous groups, it uses various techniques like Gini, Information Gain, Chi-square,
entropy.
The best way to understand how decision tree works, is to play Jezzball – a classic game
from Microsoft (image below). Essentially, you have a room with moving walls and you need to
create walls such that maximum area gets cleared off with out the balls.
So, every time you split the room with a wall, you are trying to create 2 different
populations with in the same room. Decision trees work in very similar fashion by dividing a
population in as different groups as possible.
4. SVM (Support Vector Machine)
It is a classification method. In this algorithm, we plot each data item as a point in n-
dimensional space (where n is number of features you have) with the value of each feature being
the value of a particular coordinate.
For example, if we only had two features like Height and Hair length of an individual, we’d
first plot these two variables in two-dimensional space where each point has two co-ordinates
(these co-ordinates are known as Support Vectors)
Now, we will find some line that splits the data between the two differently classified
groups of data. This will be the line such that the distances from the closest point in each of the
two groups will be farthest away.

In the example shown above, the line which splits the data into two differently classified
groups is the black line, since the two closest points are the farthest apart from the line. This line
is our classifier. Then, depending on where the testing data lands on either side of the line, that’s
what class we can classify the new data as.
What is Support Vector Machine?
The objective of the support vector machine algorithm is to find a hyperplane in an N-
dimensional space (N — the number of features) that distinctly classifies the data points.
To separate the two classes of data points, there are many possible hyperplanes that could
be chosen. Our objective is to find a plane that has the maximum margin, i.e. the maximum distance
between data points of both classes. Maximizing the margin distance provides some reinforcement
so that future data points can be classified with more confidence.

Hyperplanes and Support Vectors

Hyperplanes are decision boundaries that help classify the data points. Data points falling
on either side of the hyperplane can be attributed to different classes. Also, the dimension of the
hyperplane depends upon the number of features. If the number of input features is 2, then the
hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-
dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
Support vectors are data points that are closer to the hyperplane and influence the position
and orientation of the hyperplane. Using these support vectors, we maximize the margin of the
classifier. Deleting the support vectors will change the position of the hyperplane. These are the
points that help us build our SVM.
Large Margin Intuition
In logistic regression, we take the output of the linear function and squash the value within
the range of [0,1] using the sigmoid function. If the squashed value is greater than a threshold
value (0.5) we assign it a label 1, else we assign it a label 0. In SVM, we take the output of the
linear function and if that output is greater than 1, we identify it with one class and if the output is
-1, we identify is with another class. Since the threshold values are changed to 1 and -1 in SVM,
we obtain this reinforcement range of values ([-1,1]) which acts as margin.
Cost Function and Gradient Updates
In the SVM algorithm, we are looking to maximize the margin between the data points and
the hyperplane. The loss function that helps maximize the margin is hinge loss.

The cost is 0 if the predicted value and the actual value are of the same sign. If they are
not, we then calculate the loss value. We also add a regularization parameter the cost function. The
objective of the regularization parameter is to balance the margin maximization and loss. After
adding the regularization parameter, the cost functions looks as below.

Now that we have the loss function, we take partial derivatives with respect to the weights
to find the gradients. Using the gradients, we can update our weights.

When there is no misclassification, i.e. our model correctly predicts the class of our data
point, we only have to update the gradient from the regularization parameter.
When there is a misclassification, i.e our model makes a mistake on the prediction of the
class of our data point, we include the loss along with the regularization parameter to perform
gradient update.

5. Naive Bayes
It is a classification technique based on Bayes’ theorem with an assumption of
independence between predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature. For
example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, a naive
Bayes classifier would consider all of these properties to independently contribute to the
probability that this fruit is an apple.
Naive Bayesian model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P (x | c). Look at the equation below:

Here,
P (c | x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P (x |c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
Example: Let’s understand it using an example. Below I have a training data set of weather and
corresponding target variable ‘Play’. Now, we need to classify whether players will play or not
based on weather condition. Let’s follow the below steps to perform it.
Step 1: Convert the data set to frequency table
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.
Problem: Players will pay if weather is sunny, is this statement is correct?
We can solve it using above discussed method, so P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P
(Sunny)
Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P(Yes)= 9/14 = 0.64
Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
6. kNN (k- Nearest Neighbors) (Instance based Learning)
It can be used for both classification and regression problems. However, it is more widely
used in classification problems in the industry. K nearest neighbors is a simple algorithm that stores
all available cases and classifies new cases by a majority vote of its k neighbors. The case being
assigned to the class is most common amongst its K nearest neighbors measured by a distance
function.
These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance.
First three functions are used for continuous function and fourth one (Hamming) for categorical
variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times,
choosing K turns out to be a challenge while performing kNN modeling.
KNN can easily be mapped to our real lives. If you want to learn about a person, of whom
you have no information, you might like to find out about his close friends and the circles he moves
in and gain access to his/her information!
Things to consider before selecting kNN:
➢ KNN is computationally expensive
➢ Variables should be normalized else higher range variables can bias it
➢ Works on pre-processing stage more before going for kNN like an outlier, noise removal
7. K-Means
It is a type of unsupervised algorithm which solves the clustering problem. Its procedure
follows a simple and easy way to classify a given data set through a certain number of clusters
(assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer
groups.
Remember figuring out shapes from ink blots? k means is somewhat similar this activity.
You look at the shape and spread to decipher how many different clusters / population are present!
How K-means forms cluster:
K-means picks k number of points for each cluster known as centroids.
Each data point forms a cluster with the closest centroids i.e. k clusters.
Finds the centroid of each cluster based on existing cluster members. Here we have new centroids.
As we have new centroids, repeat step 2 and 3. Find the closest distance for each data point from
new centroids and get associated with new k-clusters. Repeat this process until convergence occurs
i.e. centroids does not change.
How to determine value of K:
In K-means, we have clusters and each cluster has its own centroid. Sum of square of
difference between centroid and the data points within a cluster constitutes within sum of square
value for that cluster. Also, when the sum of square values for all the clusters are added, it becomes
total within sum of square value for the cluster solution.
We know that as the number of cluster increases, this value keeps on decreasing but if you
plot the result you may see that the sum of squared distance decreases sharply up to some value of
k, and then much more slowly after that. Here, we can find the optimum number of cluster.
8. Random Forest
Random Forest is a trademark term for an ensemble of decision trees. In Random Forest,
we’ve collection of decision trees (so known as “Forest”). To classify a new object based on
attributes, each tree gives a classification and we say the tree “votes” for that class. The forest
chooses the classification having the most votes (over all the trees in the forest).
Each tree is planted & grown as follows:
If the number of cases in the training set is N, then sample of N cases is taken at random but with
replacement. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are
selected at random out of the M and the best split on these m is used to split the node. The value
of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.
Algorithms Grouped by Similarity
Algorithms are often grouped by similarity in terms of their function (how they work). For
example, tree-based methods, and neural network inspired methods.
This is a useful grouping method, but it is not perfect. There are still algorithms that could
just as easily fit into multiple categories like Learning Vector Quantization that is both a neural
network inspired method and an instance-based method. There are also categories that have the
same name that describe the problem and the class of algorithm such as Regression and Clustering.
Please Note: There is a strong bias towards algorithms used for classification and regression, the
two most prevalent supervised machine learning problems you will encounter.
Regression Algorithms
Regression is concerned with modeling the relationship between variables that is iteratively
refined using a measure of error in the predictions made by the model.
Regression methods are a workhorse of statistics and have been co-opted into statistical
machine learning. This may be confusing because we can use regression to refer to the class of
problem and the class of algorithm. Really, regression is a process.
The most popular regression algorithms are:
➢ Ordinary Least Squares Regression (OLSR)
➢ Linear Regression
➢ Logistic Regression
➢ Stepwise Regression
➢ Multivariate Adaptive Regression Splines (MARS)
➢ Locally Estimated Scatterplot Smoothing (LOESS)
Instance-based Algorithms
Instance-based Algorithms Instance-based learning model is a decision problem with
instances or examples of training data that are deemed important or required to the model.
Such methods typically build up a database of example data and compare new data to the
database using a similarity measure in order to find the best match and make a prediction. For this
reason, instance-based methods are also called winner-take-all methods and memory-based
learning. Focus is put on the representation of the stored instances and similarity measures used
between instances.
The most popular instance-based algorithms are:
➢ k-Nearest Neighbor (kNN)
➢ Learning Vector Quantization (LVQ)
➢ Self-Organizing Map (SOM)
➢ Locally Weighted Learning (LWL)
➢ Support Vector Machines (SVM)
Regularization Algorithms
An extension made to another method (typically regression methods) that penalizes models
based on their complexity, favoring simpler models that are also better at generalizing.
I have listed regularization algorithms separately here because they are popular, powerful
and generally simple modifications made to other methods.
The most popular regularization algorithms are:
➢ Ridge Regression
➢ Least Absolute Shrinkage and Selection Operator (LASSO)
➢ Elastic Net
➢ Least-Angle Regression (LARS)
➢ Decision Tree Algorithms
Decision Tree Algorithms
Decision tree methods construct a model of decisions made based on actual values of
attributes in the data.
Decisions fork in tree structures until a prediction decision is made for a given record.
Decision trees are trained on data for classification and regression problems. Decision trees are
often fast and accurate and a big favorite in machine learning.
The most popular decision tree algorithms are:
➢ Classification and Regression Tree (CART)
➢ Iterative Dichotomiser 3 (ID3)
➢ C4.5 and C5.0 (different versions of a powerful approach)
➢ Chi-squared Automatic Interaction Detection (CHAID)
➢ Decision Stump
➢ M5
➢ Conditional Decision Trees
Bayesian Algorithms
Bayesian methods are those that explicitly apply Bayes’ Theorem for problems such as
classification and regression.
The most popular Bayesian algorithms are:
➢ Naive Bayes
➢ Gaussian Naive Bayes
➢ Multinomial Naive Bayes
➢ Averaged One-Dependence Estimators (AODE)
➢ Bayesian Belief Network (BBN)
➢ Bayesian Network (BN)
Clustering Algorithms
Clustering like regression, describes the class of problem and the class of methods.
Clustering methods are typically organized by the modeling approaches such as centroid-
based and hierarchal. All methods are concerned with using the inherent structures in the data to
best organize the data into groups of maximum commonalities.
The most popular clustering algorithms are:
➢ k-Means
➢ k-Medians
➢ Expectation Maximization (EM)
➢ Hierarchical Clustering
Association Rule Learning Algorithms
Assoication rule learning methods extract rules that best explain observed relationships
between variables in data.
These rules can discover important and commercially useful associations in large
multidimensional datasets that can be exploited by an organization.
The most popular association rule learning algorithms are:
➢ Apriori algorithm
➢ Eclat algorithm
Artificial Neural Network Algorithms
Artificial Neural Networks are models that are inspired by the structure and/or function of
biological neural networks.
They are a class of pattern matching that are commonly used for regression and
classification problems but are really an enormous subfield comprised of hundreds of algorithms
and variations for all manner of problem types.
Note that I have separated out Deep Learning from neural networks because of the massive
growth and popularity in the field. Here we are concerned with the more classical methods.
The most popular artificial neural network algorithms are:
➢ Perceptron
➢ Multilayer Perceptrons (MLP)
➢ Back-Propagation
➢ Stochastic Gradient Descent
➢ Hopfield Network
➢ Radial Basis Function Network (RBFN)
Deep Learning Algorithms
Deep Learning methods are a modern update to Artificial Neural Networks that exploit
abundant cheap computation.
They are concerned with building much larger and more complex neural networks and, as
commented on above, many methods are concerned with very large datasets of labelled analog
data, such as image, text. audio, and video.
The most popular deep learning algorithms are:
➢ Convolutional Neural Network (CNN)
➢ Recurrent Neural Networks (RNNs)
➢ Long Short-Term Memory Networks (LSTMs)
➢ Stacked Auto-Encoders
➢ Deep Boltzmann Machine (DBM)
➢ Deep Belief Networks (DBN)
Dimensionality Reduction Algorithms
Dimensional Reduction Algorithms Like clustering methods, dimensionality reduction
seek and exploit the inherent structure in the data, but in this case in an unsupervised manner or
order to summarize or describe data using less information.
This can be useful to visualize dimensional data or to simplify data which can then be used
in a supervised learning method. Many of these methods can be adapted for use in classification
and regression.
➢ Principal Component Analysis (PCA)
➢ Principal Component Regression (PCR)
➢ Partial Least Squares Regression (PLSR)
➢ Sammon Mapping
➢ Multidimensional Scaling (MDS)
➢ Projection Pursuit
➢ Linear Discriminant Analysis (LDA)
➢ Mixture Discriminant Analysis (MDA)
➢ Quadratic Discriminant Analysis (QDA)
➢ Flexible Discriminant Analysis (FDA)
Ensemble Algorithms
Ensemble methods are models composed of multiple weaker models that are independently
trained and whose predictions are combined in some way to make the overall prediction.
Much effort is put into what types of weak learners to combine and the ways in which to
combine them.
This is a very powerful class of techniques and as such is very popular.
➢ Boosting
➢ Bootstrapped Aggregation (Bagging)
➢ AdaBoost
➢ Weighted Average (Blending)
➢ Stacked Generalization (Stacking)
➢ Gradient Boosting Machines (GBM)
➢ Gradient Boosted Regression Trees (GBRT)
➢ Random Forest
How to evaluate the performance of ML algorithms
Evaluating your machine learning algorithm is an essential part of any project. Your model may
give you satisfying results when evaluated using a metric say accuracy score but may give poor
results when evaluated against other metrics such as logarithmic loss or any other such metric.
Most of the times we use classification accuracy to measure the performance of our model,
however it is not enough to truly judge our model. In this post, we will cover different types of
evaluation metrics available.
➢ Classification Accuracy
➢ Logarithmic Loss
➢ Confusion Matrix
➢ Area under Curve
➢ F1 Score
➢ Mean Absolute Error
➢ Mean Squared Error
Classification Accuracy
Classification Accuracy is what we usually mean, when we use the term accuracy. It is the ratio of
number of correct predictions to the total number of input samples.

It works well only if there are equal number of samples belonging to each class.
For example, consider that there are 98% samples of class A and 2% samples of class B in our
training set. Then our model can easily get 98% training accuracy by simply predicting every
training sample belonging to class A.
When the same model is tested on a test set with 60% samples of class A and 40% samples of class
B, then the test accuracy would drop down to 60%. Classification Accuracy is great, but gives us
the false sense of achieving high accuracy.
The real problem arises, when the cost of misclassification of the minor class samples are very
high. If we deal with a rare but fatal disease, the cost of failing to diagnose the disease of a sick
person is much higher than the cost of sending a healthy person to more tests.
Logarithmic Loss
Logarithmic Loss or Log Loss, works by penalizing the false classifications. It works well for
multi-class classification. When working with Log Loss, the classifier must assign probability to
each class for all the samples. Suppose, there are N samples belonging to M classes, then the Log
Loss is calculated as below:
where,
y_ij, indicates whether sample i belongs to class j or not
p_ij, indicates the probability of sample i belonging to class j
Log Loss has no upper bound and it exists on the range [0, ∞). Log Loss nearer to 0 indicates
higher accuracy, whereas if the Log Loss is away from 0 then it indicates lower accuracy.
In general, minimizing Log Loss gives greater accuracy for the classifier.
Confusion Matrix
Confusion Matrix as the name suggests gives us a matrix as output and describes the complete
performance of the model.
Let’s assume we have a binary classification problem. We have some samples belonging to two
classes: YES or NO. Also, we have our own classifier which predicts a class for a given input
sample. On testing our model on 165 samples, we get the following result.

There are 4 important terms:


True Positives: The cases in which we predicted YES and the actual output was also YES.
True Negatives: The cases in which we predicted NO and the actual output was NO.
False Positives: The cases in which we predicted YES and the actual output was NO.
False Negatives: The cases in which we predicted NO and the actual output was YES.
Accuracy for the matrix can be calculated by taking average of the values lying across the “main
diagonal” i.e
Confusion Matrix forms the basis for the other types of metrics.
Area Under Curve
Area Under Curve (AUC) is one of the most widely used metrics for evaluation. It is used for
binary classification problem. AUC of a classifier is equal to the probability that the classifier will
rank a randomly chosen positive example higher than a randomly chosen negative example. Before
defining AUC, let us understand two basic terms:
True Positive Rate (Sensitivity): True Positive Rate is defined as TP/ (FN+TP). True Positive
Rate corresponds to the proportion of positive data points that are correctly considered as positive,
with respect to all positive data points.
False Positive Rate (Specificity): False Positive Rate is defined as FP / (FP+TN). False Positive
Rate corresponds to the proportion of negative data points that are mistakenly considered as
positive, with respect to all negative data points.
False Positive Rate and True Positive Rate both have values in the range [0, 1]. FPR and TPR both
are computed at threshold values such as (0.00, 0.02, 0.04, …., 1.00) and a graph is drawn. AUC
is the area under the curve of plot False Positive Rate vs True Positive Rate at different points in
[0, 1].

As evident, AUC has a range of [0, 1]. The greater the value, the better is the performance of our
model.
F1 Score
F1 Score is used to measure a test’s accuracy
F1 Score is the Harmonic Mean between precision and recall. The range for F1 Score is [0, 1]. It
tells you how precise your classifier is (how many instances it classifies correctly), as well as how
robust it is (it does not miss a significant number of instances).
High precision but lower recall, gives you an extremely accurate, but it then misses a large number
of instances that are difficult to classify. The greater the F1 Score, the better is the performance of
our model. Mathematically, it can be expressed as :
F1 Score tries to find the balance between precision and recall.
Precision: It is the number of correct positive results divided by the number of positive results
predicted by the classifier.

Recall: It is the number of correct positive results divided by the number of all relevant samples
(all samples that should have been identified as positive).

Mean Absolute Error


Mean Absolute Error is the average of the difference between the Original Values and the
Predicted Values. It gives us the measure of how far the predictions were from the actual output.
However, they don’t gives us any idea of the direction of the error i.e. whether we are under
predicting the data or over predicting the data. Mathematically, it is represented as :

Mean Squared Error


Mean Squared Error(MSE) is quite similar to Mean Absolute Error, the only difference being that
MSE takes the average of the square of the difference between the original values and the predicted
values. The advantage of MSE being that it is easier to compute the gradient, whereas Mean
Absolute Error requires complicated linear programming tools to compute the gradient. As, we
take square of the error, the effect of larger errors become more pronounced then smaller error,
hence the model can now focus more on the larger errors.

You might also like