You are on page 1of 37

What is a Bayesian Belief Network?

Bayesian Belief Network or Bayesian Network or Belief Network is a


Probabilistic Graphical Model (PGM) that represents conditional dependencies
between random variables through a Directed Acyclic Graph (DAG).

It is also called a Bayes network, belief network, decision network,


or Bayesian model. it consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve


decision problems under uncertain knowledge is known as an Influence
diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can
be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in the
graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by
the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

What do we use the Bayesian Networks for?

Bayesian Networks are applied in many fields. For example, disease diagnosis, optimized
web search, spams filtering, gene regulatory networks, etc. And this list can be extended.

The main objective of these networks is trying to understand the structure of causality
relations. To clarify this, let’s consider a disease diagnosis problem. With given symptoms
and their resulting disease, we construct our Belief Network and when a new patient comes,
we can infer which disease or diseases may have the new patient by providing probabilities
for each disease. Similarly, these causality relations can be constructed for other problems
and inference techniques can be applied to interesting results.
Mathematical Definition of Belief Networks

The probabilities are calculated in the belief networks by the following formula

As you would understand from the formula, to be able to calculate the joint distribution we
need to have conditional probabilities indicated by the network. But further that if we have
the joint distribution, then we can start to ask interesting questions. For example, in the first
example, we ask for the probability of “RAIN” if “SEASON” is “WINTER” and “DOG
BARK” is “TRUE”.

Explanation of Bayesian network:


Let's understand the Bayesian network through an example by creating a directed acyclic
graph:

Example: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but sometimes
he got confused with the phone ringing and calls at that time too. On the other hand, Sophia
likes to listen to high music, so sometimes she misses to hear the alarm.

Here we would like to compute the probability of Burglary Alarm.

Problem:

Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both called the
Harry.

Solution:

o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the alarm
and directly affecting the probability of alarm's going off, but David and Sophia's
calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary and also do not notice the minor earthquake, and they also not confer
before calling.
o The conditional distributions for each node are given as conditional probabilities
table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2 K probabilities.
Hence, if there are two parents, then CPT will contain 4 probability values

List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)

We can write the events of problem statement in the form of probability: P[D, S, A, B,
E], can rewrite the above probability statement using joint probability distribution:

P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]

=P[D | S, A, B, E]. P[S | A, B, E].P[A, B, E]

= P [D| A]. P [ S| A, B, E]. P[ A, B, E]

= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A].P[A| B, E]. P[B |E]. P[E]


Let's take the observed probability for the Burglary and earthquake component:

P(B= True) = 0.002, which is the probability of burglary.

P(B= False)= 0.998, which is the probability of no burglary.

P(E= True)= 0.001, which is the probability of a minor earthquake

P(E= False)= 0.999, Which is the probability that an earthquake not occurred.

We can provide the conditional probabilities as per the below tables:

Conditional probability table for Alarm A:

The Conditional probability of Alarm A depends on Burglar and earthquake:

B E P(A= True) P(A= False)

True True 0.94 0.06

True False 0.95 0.04

False True 0.31 0.69

False False 0.001 0.999


Conditional probability table for David Calls:

The Conditional probability of David that he will call depends on the probability of Alarm.

A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:

The Conditional probability of Sophia that she calls is depending on its Parent Node "Alarm."

A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98

From the formula of joint distribution, we can write the problem statement in the form of
probability distribution:

P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999

= 0.00068045.
Introduction To Machine Learning
Machine learning (ML) is the study of computer algorithms that improve automatically
through experience. It is seen as a subset of artificial intelligence. Machine learning
algorithms build a mathematical model based on sample data, known as "training data", in
order to make predictions or decisions without being explicitly programmed to do
so. Machine learning algorithms are used in a wide variety of applications, such as email
filtering and computer vision, where it is difficult or infeasible to develop conventional
algorithms to perform the needed tasks.

Machine learning is closely related to computational statistics, which focuses on making


predictions using computers. The study of mathematical optimization delivers methods,
theory and application domains to the field of machine learning. Data mining is a related field
of study, focusing on exploratory data analysis through unsupervised learning. In its
application across business problems, machine learning is also referred to as predictive
analytics.

ML | what is Machine Learning?


Arthur Samuel, a pioneer in the field of artificial intelligence and computer gaming, coined
the term “Machine Learning”. He defined machine learning as – “Field of study that
gives computers the capability to learn without being explicitly programmed”.
In a very layman manner, Machine Learning(ML) can be explained as automating and
improving the learning process of computers based on their experiences without being
actually programmed i.e. without any human assistance. The process starts with feeding
good quality data and then training our machines(computers) by building machine learning
models using the data and different algorithms. The choice of algorithms depends on what
type of data do we have and what kind of task we are trying to automate.

Example: Training of students during exam.


While preparing for the exams students don’t actually cram the subject but try to learn it with
complete understanding. Before the examination, they feed their machine(brain) with a good
amount of high-quality data (questions and answers from different books or teachers notes
or online video lectures). Actually, they are training their brain with input as well as output i.e.
what kind of approach or logic do they have to solve a different kind of questions. Each time
they solve practice test papers and find the performance (accuracy /score) by comparing
answers with answer key given, Gradually, the performance keeps on increasing, gaining
more confidence with the adopted approach. That’s how actually models are built, train
machine with data (both inputs and outputs are given to model) and when the time comes
test on data (with input only) and achieves our model scores by comparing its answer with
the actual output which has not been fed while training. Researchers are working with
assiduous efforts to improve algorithms, techniques so that these models perform even
much better.

Basic Difference in ML and Traditional Programming?

 Traditional Programming: We feed in DATA (Input) + PROGRAM (logic), run it on


machine and get output.
 Machine Learning: We feed in DATA(Input) + Output, run it on machine during
training and the machine creates its own program(logic), which can be evaluated while
testing.

How ML works?
 Gathering past data in any form suitable for processing.The better the quality of data,
the more suitable it will be for modeling
 Data Processing – Sometimes, the data collected is in the raw form and it needs to be
pre-processed.
Example: Some tuples may have missing values for certain attributes, an, in this case,
it has to be filled with suitable values in order to perform machine learning or any
form of data mining.
Missing values for numerical attributes such as the price of the house may be replaced
with the mean value of the attribute whereas missing values for categorical attributes
may be replaced with the attribute with the highest mode. This invariably depends on
the types of filters we use. If data is in the form of text or images then converting it to
numerical form will be required, be it a list or array or matrix. Simply, Data is to be
made relevant and consistent. It is to be converted into a format understandable by the
machine
 Divide the input data into training,cross-validation and test sets. The ratio between the
respective sets must be 6:2:2
 Building models with suitable algorithms and techniques on the training set.
 Testing our conceptualized model with data which was not fed to the model at the
time of training and evaluating its performance using metrics such as F1 score,
precision and recall.

Supervised Learning

Supervised learning is the most popular paradigm for machine learning. It is the easiest to

understand and the simplest to implement. It is very similar to teaching a child with the use of

flash cards.
Given data in the form of examples with labels, we can feed a learning algorithm these
example-label pairs one by one, allowing the algorithm to predict the label for each example,
and giving it feedback as to whether it predicted the right answer or not. Over time, the
algorithm will learn to approximate the exact nature of the relationship between examples and
their labels. When fully-trained, the supervised learning algorithm will be able to observe a
new, never-before-seen example and predict a good label for it.

Supervised learning is often described as task-oriented because of this. It is highly focused on

a singular task, feeding more and more examples to the algorithm until it can accurately

perform on that task. This is the learning type that you will most likely encounter, as it is

exhibited in many of the following common applications:

 Advertisement Popularity: Selecting advertisements that will perform well is often a

supervised learning task. Many of the ads you see as you browse the internet are placed

there because a learning algorithm said that they were of reasonable popularity (and click
ability). Furthermore, its placement associated on a certain site or with a certain query (if

you find yourself using a search engine) is largely due to a learned algorithm saying that

the matching between ad and placement will be effective.

 Spam Classification: If you use a modern email system, chances are you’ve encountered

a spam filter. That spam filter is a supervised learning system. Fed email examples and

labels (spam/not spam), these systems learn how to preemptively filter out malicious

emails so that their user is not harassed by them. Many of these also behave in such a way

that a user can provide new labels to the system and it can learn user preference.

 Face Recognition: Do you use Facebook? Most likely your face has been used in a
supervised learning algorithm that is trained to recognize your face. Having a system that

takes a photo, finds faces, and guesses who that is in the photo (suggesting a tag) is a

supervised process. It has multiple layers to it, finding faces and then identifying them,

but is still supervised nonetheless.


Unsupervised Learning

Unsupervised learning is very much the opposite of supervised learning. It features no labels.

Instead, our algorithm would be fed a lot of data and given the tools to understand the

properties of the data. From there, it can learn to group, cluster, and/or organize the data in a

way such that a human (or other intelligent algorithm) can come in and make sense of the

newly organized data.

What makes unsupervised learning such an interesting area is that an overwhelming majority
of data in this world is unlabelled. Having intelligent algorithms that can take our terabytes
and terabytes of unlabelled data and make sense of it is a huge source of potential profit for
many industries. That alone could help boost productivity in a number of fields.

For example, what if we had a large database of every research paper ever published and we
had unsupervised learning algorithms that knew how to group these in such a way so that you
were always aware of the current progression within a particular domain of research. Now,
you begin to start a research project yourself, hooking your work into this network that the
algorithm can see. As you write your work up and take notes, the algorithm makes suggestions
to you about related works, works you may wish to cite, and works that may even help you
push that domain of research forward. With such a tool, your productivity can be extremely
boosted.
Because unsupervised learning is based upon the data and its properties, we can say that
unsupervised learning is data-driven. The outcomes from an unsupervised learning task are
controlled by the data and the way its formatted. Some areas you might see unsupervised
learning crop up are:

 Recommender Systems: If you’ve ever used YouTube or Netflix, you’ve most likely
encountered a video recommendation system. These systems are often times placed in the
unsupervised domain. We know things about videos, maybe their length, their genre, etc.
We also know the watch history of many users. Taking into account users that have
watched similar videos as you and then enjoyed other videos that you have yet to see, a
recommender system can see this relationship in the data and prompt you with such a
suggestion.

 Buying Habits: It is likely that your buying habits are contained in a database somewhere
and that data is being bought and sold actively at this time. These buying habits can be
used in unsupervised learning algorithms to group customers into similar purchasing
segments. This helps companies market to these grouped segments and can even resemble
recommender systems.

 Grouping User Logs: Less user facing, but still very relevant, we can use unsupervised
learning to group user logs and issues. This can help companies identify central themes to
issues their customers face and rectify these issues, through improving a product or
designing an FAQ to handle common issues. Either way, it is something that is actively
done and if you’ve ever submitted an issue with a product or submitted a bug report, it is
likely that it was fed to an unsupervised learning algorithm to cluster it with other similar
issues.
Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning


technique that enables an agent to learn in an interactive
environment by trial and error using feedback from its own
actions and experiences.

Reinforcement learning uses rewards and punishments as


signals for positive and negative behaviour.

In the case of reinforcement learning the goal is to find a


suitable action model that would maximize the total
cumulative reward of the agent. The figure below illustrates
the action-reward feedback loop of a generic RL model.
Some key terms that describe the basic elements of an RL
problem are:

1. Environment — Physical world in which the agent


operates

2. State — Current situation of the agent

3. Reward — Feedback from the environment

4. Policy — Method to map agent’s state to actions

5. Value — Future reward that an agent would receive by


taking an action in a particular state

Markov decision process (MDP)


Markov decision process (MDP) is a discrete-time stochastic control process. It provides a
mathematical framework for modelling decision making in situations where outcomes are
partly random and partly under the control of a decision maker. MDPs are useful for
studying optimization problems solved via dynamic programming. It is used in robotics, automatic
control, economics, manufacturing and gaming.
********************************************************
********************************************************
Categorization of machine learning tasks on the
basis of required Output

1. Classification : When inputs are divided into two or more classes, and the learner must
produce a model that assigns unseen inputs to one or more (multi-label classification) of
these classes. This is typically tackled in a supervised way. Spam filtering is an example
of classification, where the inputs are email (or other) messages and the classes are
“spam” and “not spam”.
2. Regression : Which is also a supervised problem, A case when the outputs are
continuous rather than discrete.
3. Clustering : When a set of inputs is to be divided into groups. Unlike in classification,
the groups are not known beforehand, making this typically an unsupervised task.

Supervised Algorithm
1. Linear Regression (used for Regression)
2. Logistic Regression (used for classification)
3. Support Vector Machine
4. Decision Tree
5. K NN(K Nearest Neighbour)
6. Naïve Bayes
7. Convolutional Neural Network(CNN)

Unsupervised Learning Algorithm


 K-means
 Self-organizing map(SOM)

Reinforcement Learning Algorithm


 Q-learning

*********************************************

Designing a Learning System


1. Data collection
2. Data pre processing
 Data splitting(80:20 ratio)
3. Building a model(choose an algorithm)
4. Train the model (Fitting the training dataset)
5. Evaluate the model (Check the performance of model)
6. Parameter tuning(epochs, learning rate)
7. Make prediction

********************************************************
********************************************************
Statistical Learning Method
Statistics is the discipline that concerns the collection, organization, analysis, interpretation, and
presentation of data. Statistics is a collection of tools that you can use to get answers to
important questions about data.
You can use descriptive statistical methods to transform raw observations into
information that you can understand and share.
Statistical Learning is a set of tools for understanding data. These tools broadly
come under two classes: supervised learning & unsupervised learning. In this
technique main idea is data and hypothesis. Here data is evidence i.e. instantiations of some
or all random variables describing the domain. Bayesian learning calculates probabilities of
each hypothesis given the data and makes prediction.

As a machine learning practitioner, you must have an understanding of statistical


methods.

Raw observations alone are data, but they are not information or knowledge. Data raises
questions, such as:

1. What is the most common or expected observation?


2. What are the limits on the observations?
3. What does the data look like?
4. What variables are most relevant?
5. What is the difference between two experiments?
6. Are the differences real or the result of noise in the data?

It would be fair to say that statistical methods are required to effectively work through a
machine learning predictive modelling project.

Below are 10 examples of where statistical methods are used in an applied machine
learning project.

 Problem Framing: Requires the use of exploratory data analysis and data mining.
 Data Understanding: Requires the use of summary statistics and data visualization.
 Data Cleaning. Requires the use of outlier detection, imputation and more.
 Data Selection. Requires the use of data sampling and feature selection methods.
 Data Preparation. Requires the use of data transforms, scaling, encoding and much more.

 Model Evaluation. Requires experimental design and resampling methods.


 Model Configuration. Requires the use of statistical hypothesis tests and estimation
statistics.
 Model Selection. Requires the use of statistical hypothesis tests and estimation statistics.
 Model Presentation. Requires the use of estimation statistics such as confidence intervals.
 Model Predictions. Requires the use of estimation statistics such as prediction intervals.

***************************************************************************
***********************************************************************

Learning with complete data (Assignment question)


Data are complete when each data point contains values for every variable in the
probability model being learned. Complete data greatly simplify the problem of learning
the parameters of a complex model.

Example:
Naive Bayes Classifier
1. Naive Bayes algorithm is a supervised learning algorithm, which is based
on Bayes theorem and used for solving classification problems.
2. It is mainly used in text classification that includes a high-dimensional training
dataset.
3. Naïve Bayes Classifier is one of the simple and most effective classification
algorithms which help in building the fast machine learning models that can
make quick predictions.
4. It is a probabilistic classifier, which means it predicts on the basis of the
probability of an object.
5. Some popular examples of Naïve Bayes Algorithm are spam filtration,
Sentimental analysis, and classifying articles.

Naive bayes classifier in mathematical form

Example: (Numerical)
Problem-

Novel instance,

X=(outlook=sunny, temperature=cool, humidity=high, wind=strong)

Find x will play tennis or not?

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:


o Naive Bayes assumes that all features are independent or unrelated, so it cannot learn
the relationship between features.

Applications of Naïve Bayes Classifier:


o It is used for Credit Scoring.
o It is used in medical data classification.
o It can be used in real-time predictions because Naïve Bayes Classifier is an eager
learner.
o It is used in Text classification such as Spam filtering and Sentiment analysis.

Learning with Hidden data (Assignment Question)


It is a very general algorithm used to learn probabilistic models in which variables
are hidden; that is, some of the variables are not observed. Models with hidden
variables are sometimes called latent variable models. The EM algorithm is a
solution to this kind of problem and goes very well with probabilistic graphical
models.
Expectation-Maximization Algorithm (E-M Algorithm)
Expectation-Maximization algorithm can be used for the latent variables (variables that are
not directly observable and are actually inferred from the values of the other observed
variables) too in order to predict their values with the condition that the general form of
probability distribution governing those latent variables is known to us. This algorithm is
actually at the base of many unsupervised clustering algorithms in the field of machine
learning.

It was explained, proposed and given its name in a paper published in 1977 by Arthur
Dempster, Nan Laird, and Donald Rubin. It is used to find the local maximum likelihood
parameters of a statistical model in the cases where latent variables are involved and the data
is missing or incomplete .
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset,
estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation
(E) step is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available observed


data of the dataset to estimate the missing data and then using that data to update
the values of the parameters. Let us understand the EM algorithm in detail.

 Initially, a set of initial values of the parameters are considered. A set of


incomplete observed data is given to the system with the assumption that the
observed data comes from a specific model.
 The next step is known as “Expectation” – step or E-step. In this step, we use
the observed data in order to estimate or guess the values of the missing or
incomplete data. It is basically used to update the variables.
 The next step is known as “Maximization”-step or M-step. In this step, we use
the complete data generated in the preceding “Expectation” – step in order to
update the values of the parameters. It is basically used to update the
hypothesis.
 Now, in the fourth step, it is checked whether the values are converging or not,
if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step
and “Maximization” – step until the convergence occurs.

Flow chart for EM algorithm –

Usage of EM algorithm –
 It can be used to fill the missing data in a sample.
 It can be used as the basis of unsupervised learning of clusters.
 It can be used for the purpose of estimating the parameters of Hidden Markov
Model (HMM).
 It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
 It is always guaranteed that likelihood will increase with each iteration.
 The E-step and M-step are often pretty easy for many problems in terms of
implementation.
 Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
 It has slow convergence.
 It makes convergence to the local optima only.
 It requires both the probabilities, forward and backward (numerical optimization
requires only forward probability).

*********************************************

Decision Tree Algorithm


o Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules and each leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any
further branches.
o The decisions or the test are performed on the basis of features of the given
dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm and ID3 (Iterative Dichotomiser
3) algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into sub trees.

Types of Decision Trees


Types of decision trees are based on the type of target variable we have. It can be of two
types:

1. Categorical Variable Decision Tree: Decision Tree which has a categorical target
variable then it called a Categorical variable decision tree.

2. Continuous Variable Decision Tree: Decision Tree has a continuous target variable
then it is called Continuous Variable Decision Tree.

Example:- Let’s say we have a problem to predict whether a customer will pay his renewal
premium with an insurance company (yes/ no). Here we know that the income of customers
is a significant variable but the insurance company does not have income details for all
customers. Now, as we know this is an important variable, then we can build a decision tree
to predict customer income based on occupation, product, and various other variables. In this
case, we are predicting values for the continuous variables.
Important Terminology related to Decision Trees
1. Root Node: It represents the entire population or sample and this further gets divided
into two or more homogeneous sets.

2. Splitting: It is a process of dividing a node into two or more sub-nodes.

3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the
decision node.

4. Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

5. Pruning: When we remove sub-nodes of a decision node, this process is called


pruning. You can say the opposite process of splitting.

6. Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.

7. Parent and Child Node: A node, which is divided into sub-nodes, is called a parent
node of sub-nodes whereas sub-nodes are the child of a parent node.

DECISION TREE REPRESENTATION

Decision trees classify instances by sorting them down the tree from the root to some leaf
node, which provides the classification of the instance. Each node in the tree specifies a test
of some attribute of the instance, and each branch descending from that node corresponds to
one of the possible values for this attribute. An instance is classified by starting at the root
node of the tree, testing the attribute specified by this node, then moving down the tree
branch corresponding to the value of the attribute in the given example. This process is then
repeated for the subtree rooted at the new node.

For example, the new instance

X= (Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)

Would be sorted down the leftmost branch of this decision tree and would therefore be
classified as a negative instance (i.e., the tree predicts that PlayTennis = no).
Steps in ID3(Iterative Dichotomiser 3 ) algorithm:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select
the best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of
a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
Information Gain= Entropy(S) - [(Weighted Avg) *Entropy(each feature)
Or

Where I is average Information

2. Entropy:

Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Or

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

3. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o It is calculated by subtracting the sum of squared probabilities of each class
from one.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Pruning: Getting an Optimal Decision tree


Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree


o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.

You might also like