ML Techmax

To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
(of
Course Code
CSDLO6021)
(OQepartment Level
CPtional Course- Il)
ebooks
(PDE download)
Download App
ag
Gratt
ACAN TO WT
UNIVERSITY
PAPER SOLUTIONS
ISBN
78-8 1-941546-7-8
hh! __
i . mm (tfec u-Neo
[N= PUBLICATIONS
= | iil
Price % 245/- A Sachin Shoh Venture
FREE
Baan Sample chapter from our Android: App
Scanned with CamScanner

University of Mumbai
Machine Learning
(Course Code : CSDLO6021)
Pa ! Department Level Optional Course -II
or
Semester VI - Computer Engineering
Strictly as per the, Revised S yllabus (REV-2016 ‘CBCGS"

Scheme) of Mumbai University w.e.f. academic year 2018-2019
Dr. Vaikole Shubhangi Liladhar

Associate.Prof. (Ph.D)
Datta Meghe College of Engineering Sector-3, Aj

Navi Mumbai - 400708
Dr. Sawarkar Sudh

Professor & Princi
Datta Meghe College of Enginee!

Navi Mumbai - 400
i, Tecu-Neo ;T
™ M6-14A
i]
Where
PUBLICATIONS
Authors Inspire Ae
PU
A Sachin Shah Venture

—_
vachine Learning
About Managing Director....
06021) - Mr. Sachin Shah
se Code : CSDL 2018-2019)
6/ Co mp uter — ForR ev Syil.
the Semester
Liladhar, ; r= Over 25 years of experience in Academ;
kole Shubhangi
Authors : Dr. Vai
mar Deoraoji Publishing...
Dr, Sawarkar Sudhirku
With over two and a half decades of :
in bringing out more than 1200 tee
: February 2021202
First Edition for Rev Syllabus Engineering. Polytechnic, Pharmacy, ¢,__*
Sciences and Information T, =
M6-14 Sachin Shah is a name synonymoys ra
Tech-Neo ID :
quality and innovative content.
rights reserved.
Copyright © by Author. All
A driven Educationalist...
No part of this publication may be teproduced,
1. AB.E. in Industrial ~=—- Elegy
copied, or stored in a retrieval system, distributed or
transmitted in any form or by any means, including
(1992 Batch) from Bharati Vidyapees
or other electronic or College of Engineering, affiliatag b
photocopy, recording,
University of Pune.
seopeeorpreaniareninesimnnemisnats
mechanical methods, without the prior written
permission of the Publisher. An Alumnus of ITM Ahmedabad.
Vv
This book is sold subject to the condition that it 3. A Co-Author of a bestselling book ee
shall not, by the-way of trade or otherwise, be lent,
“Engineering Mathematics” for Polytechsi-
Students of Maharashtra State.
resold, hired out, or otherwise circulated without the
publisher’s prior written consent in any form of 4. Sachin has for over a decade, been working
binding or cover other than which it is published as a Consultant for Higher Education &
and without a similar condition including this USA and several other countries.
condition being imposed “on. the subsequent With path-breaking career...
purchaser and without limiting the “righ ts under A publishing career that started wih
copyright reserved above. handwritten cyclostyled notes back in [992
Published by Sachin Shah has to his credit setting up 203
Mr. Sachin S. Shah
expansion of one of the leading companies 2
Managing Director, B. E (Industrial Electronics) higher education publishing.
An Alumunus of IIM Ahmedabad An experienced professional and an expert
Address An energetic, creative & resourefl
professional Sachin Shah's exteas
Tech-Neo Publications LLP
experience of closely working with the tes§
Sr. No. 38/1, Behind Pari Com
Industrial Estate, Narhe, Mah
pany, Khedekar the most eminent authors of Publish:
Pune-411041,
arashtra, Industry, ensures high standards of quiliy
contents. This ability has helped sue
Website: www.techne
;
obooks.c attain better understanding and ine
om
Printed at knowledge of the subject.
Image Offset (Mr. A visionary... :

Rahul Shah)
A gregarious person, SACHIN SHAH is a th
Dugane Ind, Area, Su
rvey No. 28/25, Dh leader who has been simplifying the ™ the bes
Pari Company, ayari Near
India.
Pune - 411041, Maha
rashtra State, learning and bridging the gap be
authors in the publishing industry
E-mail; ‘ahuls community for decades.
hahimage@gma
il com

Ce
in
ter
ith
Dedicated to
The Readers of this Book
- Authors
Scanne' d with CamScanner

Preface
We are g lad to present the New Edition of this book titled “Machine Learnings
. . ering
ter Engine wings
This book covers the revised for th:
Syllabus of Comput ” 4 7 Ta Year
enginineering (semester-6) course of “Mumbai University” which has been effect;Ve Stingg
the academic year 2018-2014.
We have divided the subject into small chapters so that the topics Seah be
arranged and understood properly. The topics within the chapters
have been arranged
ina proper sequence to ensure smooth flow of the subject.
A large number of solved examples have been included. So, we

are sure that this
book will cater all your needs for this subject.
We ave thankful to Shri. Sachin Shah for

the encouragement and Support that
they have. extended to us. We are also
thankful to the staff members of Tech
Publications and others for their efforts -Neo||
to make this book as good. as it is.
jointly made every possible efforts We have
to eliminate all th é errors in this
book. However if
|
you find any, please let us know,
because that will help us to improv
e further.
We are also thankful to wy
family membe rs and friends
encouragement. for their patience and
“= Authors 7

—
Syllabus...
_
Course Code
CSDLO6021 Course Name Credits As i
Machine Leaming ened
4
| Prerequisieostes :. Data Structures Basic Pro |
, bability and Staic

Course Objectives : and Statistics, Algorithms
1. To introduce students to the basic concepts and techniques of Machi
2. To become familiar with re Sressii on met of Machine Learning
3, hods, classifi
ae ar with Dimensionali
To become famili , cat meth ods, clusterin methods.
ty FStuction Tock, ion eth g
Course Outcome : Students will be able to —=
1. Gain knowledge about basic concepts of Machine Leami
: » . ng
2. Identify machine learning technique suitab for a giv
s le en problem
3. Solve the problems using various machine learning techni
ques
4. Apply Dimensionality reduction techniques,
5.__ Design application using machine leaming techniques
: l
|f aayModule Detailed Contents Hrs.
| 1 Introduction to Machine Learning 6
ene ek Types of Machine Learning, Issues in Machine Learning, Application
achine Learning, Steps in developing a Machine Learning Application.
(Refer chapter 1)
2 Introduction to Neural Network 8
Introduction - Fundamental concept - Evolution of Neural Networks - Biological Neuron,
Artificial Neural Networks, NN architecture, Activation functions, McCulloch-Pitts
(Refer chapter 2)
Model. 6
3 Introduction to Optimization Techniques
Derivative based optimization- Steepest Descent, Newton method. Derivative free 1
(Refer chapter 3)
optimization- Random Search, Down Hill Simplex.
10
4 Learning with Regression and trees
Logistic Regression.
j Leaming with Regression : Linear Regression,
@earning with Trees : Decision Trees, Constru cting Decision Trees using Gini Index,
(Refer chapter 4)
Classification and Regression Trees (CART). 14
clustering
5 Learning with Classification and by Bayesian Belief
d classification, classification
5.1 Classification : Rule base
Mode ls.
networks, Hidden Markov |
: Maximum Margin Linear Separators, Quadratic
Support Vector Machine separators, Kernels for
|
to finding maximu m margin
Programming solution .
ng non -li nea r fun cti ons .
learni Se oe ctiaen “=
Maximization Algorithm, :
5.2 Clustering : Expectation efer chapter
i .
adi ial Basisi functions
clustering, Rad
Dimensionality Reduction Component Analysis, Independent
Te chniques, Princi pal
Dimensionality Reduction ition.
(Refer chapter 6)
Single va |we decompos
Component Analysis, Total

‘Introduction to Machine Learning........-.-.--ssssssesseesesees 1-1 to 1-46

© Chapter1 :
© Chapter2 : Introduction to Neural Network ..........:ccesseseteeteeteesesees 2-1 to 2-26
© Chapter3 : Introduction to Optimization Techniques
© Chapter4 : Learning with Regression and Trees
© Chapters : Learning with Classification and Clu

stering 5-1 to 5-65
© Chapteré : Dimensionality Reduction

6-1 to 6-19
god

Chapter...
Introduction to Machine
Learning
_ University Prescribed Syllabus |
ee Learning, Types of Machine Leaming, Issues in Machine Leaming, Application of Machine Learning, Steps
in developing a Machine Learning Application. :
1.1 WARING LEAITIING win sujcscnvnssSveesevevveroneytnnerte

Remy cies ranepovcdncencalonn td ivelbata cade) gnimoved lesen del Waneuiuttarss brates teemdtemeatuertars 1-2
1.2 Key Terminology .........sccccscseccseeeeceeesreeestsseesesneeencans coteeecesstaneees sestesvevceveveussecsuaensersdeasssseaseseseenerenesreranedcieeccrsesereeeeeres 13
13 Types of Machine Learnning......s.csssssscssecssseeesssreccssserscesiarcessnneeesnicscetsnnseessenreertansneessnanistenannnicrerntensissscegersogeeceengnansssssscee 155
1.4 ISSUES IM MAaChine L@AMMiNg ..........seseeceseccereseesrsteeeeneeneessneneneneenasnenenanenaneee te UEREEE STEEL SESE

EASE ALES EUAREECEECEOEOEEEES
W tees evenser ete nyr—nt

15 Howto Choose The Right Algorithm Vossuscuccnssnssnessuercescadeusensovicsusssssssseestessssussanesreranesonrssabesreniwensnsecnnes
1.6 Steps in developing A Machine Leaming Applicaion..............-ssssssesreeersnnenssenesersetereniennee

sree iS
1.7 Applications of Machine Learning .........-sscsccececsssssnnsutresensetssessennnnnnannnsnne
eet eeene
1.8 University Questions and Amar ........scscssssetessesssneerecsennsnnsennecsennneesssntscneienenn
08spaavensenccercesencacsecagssssasssssenavessassesegees
astessesenneeseversrsener?
owe!
Multiple Choice Question...........4.. seseeesscssccoersanse eossdseeeenssabeonsrsasesnsessses
neeaens ss
ChapterEnds...cccssscsssccsssssnsessssesseessssessnerssssansnesossnecnnsssennnenss
snensscnansnesenscncsenanansnragecsssaaous

: Introduction to Machine Learning ...Page no (1-2)

Machine Learning (MU-Sem 6-Comp) —
oe eee
1.1. MACHINE LEARNING

t
, have always attracted writers and early computer scientis
‘A machine that is intellectually capable as much as humans ite
igence and machine learning.
who were excited about artificial intell
developed a program to p
oped in the 1950s. In 1952, Samuel has
The first machine learning system was devel
gives better moves forma
positions at game and learn the model that
checkers. The program was able to observe
player.
in Tar e
which is a, simple classifier but when it is combined
In 1957, Frank Rosenblatt designed the Perceptron,
numbers, in a network, it became a powerful tool.
nted
He showed that the X-OR problem could not be represe
Minsky in 1960, came up with limitation of perceptron.
handled and following this Minsky’s work neural netwi
perceptron and such inseparable data distribution cannot be
research went to dormant until 1980s.
statistics. Computer science and statisti ;
"Machine learning’ became very famous in 1990s,due to the introduction of
area is further shifted to data dri
combination lead to probabilistic approaches in Artificial intelligence. This
systems that are able to analyze
techniques. As Huge amount of data is available, scientists started to design intelligent
‘and learn from data.
i
-Machiie learning is a category of Artificial Intelligence. In machine learning computers has the ability to leam
"
~ themselves, explicit programming is not required.
Machine focuses on the study and development of algorithms that can learn from data and also make predictio
data. - , ‘
Machine learning is defined by Tom Mitchell as ‘‘A program leams from experience ‘E’ with respect to some cl
tasks ‘T’ and performance measure ‘P’, if its performance on tasks in “T’ as measured by ‘P’ improves with ‘E’.”
‘E’ represents the past experienced data and “T’ represents the tasks such as prediction, classification, etc. Exampl
‘P’, we might want to increase accuracy in prediction. a
Machine learning mainly focuses on the design and

development of computer programs that can teach themselves
to grow and change when exposed to new data. Data
ja
Using machine learning we can collect information from a = Computer |= > Program
dataset by asking the computer to make some sense from Output :
data. Machine learning is tuning data into information. Fig. » 1.1.1:
LAA: Machine Learni
rning
Machine lear ming can be aj ppl ied to

to mai ny applic atio;
iP licati on: S such as Pp Olitics to geosci
eosci e:ences. It is a tool that can be PF
$ to extract s i i .
data,can benefit from machine learning methods ome information from data and also takes some action ©
( MU-New ew S Syllabus w.e.f academic. year 18-19) (M6-14) Lehech Neo Publicati A SACHIN SHAH Ventu
- ublications...A SA e
:

(MU-Sem 6-Comp) Intraduction to Machine Learning ...Page no (1-3)

Some of the applications arespam filtering in Module
email, face recognition, product recommendations » [1
from = Amazon.com and handwriting — digit
recognition.
In detecting spam email. if you check for the | Models
occurrence of single word it will not be very Learner Reasoner
helpful. But checking the occurrences of certain
words used together and combined this with the
length of the email and other parameters, you could
| Background Knowledge
get a much clearer idea of whether the email is
spam or not. Fig. 1.1.2 : Schematic diagram of Machine Learning
‘ :
Machine imning isis useduse by most: of the : companics toincrease productivity, forecast weather,to improve business
learning
decisions, detect diseaseand do many more things.
Machine learning uses statistics. There are many problems where the solution is not deterministic. There are certain
n
problems for which we don’t have that much information and also don’t have that much computing power to properly
model the problem. For these problems we need statistics, example of such type of problem is prediction of motivation
and behavior of humans.The behaviorand motivationof humans is a problem that is currently very difficult to model.
Machine learning = Take data + understand it + process it + extract value from it + visualize it + communicate it
,» W_1.2___ KEY TERMINOLOGY

_ Expert System
if ‘
knowledge representation,
e. _— Expert s ystem which is developed using some training set, testing set, and
is a system
features, algorithm and classification terminology.
will be used to train machine learning
set comprises of training exam) ples which
(i) Training Set : A training
algorithms.
done is to have a training set of data and a
leaming algorithms what’s usually
(ii) Testing Set : To test machine
separate dataset, called a test set. may be an
be stored in the form of a set of rules. It
Knowled ge representation may
(iii) Knowledge Representation ¢ tion.
set or a pro bability distribu
example from the training
es or attributes.
(iv) Features : Important properti
ased on features.
(») Classification : We classify the data b the
classification. The next step is to train
e a mac hin e learning algori thm for
Process : Suppose we want to us input a quality data called as train
ing set.
train th e alg orithm we give as a
algorithm, or allows it to learn. To to predict
ble is what we will be trying
and one tar get yariab le. The target varia
Each training example has some featu! res The machine leams by finding
targe t variable is known.
in a tra ining dataset the t variables are known as
with our machine learning algorithms. In the classi fication tasks the targe
an d the fea tur es.
some relationship betwe en thetarget variable
of classes.
a limited nu mber
classes. It is assumed that the' re will be
SACHIN SHAH Venture
[Ebhrech-Neo Publications..A
4)
MU-New Syllabus w.e.f academic year 18-19) (M6-1

Introduction to Machine Learning ...Page no (1-4) ;

Machine L
6-Comp)
Machine Learning (MU-Sem
predicted value, and we can get — Many
example belongs to is then compared to the
The class or target variable that the training
may
algorithm.
idea about the accuracy of the traini
machine learning methods. Let's take an
_— Example : First we will see some terminologies that are frequently used in the instances in to either Acceptable or
classify
example that we want to design a classification system that will
.
often related with machine learning called expert systems
Unacceptable. This kind of system is a fascinating topic
are Buying_Price,
- Four feature s of the various cars are stored inTabl
e 1.2.1. The features or the attributes selected
comprises of features,
Maintenance_Price, Lug_. Boot and Safety. Exam
ples belong to Table 1.2.1 re presents a record
two features represent
in'nat ure and takes limited disjoint values. The first
— In Table 1.2.1 all the features are categorical
the luggage
su ch as high, medium and low. Third feature shows
the buying price and maintenance price of a car
which
e represents whether the car has safety measures or not,
capacity of a car as small, medium or big. Fourth featur
takes the value as low, medium or high.
a
In this application we want to evaluate the car out of
— Classification is one of the important task in machine learning.
e_Price, Lug_Boot and
group of other cars. Suppose we have all information about car's Buying_ Price, Maintenanc
Acceptable or Unacceptable.Many machine learning
Safety. Classification method is used to evaluatea given car as
response variable in this example is the
algorithms are there that can be used for classification. The target or the
evaluation of a car.
main task in the classification is to
— Suppose we have selected a machine learning algorithm to use for classification. The
the algorithm which is called
train the algorithm, or allow it to learn. We give the experienced data as the input to train
as training data.
— Let’s assume training dataset contains 14 training records in Table 1.2.1. Suppose each training record has four features
and one target or the response variable,as shown in Fig. 1.2.1. The machine learning algorithm is used to predict the
target variable.
— In classification task the target variable takes a discrete value, and in the task of regression its value could be
continuous.
— imatraining dataset we have the value of target variable. The relationship that exists between the features and the target
variable is used by machine for learning. The target variable is the evaluation of the car. Classes are the target variables
in the classification task. In classification systems it is assumed that classes are to beof limited number
— Attributes or features are the individual values that, when combined with other features, make up a training example.
This is usually columns in a training or test set.
A training dataset and a testing dataset, is used to test machine learning algorithms. First the training dataset is given as
input to the program. Program uses this data to Jearn. Next, the test set is given to the program. The program decides
. . ,
which i is compared with the actual output of the
instance of test data belongs to which class. The predicted output
program,
g m, and w i
€ can get anidea about the accuracy of the algorith
i m. There are best ways to use all the information in
the training dataset and test dataset.
:
Assume in car ev: aluation
i class A;
‘ en system, we have tested the program and it meets the desired level of accuracy: °
. Knowledge representation used to check what the machine has learned. There are many ways in which knowledge
can be represented,
We can use set of rules ora probabi lity distrib
ability distri ution to represent the knowledge
:
yllabus w.e.f w. academic; year 18-19) (M6-14)
(MU-New Syllabus [ehech-Neo Publications...A SACHIN SHAH Venture

1-4 Machine Learning (MU-Sem 6-Comp)

Introduction to Machine Leaming ...Page no (1-5)
others. In some situations we M
an get Many algorithms represent the knowledge which is more interpretable to humans than
e representation that’s acquired from |
may not want to build an expert system but we are interested only in the knowledg
training a machine learning algorithm.
ake an
Table 1.2.1 : Car evaluation classification based on four features.
ble or
Buying_Price Maintenance_ Price Lug_Boot Safety Evaluation?
High Small High Unacceptable
High
Price,
High High Small Low Unacceptable
es. Medium High Small High Acceptable
Low Medium Small High Acceptable
wesent
Low Big High Acceptable
Low
iggage Unacceptable
Low Low Big Low
which Big Low Acceptable
Medium Low
High Medium Small High Unacceptable
ut of a High
Medium Big High Acceptable
ot and Low
Medium Big Low Acceptable
High
arning Acceptable
Medium Medium Small Low
is the Big High Acceptable
Medium High
Medium Small Low Unacceptable
Low
nis to
Buying Price Maintenance _Price Lug_Boot Safety Evaluation
called Low Low Big High Acceptable
Features Target
atures Variables
ict the
Fig. 1.2.1: Features and target variable identified
ld be bl 1.3. TYPES OF MACHINE LEARNING
:
target Some of the main types of machine learning are
jables 1. Supervised Learning

is comprises
In this type of learning we use data which
instance
imple. of input and corresponding output. For every
ponding
of data we can have input *“X° and corres
output ‘Y’. From this ML system will build model so New Input
‘X”
ven as that given an observation ‘X’, for new observation
‘Y’. In
ecides it will try to find out what is corresponding
the Input-1 Output-1 Learning
of the
supervised learning training data is labelled with
Input-2 Output-2 Algorithm
answers, e.g. “spam” or “ham.” Two most
correct
ion in supervised learning are Input-n Output-n
important types of
the output s are discret e labels, as Output Y
classification (where
filtering) and regres sion (where the
uracy- in spam
Fig. 1.3.1: Supervised Learning
viedge outputsare real-valued).
e
[ebrech-Neo Publications...A SACHIN SHAH Ventur
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)

—E——
eeTo Learn _The ~~
Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
e no
ge 1-6)
no (1-6
Introduction to Machine Learning ...Pa
6-Comp)
2. Unsupervised learning Clusters -

. .
input
In unsupervised learning you are only given Learning
the data
‘x’ there is no label to the data and given Algorithm
: to form g
or different data points, you may want
to’ find some pattern. Two
clusters or want
learning tasks are
important “unsupervised
4 |
‘ dimensionreduction and clustering. ;
Fig. 1.3.2: Unsupervised Learning
3, - Reinforcement learning
tee , Action7 a »
In
o
reinforcement learning you have an agent who is
= = eg
=
acting in an environment and you want to findout what
State 5; Stet {ood
‘action the agent must take based on the reward or
= _ =—]| Environment| — j a
penalty that the agent gets itIn this an agent (e.g., a Agent a |
=
‘
Reward r, ret 4
robot or controller) seeks to learn theoptimal actions to
7take Wo4 2 5G
based the outcomes of past actions.
Fig. 1.3.3 : Reinforcement Learning
‘ ‘ 3
;
4. Semi-supervised learning
— Itdis a combination of supervised and unsupervised learning. In this there is some amount of labeled training data — :
and also you have large amount of unlabeled data and you try to come up with some learning algorithm that
convert even when training data is not labeled. 3 5
— _ In classification task, the aim is to predict class of an instance of data. Another method in machine tearning is :
regression. j ¢
— , Regression is the prediction of a numeric value. Regression’s example is to draw a best fit line which passes —
i i
through some data points in order to generalize the data points.
i mi 4
— Classification and regression are example:
b . " ples of supervised learning. These types of problems are called as supervised q
ecause we are asking the algorithm what to predict.
Boe!
— pervised isi a task called as unsupervised learni
The ‘ exact opposite of supervised
arming. In unsu i i
learning, target value of *'
: data. §. In unsuper vised
label is not given for the
4
A problemaein which
ich simila
similar r items
i are grouped together isi called as clustering. In unsup
n ervised learning, we may.
o find statistical values that describe the data.
woe
:
This is called as density estimatio n.
| 1 f
|
. 11 . t duci | I f Jat fi om ma ny attributes to a small
number so that Ww e€ can properly visualize It in two or three dimensions,
4
(MU-New yi abus
Syll , wef acader lc
i y year 18 = 19) (M6 -14
1. )
‘ [Bhech -N
4 1 eo Pub I ications...A SACHIN SHAH Ventu

Machina Learning (MU8em 4-Conp) Intoduiation to Machine Leaning ...Page no (1-7)

Table LE Supervised tearniny dade Module
{
A-Neaowt Neighbours Llvent

Navve Mayon Locally welphted tnen
Support Vector Machines | Rldpe

Devlsion Tees Lana
Table 84.2: Uasupervived tenralny tila
A-Mennsy | Uxpeetution Maxtnization
DUSCAN | Parzen Window
bl 1.4 ISSUES IN MACHINE LEARNING
. Which algorithiy we hive to select to lene xenerl Girget functions from specific (raining dataset? What should
be the
settings for partientar alporithis, so as to converpe to the desired function, given suffigient training, data? Which
algorithms perform: best for whieh type of problems and representations’?
Tow much training data is suffeient? What should be the gener! umount of datathat can be found to relate the
confidence in learned hypotheses Co the amount tralniig experience and the character of the learner's hypothesis space’?
Prior knowledge held by the learner is used at which time and manner to guide the process of generalizing from
examples? Lf we have approximately correct knowledpe, will ithelpftul even when itis only approximately correct?
What is the best strategy for choosing useful next training experience, and how does the choice of this strategy after
the complexity of the learning problem?
To reduce the task of learning to one or more function approximation problems, what will be the best approach? What
specitic functions should the system attempt to learn’? Can this process itself be automated’?
To improve the knowledge representation and to learn the target function, how can the learner automatically alter its
representation?
1.5 HOW TO CHOOSE THE RIGHT ALGORITHM?
With all the different algorithms available in machine learning, how can you select which one to use? First you need to
focus on your goal, What are you trying to get out of this? What data do you have or ean you collect? Secondly you have to
consider the data,
Goal : If you are trying to predict or forecust a target value, then you need to look into supervised learning. Otherwise,
you have to use unsupervised learning.
(a) Ifyou have chosen supervised learning, then next you need to focus on what's your target value?
If target value is discrete (c.g. Yes/ No, | /2/3, A/B/C), then use Classification,
If target value is continuous ic, Number of values (e.g. 0 — 100, — 99 to 99), then ase Repression.
(b) Ifyou have chosen unsupervised learning, then next you need to focus onwhat is your aim?
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [ahech-teo Publications...A SACHIN SHAH Venture

Learning - .-Page no (1-8)

Introduction to Machine Se Machine Le:
mp)
Machine Learning (MU-Sem 6-Co
then use Clustering.
some discrete groups, 5. Train '
If youwant to fit your data into group, then use density estimation
strong the fit i nto each
erical estimate of how
If you want to find num
;
algorithm.
feature s? If yes, what is a reason for
? Are there missing values in
2. Data: Are the features comtinuous or nominal of these features of your
rithm selection proc ess, all
” ‘
value s? Are there outli ers in the data? To narrow the algo
missing
data can help you.
Table 1.5.1 : Selection of Algorithm
Supervised Learning Unsupervised Learning
Discrete Classification Clustering
Continuous Regression Density Estimation
N
pt 1.6 STEPS IN DEVELOPING A MACHINE LEARNING APPLICATIO
7. Use
1. Collection of Data
In th
You could collect the samples from a website and extracting data. you
- FromRSS feed or an API
— From device to collect wind speed measurement
— Publicly available data.
2. Preparation of the input data
- Once you have the input data, you need to check whether it’s in a useable format or not.
- Some algorithm can accept target variables and features as string; some need them to be integers.
- Some algorithm accepts features in a special format.
3. Analyse the input data
- Looking at the data you have passed in a text editor to checkcollection and preparation of input data steps are
properly working and you don’t have a bunch of empty values.
‘You can also check at the data to find out if you can see any patterns or if there is anything obvious, such as a few
data points greatly differ from remaining set of the data.
Plotting data in 1, 2 or 3 dimensions can also help.
Distil multiple dimensions down to 2/3 so

that you can visualize the data
The importance of thisthi step isi that iit makes
you understand that you don’t have any garbage value
coming in
(MU-New Syllabus w.e.f academic

year 18-19) (M6-14)
eh ech-Neo Publications..A SACHIN SHAH Venture -
a
Da

8
8). : Machine Leaming (MU-Sem 6-Comp) Introduction to Machine Learning ...Page no (1 -9)
Module
5. Train the algorithm
on 1
- Good clean data from the first two steps is given as input to the algorithm. The algorithm extracts information or
. knowledge. This knowledge is mostly stored in a format that is readily useable by machine for next 2 steps.
or
ur — Incase of unsupervised learning, training step is not there because target valuc is not present. Complete data is
used in the next step.
6. Test the algorithm
— — Inthis step the information learned in the previous step is used. When you are checking an algorithm, you will test
it to find out whether it works properly or not. In supervised case, you have some known values that can be used to
evaluate the algorithm.
— Incase of unsupervised, you may have to use some other matrices to evaluate the success. In either case, if you are
not satisfied, you can again go back to step 4, change some things and test again.
- Mostly problem occurs in collection or preparation of data and you will have to go back to step I.
7. Useit
In this step a real program is developed to do some task, and once again it is checked if all the previous steps worked as
you expected. You might encounter some new data and have to revisit step 1-5.
/
Training Phase
Label Machine
learning
Input
Faw COO
tracto.
extractor Features
en
iN
Testing Phase avs
eae PPLLLLL
cmeeteef cate
extractor
Input Features
\
Fig. 1.6.1 : Typical example of Machine Learning Application
1 1.7 APPLICATIONS OF MACHINE LEARNING
1. Learning Associations
— A supermarket chain-one an example of retail application of machine learning is basket analysis, which is finding
associations between products bought by customers:
— If people who buy P typically also buy Q and if there is a customer who buys Q and does not buy P, he or she is a
potential P customer. Once we identify such customers, we can target them for cross-selling.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [hech-Neo Publications...A SACHIN SHAH Venture

Introduction to Machine Leaming ...Page no (1-1 0)

Machine Learning (MU-Sem 6-Comp) = =
of the form P (Q/P) where Qis
Ic, w we are interested in learning a conditional probability
=
association rulc,
~ In finding an P. which are the product / products which we know that customer has
the product we would like to condition on
already purchased.
P(Milk/ Bread) = 0.7
—. Itimplies that 70% of customers who buy bread also buy milk
2. Classification
- : It is important for the bank to be able to predict in

A credi credit is an amount of money loaned bya financial institution.
ili that the customer will default It and not pay the
advance the risk associated with a loan. Which is the probability
whole amount back?
— Incredit scoring, the bank calculates the risk given the amount of credit and the information about the customer,
(Income, savings, collaterals, profession, age, past financial history). The aim is to infer a general rule from this
data, coding the association between a customer’s attributes and his risk.
— ‘Machine Learning system fits a model to the past data

+e {
to be able to calculate the risk for a new application Sa Voge
{ Ht to Ti Tifi
- | {| va Low-Risk —|—
and then decides to accept or refuse iti accordingly
i . | tle:| 7 LIAL CTT
RA
: ae -|-S\- A
\ i
If income > Q, and savingsi gs>Q)
> | Dealt |
re 0, md
High-Risk — d t=i=
|_| | i a
Then low ~ risk ELSE high — risk “-|-!
— Other classification éxamples
ttt
are Optical character
. Tecognition, face recognition, medical diagnosis,
speech recognition and biometric.
Izl
ab
3. Regression
Suppose we want to design a system that can predic

t
aa
the price of a flat. Let’s take the inputs as
the area of
the flat, location and purchase year
and other oO
information that affects the rate of
flat. The output is
©
the price of the flat. The applications

where output is
numeric are regression problems,
©
Let X represents flat features

and Y js the price of
flat. We can collect trai
ning data by surveying
past
purchased transactions
and the Machine Lea
ming
algorithm fits a function
to this data to learn
function Y as a
of X for the suitable
values of W and Wo.
Y = wx+w,
Fig. 1,7,2: Regression for prediction of price of flat : a
4
Ehrech-Neo Publications...A SACHIN SHAH Venture

Sem 6-Comp)
pine Leaning (MU-
| jnsupervised Learning Introduction to Machine Leaming ...Page no (1-11

Module
| Reinforcement Learning
There are some of the application S wh ere

output of System is a sequence of actions.
In such applications the
sequence of correct actions instead of single ac
ton is important in order to reach goal. An action is said to be good
if it is part of good policy. : Machine learn;
aming program generat i i i i
sequences. Such methods are called reinforcemen
t methods | BENS sag pris good acon
A‘ ee
good example
i eee of reinforce anesing isi chess playing.
: In artificial intelligence and machine learning, one of
P' researcharea is game playing. Games can be are easily described but at the same time,
they
wee difficult to play well. Let’s take a example of chess that has limited number of rules, but the game is very
difficult because for each statethere can be large number of possible moves
- Another application of reinforcement learning is robot navigation. The robot can move in all possible directions at
any point of time.The algorithm should reach goal state from an initial state by learning the correct sequence of
actions after conducting number of trial runs.
- When the system has unreliable and partial sensory information, it makes reinforcement learning complex. Let’s
take an example of robot with incomplete camera information. Here robot does not know its exact location.
1.8 UNIVERSITY QUESTIONS AND ANSWERS
~ May 2015
(5 Marks)
Q.1 What are the issues in Machine learning ? (Ans. : Refer section 1.4)
> May 2016
(5 Marks)
9 (Ans. : Refer sections 1.2 and 1.3)
Q.2 What are the key tasks of Machine Learning
(Ans. : Refer section 1.5) (8 Marks)
ine learning algorithm.
2.3 Explain the steps required for selectin ig the right mach
(10 Marks)
rning applicatiions. (Ans, :: Refer sectioion n 1.7)
1.
0.4 — Write short note on : Machine lea |
> May 2017 d learniingng !sis di different from unsupervised lear
ning. ts merks
supervise
» What is Machine learning ? Explain how i
|
ut
er sec tio ns 1.1 and 1.3)

(Ans. : Ref
section 1.7)
ons . (Ans. : Refer
6 Write short note on : Machine learning applical
’ May 2019 section 1.1).
(5 Marks)
7 (Ans. : Refe.
0.7 What is Machine pte di da ta mining
learning ? How it is different from ions..A SACHIN SHAH
Venture
‘ech-Neo Publicat
|
MU-New Syllabus w.e.f academic year 18-19) (6-14)

Introduction to Machino Learning ...Page no(1-12) i

Machin
Machine Learning (MU-Som G-Comp) ( 10 Ma rks) :
tlons. ( (Ans. : Refer section 1.6)
eee
no Learning applica Q.1.11

Q.8 — Explain the stops of dovoloping Machi
> Dec. 2019
hine Learning.9 |
tance of Mac! ,
Q.9 Define Machino learning and oxplain with examplo impor (5 Marke)
(Ans. : Refer section 1.1)
____
Q.1.6 Car price prediction is an example of
Questions . Q. 1.1:
(a) Supervised Learning
Multiple besilcdice
(b) Unsupervised Learning
Q.1.1. What is Machine Learning ?
(c) Active Learning
autonomous acquisition of knowledge’ through “Ans. : (a)
(a) The
(d) Reinforcement learning
the use of computer programs
se is
Explanation : $n car price predication car databa
(b) The selective acquisition of knowledge through the utes and car
use of compuler programs used to find relationship between car attrib
price and using this syste m is traine d.
h
(c) The autonomous acquisition of knowledge throug
. . . : ‘
he use of manual oramMs Q.1.7 Whatis the function of Unsupervised Learning?
the, ase. of manus: progrims . . .
(d) The selective acquisition of knowledge through the (b) Classifications
Y Ans. = (a) (a) Find clusters of the data
use of manual programs
(c) Predict time scries (d)__ Regression

Explanation : Definition of machine learning- system
Ans. : (a)
automatically learns from previous experience using an
Explanation In unsupervised based on common
algorithm.
| attributes grouping is done. S Cp i
aia ih et one Rest, three ‘option’
Q.1.2 Different learning methods does not include examples of supervised learning.
(a) Memorization (b) Analogy
(c) Deduction (d) Introduction “Ans. : (d) Q.1.8 = Inan Unsupervised learning
Explanation : Memory (previous experience), logic and (a) Specific output values are given
inference (deduction) is required. (b) Specific output values are not given
Q.1.3 Which of these is not the type of machine Learning ? (c) No specitic Inpuis are given
(d) Both inputs and outputs are given ¥ Ans. : (b)
Explanation : In unsupervised leaming since supervisor
(6) Reinf : is not present, we do not have the idea about expected
c) Reinforcement learning outpul. System learns only from the input.
(d) Semi-unsupervised Learning “Ans. :
. . 6 Ans. :(@) | Q. 1.9 Spam mail filtering is
Explanation : Supervised, unsupervised, reinforcement (a) Classificati
and hybrid arc types of learning. 7 assification problem
0.4 (b) Clustering problem
Which of the following is not an application of Icarning ? (c) Classification and Clustering Problem
(a) Data mining (b) World wide web (d) Time Series problem v Ans. : (a)
(c) Speech recognition (d) Data manipulation Explanation : In spam mail based on training examples
Y Ans. : (d) classification is done. Training cxamples contain
attrib utes
ibutes of ai
mail ‘i or not).
and output (spam
Explanation :4 Data manip: ulation isa data preprocessing.
Q.1.10 Data used to optimize the parameter settings of a
Q.1.5 Fraud Detection, Image Classification, Diagnostics
applications in , gnostics are supervised learner model
Supervised Learni (a) Training Data (b) Test Data

(a) Unsupervised Learning (b)
(c) Verification Data alidation Data
(c) Reinfore varni ng
ement learni (d) Inductive learning ning td) a Valiealpon V Anse! (d)
Ans, : (d 3 nation : Using validation data it is tested ‘| that

Explanation : In inductive learning earner dis (d) Expla
iscovers whether the model is giving correct and optimized outpul
rules by observing examples,
or nol.
(Ml
year 18-19) (M6-14) e
[ebrech-Neo Publications..A SACHIN SHAH Ventur

Ino (1-12
(10 Marka)
Machino Learning (MU-Sem 6-Comp) Introduction ta Machine Leaming ...Page no (1-13)
Q.1.11 In regression the output is Q.1.16 You are given reviews of few movies marked as positive, Module
(a) Continuous (b) Discrete negative or neutral. Classifying reviews of a new movie | 1
(¢) May be discrete or continuous is an example of
(d) Continuous and always lics in a finite range (a) Supervised learming
(5 Marka)
¥ Ans. : (a) (b) Unsupervised learning
Explanation : In regression the output is numerical. (c) Semi supervised learning
Q.1.12 Machine Learning is a branch of (d) Reinforcement learning ~ Ans. : (a)
(a) Natural Language processing Explanation : Supervised learning as in this cases we
(b) Artificial Intelligence know the expected output.
(c) Java (Wd) ¢ v Ans. : (b)
“Ans, : (a) Q. 1.17 Imagine a newly bom starts to leam walking.it will try to
Explanation : ML is a category of Al.
latubase is find a suitable policy to leam walking after repeated
esoond car Q. 1.13 A shop owner has a store that stores a variety of fabrics. falling and getting up. Specify what type of MI algorithm
When fabric is brought to the store, various types of is best suited to do the same.
fabrics may be mixed together. The shop owner wants a (a) Supervised (b) Unsupervised
?
model that will sort the fabric according to type. Which (c) Reinforcement (d) Semi supervised
ws
model will be efficient/accurate for this task? “Ans. : (c)
(a) Machine learning model Explanation : In this case child learns using the concept
‘Ans, ¢ (11)
(b) Feature based classification technique. of punishment (fall down) and reward (walk properly)
common (c) Computer vision which are the characteristics of reinforcement learning.
tions are (d) Fuzzy Logic v Ans. : (a)
Q. 1.18 Automated vehicle is an example of
Explanation : Machine Icarning model will be efficient
since ones the model is trained can be used for futuristic
(b) Unsupervised Leaming
use,
(c) Active Learning
Q. 1.14 To Understand the role of machine learning in public
(d) Reinforcement learning ¥ Ans. : (a)
health and = safety and the cultural, societal, and
environmental considerations in determining — the Explanation : Supervised learning as in this case we
ins. 2 (b)
non-functional requirements of products and processes. know the expected output.
Ipervisor
Which type of learning can be used?
expected Q. 1.19 Real-time decisions, Game AI, Leaming task are
(a) Supervised Learning applications in
(a) Active Learning
(c) Competitive Learning
(b) Supervised Learning
(d) Reinforcement Learning v Ans. : (a)
(c) Reinforcement learning
Explanation : Supervised learning as in this cases we
(d) Unsupervised Learning Ans. : (c)
know the expected output.
Explanation : Reinforcement leaming since in these
ns. = (a) Q.1.15 Suppose you want to design a system for waste
cases system learns from punishment and reward.
samples management. In this first the garbage is collected and it
contain is sent to the main server for analysis. Main server Q. 1.20 In which of the following leaming the teacher returns
compares the categories of garbage and appropriate reward and punishment to leamer ?
Uisposal method is selected. For comparison of garbage
{a) Active Learning
s ofa type you will use which Machine learning method ?
(b) Reinforcement learning
(a) Regression (b) Classification
(c) Unsupervised Learning
(c) Clustering (d) Dimensionality Reduction
(d) Supervised Learning Ans. : (b)
Ans. : (b)
is. ¢ (d) Explanation : Classification, depending on the attributes Explanation : Definition and working of Reinforcement
cd that waste will be segregated and also we know output. learning.
output
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [ehech-neo Publications...A SACHIN SHAH Venture

Introduction to Machine Leaming ...Page no (1-14)

Machine Learning (MU-Sem 6-Comp)
Explanation Learning element leas from
fle
the sample
Q. 1.21 : utational learni ng theory analyses : evidences so that performance can be improved.
Comp : and computational complexity of
complexity Why is Goggle very successful in Machine Leaming
Q. 1.27
(a) They have more data than other companies
(a) Unsupervised Learning (b) Inductive learning
(d) Weak learning
(b) They have better algorithms
(c) Forced based learning
¥ Ans. : (b) (c) They have better training sets
d (d) They work on low level data features 6
Explanation : Since in this system learns from observe
examples. V Ans. : (a)
Explanation : For better generalisation and accuracy —
Q. 1.22 Which of the following is an example of active learning?
. more training data is required. ‘
(a) News Recommender system
(b) Dust cleaning machine Q. 1.28 Which of the following is not Machine learning |
(c) Automated vehicle (d) Speech recognition disciplines
v Ans.
: (a) (a) Information theory (b) Neurostatistics
Explanation : In news recommendation items are (c) Optimization (d) Physics Ans. :(b)
presented to the user to learn more about their Explanation : Game theory, control theory, operation |
preferences,there like or dislike where active learning research,information theory, optimization, swam |
comes in. intelligence and genetic algorithm are disciplines of ML,
Which of the following is also called as exploratory Q. 1.29 What are the three essential components of a leaming _
learning? system? .
(a) Supervised Learning (a) Model, gradient descent, learning algorithm
(b) Reinforcement leamin z (b) Error function, model, leaning algorithm
(c) Accuracy, Sensitivity, Specificity
(d) Active Learning ~ Ans. : (c) (d) Model, error function, cost function ¥ Ans. : (b)
Explanation, :. Since unsupervised learning identify Explanation : Learning algorithm trains the model using
structure within data.
error function.
Q. 1.24 Supervised Learning and Unsupervised Learning both Q. 1.30 You are reviewing papers for the World's Fanciest
Tequire at least one?
Machine Learning Conference, and you see submissions
(a) Hidden attribute (b) Output Attribute with the following claims. Which ones would
“(©) Input Attribute —(d)
you
Categorical Attribute consider accepting?
Ans. : (a) (a) My method achieves a training error lower than all
. Explanation : Since hidden attribute Prese
rves the previous methods.
robustness of semantic attributes
and inherits the
~ discrimination ability of visual features, (b) My method achieves a test error lower than all :
previous methods.
Q. 1.25 What is the function of Supervised Lear
ning? (c) My method achieves a test error lower than all
(a) Grouping of data
previous methods
(b) Find centroid of data
(c) Find relationship betw (d) My method achieves a cross-validation error lower
een input and output
(d) Learn from punishment than all previous methods. v Ans. (0)
and reward Explanation : Test error lower than
all previous methods
“Ans. : (c) when regularization parameter is
Explanation : In Supervised
chosen so as (0
between input and Out based on relationship minimize cross-validation error
.
put model is trained,
Q. 1.26 Which modifies
the performance
Q. 1.34 What is true about Machine
Leaming?
makes better decisi element so that
on? it (a) Machine Learning (ML) is that field of computet
(a) Performance Science
element ‘
(b) Changi
(c) Leaming element (d) (b) ML is a lype of artificial intelligence that
HearinBing element
g clement extrac”
Patterns out of raw data by using
YAns. : (c) an algorithm
. (MU- New Syllabus wi method
ef academic year 2 (ML
18 ~19) (M6-14)
[ah ech-neo Publications...A SACHIN SHAH
Venture

Machine Learning (MU-Sem 6-Comp) Introduction to Machine Learning ...Page no (1-15)

(c) The main focus of ML is to allow computer systems Q. 1.36 How do you handle missing or corrupted data in a Module
Iearn from experience without being explicitly dataset ? r 1 q
programmed or human intervention. (a) Drop missing rows or columns
(d) All of the above v Ans. : (d)
(b) Replace missing values with mean/median/mode
Explanation : All statement are true about Machine
Learning. (c) Assign a unique category to missing values
(d) All of the above “Ans. : (d)
Q. 1.32 ML is a field of AI consisting of learning algorithms
Explanation : All above methods are used to handle
= (a) that?
Missing data.
racy
(a) Improve their performance
Q. 1.37 When performing regression or classification, which of
(b) At executing some task
ning the following is the correct way to preprocess the data?
(c) Over time with experience
(a) Normatize the data PCA — training
(d) All of the above “Ans. :(d)
Explanation : ML is a field of AI consisting of learning
(b) PCA — normalize PCA output — training
:(b) algorithms that : Improve their performance (P), (c) Normalize the data —+ PCA — normalize PCA
At
tion execuling some task (T), Over time with experience (E). output — training
(d) None of the above v Ans. : (a)
Q. 1.33 Which of the following are ML methods?
IL. Explanation : First preprocessing
is done then data
(a) based on human supervision reduction is done and finally training is performed.
ling
(b) supervised Learning
Q. 1.38 What is unsupervised learning ?
(c) semi-reinforcement Learning
(a) features of group explicitly stated
(d) All of the above “Ans. : (a)
(b) number of groups may be known
Explanation : The following are various ML methods
based on some broad categories : Based on human (c) neither feature nor number of groups is known
(b) supervision, Unsupervised Learning, Semi-supervised (d) none of the mentioned Ans. : (c)
ing Learning and Reinforcement Learning. Explanation : Basic definition of unsupervised learning.
Q. 1.34 Which of the following is NOT supervised learning ? Q. 1.39 What's the main point of difference between human and
est
(a) PCA (b) Decision Tree machine intelligence?
ns
(c) Linear Regression (d) (a) human perceive everything as a
ou Naive Bayesian pattern while
machine perceive it merely as data
“Ans. : (a)
(b) human have emotions
all Explanation : Principal Component Analysis (PCA) is
(c) human have more IQ and inteMect
not predictive analysis tool. It is a data pre-processing
tool. It helps in picking out the most relevant linear (d) human have sense organs ¥ Ans. : (a)
all
combination of variables and use them in our predictive Explanation : Humans have emotions and thus
form
model.PCA is a technique different patterns on that basis, while a machine(
for reducing the say
ul computer) is dumb and everything is just a data for
dimensionality of large datasets, increasing him,
interpretability but at the same time minimizing Q. 1.40 Choose the options that are correct regarding machin
information loss. e
c. learning (ML) and artificial intelligence (AI),
Q.1.35 Which of the following is a good test dataset (a) ML is an alternate way of Programming intelligent
characteristic? machines.
(a) Large enough to yield meaningful results (b) All options are correct.
(b) Is representative of the dataset as a whole (c) ML isa set of techniques that turns a dataset into a
(c) Both (a) and (b) sof tware,
(d) None of the above (d) Al is a software that can emulate the human mind.
Vv Ans. : (ec)
Explanation ; For better result more records as well as Ans.: (b)
meaningful records are required. Explanation : Since all three options are correct.

[ahrech-Neo Publications...A SACHIN SHAH Venture

to Machine Learning ...Page2nono (116)

(1-46) |.
=i
Introduction
regarding Explanation : Since all components are necessary tn
Q. 1.41 Which of the following sentence is FALSE :
define a learning problem.
regression?
(a) It relates inputs to outputs. Q. 1.46 Successful applications of ML __.
i
(b) It is used for prediction. (a) Leaming to recognize spoken words
(c) It may be used for interpretation. (b) Learning to drive an autonomous vehicle
(d) It discovers causal relationships. ~ Ans. : (d) (c) Learning to classify new astronomical structures
Explanation : Regression method do not have ability to (d) All of the above v Ans, ; (@)
find causal relationships. Explanation : All are applications of ML.
Q. 1.42 Which of the following is/are one of the important
Q. 1.47 Designing a machine learning approach involves ____
step(s) to pre-process the text in NLP based projects?
(a) Choosing the type of training experience
I. Stemming 2. Stop word removal
(b) Choosing the target function to be leamed
3. Object standardization
(c) Choosing a representation for the target function
(a) land2 (b) 1 and3
(d) All of the above v Ans. : (d)
(c) 2and3— (d)_—s«1,2 and 3 “Ans. : (d)
Explanation : Stemming is a rudimentary rule-based Explanation : Since all actions are necessary to design a
ML system.
process of stripping the suffixes (“ing”, “ly”, “es”, ‘‘s”
etc) from a word. Stop words are those words which will Q. 1.48 Real-Time decisions, Game AI, Learning Tasks, Skill
have not relevant to the context of the data for example
Acquisition, and Robot Navigation are applications of
is/am/are.Object Standardization is also one of the good
which of the following
way to pre-process the text,
(a) Supervised Learning : Classification
Q. 1.43 Which of the factors affect the performance of learner (b) Reinforcement Learning
system does not include?
(c) Unsupervised Learning : Clustering
(a) Representation scheme used
(d) Unsupervised Learning : Regression ~ Ans. :(b)
(b) Training scenario
Explanation : Since these applications learns
(c) Type of feedback using
punishment and reward.
(d) Good data structures “Ans, :(d)
Q. 1.49 Targetted marketing, Recommended
Explanation : Factors that affect the performance Systems, and
of Customer Segmentation are applications in
learner system does not include good data structur which of the
es.
following
Q.1.44 The model will be trained with data in one single
batch is (a) Supervised “Learning: Clas
sification 2K
known as
(b) Unsupervised Learning: Clus
(a) batch leaming (b) tering 24
offline learning
(c) Unsupervised Learning:
(c) both(a) and (b) (d)
Regression
none of above 2é
(d) Reinforcement Learning
Ans. : (b)
“Ans. : (c) Explanation : Sine in this appl 2.€
Explanation : In both methods
data is trained in a single ications data is grouped
batch. together based on common
characteristics.
27
Q.1.45 In general, to Q.1.50 If we want Output as numeric and we have the idea about
have a well-defined lea
ming problem, we ©xpected output which method 2.8
Must identity which of the we should use?
following :
(a) The class of tasks (a) Classification (b) Regression 2.9
(b) The measure of Per (c) Clustering
formance to be improv
ed
(d) Density estimation 2.1
(c) The source of exp
erience “Ans. :b
(d) All of the above Explanation : Basic definiti
on of regression.
a
ChapterEnds...
gag

“wr
-
Chapter...
Introduction to Neural
Network
— University Prescribed Syllabus
Introduction - Fundamental concept - Evolution of Neural Networks - Biological Neuron, Artificial Neural Networks, NN
architecture, Activation functions, McCulloch-Pitts Model.
2.1 IMMOGUCTION 0.0.0... cece ceseeseseseesseseesesesensssensesenesesseeesessasesssesesessuasesesessseceesenesssucecacsusesssasasssscsssseecesaneacacaescessausesaatusevanareasenes 2-2
2.2 Biological NCUrON.......sssesssesssssnssssnesssesesssesseessssenasserersceesessanatatsnsessnsssseatssssecscsesesssaseesssecsssescssssacesseacasssessssssnacateessseses 2-2
2.3 Basic ANN Model/ McCulloch - Pitts Model ...........ssssssssssssssssssssccsssecsesseessseusesseussssescceseaseeseaaesetancaccasssesseessessessseeeeaseness 2-3
2.4 Difference between Biological Neuron and Artificial N@UrOM ............:c::ssecssseessescesenseescsesnessensecssacsussetersssetestsatsucssncereneee

225
2.5 Types of Activation functions/ Transfer functions seseessnneveetsnsnestnesneentsntnesnsivnteueetieeimnsetnsetitesinatusitieeunssiees 2-5

2.6 Neural Network architecture 2.0... sscecssscsessssesssssneessseneesssseenssnseesenateccousesesanseeseatseeseesseseatssenecessesenecsecececsususessnssssenseasee 2-7
2.7 TYP@S Of LEAMMing .........cccssssssecsssnceessssessssensecensesesansesessenssesneneseneeesseseesssesseataestsanaeatssaaeeecousesssamseeesoeaeseneeeseeneeasaneenseneaeseeee 2-9
2.8 Perceptron Learning RUle..........ssccssssssssessessssnsesssensesssnesssseesteneaessesenessnsenessusessseuseseseusesanenseseserscuaarscedasessceeecaeetseseneenss OTE
2.9 Single Layer Perceptron Learning .........scescsesscscsescssssesssscneecsssnenssseeeesssensesssensesesueeesseseessenseecseseeseaseecaeaesscseaesesanaeeneesesss 2-16
2.10 University Questions BING ANSWETS sn mnacssesnnsvanasencsnsnisinsbeestesonsaccdectecssdesivennsneateveseatunedistssevecseyFeusehisrasartiesnesesaasetenrssiauiece 2-20
Multiple Choice Questions................:.:0:ssscesseee . . we asaunanvnsicrinzeetelvsrcersogeensnuenszenearceeeacia 2-21
Chapter Emds.......s.sssssssescssssensensssensesessenesssnenecssnennesaeaeussonsacassaeaesseneanracneses evaerssaunsensenapeenteassiurecssaaeuessoencnsaarcacusee pteseenees 2-26

Neurales
LOOT eetoeam
Introduction "Age e | no
Network ,,.Pag
a
| ; a
‘pi 241. INTRODUCTION
: nyt s Sug uy WY awhile
) is inspired from the Biological Nervous . System. The way by whieh Motogic
An Arti i Artificial
Sas aL
¢ the information in the same manner ANN also‘i processes the Information ;
Oy CARE
-: i Neural Network . (® NN isins
i i cesses
tem such a3 Brain pro’ Sse
ng paradig' m. It sesembl |
brain ini two respects
esss the the brain 4
oe ‘ nformati on processi a -naiwork.
ANN is also called as jnforma
the knowledge from the Environment by the a
i ing process isis used d toto acquire
a
weights, fh 4
using Interncuron connection strengths known as synaptic
d Ising a
= oO0 i
Acqcquired Knowledge iis stored U
‘ 1 of a large number of highly interconnected processing s;
elementa4
— The Information processing system 18 composcec
specifi ems.
(neurons) which works ; together to solve specific problem ; ‘ iF
ions
connections that ¢ sxist between the neurons. ns |
i j
process iinvolves aadjustm ent s to the ynap ic connect
synaptic
—° In Biological System learning |
in the same manner learning is carried out in ANN.
¢ detection of trends - is a tedious oe oe and | fs
— ANNcan be used ini many Applicat icatiions. Pattern extractiion on aand
nd process for humans
| ted of t || tast5
techni
- other computeiter r techniqu cir remarka
Wi their
es. Neural networks,‘s, with remarkabble ability to derive meaningg from : complica
le ability
imprecise data, can be used to extract patterns and detect trends.
anees ° i isu
soe ]
icatiions of ANN includes

One of the applicat , Adaptivi e learning given for
ata isis given
ing inin thisthi the data uses
training and the networktk uses
for training | Ne
,— i
this data to learn how to perform the tasks based on this data, ANN computations may be carried out in parallel, and
special hardware devices are being designed and manufactured which take advantage of this capability.
— Other applications of ANN include Signal Processing, Pattern Recognition, Speech Recognition, Datn Compression,
Computer Vision and Game Playing.
'Wi02.2 BIOLOGICAL NEURON 7

'- Neuron is the elementary nerve cell and a basic unit of the nervous
system. Neuron has an information processing ability.
— Each neuron has: three main regions, cell body or Soma, Axon
and
Dendrites. Soma contains the nucleus and it processes
the information.
Axon is a long fiber that serves as a transmission line.
End part of the Dendrites
Axon splits into fine arborization that ends into small bulb called as a Fir. 2.2.1 fodel
: the dendrite of the neighbor
Synapse almost touching Fig. 2.2.1: Bi el
ing neuron. & lological Neuron Mo
Dendrites accept the input from
the neighboring neuron through
axon. Dendrites look like a tree structure, receives
signals from other neurons, Syna
pse is the electro-chemical
contact between the or, gans. They do not physically touch
“because they are Separated
by a cleft. The electric gi
signal is called pre- Synaptic cell andt gnals are sent through chemical interaction. The neuron sending (he
iving the electric .
— ‘
The electricas ° he neuron rece al signat is called postsynaptic cell.
Signals that the neurons
use to convey the informat
3determine which { ype ion of the brain are all identical. The brain can
of informat
i ion
iis bein 8 received
si gnals sent, and from i based on the
. that informat
. ion jt interprets i
path of the signal,
t ain i analyzes all pattems? ‘¢ !
Biological neurons, When the neur | he “ype of information received.
cccved, There :
° Thee aiare different types of
. ONS are classifi led by th : . i nolat 1
neurons, bipolar neurons and multipolar neurons ons Processes they carry out they are classified as uniP®.
:
(MU-New Syllabus w.ef academic year 18-19) (M6-14)
— ;
oH Tech-Neo Publications..A SACHIN SHAH Ventu

Machine Learning (MU-Sem 6-Comp) Introduction to Neural Network ...Page no (2-3)
— Unipolar neurons have a single process. Their dendrites and axon are located on the same stem. These neurons are
“opie
found in invertebrates. Bipolar neurons have two processes. Their dendrites and axon have two separated processes too.
Tryation
Multipolar neurons are commonly found in mammals. Some examples of these neurons are spinal motor neurons,
pyramidal cells and purkinje cells.
- When biological neurons are classified by function they fall into three categories. The first group is sensory neurons.
These ncurons provide all information for perception and motor coordination. The second group provides information
to muscles, and glands. There are called motor neurons. The last group, the interneuronal, contains all other neurons and
le
Ment, has two subclasses. One group called relay or protection intemeuron. They are usually found in the brain and connect Module
different parts of it. The other group called local interneuron’s are only used in local circuits. i 9 3
Neuron, 5 ‘yy 2.35 BASIC ANN MODEL/ MCCULLOCH -— PITTS MODEL ool
,
McCulloch and Pitts proposed a computational model that resembles the Biological Neuron model. These neurons were
lans and
represented as models of biological networks into conceptual components for circuits that could perform computational
cated or
tasks. The basic model of the artificial neuron is founded upon the functionality of the biological neuron. An artificial neuron
is a mathematical function that resembles the biological neuron.
ork Uses
Neuron Model
lel, and
— Aneuron with a scalar input and no bias appears below. Fig. 2.3.1 shows a
simple artificial neural net with n input neurons (X,, X)..., X,) and one
ression,
output neuron (Y). The interconnected weights are given by W,, W,,
and W,.
— - The output of the above model is given as,

Y=! if net, = LW, * X,>T
xon =0 if net, = LW, * X,<T Fig. 2.3.1 : Artificial Neuron Model
Where ‘i’ represent the output neuron and ‘j’ represent the input neuron.
— The scalar input X is transmitted through a connection that multiplies its strength by the scalar weight W to form the
product W * X, again a scalar. The weighted input W * X is the only argument of the transfer function f, which
produces the scalar output Y. The neuron may have a scalar bias, b. You can view the bias as simply being added to the
product W * X. The bias is much like a weight, “xcept that it has a constant input of 1.
- The transfer function net input ‘n’, again a scalar, 1s the sum of the weighted input W * X and the bias b. This sum is the
argument of the transfer function f. Here f is a transfer function, typically a step function or a sigmoid function, that
takes the argument n and produces the output Y. Note that W and b are both adjustable scalar parameters of the neuron.
- The central idea of neural networks is that such parameters can be adjusted so that the network exhibits some desired or
interesting behaviour. Thus, you can train the network to do a particular job by adjusting the weight or bias parameters,
or perhaps the network itself will adjust these parameters to achieve some desired end.
- As previously noted, the bias b is an adjustable (scalar) parameter of the neuron. It is not an input. However, the
constant | that drives the bias is an input and must be treated as such when you consider the linear dependence of input
vectors in Linear Filters.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications..A SACHIN SHAH Venture

Introduction to Neural Network ...Page no (2-4

\z
. . - Pi I
Example 1: Simulation of NOT gate using McCulloch - Pitts Mode
The truth table of the NOT gate is as follows
Input | Output
x Y
1 0
0 1
We assume the weight vector as W.
For the first row (i.e. input 1) we may write net value as W * X = W * 1 = W according to the McCulloch - Pitts mode]
if the output is 0 then net value must be less than threshold W < T.
For the second cow (i.c. input 0) we may write net value as W * X = W * 0 = 0 according to the McCulloch - Pitts
model if the output is | then net value must be greater than or equal to threshold, 0>T.
Iwi
Now we are having two equations
1
l. W<T 2. OST
Now select the values of W and T such that the above conditions gets satisfied.
One of the possible values are T = 0.8, W =— 1.
Using this values the NOT gate is represented as,

We=-1
x Y
ica
Example 2; Simulation of AND gate using McCulloch - Pitts Model
isa aehioa aah

The truth table of the AND gate is as follows :
Input Output
a lence
X, X, Y
0 0
a Seanad
0
este Peal
0 1 0.
1 0 0
1 1 1
We assume the weight vector as W;, for X, and {
W, for X,.
For the first row, we may write net values
as (W,* X,)+ (W,
* X2) = (W, * 0) +(W, 0) *
=0,
According to the McCulloch - Pitt
s model if the output is 0 then
net valu ¢ must be less than threshold
For the second row, we may 0 <T.
write net values as (W, * X))
+ (W.* X,) =(W, * 0) +
According to the McCulloch - SW (W. 2
Pitts model if the output is 0 then
net value must be less tt 1an threshold W,<T.
For the thir
d row, we may write net valu
es as (W,
According to the McCulloch -
*X1)
+ (W, X,)=(W, * 1) +(W,* 0)=W,.
Pitts m
odel ifi the tg
output is 0 then net value must be less than threshold W,<T.
(MU-New Syllabus w.e.f academ
ic year 18-19) (M6-14)
Tech-Neo Publications...A SACHIN SHAH Ventu

Ie n t
L(2 “4
>) : Machine Learning (MU-Sem 6-Comp) Introduction to Neural Network ...Page no (2-5)
For the fourth row, we may write net values as (W, * X,) + (W,* X,) = (W, * 1) +(W,* ID =W,+W3.
According to the McCulloch - Pitts model if the output is 1 then net value must be less than threshold W,+W, >T.
Now we are having four equations :
1 O<T 2 W,<T 3. Wi<T 4.W,+W,>T
Now select the values of W,, W,, and T such that the above conditions get satisfied.
Let’s assume T = 0.8, W,= 0.3, W, = 0.5.
Using this values the AND gate is represented as,

‘ts Mode]
x W, = 0.5
1
X
h. Pitts
X2 (t+08 )
W, =0.3
pb 2.4 DIFFERENCE BETWEEN BIOLOGICAL NEURON AND ARTIFICIAL NEURON
Sr. Points of Difference Biological NN Artificial NN

No.
1. | Processing elements 10'* synapses 10° transistors

2 Speed Slow Fast
3. Processing Parallel Execution One by one
4 Size and Complexity of Less More, difficult to implement
operation complex operation
5. Fault Tolerance Exist Doesn't exist
Storage If new data is added old is not erased Erased
Control Mechanism Every neuron acts independently CPU
pm 2.5 TYPES OF ACTIVATION FUNCTIONS/ TRANSFER FUNCTIONS
Activation function f (x) is used to give output of a neuron in terms of a local field x or net. The various activat
ion
functions are follows.
1. Linear
Output = net
Linear activation function gives the output same as that of the input or x
net value. The Matlab toolbox has a function, purelin, to realize the a
mathematical linear transfer function shown above. Neurons of this type
are used as linear approximators in Linear Filters :
a = purelin(n)
2. Hard limit / Unipolar binary
Output = 0 ifnet<O
= 1 if net >0
_ ' (MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture
ye

Network ...
m 6-Comp)
Machine Learning (MU-Se
of the
ve limits the output
it tran sfer function shown abo
The hard -lim 0, or 1, if nis
net inputargumen' t nis less than
neuzon to either 0, if the in Perceptron’s, to
than or equa l to 0. This function 1S used
greater
classification decisions.
create neurons that make
a = hardlim(n)
Hard-Limit Transfer Function
ary
3, - Symmetrical Hard limit/Bipolar bin
a. ri
Output =o= —1 ne <0
ifif net
-1 if net >0
function shown above limits the

The Symmetrical hard-limit transfer
input argument nis tess
output of the neuron to either — 1, if the net
func tion is used in
than 0, or 1, if n is greater than or equal to 0. This ‘ a = hardlims(n)
ificati isions.
ation decisio Symmetric Hard-Limit Transfer : Function
Perceptron’s ; s, to create neurons that make classific
;
.4. Saturating linear .
Output = 0 if net <0
if O<net<1 oe
. io = net
if net > 1 +f ee
ec =1
~ a= Satlins (n) i
ey Satlins Transfer Function
5. Symmetrical saturating linear -
Output = -1 if net <0
net ifO<net <1 7
=1 > 1
if net
The symbol in the square to the right of each transfer function graph
shown aboye represents the associated transfer function. These icons
. ams to show the
5 of network diagr : a = satlins(n
tlins(n)
1 e the general f ini the boxes
replac Satlins Transfer Function
particular transfer function being used.
6. Unipolar continuous.
, Output = ao
The sigmoid transfer function

shown above takes the inpu
t, which
minus infinity, and squash
es
tansfer function is com
used’ in back pro; i monly
e ei ‘ pagation networ! » i in
_ differentiable. . Part part because it: is : LU a = logsig(n) : : E
09-sigmoid Transfer Function
{MU-New Syllabus W.ef academic year 18-1
9) (Mg 14)" [el rech-Neo Publications...A‘SACHIN SHAH

@ earning (MU-Sem 6-Comp)
ansig / Bipolar continuous =
Output =-—2
Trem -] “7)
€
igmoid transfer function shown abo

3
VE takes the ;
ey i Input, Which can
have any value between plus and minus infinity
and s,
‘ato the range— | to 1. quashes the output
_ 4 tansig(n)
Tan: “sigmoid Transfer
Function
Feed forward Network

Feedback/Recurrent Network
fy { |
Single Layer|| Multi Layer Radial | | Competitive Self Hopfield
Perceptron || Perceptron Basis Network ]]Organizing| | Network
Network Network Function Map
Network Network
Fig. 2.6.1 : Network Architecture
In this type of network the neurons are connected in a

layer i to Input layer | Output ber
forward direction. Connection is allowed from
layerj if and only if, i<j
Input vector, output vector and the weight matrix of the

network are represented as follows,
wiShees ]
Input vector, X = [Xq, Xz X3 e000 wut
Output vector,Y = [Y¥y. Yx ¥3 W
Wi Wi Wis 7 7
d Network
Wa Wn Was Wa Fig. 2.6.2: Single Layer Feed forwar
Weight matrix, W =
. . Ww pi
Wat me Was . tion

: lowing equallo”
> The net input to the j’"is calculated using the fo for jel tom and i= 1tor
net, = ZW; * Xi n
«..A SACHIN SH AHV enture
Publicatio
al Tech-Neo
a . _19) (M6-14)
> New Syllabus w.e.f academic year 18-19) (

ge no (2.9 4
Introduction to Neural Network ...Pa
m 6-Comp)
Machine Learning (MU-Se function ¢
ulated, the output of
the j neuron is calculated by applying the activation
— Once the net input is calc
net input as follows,
Y; = f (net)
k ¢
1.1 Single Layer Perceptron Networ outpy
there are only input and
layer Perc eptr on netw ork. In this type of network
Fig. 2.6.2 represents the singl e
connected to the outputs, via a serrjeg of i
—
e layer, where the inputs are directly
layers are present. It consists of singl i
weights.
is aboy, |
is calculated in each neuron node, and if the value
— The sum of the products of the weights and the inputs other wise inhibi t the value
and displays the output (generally, 1)
some threshold (generally, 0) the neuron fires
(generally, — 1).
1.2 Multi Layer Perceptron Network

or more hidden layers in between input ;
This type of network besides having the input and output layers also have one
and output layers.
Hidden Hidden
oa ane
Output layer
‘ Input layer layer 1 layer 2
7if
sll nici st
a ‘+
p
4
1
;
i
:
Fig. 2.6.3 : Multi Layer Perceptron Network ;
4
13 Radial Basis Function Network
In this type of networks a single hidden layer is present.

Input layer Hidden
layer Output layer
Fig. 2.6.4: Radial Connec

ted neural networ
k
(MU-New Syllabus w.e- academic year 18-19) (M6-14)

ae Puiblieatinnn oa ea ieee ArtALE


m 6-Comp)
Introduction to Neural Netw
2. Feedback /Recurrent Net
work
—— =
Ce pg
Fig. 2.6.5 : Feedback / Recurrent Network
2.1 Competitive networks
In this type of network the neurons of the output layers compete

between themselves to find the maximum Output.
2.2 Self Organising Map (SOM)
The input neurons activate the closest output neuron.
2.3 Hopfield Network
Each neuron is connected to every other neuron but not back to itself.
‘bh 2.7°°'° TYPES OF LEARNING ©
Learning in biological systems involves adjustments to the synaptic connections that exist between the neurons. This is
true of ANNs as well; Learning of neural network means setting or updating the weights.
1. Supervised Learning
Input ——»] Adaptive Network Oo
In this type of learning when input is applied supervisor provides a desired |
response. The difference between the actual response (0) and the desired d-O 4
response (d) is calculated called as error measure which is used to correct ( )
the network parameters. Fig. 2.7.1 : Supervised Learning
2. Unsupervised Learning .
Input——»| Adaptive Network To oO
In this type of learning supervisor is not present due to this there is no idea
or guess of output. Network modifies it’s weights based on patterns of input
and/or output.
Fig. 2.7.2 : Unsupervised Learning
fel Tech-Neo Publications..A SACHIN SHAH Venture


y,
Network Fase ne
——=—— !
Machine Learning (MU-Sem 6-Comp) - Mac

-
;
3. Hybrid ; Learning . i used.
es i nsupervised learning 15
in this type a combination of supervised and unsuP
; ,
4. Competitive Learning
having the maximum fesponse 1s declared Bq
themselves. The neuron . hanged . i
2.
In this the outpul neurons compete : between ons are modified else remains unchanged. er
‘,
neurons
winner neuron and the weights of winner
Example 2.7.1: A Single layer NN is

to be ig i
Howi saigned with 6 input and 2 output. The outputs are 10 De limited to ang:
uestions
.
following q
Answer the
‘ continuous over the range 0 to 1.
1. How many neurons are required ?

3.
2. What are the dimensions of weight matrix ?
be used ?
3. What kind of transfer function should
M1 Solution:
4
total 8 neurons are required.
For each input and output one neuron is required, hence
‘the input neurons,
In weight matrix the number of row reprints the output neurons and number of columns represent
represent all the incoming|
The first row reprints all ‘the incoming vectors of the first output neuron and the second row
‘ !
vectors of the second output neuron. Hence the matrix dimension is 2 x 6 as follows :
W = [ Wi We Ws Wa Wis Wie |
Wa Wa Wo; Wo Ws Wo
The outputs are to be limited to and continuous over the range 0 to 1. Hence we may use unipolar continuous function. 5
‘
' .! » 1
f(net) = ———————
(net) (1 + exp (—A* net))
; ;
- Example 2.7.2: j Given a 2 input neuron with followin Q parameters, b = 1.2, w = —-[_&
output
, for following transfer functions. w=[3, 2], x = [-§, 6]. Calculate neuron’s
1. Hard limit 2. Symmetrical Hard limit
.
3. . Li Linear a ii
5. Symmetrical saturating linear 6. Unipolar . 4. Saturating linear
I Solution: continuous 7. Bipolar continuous
6
. : can be calculated as
First we cal culate the net value which
‘ i
follows : -5 3
nh
Y
net; = > W, xX.
: i 6 2
yet 1.2
But, in this particular '
- problem bias value j L . .
* #'S0 formula
performance of the net work SO we will: use the following ich ‘j 'is added in‘ the net to improve the
given, Bias is a val ue which "
an
net = > W, X;+b
. jet
5*3)+(6*2419-
Net e(-
Now we will calculate ai

the final ou tput
for the gii ven activa
tion functions
((MU-New Syllabus w.e.2.f academ

. ic year 18-19)
(mg -14)
Tech- .
ech-Neo Publications... Fifi)
SACHIN SHAH Ventu®

= _—_w
Machine Leaming
(MU-Sem 6Comp
D)
1. Introduction to Neural Networ
Hard limit k ---Page no (2-11
Outpur = 0
if. net <0
=1
if. net =0
Hence, Y =0
2. Symmetrical Hard limit
Output = —]
if. net <0
=I
if, net>0
Hence. Y =~—]
Module
3. Linear
Output = net
Hence. Y = —O8
the ingy 4. Saturating linear
tall the
Ourput = 0 if. net <0
= net if,O<net<1
=] if. net> 1
Hence. Y = 0
inuous fz 5. Symmetrical saturating linear
Output = ~1 if. net <0
__
= net if,O<net<1
te neurti!
=]
ify net > 1
neat Hence, Y = -1
jnuous
Unipolar continuous
Output =
(1 + exp (—2 * net))
aa
We assume the value of 2. equal to 1(If the value c* ) is not given then assume the standard value as equal to 1)
Hence, Y = 0.1418
7. Bipolar continuous
1 a 7
Output =
(1 + exp (—2
* net)) -1
Hence, Y = —0.71
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) fe Tech-Neo Publications..A SACHIN SHAH Venture

...Page no (2-19
Introduction to Neural Network
m Machin
= : flowing network us ing a inuous.
Unipolar cont
Exam
Example 2.7.3 : Compute thea
output of the following -2.82 F
Ms
I
1
-I fe
-2.74 Bipc
MM Solution : First we will calculate net input and output of hidden nodes, : For!
Hl jy = (4.83 * 0) + (4.83 * 1) 2.82 =~ 7.65

H1 output) a 4.758 x 10" , 6 s
H2qq = (4.63 * 0) + (4.6 * 1) - 2.74 = 1.86

F2oupary = 0.865
Now we will calculate net input and output of output node,

Ovqaiy = (4.758 x 107" * 5.73) + (0.865 * 5.83) — 2.86 = 2.167
Orouipuyy = 0.899
mH 2.8 PERCEPTRON LEARNING RULE
x1 oi i
Perceptron Learning is a supervised type of learning as the desired _th neuron
response is present. It is applicable only for Binary types of neurons x2
(activation functions). Learning signal is the difference between the actual Xn on ’
output and desired output of neuron and it is used to update the weight.
Leaming signal, r = d;-0,
Where 0; is the output of the i” neuron and d, is the desired response.
Weight Increment, AW, = C *[d,-0,]* X,
. ee : c
Where C is a constant, X is input and j = 1 ton. Fig. 2.8.1 : Perceptron Learning rule
Wrew = Wogt AWij
In the Perceptron learning weights are updated only if, d, # 9,
Let’s assume »
if, dj=1 and o,=-1
then AW; = C*[d\-o]*X=C*[l-( 1)] * X= 2Cx,
if d=-1 and o= 1 ;
Den
Hence,
AW, = C¥Ed-0] X= C* [1-1]
AW, =+2C X,
* X=—20x i,
2
(l
.
Syl us w.e.f academic year 18-19) (M6-14)
(MU-New yllab fal Tech-Neo Publications..A SACHIN SHAH Venture ~.

n e
S01,
Introduction to Neural Network ...Page no (2-13)
Example 2.8.1: Initial weight vector W, needs to be trained using three
input vectors, X,, X, and X, and their desired
responses. Find final weight vector using Perceptron learning
rule.
1 1 0 -1
W, = ; X= : X= “oe M=] yg |) C=0.1,dy=

dy =—1,d
—1, 5=1
0.5 -1 -1 -1
Solution :
Perceptron learning is applicable only for binary function and also in this problem
Module
desired responses are given in 1 and
27
— 1 format Hence we solve this problem using bipolar binary function
Bipolar Binary
For bipolar binary output = —1 if net<0

= 1 if net>0
>» Step1: When X, is applied
1
-2
net, = W.*X,=[1-100.5] > | =25
-1
0, = F(net,)=1 as d, #0,
es W, = W,+C*[d,-0,]
*X,
pes 1 1 08
-1 -2 -0.6
3 = 0 (1-1
+0.1C1-)) 0 = 0
0.5 -1 0.7
» Step2: When X, is applied

0
eo 4 net, == W.*X,=[08-0.6
W, * X,=[0. . 0 0.7]. 05" |=-16
-1
ing 0, = f(net,)=-1 as net <0

ii
Since, d, = 0,, weight updation is not required
So, W; =W,
b Step3: When X, is applied

-1
:
net; = W,*X=[08 -06007]| 4. 1 | =-22.1
-I
03 f (net) =- 1 as d, #0,
a Tech-Neo
- Publications .
ications...A SACHIN SHAH Venture
t
sve

work ...Page no (2.

Introduction to Neural Net
a|
(MU-Sem 6-Comp)
|
Machine Learning 06°
_o )*tX
03] 3 1
= w,+C* [ds
Wy
0.8 —0.4.
1 -
— 0. 6 0.1 i
(tite )) 05 i
9
|
‘
=
ee
i
.5
Ms
rule.
Example 2.8.2 :' Implement Perceptron Learning 4
0 2 0
i
we| 1 1% 1|,X%=| -1 (Cet, dy=- 102-17.
“1
- 1
0 ieved.
ponses in a row are ach
ce until two correct res
4
Repeat training sequen
~ @ Solution: are given in | and }

function and also in this problem desired responses
Perceptron learning is applicable only for binary
using bipolar binary function
—1 format Hence we solve this problem
Bipolar Binary
‘For bipolar binary output = -1 if net <0
=1 if net>0
). Step1: When X, is apptied

2
[01 0]]
inet, = W)*X,= Ff =l
-1
=1
o, = f (net) as d, #0,
“W, = W,+C*[d,-0,]*X,
0 2 -4
=] 1 J#1(-l-ly+] 1 f=] -1
0 -1 2
b Step2: When X, is applied
net, = W,*X,=[-4-12] | -1] =-1

-1
0, = f(net,)=-1 asnet<O d, #0, |

Ws = W2+C* [d.~o,] * X, |
~4 0 4
=} -l/+1d-Ciy] -1 [=] ~3
2 -1 0
In the problem it is given

gi that we h ave to re
aining until‘ two correct responses are achieved, so we will age
peat the traini
is:
apply X,.
(MU-New Syllabus w.ef

ef academici year 18-19) (M6-14) [eI . : gue
PI Tech-Neo Publications... SACHIN SHAH V@r


m 6-Comp)
Untroduction to Neural
> Step3: Network ...P
When X, is applied
—
2
t
net; = W,*X,=[-4 -3 0] 1} sy
™ -1
0; = f(net,)=-1]
As d, = 05, weight updati
on is not required
W, = W;
> Step4: When X, is applied

Modale
0
LZ
ney = W,*X,=[-4 -3 o]| -1 |=3
D1! ang -1
0, = f(net,)=1
As d, = 0,, weight updation is not required
Ws = W,
Thus we have obtained the correct response in a row two times.
Example 2.8.3 : A Single neuron network using f (net) = sgn (net) has been trained using the pairs of X,, d, as follows. Find
Initial weight vector.
3 1 0 -2
2 -2 -1 0
W, = 6 »X,= 5 Xo= 9 X= -3 ,C=1,d,=—-1,d,=1,d,;=—-1
1 -1 -1 -1
v Solution : This Problem is different from the previous problems; we have to find the initial weight vector using the
final weight vector
> Stept:
. W, = W,+AW;
The above equation can be written as, W,= W,4— AW,

As we know AW, = +2 * C * X; and d;=— 1, we consider the negative sign
~ 4
AW, = -2*C*X,= 6
; 2
i 2
4 W; = W,-AW;= 0
a |
SACHIN SHAH Vennture

- eo
Tech-N Public i ...
i ations
(a
nic year 18 19) (M6-14)
Syllabus w.e.f acade
(NV U-New

- Introduction :to Neural Network .--Page no(2-46)

Machine Learning (MU-Sem 6-Comp) TS
» Step2:
W, = W,+ AW,
AW,
The above equation can be written as, W,= W,-
er the positive sign
As we know AW, =+2*C * X, and d,= | we consid
0
-2
AW, =2*C*X;=|
-2
-1
4
W, = W;-AW:=| _,
» Step3:
W, W, + AW,
The above equation can be written as, W, = W,— AW,
As we know AW, =+ 2 * C * X, and d, =— | we consider the negative sign
-2
4
AW, =-2*C#X,=
-6
2
0
W, = W,-AW, = 5
b4 2.9 SINGLE LAYER PERCEPTRON LEARNING
Perceptron Architecture
One type of NN system is based on the “Perceptron”. A Perceptr

on
x J i» O
computes a sum of weighted. Combination of its inputs,

if the sum is :
greater than a certain threshold, then its output is 1 else — 1.
b
Fig. 2.9.1 : Perceptron Architecture
Output of the neuron is 1/0 or I/-
4, thus each neuron in the network divides the input space into two regions. This 15
useful to determine the boundary between
these regions.
P1
Let’s see the example for this.
<
Output of the above network is give
n by,
. P2 W12
O = hardlim (W,\* P, +W,, *P,+b)

year 18-19) (M6-14) re
Tech-Neo Publications..A SACHIN SHAH vent

on E e
ee
a Be hy
&
mp)
Introduction to Neural Network ...Page
no (2-17)
Input vector for which the net input
is zero determines the decision boun
dary
Wi" P,+Wi.*P,+b = 0
Let’s take the value of W,,== W).==la
| andb=-1 and substitute this values in the above equat
ion we will get
P,+P,-1 =0
To draw the decision boundary we
need to fi nd the intercepting
points of P, and P, axes
P}+P,-1 =0
Substitute P, = 0 then we will get P, = | ie.
(0, 1)
Substitute P,= 0 then we will get P,= I ice. (1,
0)
Now to find which decision region belongs to output
= 1.
Let’s pick one point (2, 0) and substitute in the followi
ng equation
‘ ( 2
O = Hardlim (W' P +b) = Haratim (| 0 lu l+(- 1) = Haratim (N=!
Decision boundary is always orthogonal to the weight vector and it always points towards
the region where neuron
output is 1.
Example 2.9.1: Implement Perceptron Network for AND function Using the concept of decision Boundary.
P, = [0 0], t, = 0 and P,= [0 1], t,= 0 and P;=[1 0], t,=0 and P,=[1 1], ty=1
M1 Solution:
Now first we plot the given points, For the P, point the target or
a
desired response is given as | this is represented as filled circle and for the
remaining points it’s 0 which is represented as empty circle. After plotting
a
the points the next step is to draw the decision boundary such that it will
LLP,
divide the points into two regions according to their desired responses
(i.e. output 1 and 0).
As we know weight vector is orthogonal to the decision boundary and

it points towards the region where neuron output is | (in this case
<= for P,(1, 1) output is 1) Hence we will take W = [2 2]
To find the value of bias pick one point on decision boundary as P = [1.5 0].
[ 0 By substituting this values in the equation W'P+b=0 we will get b =—3.
We test the network for our calculated values as follows.

Let’s take P, to test
2 .
O = Hardlim (W'P +b) = Hardlim ( [ 5 | [0 0] + (—3)) = Hardlim (- 3) = 0
input vector in order for
Example 2.9.2: Solve the following classification problem with Perceptron learning rule. Ap ply each the problem o! nly after
proble m is solved . Draw a graph of
that the
as many repetitions as it takes to ensure
you found a solution.
-2 -1
-”' =
b= 1.Po=| Pex=|
p || e= =0,P |% t.c=t
+ Jue
W,=l00, }=0,P:=| ?5 ]ueore=[
re
(M6-14) Tech-Neo Publications..A SACHIN SHAH Ventu
(MU-New Syllabus w.e.f academic year 18-19)

work ...Page no (
Introduction to Neural Net
olve this problem using un
Machino Learning (MU-Sem 6-Comp) iin 4/0 fo mat so We will Ss
i
are give
d desl ed responses
In this problem t
he
function.
wi Solution :
Iteration I
% Step: When P, is applied

= hardlim (0) =1
| 0 hardlim (W, * P, +b)
Error, ¢ =,- 9) = 0- 1=-!
W2 w,+e*P =[00)+- 1) 22)= 2-7]

b, = b+e=04+(-1)=-! aa 4
When P, is applied pe.
> Step 2:
| 0, = hardlim (W,
* P, +b,) =hardlim (1) = 1 : 4
Error, ¢ = t)—0)= | - 1=0 , : if |
‘a
Since the error is 0, weight and bias updation is not required.
W; = W,
\ b; =b,
Step 3: When P, is applied
0, = hardlim (W, * P; +b,) = hardlim (- 1) =0
Error, e=t,-0,=0-0=0
, W, = W;
b, = by
> Step4: When P,is applied
0, = hardlim (W, * P,+b,) =hardlim (— 1) =0
Error,
e = t,-0,= 1-O0=1
Ws = WetePi=2-241E1 =E-3-1]
bs = bjt+e=-1+1=0
We have to repeat the iiterations

i untili all Input vectors
. are correc t Class
1 if led
' P L€. error = Of or
y i (i all the inpu t
i )
(MU-Ne
| w Syllabus W.e.
eT acade mi ic year r
18 -19
1 ) (M6-14)
[al
. ech-Neo Pub HIIN
li cat
z ion s...A'
: SAC
SHAH
AH

~ Machine Learning (MU-Sem 6-Comp)

e———<~— —— Introduction to Neural Network ...Page no (2-19)
Iteration 2
p Step 5: When P, is applied
Os = hardlim (Ws * P, +b.) = hardlim (-8)=0

Error,e =t,-0;=0-0=0
Since the error is 0, weight and bias updation is not required,

We = Ws
by = bs Module
> Step6: When P, is applied
O, = hardlim (W, * P, + b,) = hardlim (- 1) =0

Error,
e =t,-0,=1-0=1
t
W, = WeteP, =[-3-1]+1[1-2]=-2-3]
b, = b,+e=O04+1=1
) Step7: When P, is applied

0, = hardlim (W, * P, +b,) =hardlim(-1) =0
Error, e = t;— 0,=0
W, = W,
b, = b,
> Step 8: When P, is applied
oO, = hardlim (W, * P, + by) = hardlim (0) = 0
Error, e = tz— 0,=0
Since the error és 0, weight and bias updation is not required.
Wo = Ws
by = bs
In this Steration for P,, P, and Py, ¢ = 0 Hence we will again go for Iteration 3
Iteration 3
rs)
Step9; When P, is applied
0, = hardlim (W, * P, + by)= hardline(- 9) =0
Error, e = t;— 0,=0
is not required.
Since the error is 0, weight and bias updation
Wy = Wy
IN SHAH Venture
[Fal rech-neo Publications...A SACH
fe
lel ntv

Network
-Sem 6-Comp)
Machino Learning (MU
by = b,
. a
applicd
Step 10: When P, is
®
= hardlim (5) = |
Oy = hardlim (Wo * Pa + Pio)
Error, ¢ = ty— On= 0
required.
and bias updation is not
Since the error is 0, weight
St
Wi = Wio
;
by = bio ‘ - i
Thus, we can say that we hare’
also we are getti ng the crror, ¢ = 0. In this iteration ¢ = 0 for all input vectors.
Now for P,
found a solution
s
Final weight vector and bias value are as follow
W = [-2-3]b=1
la
on to find the equation of decision boundary
Now we substitute these values in the following equati
aia
Wi," Py + Wiz * Pp +b =0 ;
i
-2P,-3P,+1 =0
points.
4 Now to draw the decision boundary we need to find the intercepting
By substituting P, = 0 we will get P,= 0.3 i.e. point] (0, 0.3)

I
a By substituting P,= 0 we will get P,=0.5 i.c. point2 (0.5, 0)

!
athe i
:
_
From the above
.
diagram we can say that the decision
oe
boundary
.
classifies the input vectors in to two classe)
if
' corresponding to output | and 0. es } |
é y { I ry ii
Pa(-2.2) -O-|- I.\ || -O P4(2,2)
lt tap pe | - +-K-FEH
yh Say
yp Pane E HE
Bare
AEP Neer
ae a NEI
| . |
bo 2.10 UNIVERSITY QUESTIONS AND ANSWERS
> May 2017

Q.1-—— Explain hard limit and soft li limit acti i function. .
a ey mit activation i (Ans. : Refer section 2.5) (5 Ma, is : |
65 all ) :
0
. xplain Mc Culloch Pitts neuron m
odel with the h elp of an example. (Ans. : Refer section 2.3) a
What is learning i eural networorks ? Diffe
tl ea int a te betw
a twe en pervis
Supervi sed e and1d unsuperv ised learnit 9 7 wart ) c
(MU-New Syllabus w.e.f academici year 18-19) (M6-14) Tech-Neo-Neo Publ A SACHIN SHAH Ver
o Publications...A SA : 3

*® .
ob
“&
*
Machine Leaming (MU-Sem 6-Comp)

Introduction to Neural Network .,.Page no (2-21)
> May 2019 —
Q.4 Determine weights and threshold for the given data using McCulloch-Pitts neuron model.
(5 Marks)
Plot all data points and show separating hyper-plane. (Ans. ; Refer section 2.3)
Input Output
x
0
a
Module
NOTE : For XOR use (X1 AND (NOT X2)) OR ((NOT X1) AND
X2)
> Dec. 2019
Q.5 _Wite short note on McCulloch-Pitts Neuron Model. (Ans. : Refer section 2.3) (5 Marks)
Q. 2.5 is a non-recurrent network having processing

Multiple Choice Questions units/nodes in layers and all the nodes in a layer are
connected with the nodes of the next layers.
Biological networks are superior to AL networks because
(a) Feedback network
of
(b) Recurrent networks
(a) Robustness and fault tolerance
(c) Feed forward Network
(bh) Collective computation (d) Fully Recurrent networks ¥ Ans. : (c)
(ce) Fleubility Explanation : In feed forward network connections are
(d) Correctness Y Ans.
: (a) allowed only in forward direction.
Explanation : BNN are more robust and also fault
Q. 2.6 During the training of ANN under learning,
tolerance (Information does not lost completely) exist.
the input vectors of similar type are combined to form
Q. 2.2 Soma in a Biological neural network maps to in clusters
artificial neural network (a) Supervised (b) Unsupervised
(a) Input (b) Node (c) Reinforcement
(c) Output (d) Interconnections v Ans. :(b) (d) Both Supervised and Unsupervised ~ Ans.
: (b)
Explanation : Node collects all inputs for processing Explanation : Data is groped together based on common
same as that of soma in BNN. characteristics in unsupervised learning.
are tree-like branches, responsible for Q. 2.7 The maximum time involved in case of layer calculation
receiving the information from other neurons it is isin
connected to (a) Input layer computation.
(a) Soma (b) Axon (b) Output layer computation.
(c) Dendnites (d) Synapse Y Ans. : (c) (c) Hidden layer
(d) Equal effort in each layer ¥ Ans. : (c)
Explanations: Dendrite is the part of neuron that accepts
layers
the input. Explanation : Main processing is done in hidden
only.
is just like a cable through which neurons send on ?
the information. Q.2.8 Which one is type of linear activation functi
(a) Identity (b) Unipolar Continuous
(a) Axon (b) Dendnites (d) Binary ,
(a)
v Ans: . (c) Bipolar Continuous
(c) Soma (d) Synapse Ans.: ( (a)
send the
Explanation : Axon is a part of neuron that both are same.
Explanation : In identity input and output
output.
e
Tech-Neo Publications..A SACHIN SHAH Ventur

=
;
Introduction to Neural Network ...Page no(2-22) i
|
Machine Leaming (MU-Sem 6-Comp) ‘ : i.
Explanation : In perception output is calculated Using.
olla
Q.2.9 — Attitiviat Noval Network iy used ti input and initialized weights, if output is not same ag thay
(a) Unsupervised teaiilig model of expected output then weights are
updated and then
(DP) Supervised learniag mode is perfo rmed.
Model next iteration
(ed to both Unsupervised andl Supervised learning
(UD Neither used te Unsupervised wer in Supervised Q. 2.15 Processing of ANN depends upon
¥ Anse: (e)
loarning Model 1) Network Topology
Uxphination : ANN applications ate developed using
2) Adjustments of Weights or Learning
both type of Teartiig, methods,
Q) Activation Functions
Q. 2.410 ‘Training set of data tn supervised learning includes a 12 (b)) 2,3
(by Output \ (c) 1.3 qd) =1,2,3 Y Ans. : (i
(ay Input
(ed Both input and Output
Explanation 2 Since all components are used for
QD Neither input nor output Y Ans. : (ec)
processing.
Explination : lo training data input an outpat both are
Q. 2.16 Suppose you have to design a system where you want to
present which is sce to tear,
perform word prediction also known as language
Q.2.11 _ is the connection between the axon and other modelling You are to take output from previous state and
, neuron dendrites. also the input at cach step to predict the next word The
@ Soma (b) Axon inputs at cach step are the words for which the next |
(c) Dendrites (I) Synapse Y Aus.) words are to be predicted. Can we ‘use Recurrent neural |
network for the design?
Explanation + Synapse is the electro chemical contact
organ. (a) Yes (b) No : ¥ Ans,
: (a) |
Q.2.12 Which is true for neural networks, Explanation : Yes, as in case of RNN output is again|
; 1) It has set of nodes and connections.
given to input unless and until we get desired result.
2) Each node computes is weighted input Q. 2.17 Suppose you want to predict the cyber bullying so that '
the parents can run this system im the background and
3) They have the ability to learn by example.
when the children’s are watching any video/audiolsiteif
4) The training time does not depends on the size of the
it contains any offended contents then site is blockedor
network.
contents are blurred. According to you which method
(a) 1,3,4 (b) 1,2,3
send
will give you best result?
(c) 2.3.4 (dd) 1,2,3,4 ¥ Ans, + (b) (a) Long Short Term Memory
Explanation : Since training time depends on the size of
(b) Convolution Neural Network
the network,
(c) Recurrent Neural Network
Q. 2.13) The following Gate cannot be modelled with a single (d) Anificial Neural Network Y Ans.: (2),
neuron
Explanation : LSTM, as audio or video is to be
(a) 3-input AND Gate (b) 3-input XOR Gate converted in to sentence model then will be given
(c) Not Gate network.
(d) All can be easily modeled Q.2.18
Y Ans. : (b) Can you represent the following Boolean function with!
Explanation : Since to design XOR gate we require single logistic threshold unit?
AND, NOT and OR gate.
Q.2.14 Steps followed in training a perce A |B | F(A,B)
ption are listed below
What is the correct sequence of the steps? Ly 0
\. Fora sample input, compute
an output 0] 0 0
2. , Initialize weights of perception
randomly 1] 90 1 .
_ 3. Goto the next batch of dataset ‘
4 oOo] 1 0
If the Prediction does not match the Output,
pul, ch ange (a) Yes (b) no . ¥ Ans:
. +8 wit!
(a) 1.4.3.2 (b) 2,443 Explanation : Yes, you can represent this function Wo
1234 @ 23,44 a single logistic threshold unit, since it is lin
¥ Ans. : (b) Separable.
(MU-New Syllabus w.ef academic
year 18-19) (M6-14)
Tech-Neo Publications...A SACHIN SHAH ventut :

Machine Learning (MU-Sem 6-Comp) Introduction to Naural Notwork ...Paga no (2-23)

==
Q. 2.19 Consider the neural network below. Find the appropriate Explanation : The price of a house is a single value,
weights for wa, Ww, and w, to represent the AND Hence, one neuron is enough,
function. Threshold function = (1, if output > 0;
0 otherwise}. Xp and x, arc the inputs and b, = 1 is the Q. 2.23 Which one of the following activation functions can be
used as an activation function at the output Jayer for the
bias.
task given in Question 22 ?
‘ing 9
Wo (a) Sigmoid (b) Relu
* threshold (WoeXp + wy x4+1 Wosb,)

241
(c) Softmax (d) Identity Y Ans. i (b)
Explanation : Since we don't have a particular range of
bo “2 value to predict at the output, we can use the
v Reluactivation function for the best interpretability of the Module
; ;
ents » 4ae Fig. Q. 2.18 result. Relu(x) = max(0,x) also gives us the same value 2
X for positive outputs.
(a) Wo= 1l,w, = 1,w,=1
' Where Yo (b) Wo= lew, = 1, w,=-1 Q. 2.24 Which of the following is truc?
OWN a le (i) Neural networks learn by example
(c) Wo=-1,Wy=-1,Ww,=- 1
A pr CVioUs gs Y Ans. : (b) (ii) On average, neural networks have lower computation
(d) Wo =2, w, =-2, w;=- 1
rales than conventional computers,
ithe Next wy Explanation : For x) = 1, x; = 1 and b, = 1option (b)
‘OF Whic gives I*1+1*1+-1=1>0=1
(iii)Neural networks imitate the way the human brain
works
5 Recurey,
Q. 2.20 Which of the combination of weights make the network (a) (i) and (ii) are true (b) (i), (ii) (iii) are true
represent OR function ? (c) Only (i) and (iii) are true (iv) Only (ii) is true
Y An: (a) Wo=1,w, = 1,~,=0 Y Ans. : (b)
XN outputiy (b) wo= 1, w,= 1, wy
= 1 Explanation : All statements are true for ANN.
Jesired resuh (c) Wo= lw, =1,W2=-1
Q. 2.25 McCulloch Pits model is a form of
Ww, =-1.w,=-1
(d) wo=—-1, : (a)
“Ans.
er bullyings: (a) Unsupervised learning model
re backgroa: Explanation : For x9 = 1, x, = | and b, = I, option (a)
(b) Supervised learning model
gives 1*1+1*1+0=1>0=1
video/autes (¢) Stochastic learning Model
1 site is bl Q. 2.21 Suppose you are to design a system where you want to (d) Reinforcement learning Model ¥ Ans. : (b)
ou which perform word prediction also known as Explanation : McCulloch Pitts model works on basis of
1) language modelling You are to take output from supervised learning.
previous state and also the input at each step to
the Q. 2.26 In a neural network, which one of the following
2) predict the next word. The inputs at each step are
words are to be techniques is NOT useful to reduce overfitting 7
words for which the next
(a) Dropout (b) Regularization
predicted. Which of the following neural network
J 3)
would you use? (c) Batch normalizatioh (d) Adding more layers
: (d)
¥ Ans.
video (a) Multi Layer Perception
laycrs docs not help to
Explanation : Adding more
will (b) Recurrent Neural Network
reduce overfitting.
at (c) Convolution Neural Network
an func (d) Perception. ¥ Ans. : (b) Q. 2.27 One of a possible neuron specification
requires a
many neurons to solve AND problem
Explanation : Recurrent Neural Network (RNN) are a minimum of how
Neuro n
type of Neural Network where the owtput from previous (a) Two Neuron (b) Single
step are fed as input to the current step. Refer to lecture (c) Three neuron (a) Four Neuron
notes for detailed explanation. ¥ Ans. : (b)
and one
Q. 2.22 You are given the task of predicting the price of a house Explanation : Single ocuron with two inputs
given the various features for a house such as number of output is required.
rooms, area (sqft), etc. How many neurons should you
have at the output 7
(a)3 (b) 2. (c) I (d) 4 vY Ans. : (c)
N SHAH Venture
(el Tech-Neo Publications...A SACHI

FE
Introduction to Neural Network ...Page no (2-24) |
'
(c) predicting the future inputs

Q. 2.28 Those neurons which responds strongly to the stimuli (d) none of the mentioned Y Ans: (b
have their weights updated in the following Learning Explanation : This is the basic definition of
Method. auto-association in neural network.
(a) Competitive Learning
Q. 2.33 When the cell is said to be fired ?
(b) Stochastic Learning
(c) Hebbian Learning (a) if potential of body reaches a steady threshold values
Y Ans, : (a) (b) if there is impulse reaction
(d) Gradient Descent Learning
(c) during upbeat of heart
Explanation : In competitive learning maximum output
is considered. (d) none of the mentioned Vv Ans, : (a)
Q.2.29 We have decided to us¢ a neural network to solve this Explanation : Cell is said to be fired if and only if
problem. We have two choices : either to train a separate potential of body reaches a certain steady threshold
neural network for each of the diseases or to train a values.
single neural nctwork with one output neuron for each Q. 2.34 The cell body of neuron can be analogous to what
disease, but with a shared hidden layer. Which method mathematical operation?
do you prefer? (There is dependencies between diseases.)
(a) summing (b) differentiator
(a) First (b) Second
(c) integrator (d) none of the mentioned
(c) Both (d) None v Ans. : (a)
: (a)
Y Ans.
Explanation :
Explanation : Because adding of potential(due to neural
1 Neural network with a shared hidden layer can
fluid) at different parts of neuron is the reason of its
capture dependencies between diseases. It can be
firing.
shown that in some cases, when there is a
dependency between the output nodes, having a Q, 2.35 What is hebb’s rule of leaning ?
Shared mode in the hidden layer can improve the
(a) the system learns from its past mistakes
accuracy. ,
(b) the system recalls previous reference inputs
2 If there is no dependency between diseases (output and
respective ideal outputs
neurons), then we would prefer to have a separate
neural network for each disease.
(c) the strength of neural connection get modified
accordingly
4, ‘ Q. 2.30 Let's say, you are using activation function X in
hidden (d) none of the mentioned v Ans.
layers of neural network. At a particular neuron for : (c)
! any Explanation : The Strength of neuron to
given input, you get the output as “— 0.0001”. Which of fire in future
increases, if it is fired repeatedly,
the following activation function could X repres
ent?
(a) ReLU (b) tanh Q. 2.36 What is the feature of ANNs due to
which they can deal
(c) SIGMOID (d) with noisy, fuzzy, inconsistent
None of these ¥ Ans. : (b) data?
Explanation : The function is a tanh (a) associative nature of netw
because the this orks
function output range is between (-1,-1). (b) distributive nature
of networks
Q. 2.31 (c) both associative and dist
Why do we need biological neural networks ? ributive
(a) to solve tasks like (d) none of the mentioned
machine vision and Y Ans. : (c)
natural
language processing Explanation : General characte
ristics of ANNs.
(b) to apply heuristic search meth Q. 2.37
ods to find solutions of What was the name of the first
problem model which can perform
weighted sum of inputs?
(c) to make smart human inter
active and user fri endly (a) McCulloch-pitts neur
system on model
(b) Marvin Minsky neur
(d) all of the mentioned on model
Y Ans. : (d) (c) Hopfield model of
Explanation : These are the neuron
basic aims that a neural (d) None of the mentio
network achieve. ned v Ans. : (a)
Explanation; McCulloch-pitts neuron model
Q. 2.32 What is auto-association cam
task in neural networks ? perform weighted Sum of inputs followed by threshold .
(a) find relation between logic operation,
2 consecutive inputs
ATT TY,
(b) related to Storage

and recall task
(MU-New Syllabus w.e
.f academic year 18-19)
BROT
‘
(M6-14)
Tech-Neo Publications... SACHIN SHAH Ventu
re.

Machine Leaming (MU-Sem 6-Comp) Introduction to Neural Network ...Page no (2-25)

Q.2.38 Who developed the first leaming machine in which (a) describes the change in weight vector for ith
connection strengths could be adapted automatically ? processing unit, taking input vector jth into account
(a) McCulloch-pitts (b) Marvin Minsky (b) describes the change im weight vector for jth
(ce) Hopfield (d) none of the mentioned processing unit, taking input vector ith into account
(c) describes the change in weight vector for jth and ith
Y Ans.: (b)
Processing unit.
Explanation : In 1954 Marvin Minsky developed the
(d) none of the mentioned v Ans. : (a)
first learning machine in which connection strengths
Explanation : According to weight updation formula.
could be adapted automatically and efficiently. :
Q. 2.39 What is an activation value ?

Q. 2.45 What are models in neural networks ? Module
(a) mathematical representation of our understanding
(a) weighted sum of inputs (b) threshold value
(b) representation of biological neural networks
(c) main input to neuron (d) none of the mentioned
(c) both way
¥ Ans. : (a) (d) none of the mentioned ¥ Ans. : (c)
Explanation : It is definition of activation value and is
Explanation : Model should be close to our biological
basic qanda.
neural systems, so that we can have high efficiency in
Q. 2.40 The process of adjusting the weight is known as machines too.
(a) activation (b) synchronization
Q. 2.46 What is competitive learning ?
(c) learning (d) none of the mentioned
(a) learning laws which modulate difference between
Y Ans.: (c)
synaptic weight and output signal
Explanation : Basic definition of learning in neural nets.
(b) learning Jaws which modulate difference between
akes Q. 2.41 Which of the following model has ability to lear? synaptic weight and activation value
fence Er (a) pitts model (c) leaming laws which modulate difference between
(b) rosenblatt perception model actual output and desired output
(c) both rosenblatt and pitts model (d) none of the mentioned v Ans. : (a)
ion get me
(d) neither rosenblatt nor pitts v Ans. : (b) Explanation : Competitive learning laws modulate
Vv Asi : Weights are fixed in pitts model but difference between synaptic weight and output signal.
Explanation
adjustable in rosenblatt. Q. 2.47 What are the advantages of neural networks over
ato freee
conventional computers?
Q. 2.42 When both inputs are 1, what will be the output of the
(i) They have the ability to lear
pitts model nand gate ?
hich bes (ii) They are more fault tolerant
(a) 0 (b) 1
(iii) They are more suited for real time operation due to
(c) either0 or | (d) z ¥ Ans. : (a)
their high computational power
of a nand
Explanation : According to the wuth table (a) (i) and (ii) (b) (i) and (ii)
gate. (c) Only (i) (d) All Y Ans. : (d)
we Connections across the layers in standard iopolo

gies and Explanation : all statements are true for NN.
organised?
among the units within a layer can be Q. 2.48 Which of the following gives non-linearit
y to a neural
of ANSS
fr
(a) in feedforward manner network 7
pic (b) Bias
W (b) in feedback manner (a) Gradient descent
(d) None Correct
(c) both feedforward and feedback (c) Relu Activation Function
v Ans. : (d) Y Ans.
: (c)
(d) either feedforward and feedback
in standard function such as Relu gives
Explanation : Connections across the layers Explanation : An activation
r or in feedback
topologies can be in feedforward manne
ork.
anon-linearity to the neural netw
manner but not both.
|
by Awij,
Q. 2.44 If the change in weight vector is represented
what does it mean ?
... SACHIN SHAH Venture
BI Tech-Neo Publications

...Page n
Introduction to Neural Network
rule?
layer, Q. 2.50 Error back propagation uses which learning
Q. 2.49 For a fully-connected ‘network with one hidden tion Jearnin;
what (a) Hebbian learning (b) Percep
increasing the number of hidden units should have
(c) Delta learning (d) e
Competitivlea
effect on bias and variance? . , ae
(a) Decrease bias, increase variance
(b) Increase bias, increase variance Explanation : Delta learning tule is used to
weights of hidden and output layers with the
(c) Increase bias, decrease variance
Y Ans.: (a) function.
* d), No change Correct
Explanation : Adding more “hidden units should
decrease bias and increase variance. In general, more
complicated models will result in lower bias but larger
variance, and adding more hidden units certainly makes
the model more complex.
Chapter

Module 3
Chapter...
lee
Introduction to
Optimization Techniques
%
A,
=
University Prescribed Syllabus
method.
Derivative based optimization - Steepest Descent, Newton _
Down Hill Simplex . _
Derivative free optimization - Random Search,
3.1 Introduction
3.2 Derivative Based Optimization
3.2.1 Gradient based Methods
ent
3.2.2 Method of Steepest Desc
3.3
3.4
—<—

Machine Learning (MU-Sem 6-Com
pl 3.1 INTRODUCTION
Methods are used to minimize the scalar function of a number of variables. In Unconstrainey |
— Optimization
Want tg ;
optimization, the variables are not restricted by inequalities or equality relationships. Scalar function which we
|
minimize is called as an objective function.
The objective function is an error function for feed forward network and energy function for recurrent network, For the”
further topics we need the terms, Hessian matrix and gradient vector, so first we will see the basic concept of these two
terms.
Gradient Vector - Let’s take a scalar function E(x) of a vectorial variable x, defined as an n-elements column vectorx is |
denoted as follows,
— The gradient vector of E(X) with respect to column vector x is denoted as GE (X)
OE(X) = [dE/dX, dE/dX).......... dE/dX,]
— Hessian Matrix — For a scalar function E(x), a matrix of second derivatives called the Hessian matrix is defined as”
follows
0°E(X) = OE (X)]
dE GE | | WE
dx. dx, dx, dx, dx,
dE gE | | WE
dx,dx; dx" dx, dx,
d°E(X) =
dE dE WE
dx, dx, dx, dx, “ axe
n
fe z This Hessian matrix is often denoted by H.
- Fora objective function E (X) ofa single

variable X the condition for a minimum
at X = X* are as follows,
dE (X*)/dX =0 and d’E(X*)/dx?>0
For a objective func-tion E (X) ofa vectorial variable
x the condition fora minimum
at X = X* are as follows,
dE
aei(X*)
sust= , Fequa ires the gradient
i vector at a mininiimum of E (X) to be
r a null vector and ' @°E (X*) is posi
€ssian matrix at a minimum
of E (x) to be positive defi
tive defini :
nite
- If X* is to be minimum of o
; E (X), any infi nitely s al chang " tte : i :
requires the following : Sy small change AX at X» should result in E ( X¥* + AX ) > E(X*). i ‘
1. The gradient vector at xX*
ishes and makes the linear
vanis yor ob
term in the equation of E
. ’
i
2. The Hessian matrix (X) to zero.

at X* is positive def
inite
: (MU-New Syllabus wef

academic year 18-19) (Mg
14)
Tech-Neo Publications...A SACHIN SHAH ventule

Pas
Machine Leaming (MU-Sem 6-Com,
ry - — ee Spectrum . methods exists for unconstrained optimization, methods can be broadly categorized in
sage _.
terms= of the derivative informatiation that iis, used (Derivative based optimization)or is not used (Derivative free
hig, oN 4
‘y optimization).
Ney py 3.2 DERIVATIVE BASED OPTIMIZATION

M
%“h
Hayy 73. 3.2.1 Gradient based Methods
nt local optimization then it is better to use gradient based
: : se
hen the objective function is smooth and if we need efficie
- Whe e objecti ioni
lum Mp optimization method.
- — Gradient based optimization method determines the search

direction using the function’s derivative
objective
information. It is a first-order optimization algorithm.
— At the current point the step is taken in the direction
negative of the gradient of the objective function at that

Modul
gradient Jodule
UFIN is defy point to find a local minimum of a function. In ata
ascent method the step is taken in the positive directi
on of [3
the gradient to approach a local maximum of that
Fig. 3.2.1 : Gradient based method
function.
then
it in a neighborhood of a point ‘a’ ,
ded and we are able to differentiate
— When areal valued function E (X) is provi a basic
gradient of Eat ‘a’, — 2 E(a). This is
from ‘a’ in the direction of the negative
E (X) decreases fastest if one goes
nt.
concept that is used in Gradient desce
— Gt follows that, ifb =a-7 0 E(a)

er, then E(a) 2 E(b).
for 7 > Oasmali enough numb
input space efficiently. With this
E, we often resor t to an iterative algorithm to explore the
Due to the compiexity of Xp, X,, Xz such
—
a guess Xo for a local mini
mum of E, and considers the sequence
in mind , one start s with
observation
that
x n+l
= X,—- 1, 9 E(a),n 20
We have
E(X,) 2 E (X,) 2 E (Xp) 2 ++ « i
step size 7) |S
Note that the value of the
ows
(X,) conver ges to the desired local minimum.
- So hopefully the sequence
; allowed to change at every iteration. the plane, and that its graph has
a bow!
in the Fig. 3.2.1. Here E is assumed to be defined on at
- This process is illustr ated t. An arrow originating
alto”
that is, the region s on which the value of E is constan
contour lines,
posit shape. The curves are the
at that point.
of the negative gradient one of the stopping criteria is
, ' a point shows the direction . 41 . Pe aller shan a
at proced ures are typica lly repeat ed until on
on, the descent
pt - For minimizing the objective functi of the gradient vector 'S §
value is sufficiently small. The length
)7 satisfied such as, the objective function
g time is exceeded.
specified value or The specified computin
Venture
ons...A SACHIN SHAH
Bl Tech-Neo Publicati
18-19) (M6-14)
(MU-New Syllabus w.e.f academic year


hog
to the contour Hine polo through that point. We
= Note that the (negative) gradient ata point ts oethapomml
gradient descent leads us to the bottom of the bowl,
that dy, to the potat where (he value of the funetlon fn inn
whet (te I We { (er val

- Gradient methods are genenl ly TEE effle lent
\
Higher order methods, such tS Newt ton! s method, are only re ally sultat los hen the xeeot raer : i Mt lon Is mae
4
second order Information, using. numerteat (ifferentiation,
and easily calculated, because enleuhidon of |
formation about the slope ot the functlon to dictnte a direction
computationally expensive, Gradient methods use int
search where the minimum is thought to lie.
WS 3.2.2 Method of Steepest Descent — '
— "In the method of steepest descent, the successive adjustments applied to the weight vector ware in the direction of.
steepest descent i.c., ina direction opposite of gradient vector,
Wecan write, 2 = OE (W) i
P
‘Accordingly, steepest descent method is formally i
described by, }
¥ Wnt) = W(n)-y
ge (0) | Bt
: Where 11 is positive constant and g (n) is gradient veetor evaluated at point W (1). When going from Heration n ton+t)
‘the algorithm applies the correction as, ‘
{
AW(n) = Wnt 1)- WW)
=- Ne (n) \
To show that the formulation of the steepest descent algorithm satisfies the condition for iterative descent, we use a first,
order Taylor series expansion around w(n) to approximate ¢ (W (nF 1)) as,
1
i
oe
e(W (n+ 1)) = €(W(n)) +2 (n) AW (n)
|
|
- Substituting value of AW (n) we gel,
€(W(n+1)) = €(Wim)- 12 (0) gi) =e (Wn) - 2 (0)?

Which shows that for a positive learning parameter 1 the cost function is decreased
as the algorithm progresses fon
one iteration to the next? The method of steepest descent converges to the
optimal solution slowly, i
Algorithm :
Start with arbitrary point X,, set iteration number i= 1.
i
> Step1: Find the search direction S, as S,=-Vf,=V fi (x)
Step2: ep2:Calcul
Calculate i
the optimum SS
step length A= ene and a new point as : i
. : tool
Xie = X; +A 15
0 Step3 :_ Test t optimality for new point Xi, by V(x, ) = 0
nlf met stop, otherwise repeat
step | forthe new point,

18-19) (M6-14)

no (3-5)
Ws j Learning (MU-Sem 6-Comp)
nj tyRig te ‘i Machine—=—— SETechniques ...Page
Introduction to Optimization
.
r infy ‘ fin Example 3.2.1: Minimize f(X;, Xz) = X; — X2 + ax +2X, Xo4 x
Ca] ati, 4 Starting from the point X, = (0, 0)
th
og “itt, ‘, 1 Solution:
I
ae of
“tay
i\ a
We will calculate gradient of f as, V, = ae
ax,
re ; Take a derivative of given equation w.r.t X, and X,
i 7 the g vy, -{ 1% +2X,
. -142X, +2, (1)
Now we will calculate Hessian matrix as,
vf _ar
Ox, OX, X, 42
of at “L2 2
i eration», Ox, 21x Ox
1 . 2
Moduie
Iteration 1
[33
At X, = (0, 0) :
Cent, We use: » Step1: Find S, at X, , substitute the yalue(0, 0) in Equation 1 and take a negation of that,
-1
S; = -vEK)=[ 1 ]
b Step2: Computed, at X,
siT Sy em)
-1 1 ']
Aa == OI
7
S, HS) (1 ul 5 i]
progressesit
Hence the new point is,
-1
X, = x +4,8)=| l |
> Step3: Check optimality by substituting value of X, in Equation 1.

-1 [ 0 ]
VE(X,) =| _1 | *Lo
So X, is not optimum, go to next iteration.
Iteration 2
At X, = ¢1,-))
b Stepi: Find S, at X,, substitute the value(- 1, — 1) in Equation 1 and take a negation of that,
1
S, = -viom=| | |
A (MU-New Syllabus w.e.f academic year 18-19) (M6-14) [Bl Tech-Neo Publications...A SACHIN SHAH Venture
At we :
ee aS. e

To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419 T
Machine t paring (MU-Sem ¢ 6-Comp) Introduction to Optimization Techniques ...Page no a4
& Step2: Compute Ant Ny

Te
sts;
oN
A, 3
7 Stisy
Henee the new point is,
- 0,8
Ny = Nat A 81=[ 12 ]
Equation 1
% Step ds Cheek optimality by substituting yalue of X, in
ane 0.2 |
Wiis) = [ve leLo
So X, is not optimum, go to next iteration
Iteration 3
AUX, = (-0.8, 1.2)

in Equation j and take a negation of that,
b Step 1: Find S,at X) substitute the value(- 0.8, 1.2)
-0.2
S, = _vr(x=| 02 |
d Step2: Computed, at X,
r
S, S,
= T ==|
S, HS 3
x= %hSe[ 14]
1.
b Step3: Check optimality by substituting value of X, in Equation
craxy = [293 HE]

So X, is not optimum, go to next iteration
Iteration 4
At X,; = C114)
and take a negation of that,
» Step1: Find S, at X,, substitute the value(— 1, 1.4) in Equation 1
0.2
S, = _ vex) =[ 05 |
¥ Step 2: Compute A, at X,
Ay = 1S
— 0.96 ]
Xs; = X+AS.=|_ 1.44
ture
[al Tech-Neo Publications...A SACHIN SHAH Ven
—— |
:
Machine Learni one
ng (MU-Sem 6-Comp) .
Introduction to Optimi
> Step3: zationoI Techni niq
Check optimality by ues ...Page no (3-7
substituting value of
X; in Equation 1
VE(X5) — 0.04 =
= ~ 0041-12 |
So X, is optimum.
%& 3.2.3. Newton’s Method
In the gradient based method

only, the gradient vector j
required.
- {n Classical Newton’s
method the descent
direction is determined by using
the second
order derivative of the available obje
ctive
function E. In Fig. 3.2.2 gradient descent
method is represented using green line and
Newton's method is represented using
red line
(A)
for minimizing a function (with small
CI
Step sizes), Newton's method uses curvature
information to take a more direct route.
- Newton's method is an iterative method for

finding roots of equations. Generally,
Newton's method
Xo
is used to find critical
points of differentiable functions, which are
the zero’s of the derivative function. Fig. 3.2.2 : Comparison of gradient descent and Newton’s method
— From an initial guess x9 a sequence x, is constructed which converges towards x* such that f'( x*) = 0. This x* is called
a Stationary point of f (.)
— The second order Taylor expansion f; (x) of function f(.) around x, (where Ax = x — x,) is
fp(X,+AX) = fp(X)=
fp (X) + F(X) AX + V2 £(X) AX 2
Attains its extremism when its derivative with respect to Ax is equal to zero, i.e. when Ax solves the linear equation :
£(X,) +£(X,)AX = 0
coefficients. Thus, provided
— Considering the right-hand side of the al bove equation as a quadratic in Ax, with constant
that f(x) is a twice-differentiable function well
approximated by its second order Taylor expansion and the initial guess
is chosen close enough to Xx,
AX =X-X,=-(f(%,)/£%))
— The sequence (X,) defined by :
Xi.) = X,— (£(X) (Xp) ) 9 1 = 0, Dyer
. ; =0.
will converge towards a root of f’, ie. X* for which f (Xx)
Venture
er ; : 14 Tech-Neo Publications..A SACHIN SHAH
- (MU-New Syllabus w.e.f academic year 18-19) (M624)

— To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419 1
_ Introduction to Optinizatlon Techniques u(y
Algorithm
Start with arbitrary polnt X), set Heratlon auniber jel

‘ wh . ng of Jaccobaln matrl
of Jaccobaln
!
mal x (Newton' sWale
sep),
®
:
Step ts Mad the search dlrectlon as Ve (X)) and Cateutate the Inverse
> Step 2: Calculate the new polnt as,

‘ lime
Nua? Xx “d, Vi
% Step Js. 9) 0
‘Test optimatlty for new polut X;,, by VE (xy,
new point
If met stop, otherwise repeat sep) 1 for the
——~"7 a
Xy—Xyh BX 2X Xyt xy
Example 3.2.2: Mininilzo ((X4, Xa)
Starting from tho polnt X, = (0, 0)
M sotution:
ot
7 | Oxy
We will calculate gradient of fas, VPS) 5p
iy
‘
Take a derivative of given equation w.e.t X; and X)
if
og :
, | ' /_ «= |+aX +2% | wll
i - os VE = | 1 42X, 42K) 4
Now we will culculate V¥, by substituting (0, 0) in above equation
Wi, = ('.]
Now we will calculate Jaccobian matrix as,
ar art
Ox; OX) X) ry 9
J=ll = 4 . -|
or ot 2 2/1
OX) XOX;
7 ” .
sppertcct, Hl
Now we will find J
rol—
rvi—
ble
ni
Iteration 1
AUX, = (0,0)
b Step I :
ee a e
rJ—
vi—
2 ! a
ntute
- (MU-New Syllabus wef academic year 18-19) (M6-14)
s
rech
Teal eNeo Publications..A SACHIN SHAH Ve
posse

Machine Learning (MU

-Sem 6-Comp)
» Step2: Introduction to Optimizat
Check optimality by substituting value ion Techniques ---Page
of X, in Equation 1 no (3-9)
= =
¢ 2) 0
So X, is Optimum.
»_3.3__ DERIVATIVE FREE OPTIMIZA

TION
~ Derivative free optimi
zation is a subject
of mathematical optimi
information is unavailab zation. It may refer
le or methods that do to problems for which
not use derivatives derivative
@ 3.35.1 Random Search
Random search explor

es the Parameter space
find the optimal point of an objective function.
that minimizes the obj Se quentially in a Seemingly
ective function, random fashion to
Algorithm
I. Choose a start point x as
the current point. Set init
. ial bias b equal to a zero vec
2. Add a bias term b and a rand tor.
om vector dx to the Current
the new point at x +b 4 dx. point x in the input space and evaluate the Module
objective function at r 3
a
3. iff(x+b+4 dx) < f(x), Set the Current
t
point x equal to x + b + dx
6 otherwise go to next Step. and the bias b equal to 0.2 b
+ 0.4 dx, go to step
4. iff(e+b-— dx) < f(x), Set the Current point
x equal to x + b — dx and the
6 otherwise go to next Step. bias b equal to b — 0.4 dx, go
to step
5. Set the bias equal to 0.5b and 80 to step
6.
6. Stop if the maximum number of function
evaluations is reached, otherwise 80 back to
step 2 to find new point.
Initialisation
Selecta random dx
N N
Y Y
x =x+b+dx x=x+b-dx b=0.5b
b=0.2b+0.4dx b = b-0.4 dx 2
Stop
Fig. 3.3.1 : Flowchart of Random Search
ZL (MU-New Syllabus w.e.f academic year 18-19 [el Tech-Neo Publications...A SACHIN SHAH Venture
) (M6-14)

B
oo
Introduction to Optimization Techniques ...Page no 34
23. 3.3.2. Down Hill Simplex Algorithm

ion. This technique uses Beomet,
The Downhill Simplex algorithm is a multi-dimensional technique of optimizat
relationships to support in finding function minimums.
- The main superiority of this tec hnique is that in this technique calculation of
the derivative of the function jg nes
required. This technique generates its pseudo-derivative by assessing enough points for each independent variable
the function for defining derivative.
The main component of this technique is a simplex.

Simplex is a geometrical object that has n + I vertices, here
n represents the number of independent variables.
: If there is a problem of a one dimensional | ~
- Example
then the simplex will contain two points and _
optimization
1-D simplex 2-D simplex 3D simplex
represent a line segment. In a two dimensional optimization,
a three pointed triangle would be used as shown in
Fig. 3.3.2 : Example of Low order Simplexes
=e
Fig. 3.3.2.
ee:
derivative calculation. If
The downhill simplex optimization technique only function assessing is required instead of
the space is N- dimensional then a polyhedron with N + 1 vertices is used as a simplex. Initial
simplex is define by
oS
each iteration four operations

4 selecting the N + 1 point. The technique updates the worst point during each iteration. In
£
are performed as reflection, expansion, one-dimensional contraction, and multiple contractions
yf
help of
The worst point for which objective function value is maximum is moved to a point which is reflected with the
remaining N points in Reflection. In expansion method tries to expand the simplex along this line if this point is beter
as compared to the best point.
In Contraction the simplex is contracted along one dimension from the maximum point if the new point is not better as
compared to the previous point. If the new point is not good as compared to the previous points then the simplex is
contracted along all dimensions toward the best point and steps down the valley. In the downhill simplex search
technique these operations are applied serially during each iteration till the method finds the optimal solution. In the
downhill simplex technique following operations are applied serially :
condition.
Arrange points with the rank and do the relabeling of N +1 points. Points are ranked according to following
F (Py, ;)>-->F(P,)>£P,)
Create initial P_ point with the help of reflection.
P. = P+ (P,— Pui)
Here P, is calculated by taking the centroid of the N best points in the vertices of the simplex.
Py, , is reptaced by P,, if following condition satisfied
f (P\) <f (P)<f (Py)

A new point P, is created with the help of expansion if following condition is satisfied
Condition: f(P,)<f(P,)
[al Tech-Neo Publications...A SACHIN SHAH Venture


Macht
Machine Learning (
ing (MU-S 5
em 6-Comp) Introduction to Optimization Techniques ...Page no (3-11)
P, = P,+ B* (P,-P.)
&y
Vo Py 41 is replaced by P,, if following condition is satisfied, else Py, , is replaced by P,
t Condition : f (P,) < f (P,),
On. . ‘
H 1
r |
= Create a new point P, with the help of contraction if following condition is satisfied.
4 Condition : f (P,) > f (Px)

P, = P.+¥* (Pyii—P,)-
- Py, is replaced by P, if following condition is satisfied
Condition : f (P,) <f (Py \)s
- Contract along all dimensions toward P, if following condition is satisfied.
Condition : f (P,) > f (Py, ))-
— Assess the objective function values at the N new vertices as,
imp P, = P,+n* (P,-P,)
exe mH 3.4 UNIVERSITY QUESTIONS AND ANSWERS
lai, = May 2018 t

3 I
defines . : (10 Marks)
; 4 Q.1 Write short note on Derivative based optimization. (Ans. : Refer section 3.2)
eral,
> May 2019
seh Q.2 Describe Down Hill Simplex method. Why is it called Derivative Free method 2
(5 Marks)
tiske (Ans. : Refer sections 3.3 and 3.3.2)
2
, Starting from the point X, = (0, 0). (10 Marks)
Q.3 = Minimize f(X,, Xz) = 4X, —2 X_ + 2X, +2 X, Xo+ x,
: Refer section 3.2.2)
t bet Using steepest descent method (Perform only two iterations) (Ans.
techniques.
simple Q.4 Differentiate : Derivative Based and Derivative free optimization
(5 Marks)
ox x (Ans. : Refer sections 3.2 and 3.3)
on! = D> Dec. 2019
techniques. Explain Steepest Descent method for
Q.5 List some advantages of derivative-based optimization
on™ optimization. (Ans. : Refer sections 3.2 and 3.2.2) (10 Marks)
section 3.3.2) (5 Marks)
Q.6 Write short note on Down Hill Simplex Method. (Ans. : Refer
Q.3.2 For Gradient Decent (GD) and Stochastic Gradient
Multiple Choice Questions Decent (SGD) which of the following sentence is
correct?
Q. 3.1 What type of constraint in the feasible basic solution (a) You update a set of parameters in an iterative manner
must satisfy for simplex method ? to minimize the error function.
in your
(a) Non-Negativity (b) Negativity (b) You have to run through all the samples
Y Ans. : (a) a paramet er in each
(c) Basic (d) Common training set for a single update of
Explanation : Non negative points are uscd for simplex iteration for SGD.
method. (c) You either use the entire data or a subset of training
data to update a parameter in each iteration in GD.
Tech-Neo Publications...A SACHIN SHAH Venture

“(MU-New Syllabus w.e.f academic year 18-19) (M6-14)

i
Machine Learning (MU-Sem 6-Comp) Introduction to Optimization Tech :b
niques -..Page no
(d) You update a set of parameters in parallel manner to |
Q. 3.8 Convergence
3-12)
‘minimize the error function,
speed of Derivative Free Optimization t
~ Ans. : (a) method is i
Explanation : In SGD for cach iteration you choose the (a) Fast (b) Slow
batch which is generally contain the random sample of
(c) Very Fast ' () Medium ¥ Ans.:(y
data But in case of GD each iteration
contain the all of Explanation : more number Of iterations
the training observations. are required in
derivative free methods
since they work on random
Q.3.3. Which onc of the following is incorrect w.r.t. basis
Derivative
based optimization ? Q.3.9 Derivative based Optimization method uses
(a) Uses derivative information with objecti (a) Heuristic Search (b) Bes
ve function t First Search
(b) Slow convergence (c) Breadth First Search
(d) Depth First Search
(c) Follows mathematical methodology
: (a)
Y Ans.
(d) Fast Convergence Expianation =: — Heuristic
Y Ans. : () function (derivative |
Explanation : Derivative based optimization methods information) is used for optimizat
ion.
: converges very fast.
Q.3.10 This is being used for evaluation in De Tivative Based
Q.3.4 In Classical Newton’s method Optimization method
the descent direction is
determined by which method? (a) Only Objective Function
(a) First order derivative of the (b) Derivative Information
function with Objective Function
(b) Partial order derivative of (c) Only Derivative informat
the avail able objective ion i
function (d) Only Objective Information
(c) Gradient method Ans. :(b)
Explazation : Derivative of objective function is _
(d) Second order derivative calculated in this method,
of the available objective
.
function
~ Ans. : (d) Q.3.11 As the iterations are performe
Explanation : According d cost to update Parameters —
to working of the Newton in method of stcepest descent
method. method is
(a) increases * (b)
Q.3.5 decreases
In Derivativé free ‘Optimizatio (c) constant
i
n methods points are (d)- fluctuates
selected based on which criteria Y Ans. : (b)
Explanation : As number Of
(a) Minimum value iterations is progressed cost
(b)
Maximum value to update weight parameters ‘
decreases.
(c) Fitness Vaiue (d) Error value ;
¥ Ans,: (ce) Q. 3.12 Evolutionary conc
Explanation : Derivative ept is used in
free optimization methods
works on the concept of surv (a) Derivative free method
ival of the fittest. It denotes
how good the point is. (b) Derivative based meth
od
(c) Both the methods
’ When Gradient information
is used it i Ss called as which
type of optimization (d) None of the methods
¥ Ans. : (a)
(a) Derivative free optimiza Explanation : Derivative free metho
tion. ds like genetic
(b) Derivative based opti algorithm is based on evolutiona
mization ry concept.
(c) Constrained Optimiza
tion
. () Optimization
Q. 3.13. IF g(2) is the sigmoid function, then its derivative with
Y Ans. : (b) respect to z may be written in
Explanation term of g(z) as
in derivative based
derivative information optimization (@) g(z)(g(z) - 1) (b) g(z)(1 + g(2))
is used,
() B21: +82) 4) g(z\(1- (2)
’) '
Q.3.7 Which One is not a der
ivative Free optimizat Explanation : Formul laws
V Ans‘omo. (@id
(a) Genetic alg ion method? a to calculate derivativ
e of sigm Pie
orithm (b) Downhill simplex function.
(c) Simulated Ann method,
ealing
‘ (d) Gradient
Q.3.14 son ifif wewe 5° solve:
We can et multiple local optimum solutions
Descent method
Explanation ; Gra
4 linear regression problem by minimizing
dient based meth
Y Ans.: (d) the sum. s
based opti Mization od is a derivativ Squared errors using gradient descent.
e oy
method , (a) True (b) False
Explanation
v Anes
: We will not ‘ imum
get multiple optim
Solution,
: (Mi

Machine Learning (MU-Sem 6-Comp) Introduction to Optimization Techniques . ..Page no

ae
Q. 3.15 The error function most suited for gradient descent using Explanation : When the function is at a maximum, its
logistic regression is second derivative has a negative value and mot a positive
(a) The entropy function (b) The squared error value.
Pad
(c) The cross-entropy function is the function x - 6x - 2

Q. 3.20 For what value of x,
(d) The number of mistakes Y Ans. : (b)
en
minimized ?
Explanation : Mean squared error function is used. (a)0 (b) | (c)5 (d) 3 Y Ans. : (b)
4
Q. 3.16 Suppose your model is overfitting. Which of the Explanation : Probably the easiest way to solve the
following is NOT a valid way to try and reduce the problem is to recognize that the second derivative is
overfitting ? positive. So the furction is a minimum at x = 0, and the
correct answer is (b) This is also the absolute minimum
(a) Increase the amount of training data.
because the first derivative is zero only at a single point.
a
(b) Improve the optimisation algorithm being used for

Using an initial estimate of 0, using Newton's method,
error minimisation.
the first iteration would converge to the optimal solution.
(c) Decrease the model complexity.
Q. 3.21. Which of the following is incorrect ?
(d) Reduce the noise in the training data. v Ans. : (b)
(a) Direct search methods are useful when the
Explanation : Improving optimization algorithm does
optimization function is not differentiable
not help in reducing overfittng.
(b) The gradient of f(x,y) is the a vector pointing in the
Q. 3.17 To find the minimum or the maximum of a function, we direction of the steepest slope at that point.
set the gradient to zero because :___ (c) The Hessian is the Jacobian Matrix of second-order
(a) The value of the gradient at extrema of a function is partial derivatives ofa function.
always zero (d) The second derivative of the optimization function is
to determine if we have reached an optimal
(b) Depends on the type of problem used
point. ¥ Ans. : (d)
(c) Both (a) and (b)
v Ans. : (a) Explanation : The statement “The second derivative of
(d) None of the above
the optimization function is used to determine if we have
Explanation : According to working of gradient based
reached an optimal point” is incorrect. The second
optimization methods.
derivative tells us if we the optimum we have reached is
for using
Which of the following is NOT required
a maximum or minimum.
Q. 3.18
aN
Newton’s method for optimization ? Q. 3.22 An initial estimate of an optimal solution is given to be
region
(a) The lower bound for the search used in conjunction with the steepest ascent method to
(b) Twice differentiable optimization
function determine the maximum of the function. Which of the
(c) The function to be optimized following statements is correct?
(d) A good initial estimate that is

reasonably close to the (a) The function to be optimized must be differentiable.
v Ans. : (a) (b) If the initial estimate is different than the optimal
optimal Solution
bracketing solution, then the magnitude of the gradient is
Explanation : Newton’s method is not a
bracketing methods nonzero.
method but an open method. Only
for the search region. (c) As more iterations are performed, the function values
require a lower (or upper) bound
Newton’s method. of the solutions at the end of cach subsequent
Therefore a is not required for using
iteration must be increasing.
ORRECT?
Q.3.19 Which of the following statements isINC (d) All 3 statements are correct.
v Ans. : (d)
negative, then i x isa
(a) If the second derivative at i x is Explanation : All statements are correct
for steepest
maximum. ascent method.
then i x is an
(b) If the first derivative at i x is zero, of the Hessian
optimum. Q. 3.23 What are the gradient and the determinant
optimuny?
(c) Ifix is a maximum, then the second derivative at i x of the function f(x, y) = x y" at its global
(H) > 0
is positive. (a) VE = Oi + Oj and determinant
tive (H) = 0
(d) The value of the function can be positive or nega (b) VF =0i+ Oj and determinant
Y Ans. : (c) (c) Wf=li+lj and determinant
(H) < 0
as any optima.
(H) = 0 v Ans. : (a)
(d) Wf=li +1j and determinant
SHAH Venture
Tech-Neo Publications...A SACHIN


Explanation : When the global optimum is reached, Explanation : Starting
travel in any dircetion would: increase/deerease the Newton's method.
function value, therefore the magnitude of the gradient
must be 0. Atany optiniim, the hessian must be positive, Q. 3.31 Newton's method would probably require fey
therefore the correct answer Is a. iterations than the gradient method, but each ite
Tation
would be much more costly.
Q. 3.24 Determine the gradient of the function - Jy" - dy +6 vAns.)
() Tre (b) False
at point (0, 0)?
Explanation : According to property of Newton,
(a) Uf=2i- 4j (by) Vr=0i- jf
method.
(c) P= 0i + 0j dd) Vt=-4i-4j
Q. 3.32 Gradient of a continuous and differentiable
Y Ans. : (b) Function
Explanation : At point (0.0), we calculate the gradient at
: if (wv) is zero ata minimum
this point assy = A= AOH=0,
(b) is non-zero ata maximum
a. sy - 4 = - 40) = 4 = 4 which are used to (c) is zero ata saddle point
determine the gradient as Vf= Oi = 4] (d) decreases as you get closer to the minimum
Y Ans.
: (a), (0), (a
— Q.3.25 Determine
the determinant of hessian of the function
x- 2 — dy +6 at point (0. 0)? Explanation =: Gradient’ of a continuous and:
differentiable function is not a non-zero at a maximum,
@2(b 4 () O @W -S8 Ans. : (d)
Explanation : To determine the Hessian, the second Q. 3.33 Let us say that we have computed the gradient of our
partial derivatives are determined and evaluated as cost function and stored it in a vector g. What is the cost
follows ne
dt ae cr a. of one gradient descent update given the gradient?
2, ay =-4 dydx = 0. The resulting
(a) O(D) (b) O(N)
Hessian matrix and its determinant are
> (c) O(ND) (d) O(ND?) Y Ans,: (a)
H= [5 “J and determinant (H) =— 8. Explanation : According to working of gradient descent
method.
Q.3.26 In descent methods, the particular choice of search
direction does not matter so much. Q. 3.34 Which of the following statements is FALSE ?
(a) True (b) False. Y Ans,
: (b) (a) Multidimensional direct search methods are similar
Explanation : Selection of search direction makes an to one-dimensional direct search methods.
impact. (b) Enumerating all possible solutions in a search space
Q.3.27 and selecting the optimal solutions is an effective .
Im descent methods, the particular choice of line search
é
method for problems with very high dimensional

does not matter so much.
solution spaces.
(a) True (b) False Y Ans. : (a)
(c) Multidimensional direct search methods do nol
Explanation : Selection of line search does not make
require a twice differentiable function as a
any impact.
optimization function
f
Q. 3.28 When the gradient descent method is started from a point (d) Genetic Algorithms belong to the family of
near the solution, it will converge very quickly. multidimensional direct search methods.
(a) True (b) False Y Ans. : (b) : (b)
Y Ans.
Explanation : Starting point is near or far does not Explanation : Problems with very high dimension@l
matter in gradient descent method.
solution spaces are very large and therefore it is
Q. 3.29 Newton's method with step size h= | always works. computationally difficult to enumerate the search space.
(a) True (vb) False Y Ans. : (b) Q. 3.35 Which of the following statements is FALSE?
Explanation : It will not work for every
scenario,
(a) Multidimensional direct search methods require
Q. 3.30 When Newton's method is Started from upper and lower bound for their search region.
a point near the
solution, it will converge very quickly. i
(b) Coordinate :
cycling method :
relies on siingle:
(a) True (b) Fatse Y Ans. : (a) dimensional search methods to determine an opm”
solution along each coordinate direction iteratively-
(MU-New Syllabus w.e.f academic year 18-19) (M6-14
) te Tech-Neo Publications...A SACHIN SHAH Venture

Toh, | Machine
Machine Learning
Learning | (MU-Sem 6-Comp) Introduction to Optimization Techniques ...Page no (3-15)
et hy Sac. (c) (@) Ih.the:
the optimization function is twice differentiable, () smooth (d) (a) and (b) Y Ans. : (c)
nM by multidimensional direct search methods cannot be Explanation : According to definition of cost function of
\ é used to find an optimal solution. optimization problem
\ (d) Multidimensional direct search methods are not . ;
) Pr a guaranteed to find the global optimum. Q.3.41 The feasible region for the inequality constraints with
Py ' Y Ans. :(©) respect to equality constants
| Explanation idj :
: Multidimensional direct search methods i eases
M2) tncredses () deer
qd ifr can be used with any function to find optimal (c) does notchange (d) none of the above
Ly solutions.
If the functions are twice differentiable v Ans.: (a)
there are more
computationally efficient techniques for optimization of Explanation : Standard procedure of constraints.
these functions. Q. 3.42 Which of the following is a objective function in
derivative based optimization for feed forward
/Q.3.36 If we want to find the value of the variable ‘x’ that
network ?.
tothem. maximizes a differentiable function f(x), then we
hy should: (a) Error function (b) Energy function
"hy Find (c) Cost function (d) Fitness function
of : (a) Find ‘x’
‘x from rom 4,
2 =0
=
Y Ans. : (a)
NON-2ery4
“ny a Explanation : Derivative of error function is used in
(b) Find ‘x’ from 4, = 0 and check that that second
feed forward network optimization.
puted trp derivative is positive
Q. 3.43 When the objective function is smooth and if we need
ede 4
(c) Find df = © and
et‘x’ from dx check that that second efficient local optimization then it is better to use
* BLVEn then, derivative is negative (a) Gradient based optimization method.
\(N) ' (d) Find ‘x’ from second derivative = 0 (b) Derivative based optimization method
v Ans. : (c)
(ND (ND) ! Explanation : oe (c) Both of above methods
Find ‘x’ from a. 0 and check that that
vorkinof gmt second derivative is negative.
(d) None of the above v Ans. : (a)
Explanation : Gradient based optimization method gives
\Q.3.37 Each optimization problem must have certain parameters better solution if objective function is smooth.
rents isFALE called
(b) dummy variables Q. 3.44 Gradient at a point is to the contour line
search elit (a) linear variables
none of the above going through that point.
search met (c) design variables (d)
Y Ans. : (c) (a) Parallel (b) Orthogonal
solutions
bles are requi red for every (c) None of above
al sotuticst! Explanation : Design varia (d) All above Y Ans.: (b)
vith ve Ww optimization problems. Explanation : According to property of gradient.
Q.3.38 A “s type” constraint expressed in the standard form is Q. 3.45 In Random search method, if f (x + b + dx) < f(x),
Set
' search # active at a design point if it has
semtiab (a) zero value (b) more than zero value (a) x=x+b+dx andb=0.2b+0.4 dx
(c) less than zero value (d) (a) and (c) a
(b) x=x+b-—dand
x b=b-0.4dx
v Ans. : (a)
oe
Explanation : Standard representation of constraint. (c) b=0.5b
(d) None of above Y Ans. : (a)
B: 3.39 Maximization of f(x) is equivalent to minimization of
if
Explanation : According to working of random search
yith wil (a) -f(x) (b+) 1/f() method.
iat? é (c) VIG) (d) noneoftheabove W Ans. : (a) Q. 3.46 Down hill simplex method uses which Telationship____
ent i Explanation : Standard procedure of constraints. (a) Input-output (b) Geometric
fe 3.40 ania problem oa, functions are (c) Actual output-desired output (d) None of above
: roblem 1s referred to as
4oo (a) rough a5 nonsmooth

» Bee)
fun New Syllabus w.e.f academic
year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

Introd uction to Optimization Techni

ques ...
Machine Learning
following which operation is performe
_ Explanation _+ Downhill ‘simplex method is a Q. 3.48 Out of
multi- dimiensional technique that uses geometric downhill simplex search
relationship. (a) Reflection (b) Contraction
(d) All of above

Q. 3.47 Simplex in downhill simplex search method is an object (c) Expansion
_ with how many vertices Explanation :
‘aa (b) «n+l during each iteration. In each iteration four operation
‘ performed as reflection, expansion, one-dimensi
(c) n-1 | (d) n+2 v Ans.
: (b)
contraction, and multiple contractions.
Explanation : Simplex is a geometrical object that has
n + 1 vertices, here n represents the number of
independent variables:

Chapter...
Learning with Regression
and Trees
University Prescribed, Syllabus sissies
Learning “with Regression : Linear Regression, Logistic Regression. Learning with Trees : Decision Trees,
Constructing Decision Trees using Gini Index, Classification and Regression Trees (CART).
4.1 Linear Regression

tee eee
4.1.1 Simple Linear Regression .....ssssssccsssesessseseessenecsseneestenessenenesssuunannnnnnnengygnananuanennseeneesenegssseneee
seseseeessessscesessesnesennesscecesseeeseaaeensaneenes
44 Multiple Linear REgressiOn.....-----..---+-ssesseeeeees
4.2 evample of Linear REQreSSION ..-----.:--sesssecneeeseeenstennessnnansnsnnanserensets
UExample 4.2.4 [UEP EaEs 10 Marks
4.3 Logistic Regression

4.4 Decision trees .......:ssesee
45 Constructing Decision Trees..
Using ID3
46 Example of Classification Tree ;
Gini Index
4.7 Example of Decision Tree Using
EN 15, 12 Marks
UExample 4.7.4 (UUBAU
DURE 17, 10Marks
UExample 4.7.5
UExample 4.7.6 DUE May 19, 10 Marks
Tree (CART}..---
48 Classification and Regression
49
4.10
ccousumansunmansoenensece’
neessennesssouccscuuanets gars sacauscu
socuunancanes™ cenn
eucncanccousunnenseeeee
nensnannsouennennees®
vupesnnssccesaunnssaaaago
t
penenneanancessensaerenen
auneessncone® sunnnnennececeessnanmessneneet®
aapewmneeessssncennsaeas

Learning (MU-Sem 6-Com Learning with Regression and Trees ...Page no (4
Machine
achine g : = P
vi 4.1 LINEAR REGRESSION

t with X
is regression. In regression set of records are presen
— One of the most important supervised learning tasks
if you want to predict Y from an unknown X this|
values and this values are used to learn a function, so that
Y, So, a function is required which predicts Y given X_:
function can be used. In regression we have to find value of
continuous in case of regression.
. ‘ Re 9g ression Models
Seat
— Here Y is called as criterion variable and X is
- |
called as predictor variable. There arc many,
Simple 7 Multiple
types of functions or modules which can be
I _ I
used for regression. Linear function is the
= :
simplest type of function. Here, X may be a
: Linear Non-Linear Linear Non-Linear|
‘ %? 5
single feature or multiple features representing
the problem.
Fig. 4.1.1 : Types of Regression Models
4.1.1 Simple Linear Regression
— Let’s see simple regression first, in this X contains a single feature. In multiple regressions, X contains more than
feature. In simple regression training records are plotted as value ofX vs. value of Y. Next task is to find a functi
that if a random unknown X value is given we can predict Y. There are different types of functions that can be
linear regression, we assume that the function is linear as shown in Fig. 4.1.2(a).
yi tty tt |
LLETi PLL 4
(a) Linear Function (b) Non-Linear Function
Fig. 4.1.2 : Types of regression functions
—In Linear regression a best fitted straight line is drawn.

which passes through the points called as the regression rs
*~* line. The line shown in Fig. 4.1.2 is the. regression line. | 6
ol. || Rearession line shows the calculated values of Y for
re Le
a. mat each possible value of X. Errors in the prediction
eee is l 4 A
a shown by the vertical tines drawn from the Points to the r+ > ae ]
Seo Fegression line. Error in prediction
is less when the point
S very near to the regression line as
shown by first and
Second point i im Fig. 4.1.3. When the point is
far from the Pi
~ “regression fine th en the error in predicti Pe
on is high, as , itot *|
shown by the third and fourth point to | oPé 1h;
in Fig. 4.1.3,
(MU-New Syllabus wef academic year 18-19) (M6-14) w Tech-Neo

- Publications...A
Ications...A SACHIN
SAC SHAHi

—
Machine Lea rning (MU-Sem = 6-Comp) Learning with Regression and Trees ...Page no (4-3)
ae
n the¢ value sof mal ted value is the
The di ifference betwee of point and the predicted value is called as error of prediction. Predic
value of point on the line.
predicte yl
*s take an example, the are shown in Table 4.1.1. From
the
the errors of prediction (Y-Y’)
hescepredicted . values; (Y’) and
able
ne we
sy can
In say
sal that,
hat the
tl icted Y value as 3.2. The error of prediction is 0.8.
aa y tha, the second point has Y value as 4 and a pred
Table 4.1.1 : Regression data.
Sr.No. | X | ¥ | y’ | y-y’ | cy-y’?

I 2/2/22/-02] 0.04
2 4/4/32] 08 | 0.64
3 6/3]43]-1.3] 1.69
4 816154] 06 | 0.36
squared errors of prediction. This crite rion is used
The best-fit line is called as the line that will minimize the sum of the
in the last column of Table 4.1.1.
onM to draw the line in Fig, 4.1.3. The squared crrors of prediction are shown
ion Model Lo represented using the following equation,
8 _ ‘The regression line is
Yo =aX+bte
b shows the Y intercept and €
value, a represents the slope of the line,
in the above equation Y' represents the predicted
Here we assume that mean value of r ando
m error is 0, so the equation becomes,
X containsne is the random error.
y =aX+b
ask is to finde ression Parameters
Table 4.1.2 : Calcuiation of Reg
nections thatcal
xy | x’
4/4 Module
r4
16 | 16
18 | 36
48 | 64
Total 86 120
on is,
- The regression line equati
y’ =aX+b
20x15 _ 9 55
n DXY-DLXLY _ 4% 86=
120-400 4x
nD X7- (LX)
s
I x 20) = |
b = 4 (XY -ax LX)=4 (15 -0.55
Now the equation for the linc becomes,

Y = 055X+2.2
For X = 2,
Y’ = (0.55) (2) +1 =2.2
ions... SACHIN SHAH Venture

A [a] rech-neo Publicat
at i AU-New Syllabus
a w.ef academic year 18-19) (M6-14)

mp)
For X =4,
Y’ = (0.55) (4)+ 1 =3.2
Y’ = (0.55) (6)+1=43
phe sas
= (0.55) (8) +1=5.4
Ya. 4.1.2 Multiple Linear Regression

|
|
a extension to sim
number of features. We can also say it is
cea
"In Multiple linear regressions there are two or more
linear regression.
4
]
ing equation, .
The regression line is represented using the follow
s
Yo = Oyt O Xpt Oy, Xy + = +a,X,+¢€
In the above equation Y’ is the predicted value, X, X, _. X, are the predictors, ¢ is random error and ;
Op, O,, %, 0, are regression coefficients.
i i El
ip 4.2 _EXAMBLES OF LINEAR REGRESSION
in table below:
re expenditure of an organization (in thousand) for every month is shown
Example 4.2.1: The
X (Month) 1/2/;3]4)]5
Y (Expenditure) | 12 | 19 | 29 | 37 | 45
Find regression line, Y = a X + b using least square method.

Estimate the expenditure of company in 6” month using line as a model.
vw Solution :
Sr.No. | X | Y | XY |X’
1 1/12) 12] 1
2 2] 19 | 38 | 4
3 3 | 29 | 87 | 9
4 4 | 37 | 148] 16
5 5 | 45 | 225 | 25
Total | 15 | 142 | 510 | 55
(a) The equation for the regression line is,
,
Y =aX+b
_BLXY-UXDY 5x510-15x 142 |

ndx’-(xy 5 K55-225 84
b = in (DY ~ax DX) =2 (142-84 x 15) =3.2
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A:SACHINSH

s ublications...A’ H

ing (MU-Sem 6-Com

Machine Leaming (WE P) —— __ Learning with Regression and Trees ...Page no (4-5)
Now the equation for the line becomes,
,
Y 8.4X+3.2
(b) In" month

,
Y 8.4x6+3.2
’
Y 53.6 Thousand
Example 4.2.2: Consider the set of data as {(- 1, - 1}, (2, 2), (3, 2)} (a)
Find the equation of regression line.
(b) Draw the scatter plot of data and regression line.
a Alen, M] Solution:
Sr.No. | X | Y | XY
wu
|e
1 -l]-1] 1
ola]
2 2 2 4
3 3 | 2] 6
IS random : Total 4 3 Il | 14
(a) The equation for the regression line is,

,
' Y =aX+b
n>,XY-DX
a = A
SO LY 3x1l-4x3 = 0.807
below: n> X?-(xy¥ 3x 14-16
ly —ax DX)= : (3 - 0.807 x 4) = -0.076

Ce
"

Module
Y’ = 0.8 0.076
X -07 | 4
|
X - 0.076 and the given data points.
(b) Now we can plot the regression line given by equation Y’ = 0.807
Sr. No. | X y’
1 —1 | —0.883
2 2 1.54
3 3 2.34
of data
Fig, Ex. 4.2.2 : Scatter plot
fe] Tech-Neo Publications..A SACHIN SHAH Venture

MU-New Syllabus w.e.f academic year 18-19) (M6-14)

Machine Learning (MU-Sem 6-Comp) Learning with Regression and Trees ...Page no (4.
seule ewesg MU - May 17, 10 Marks
course. Use the method a
Following table shows the midterm and final exam grades obtained for students in a database pet
of a student who received 86 in the mid term exam.
squares using regression to predict the final exam grade
72150 | 81 | 74] 94] 86 | 59 | 83 | 86 | 33 | 88 | 81
Midterm exam(X)
84 | 53 | 77 | 78 | 90 | 75 | 49 | 79 | 77 | 52 | 74 | 90
Final exam(Y)
vi Solution :
Sr.No.| X | Y | XY | X’
1 72 | 34 | 6048 | 5184
2 |'50 | 53 | 2650 | 2500
3 81 | 77 | 6237 | 6561
4 74 | 78 5772 5476
5 94 | 90 | 8460 | 8836
6 86 | 75 | 6450 | 7396
7 59 | 49 | 2891 | 3481
8 83 | 79 | 6557 | 6889 :
9 | 86 | 77 | 6622 | 7396
10 | 33 | 52 | 1716 | 1089
Il 88 | 74 | 6512 | 7744 * ;
12 | 81 | 90 | 7290 | 6561 |
Total | 887 | 878 | 67205 | 69113
The equation for the regression line is,
Y =aX+b
a _ BLXV-2KEY _ 65 :
n xX? (2X)
b = 4( Dy -ax Dx) =25.12
Y’ = 0.65X + 25.12
The final exam grade of a student who received 86 in the mid term exam,
Y = 0.65 x 86+ 25.12

Y’ = 81.02
Set -)-e ees MU - Dec. 19, 10 Marks i
Given the following data for the sales of car of an automobile company for six consecutive years. Predict the sales for n
two consecutive years. ..
Years (x) | 2013 | 2014 | 2015 | 2016 | 2017 | 2018
Sales (y) | 110 | 100 | 250 | 275 | 230 | 300
Mw Solution :
We will take t = x — 2013
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Vent


Learning with Regression and
Sr. No,
SSTree s .--Page no (4-7)
O
t Y ty | ¢
0 9 lol tmlo Te
1 | 100 | 100 Fl
2 | 2| 250 [soo | a |
3/3 | 275 | g25 | 9)
4 | 4| 230 | 920 Ta]
S__|5 | 300 | 1500 las
Total_| ts [1265 | 3845 [55
The equation for the regres
sion line is,
,
Y = at+b
a = 39
b = 113.33
Now the equation for the
line becomes,
Y’ = 394113,33
The sale of company for next
two years, X = 2019, 1=6
,
Y 39 x 6 + 113.33
’
Y 347.33
X= 2020,t=7
,
Y = 39x74 113.33
,
Y = 386.33
mt 4.3 LOGISTIC REGRESSION _
Suppose we have different training data which belongs to

two different classes, and we have to design a system that will
i ene yhich data is from which class. The output of this function
will be a real value and it is not Suitable for
classific
testy ation
ti method. ‘ We can use another function
eon on this linear function so that we can use the result for classification.
In logistic regression logistic function or the sigmoid can be used,
semoidfuneti
Let's first see what is the meaning of Logistic or Sigmoid function. | me:
=
In Logistic regression classifier we will take features as
the input. These features are multiplied with the vein ]
coefficients and we add the product. This is called as the !
net input that will be given to the sigmoid inet a aT
classified as eeeede
edhoie land anything below 0.5 isene
classified as 0. Fig. 4.3.1 : Logistic or Sigmoid Function
The net input to the sigmoid function is,

Z = bo +b, X) + by X2 + +b, xX,
+ 9, .
~ Inthe net input equation x : represe nts the input data or the features. When we use optiimi icient b, Classifier will
mized coefficien |
; imiza
be successful, Optimized bcan be calculated using the optimizatition concept.
o’ P


Learning with Reg ression and Trees ,,,

__
Gradient Ascent Method

in the direction of the gradient to find the
maximum point on a function,
— $n gradient ascent method we move 3
of (x, y) * oe:
ox
|
_
| -
aray
V E(x, =
P=) . a :
ay
1%
X represents the amount by which updation is applied in x direction and Of (,,a
_ On the above equation of (x, ,yya
y) direction c
in y
represents the amount by which updation applied
asc.
towards the direction of gradient incre
=. The gradient operator will always point
b = b+axVF(b)
— _ This step is repeated until we reach stopping criterion.
Pseudo code
~ Start using the logistic coefficient, all set to 0

Repeat no. of times
‘Find the gradient of complete dataset
. Update, logistic coefficient = logistic coefficient + alpha x gradient
_ Return the logistic coefficient
Gradient Descent Method
— Ingradient descent method we move in the opposite direction of the gradient to find the minimum point on a function _
- The gradient operator will always point opposite to the direction of gradient increase.
b = b-axVf(b)
- _ This step is repeated until we reach stopping criterion.
— -Using above mentioned methods optimized b is calculated, net input is calculated and then the prediction is caleult
by giving the net input to the function,
1
Prediction = —X%
l+e
— Finally the data points are classified as,
Class = 1 if Prediction>0.5
|
0 else L
— The classified data using the decision boundary is as shown

in Fig. 4.3.2..
— Logistic regression is used to model a relationship between |
input variables and a target variable. Let's take an example of
a house management system, we can usc logistic regression
to model the relationship between the parameters such
as the
‘total monthly income of the family, various liabilitie
s,
monthly expenditure ofa family to predict if
inonthly savings a
(investment) can be done or not.
Fig. 4.3.2 : Sample of output of Logisticcree
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A‘SACHIN SHAH ven

— There are three types of logistic regression

as below.
Binary Logistic Regression
- Binary logistic regression is used

if the target variable is binary.
— The application of this method
can be above mentioned house
management example.
or ( ‘ Nominal Logistic Regression
- The examples of this type of

Tegression could be diffe
Mechanical, Civil, Electronics etc.
). ? 2
Ordinal Logistic Regression
- Ordinal Logistic Regression is used

if there are three or more classes with
specific orders.
— The examples of this type of regression could
be how customers rate the taste of food using
a scale of 1-3 (Bad, Good,
and Excel lent).
Example 4.3.1: A Bank has to decide whether to sanction loan or

not based on two attributes as person's income and
savings. The data is given in the following table where his
1 represents loan is sanctioned and 0 represents
loan is not sanctioned. Predict whether a Person 3 will get a loan
or not having annual income as 12.5 lakhs
and savings as 10 lakhs.
Person | Annual income in lakhs (x,) Savings in lakhs (x,) | Loan sanctioned? (y)
funete
1 14.5 12.5 1
2 8.5 4.5
_
0
vj Solution :
Module
Initially assume logistic regression coefficients by = b, = b, = 0
cal! For 1 row, x, = 14.5, x, = 12.5 andy = 1
Now we will calculate prediction for the first row,
Prediction = 1/(1 +e% C (by + by XX, + by X X,)))
Prediction 1/( +e4 (-(0+ 0x 14.5 +0x 12.5)))
. ;
Prediction = 0.5
calculate the new coefficient values using a simple update equation. Ideal values for alpha are from 0.1 to
Ni ow we willill calculate
0.3. Let’s take alpha as 0.3. For bg by default input is 1.
b.,, prediction
+ alpha x (y— prediction) x aa n x (1 - prediction) x input
Drew = ‘old
0.5 x (1 - 0.5) x 1.0 = ; 0.0375

.
b ‘Onew ~= 0 +03* x (1 =0.5) x
(1 - 0.5) x 14.5 = 0.5437.
b Incew = 0 +.0.3* x (1 — 0.5) x 0.5 x
—
x 12.5 0.46
b 2new = 940.3x
— * (10.5) x 0.5 x (1 ~ 0.5)
‘oti
Now we will calculate prediction for the sec ‘ond row,
so a +b x x, + by X X2)))
Prediction = 1/ (1 +e° Oot ia Tech-Neo Publications...A SACHIN SHAH Venture

fe AMU-New Syllabus w.e.f academic year 1 8-19) (M6-14)

Learnin g with Regres sion -and Trees ...Page no 4-19 oe
To Learn
Machine Learning (MU-Se m 6-Com
The Art Of Cyber Security & Ethical Hacking Contact Telegram @crystal1419
75 X 4.5)))
Prediction = 1/(1 +e (- (0.0375 + 0.54735 x 8.5 + 0.468
Prediction = 0.99
icient values
Now we will calculate the new coeff
ction x (1 — prediction) x input
bo, new = Day + alpha x (y — prediction) x predi
Boney = 0.0375 + 0.3 x (00.99) x 0.99 x (1 - 0.99) x 1.0 = 0.034

Dino = 0.54735 +.0.3 x {0 — 0.99) x 0.99 x (1 - 0.99) x 8.5 = 0.523
Boney = 0.46875 + 0.3 x (0 - 0.99) x 0.99 x (1 - 0.99) x 4.5 = 0.456
Now we will use these values for prediction
Prediction 1/(1 +e (— (by +b, X x, +b, XX)
Prediction = 1/(1 +e (- (0.034 + 0.523 x 12.5 + 0.456 x 10)))

Prediction = 0.99
Since prediction 2 0.5
Prediction for person 3 is, his loan will be sanctioned.

of iterations are also applied to get accuracy. Here
Note : Generally huge amount of data is used for training and number
and a single iteration is shown.
for example purpose only two records for training are taken
yl 4.4 DECISION TREES
cation and prediction. The attractiveness of decision

- Decision trees are very strong and most suitable tools for classifi
n trees represent rules.
trees is due to the fact that, in contrast to neural network, decisio
may be achieved. By comparing the records
Rules are represented using linguistic variables so that user interpretability
the record belongs to.
with the rules one can easily find a particular category to which
only thing that matters in such situations we do
In some applications, the accuracy of a classification or prediction is the
the ability to explain the reason for a decision is
not necessarily care how or why the model works. In other situations,
g professionals, so that they can use this
crucial, in marketing one has described the customer segments to marketin
knowledge to start a victorious marketing campaign.
knowledge and for this we need good descriptions.
This domain expert must acknowledge and approve this discovered
quality of interpretability (ID3).
There are a variety of algorithms for building decision trees that share the desirable
Where Decision Tree is applicable?

:
Decision tree method is mainly used for the tasks that possess the following properties
— The tasks or the problems in which the records are represented by attribute-value pairs.
re’ attribute the value is f
Records are represented by a fixed set of attribute and their value Example : For ‘temperatu
‘hot’. :
then decision tree learning becomes
When there are small numbers of disjoint possible values for each attribute,
very simple.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Ventur

:
Learning with Regressio
n and Trees ...Page
Example: Temperature attrib no (4-114
ute takes three values as hot, mil
d and cold
Basic decision tree algorithm may be extend
ed to allow real valued att
ributes as well.
Example: we can define floating point temperature,
— Anapplication where the target function takes
discrete output values
In Decision tree methods an easiest situation exists,
if there are only two possible classes
Example: Yes or No
When there are m ore than two possible

i output classes then decisi ion tree meth
ods can also be easily extended
A more significant extension allows
learning target functions with real val
wo
decision trees in: this area is not frequent. ued outputs, although 8 the applicati
pplication of
The tasks or the problems where the basic

requirement is the disjunctive descriptors.
Decision trees naturally represent disjunctiv
e expressions.
— Incertain cases where the training data may contai
n errors.
Decision tree learning methods are tolerant to errors that can be a classification error of training records or
— attribute-value representation error.
et
j amy = si .
The training data may be incomplete as there are missing attribute values.
—_ Although some training records have unknown values, decision tree methods can be used.
—_—
2. Decision Tree Representation
— Decision tree is a classifier which is represented in the form of a tree structure where each node is either a leaf
veness dt
node or a decision node.
«ht ° Leaf node represents the value of the target or response attribute (class) of examples. re
spent ° Decision node represents some test to be carried out on a single attribute-value, with one branch and sub tree ull
! for each possible outcome of the test.
chsi — Decision tree generates regression or classification models in the form of a tree structure. Decision tree divides a
on jut? dataset into smaller subsets with increase in depth of tree.

they a 2 rn final
hamnehes is a tree with decision nodes and leaf nodes. A decision node (e.g., Buying_Price) has two or
decision (enetreeHigh,
Medium and Low). Leaf node (eB Evaluation)
shows a Classification or decision. The
goa topmost decision node in a tree which represen the best predictor is called — —— trees can be used
. |
atl to represent categorical as well as numerica High
re set of records
© Root Node : It represents enti fi. m mn . ~
ag: in divided into two or
or dataset and thisis 1sis aga
more similar sets.
woe I Low, High
i © Splitting : Splitting procedure is used to divide Sma) Big
: din .
a node into two or more sub-nodes depending [Unacceptable] [Acceptable| [unacceptable] [Acceptable
on the criteria.
[El rech-Neo Publications...A SACHIN SHAH Venture

Learning with Regres sion and Trees ...Page no (4-12)

Machine Learning (MU-Sem 6-Comp) er eee e:_ _ eS
which is divided into more sub-nodes.

° Decision Node : A decision node is a sub-node
further divided or a node with no children.
oO Leaf/ Terminal Node : Leaf node is a node which is not
sub-nodes and sub-nodes are called ag
° Parent and Child Node : Parent node is a node, which is split into
child of parent node.
decision tree.
° Branch / Sub-Tree : A branch or sub-trec is a sub part of
n trees by removing nodes.
° Pruning : Pruning method is used to reduce the size of decisio
3.. Attribute Selection Measure
1. Gini Index
— Allattributes are assumed to be continuous valued.
— Itis assumed that there exist several possible split values for each attribute.
+= Gini index method can be modified for categorical attributes.
— Gini is used in Classification and Regression Tree (CART).
‘Tf a data set T contains example from n classes, gini index, gini (T) is defined as,
gini (T) = 1- x, = 1(P)° . (4.4.1)
_ Inthe above equation Pj represents the relative frequency of classj in T.
After splitting T into two subsets T, and with sizes N, and N,, gini index of split data is,
N, i
gini (T3) w» (4.4.2)
BiNk.sprinn(T).== Vein Ty) ro 7
The attribute with smallest gini split (T) is selected to split the node.
2. Information Gain (D3)

In this method all attributes are assumed to be categorical. The method can be modified for continuous valued
attributes. Here we select the attribute with highest information gain.
Assume there are 2classes P and N. Let the set of records S contain p records of class P and n records of class N.
.. The amount of information required to decide if a random record in S belongs to P or N is defined as,
T(p,n) = —- sea) los (525)-( sia) boee( ) (4.4.3)

.....8,}
Assume that using attribute A, a set S will be partitioned in to sets (5), So,
objects in all
If S, has p, records of P and n; records of N, the entropy or the expected information required to classify
subtrees S; is,
_ E(A) = 5)= a 1 (pn) (4.4.4)

object in S
—' Entropy (E): Expected amount of information (in bits) needed to assign a class to a randomly drawn
under the optimal shortest length code.


T tremtoumeranyneesaeantin
omit
- Gain (A): Measures reduction in entro

‘ + US i .
(maximum Gain). © of split. Choose Split that achieves Most reduct ton
Gain(A) = 1 (p, n-E (A)
~
Avold Overfitting in classification (Tree pruning) (4.4.5)
The generated tree may overfit the training data

- If there aare too maany branches
: then. some may reflect
anomalies duc to Noise
or outliers
- Overfitting result in poor accuracy
for unseen samples,
There are two approaches to avoid Overfittin
g, prune the tree so that it is not too specific
Prepruning (prune while building tree)
Stop onstru
top tree constr ction carly do not divide
ucti ivi a node iiff thisthi would result in the goodne
ss measure falling below threshold,
Postpruning (prune after building tree)
Fully constructed tree get a sequence of progre

ssively pruned trees,
Strengths of Decision Tree Method

4A:
— Able to generate understandable rules.
— Performs classification without requiring much computation.
- Able to handle both continuous and categorical variables.

bd, — Decision tree clearly indicates which fields are most important for prediction or classification.
Weakness of Decision Tree Method
- Not suitable for prediction of continuous attribute.
— Perform poorly with many class and small data.
— Computationally expensive to train.
4.5 CONSTRUCTING DECISION TREES

ates through
ation of the algorithm, it iter
the orig inal se! 5 as the root node. On each iter
ori thm stastarts with
ithm
The e ID3 alg
algor
e ntropy (or information gain) of that attribute. Algorithm next
the set § and calcu‘ates the
every unused attribute of
y (or larg’ est information gain) value.
selects the attribute which has the smallest entrop
is between 20 K and 40 K,
Income is less than 20 K , Income
The set § is then divided by the chosen attribute (8. called for each subset,
duce subsets of the data. The algorithm is recursively
Income is greater than 40 K) to produce
; d before.
considering the attributes which are not selecte
se situations -
. i . 1 Cc an be one of the .
The stopping criteria for recarsion ss (+ or —); then the node is converted into a leaf node and
the sam e cla
the subset belongs to
© When all records in
labelled with the class of the records.
[fe Tech-Neo Publications..A SACHIN SHAH Venture
: 19) (M6-14)
(MU-New Syllabus w.e.f acade mic year 18


Learning with Regression and Trees ...Page no (4-14
CO
o When we have selected all the attributes, but the records still do not belong to the same class (some are + and
some are —), then the node is converted into a leaf node and labelled with the most frequent class of the records in
the subset
o When there are no records in the subset, this is due to the non coverage of a specific attribute value for the record
in the parent set, for example if there was no record with income = 40 K. Then a leaf node is generated, and
;
labelled with the most frequent class of the record in the parent set.
— Decision tree is generated with each non-terminal node representing the selected attribute on which the
data was Split,
and terminal nodes representing the class label of the final subset of this branch.
Summary |
Entropy of each and every attribute is calculated using the data set
- Divide the set S into subsets using the attribute for which the resulting entropy (after splitting) is minimum (or,
equivalently, information gain is maximum) 4
— Make a decision tree node containing that attribute ~ |

— Recurse on subsets using remaining attributes.
Pseudocode
ID3 (Records, Target_Attribute, Attributes)

i Generate a root node for the tree
e
‘Teall records are positive, Return the single-node tree Rool, with ‘+ ‘label.
‘fall records are negative, Return the single-node tree Root, with *-’ label.
If number of predicting attributes is empty, then return the single node tree Rool, and label with most frequen! value of the target
‘attribute in the records.
_ Otherwise Begin
pA + The Attribute that best classifies'records.

t Decision Tree attribute for Root ‘A’.
f For each possible value, v, of A,
‘Add a new tree branch below Root, corresponding to the test A = v,.
Let Record (v) be the subset of records that have the value v, for A
pal Record (v) és empty .
fe Then below this new branch add a leaf node and label with most frequent target value in the records
Else below this new branch add the sub tree ID3 (Records (v,), Target_Altribute,
Mu-
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) + at
fal Tech-Neo Publications..A SACHIN SHAH Venture
Ve a

aching Leaming U-Sem 6-Comp) — —_ Learning with Regression and Trees ...Pa 4
aS7.6 _ EXAMPLE OF CLASSIFICATION TREE USING ID3 SSS
4.6.1: Suppose we want ID3 to evaluate car :
example —— is “Should we accept car?” which canbe
acer imaccepisbie OF nol. The target
Buying Price | Maintenance_ Price Lug_Boot | Safety | Evaluation?
High High Small High | Unacceptable
High High Small Low | Unacceptable
Medium High Small High | Acceptable
Low Medium Small High | Acceptable
Low Low Big High | Acceptable
Low Low Big Low | Unacceptable
Medium Low Big Low | Acceptable
High Medium Small High | Unacceptable
High Low Big High | Acceptable
Low Medium Big High | Acceptable
High Medium Big Low Acceptable
Medium Medium Small Low Acceptable
Medium High Big High | Acceptable
Low Medium Smail Low | Unacceptable
VJ Solution :
Class P : Evaluation = “Acceptable”

Class N : Evaluation = “Unacceptable”
Total records = 14 a
4
= 5
No. of records with Acceptable = 9 and Unacceptable
19.2) = (ta) 98e(50)-(o =ta

ptn (5))
0,940 (14 } 108 =
14) ~ (7)
9 log, ( =)-
1(9, 5) - (<3)
W
» Step1
1. Compute entropy for Buying_Price

For Buying_Price = High
say «10.2 (3)e(3)-(3)e(3)-0"

p, = 2andn,= 3
dium and Low.

Similarly we will calculate I (pj 9 ) fo ¢ Me
(e seemernecmescemmarene.pseroy

«gy
Learning with Regressi

lyIg} |
Machine Learning (MU-Sem 6-Comp) on and Trees See
y :
pty
E(A) = — & ptn Lp.)
5 4
E(Buying_Price) = (3) 1(2.3) + (4) 1 (4,0) +( a ) 1 (3,2) = 0.694
Gain(S, Buying_Price) = I (p.n)-E (Buying_Price) = ().940-0.694 = 0.246

Buying_Price
Similarly, Gain (S, Maintenance_'Price) = 0.029,

Gain (S, Lug_Boot) = 0.151, Gain (S, Safety) = 0.048
Since Buying_Price is the highest we select Buying_Price as the
root node.
> Step 2
As attribute Buying Price at root, we have to decide on remaining tree attribute for High branch.
Buying_Price | Maintenance_ Price | Lug_Boot | Safety Evaluation?
High High Small High | Unacceptable

High Low Big High Acceptable
High Medium Big Low Acceptable
No. of records with Acceptable = 2 and Unacceptable = 3
I(p, n) (st) 96(35)-(Ga ) wes (525)

1(2, 3) -(3)1e:(3) in -(
(3)=$)097
I. Compute entropy for Maintenance_ Price
Maintenance_Price | p, | n, | I (P,, n, )
High 0} 2 0
Medium | l
Low 1 | 0 0
E(Maintenance_ Price) = (2) 1 (0,2) + (2) Td1,l)+ (5) 1(1,0) =0.4
Gain(S High’ Maintenance_ Price) =f (p, n) —

E (Maintenance_ Price) = 0.971-0.4=
0.571
2. Compute entropy for Lu g_Boot
P, =O and n, =3 .
1(P,,n;)=1(0,3)=0
Lug Boot | p, | n, | 1(P,,n,)

3 Small 0] 3 0
Big |2/0| 9
(MU-New Syllabus w.e¢

academic year 18-19) (M6
-14) La Tech-Neo Publications...A SACHIN SHAH Venture
i iin a

E (Lug_Boot) = (2)10, 3+(2

) 1(2,0)=0
Gain (S sions Lug_Boot)
=I (p, n)-E (Lug_Boot)
= 0.97] _g— 0.971
3. Compute entropy for Safety
Pj) = 1 and n,=2

1(p,n;) = ¥(1,2)=0.918
E (Safety) (2)ha,24 (2)1a, 1) =0.951

i
Gain(S yis,, Safety) = I(p,n)-E (Safety) = 0.971 -

0,951 =0.02
Since Lug_Boot is the highest
we select Lug_Boot
as a next node below High bran
ch.
d Step 3
Consider now only Maintenance_ Price
and Safety for Buying_Price = Medium
Buying_Price Maintenance_ Price | Lug_Boot Safety | Evaluation?
Medium High Small High | Acceptable
Medium Medium Small Low | Acceptable Module
Medium High Big High | Acceptable f 4)
Since for any combination of values of aintenance ing Pn ool

Price and Safety, Evaluation? value is Acceptable, so Heh
Low
we can directly write down the answer as Acceptable. Medium
[Acceptabie}
Small Big
» Step 4
=
Consider now only Maintenance_ Price and Safety for Buying_Price = Low
| Evaluation?
Buying_Price | Maintenance_ Price | Lug_Boot | Safety
Medium Small High Acceptable
Low
Low
Low Big Low | Unacceptable
Li
Big High Acceptable
7 Medium
Medium Small Low | Unacceptable
=
P, = 3 and n,=2
icatiions..A SA CHIN SHAH Venture
Tech-Neo Publicat
(MU-New Syllabus w.ef academic year 18-19) (M6-14) (a e

Machine Learning (MU-Sem 6-Comp) Leaming with Regression and Trees ...Page no 4-18)
SS SS)
1.) = 1(3,2)=0.970
|. Compute entropy for Safety
Safety | pj | n | Tn)
High | 3 | 0 0
Low | 0 | 2 0
2
E (Safety) . (2)te, 0 +(F)10, 2)=0
Gain (S ,.y» Safety) = 1 (p, n)— E (Safety) = 0.970 - 0 = 0.970
2. Compute entropy for Maintenance_ Price
Maintenance_ Price | p; | n, | I (P,,n, )

High o}o 0
Medium 2) 1 0.918
Low 1] 1
3 2
E(Maintenance_ Price) = (2) 1(0, 0) + (3) 1(2, 1) + (2) (1, 1) =0.951
Gain (S ,,,. Maintenance_ Price) =I (p, n) — E (Maintenance_ Price) = 0.970 - 0,951 = 0.019
Since, Safety is the highest we select Safety below Low branch.
Now we will check the value of ‘Evaluation?’ from the database, for all branches.
o Buying_Price = High and Lug_Boot

= Small —> Evaluation? = Unacceptable
Buying_Price
o Buying_Price = High and Lug_Boot ; 7 oN
. . High Low
= Big —> Evaluation? = Acceptable Medium
o Buying Price = Low and Safety | Lug Boot Acceptable| | Safety

f = Low = —> Evaluation? = Unacceptable
ion? = Small Big _ High
o _-Buying_Price = Low and Safety

= High —> Evaluation? = Acceptable
*. Final Decision Tree is

Buying_Price
High Low
Medium
Lug_Boot | Acceptable| | Safety |
Small Big Low High
[Unacceptable] |Acceptable| |Unacceptable| [Acceptable]
(MU-New Syllabus w.e. academic year 18-19) (M6-14) ta Tech-Neo Publications...A SACHIN SHAH Venture

ample 4.6.2: Suppose we want ID3 to decide whether the loan is to be sanctioned or
not. The target classification is
“Should we sanction loan?” which can be yes or
no.
Customer no | Spending_ Habit | Collateral
Income | Credit_Score | Sanction?
|
| High None Low Bad No
2 High None Medium Unknown No
3 Low None Medium Unknown No |
4 Low None Low Unknown No
5 Low None High Unknown Yes
6 Low Sufficient | High Unknown Yes
7 Low None Medium Bad No
8 Low Sufficient High Bad No
9 Low None High Good Yes
10 High Sufficient | High Good Yes
ul High None Low Good No
12 High None Medium Good No
Solution:
")” Class P: Sanction = “Yes”
Class N : Sanction = “No”
Total records = 12
No. of records with Yes = 4 and No=8
I(p,n) = ~ (2) wes(s25)- (Sta) (sta)
(we
mm 1(4,8) = - (5) tors(-3)-(45)-to22( 45) =0.922
a
Module
/ Step tl
p Compute entropy for Spending_Habit
4 For Spending_Habit = High
© ten «10.02-(4)ne($m3)
)-(§07)
/ p,= Landn,=4
Similarly, we will calculate I (p,.n,) for Low.
Spending_Habit | p, | ™ | 1 (p.m)
High t{ 4] o720
Low 3 | 4+] 0.985
E(A) = 2 BT ni)
/} { E (Spending_Habit) (3) x 0721+ (4) x 0.985 = 0.874
v4 ai
tNew Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications_A SACHIN SHAH Venture

with Hegrenadion and Troan ,,,Pa ;

Machine Leaming (MU-Som 6-Comp) = Learningwit ous — Ray |
Gain(S, Spending _Habin = 1 (p.n) - (Spending labin = 0,022 — O.87E = O.018
2. Compute entropy for Collateral a
Collateral) P| He) |
None 7) Ont
Sufficient | 2] 1 | 0.018
TO
BoA) =X = 1 ned |
\
9 3 }
E (Collateral) = ( 12 ) X 0.77 + (4) x 0,918 = 0,806

{
Gain(S, Collateral) = 1 (p.n)-E (Collateral) = 0.922 = 0.806 = O16
3. Compute entropy for Income
Income | py | 1 | EP)
Low QO] 3 0 ‘ i
Medium | 0 | 4 0)
High 4} 0.721
PrN,
E(A) >=!
ptn I(p, 0)
\ 4 5
_E (Income) . (3) xO+ (4) x0+ (3) x 0.721 = 0.3
I
Gain(S, Income) = I (p,n) — E (Income) = 0.922 - 0.3 = 0.622
© 4, Compute entropy for Credit_Score
Credit_Score | p,; | 9, | I (py n,) i

Bad 0] 3 0 :
Unknown | 2 | 3 0.97
Good 2] 2 l
v
. + n.
E(A) = 3 =F 1(p,,,)
i=l
. 3
E (Credit_Score)' = 3) x0+ (3) x 0.97 + (5) x | =0.734
Gain(S, Credit_Score) I(p,n)-E (Credit_Score)
= 0,922 - 0.734 = 0.988 he

‘Since Income is the highest we select Income as the root node.
F
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A ‘SACHIN SHAH vertu

Leaming with Regression and Tress ...Page 10 (4-21

» step?
as ani jbute Income at root, we have
tn Gécide
deci on Temaining
ink tree attribute
i for Low branch.
Consider only Spending Habit. Collateral end Credit_Score for Income = Low
[ Customer no | Spending Habit | Collateral | Income | Credit Score | Sanction? |
l i High | Nowe | Low | Ba | No |
4 I Low i None | Low | Unknown I No |
uN | High | Noze | Low | God | No |
Since for amy combination of values of
Spending_ Habit. Collateral and Credit_Score, Sanction?
yalu2 is No for Income = Low, so we can directly write
down the answer as No
& Step 3
Consider only Spending Habit. Collateral and Credit_Score Income = Medium
Customer no | Spending Habit | Collateral | Income | Credit_Score | Sanction?
2 | High | None | Medium | Unknown | No

3 | Low | None | Medium | Unknown | No
Low | None Medium | Bad | No

7
2 | High None | Medium| Good | _No
it.
Since for any combination of values of Spending_Hab
No for
Collateral and Credit_Score, Sanction? value is
write down the answer as No
Income = Medium, so we can directly
» Step4
and Credit_Score Income = High
Consider only Spending_Habit, Collateral
| Sanction?
| Collateral | Income | Credit_Score
Customer no | Spending_ Habit Yes
None High Unknown
5 Low
Unknown Yes
| Sufficient High
6 Low
Bad No
Sufficient High
8 Low
Good Yes
High
9 Yes
High Good
High Sufficient |
10
No. of records with Yes = 4 and No= 1

igh
I(p, n) “2-)ion(25)- _{ (gta)
- -(<23) n(sta) —— oaloea)
p+no/ Jlog. “*\ptn
~
~~
|
4 4 -(4)5
w8.($) log2\,(4)=0721
5
1(2,3) = _(4)
[tal Tech-Neo Publications..A SACHIN SHAH Venture

U-New Syllabus w.ef academic year 18-19) (M6-14)

no
Learning with Regression and Trees ...Page
Compute entropy for Spending_Habit _—_——————
it | py | I (pj 0, )
Spending_Habi
Low 3/1] 0811
High 1 | 0 0
E(A) z= PF Lo.md
4 1) x 0 =_ 0.648
E(Spending_Habit) = (4) ce +(5
= 6.073
— E (Spending_Habil) = 0.721 - 0.648
Gain (Spigns Spending_Habit) = 1 (p, n)
2. Compute entropy for Collateral
Collateral | p; | ™% I(p,, n; )
None 0 0
Sufficient 1 0.918
. E(A) = Eel a * 1(Pis mi)
2 3 _
E (Collateral) = (2)x0+($)xos18=05 5
Gain (Syigy, Collateral) = I (p, n) - E (Collateral) = 0.721 — 0.55 = 0.171
3. Compute entropy for Credit_Score

Credit_Score | p,; | ™; | I(p,,n,)
Unknown 0 0
Bad 0; 1 0
Good 0 0
E(A) on “Lp, m)
E=1
E (Credit_Score)
(5)<o*O+\+(s)x
(5) o+(3 5
5 )XO+| )x0=0
Gain (Syigq, Credit_Score) = I (p, a) — E (Credit_Score) = 0.721 -0 = 0.721

Since Credit_Score is the highest we select Credit_Score as a next node below
High branch.
| Income
Low
Medium
| No | “No Credit_Score
2S Good

[e) Tech-Neo Publications...A SACHIN SH

\

Learning with Regression and Tre ge no (4-23)
Now we will check the value of ‘Sanction?’ from
the database, for all b h
S¢, for al branches, ao
o Income = High and Credit_Score = Bad -> Sanction? = Nc
27=No
o Income = High and Credit_Score = Unknown -> Sanction? = Y,

; ?= Yes
o Income = High and Credit_Score = Good -> Sanction? = Yes
— Final Decision Tree is
Medium
Example 4.6.3 : Suppose we want ID3 to decide whether the car will be stolen or not. The target classification is “car is
stolen?” which can be Yes or No.
Car no | Colour | Type Origin | Stolen?

1 Red_ | Sports | Domestic Yes
2 Red | Sports | Domestic No
3 Red Sports | Domestic Yes
4 Yellow | Sports | Domestic No
5 Yellow | Sports | Imported Yes
6 | Yellow| SUV | Imponed | No |
7 Yellow | SUV | Imported Yes
8 Yellow | SUV | Domestic No
9 Red SUV | Imported No
10° Red Sports | Imported Yes
Vv Solution :
Class P: Stolen = “Yes” Class N: Stolen = “No”
Total records = 10
No. of records with Yes = 5 and No=5
_ (sta) toe(sta)
I(p,n) = (<t5) log, (583)
16.5 =-()me(8)-(é)ee(t)=
* Step1
1. Compute entropy for Colour
Colour | p, | | 1 (Pr)
Red 3{2{ 097!
Yellow| 2] 3| 0971|
Venture _
ations-A SACHIN SHAH
/ (el Tech-Neo Public
) (M6-14)
(MU-New Syllabus w.e.f academic year 18-19
Is

Learning with Regression and Trees ...Page no (4-2 ri . '

Machine Learning (MU-Sem score -
= a 2
=
E(A) = 5 no
ven “1 (p,m)
i=l
5 5 x 0.971 = 0.971
E (Colour) = (5) x 0.971 + (3)
Gain(S, Colour) = I(p,n)-E (Colour) = 1 -0.971 = 0.029
2. Compute entropy for Type

Type Pi | ™% I (Pi n, )
Sports | 4 | 2 | 0.923
suV | 1/3] O81I
E(A),== z me
pin = 5 (p, n)
6 4 ,
+( 4) x08 | =0.878
E (Type) (5) x0.923
"
Gain(S, Type) = I (p, n)-— E (Type) = | — 0.878 = 0.1218
"3. Compute entropy for Origin

Origin Pp, | m, | 1 (pp my)
Domestic | 2 | 3 0.971
Imported | 3 | 2 | 0.971
‘E(A) = ran
oom “(py n)
I
_E (Origin) (3) xoon+ (5) xoon = 0.971

Sports SUV
Gain(S, Origin) = I(p,n)-E (Origin) = | - 0.971 = 0.029
Since Type is the highest we select Type as the root node.
» Step2
branch.
As attribute Type at root, we have to decide on remaining tree attribute for Sports
Consider only Colour and Origin for Type = Sports
Car no | Colour | Type | Origin | Stolen?
1 Red | Sports | Domestic Yes

4 Yellow | Sports | Domestic No ‘
5 Yellow | Sports | Imported Yes :
i | 10 Red | Sports | Imported | Yes A
;
No. Of records with Yes = 4 and No =2
(M6-14)

bi ing (MU- C
chine Learning (MU-Sem 6-Comp) Learning with Regression and Trees ...Page no (4-25)
T(p.n) = (25) 108 (s2=)-(s2:) ton (s)

1(4,2) = -(#) log, (3) -(2) log, (2) = 0.923
Compute entropy for Colour

Colour | p, | n, | I(p,n,)
Red 3) 1 0.811
Yellow | 1| 1 1
v P. +
i, + 0;
E(A) = ee pen I (p;, 0)
4 2
E(Colour) = (2) x 0.811 +(2) x 1=0.873
Gain (Ssponsx Colour) = I (p, n) — E (Colour) = 0.923 — 0.873 = 0.05
Compute entropy for Origin
Origin | Pp, | | 1@,n,)

i)
l Domestic 2 I
Imported 0 0
.
pen 1 (Pim)
E(A) = = Hei
p, +n,
a suv
E (Origin) = (2)x 1 +(2)x0=0.666
= 0.257 Domestic Imported Module

Modal
F Gain (Ssyony Origin) = I (p, n) — E (Origin) = 0.923 - 0.666 14
ode below Sports branch.
' Since Origin is the highest we select as a nextn
\
Step 3
chosen, w e have
As attribute Type and Origin is already
attribute for SUV
to decide on only remaining Colour
branch.
from the
Now we will check the value of ‘Stolen?’
database, for all branches,
= Domestic, Stolen? Domestic
nt o For, Type = Sports and Origin
pee = Yes as well as No
most common class. In
So for this type of case we hav e to select the
Yes as well as No, s oO we can
this example there are 2 instances for
| select any one. Let’s we select No.
o For, Type = Sports and Origin = Imported, Stolen? = Yes
n? = No
o For, Type = SUV and Colour = Red, Stole
well as No
= Yellow, Stolen? = Yes as
For, Type = SUV and Colour
SACHIN SHAH Venture

Tech-Neo Publications..A
‘

Machine Learning (MU-Sem 6-Comp) _ Learning

i
wih ith Regression
Sor and Trees ...P.
$22.00 (4-25
— So for this type of case we have to sclect the most common class. In this example there are 2 instances for Ng and|
instance of Yes, so we will select No.

- Final Decision Tree is
Domestic
> 4.7 EXAMPLE OF DECISION TREE USING GINI INDEX
Example 4.7.1: Create a decision tree using Gini Index to classify following dataset.
Sr.No. | Income Age Own Car

1 Very High | Young Yes
2 High Medium Yes
3 Low Young No
4 High Medium Yes
5 Very High | Medium Yes

6 Medium Young Yes
7 High Old Yes
8 Medium | Medium No
9 Low Medium No
10 Low Old No
11 High Young Yes
12 Medium Old No
MI Solution:
- Inthis example there are two classes Yes and No.

No. of records for Yes = 7
No. of records for No = 5
Total No. of records = 12

— Now we will calculate Gini of the complete database as,
Gini(T) = 1— ((4) + (4)) =0.48

Next we will calculate Split for all attributes, i.e. Income and Age.
ture
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH ven

gini (Very High) +7< ginj (High)

+
= gini (Low)
a
+=4 ginj (Medium)
(3) ) J+ PL) 8b-@)AN.ap-@y-@

y
Te
ee ee
Age Split gini(y
(Yong) +=> ini (Mediu
m) +> gin (Old)
el-@+@))} al-(G)-@))-
8l-(G-)]
eee
__- Split value of Income is smallest, so We will select Income
<TR
as root node.
nod
Income
Very High High} Low Medium
——
- From the database we can see that,
o Own Car = Yes for Income = Very High, so we can

directly write down ‘Yes’ for Very High branch.
o Own Car= Yes for Income = High, so we can directly write down “Yes’ for High branch.
o Own Car=No for Income = Low, so we can directly write down ‘No’ for Low branch.
— Since Income is taken as root node, now we have to decide on the Age attribute, so we will take Age as next node Module
below Medium branch. _
| Income
Medium
.7 From the database we can see that,

write down “Yes’ for Young branch.
©
= jum an d Agege== Young, so we can directly
Own Car = Yes for Income = Medium so we can directly write down ‘No’ for medium branch.
= Medium,
© Own Car =No for Inco me = Medium and Age for Old branch.
Age= Old, so we can directly write down ‘No’
.
ium and Ag
© Own Car=No for Income = Med
SACHIN SHAH Venture

[Ea tech-Neo Publications..A
: -19) (M6-14)
“New Syllabus w.e.f academic year 18-19) (

Waal Decision ‘Tree is,
Very High Medium
Medium
Example 4.7.2: Stock market involving only Discrete ranges has profit as categorical value {up, down}.Use Ginj inde,
method to draw classification tree. .
Age | Competition Type Profit
old Yes software | down
‘, , , old No ‘software | down
old No hardware | down

mid Yes software | down
mid Yes hardware | down
; mid No hardware up
: ‘ ; mid No software |. up
i new Yes software up
* i new No hardware | up
, aw new no software up
i j Solution :
bay — — Inthis example there are two classes down and up.
aN No. of records for down = 5
No. of records for up = 5
cwicn = [1-((3)«(3))05
- Now we will calculate Gini of the complete database as,
— Next we will calculate Split for all attributes, i.e. Age, Competition and Type.
Age
. 3 4
Split = To gini (old) +79o Bini (mid) * gini (new) :
= gol (5) +(3) ) Jeol (GG) +)

3
Ji [-() +)]-
3yY 4 2y 2 3 3y oe
Tech-Neo Publications...A SACHIN SHAH ventute :

cael

- Machine Learnin
Competition
Spiit sini (yes) +=

i gini (no)
4[-(G)-QNeeh-
—<s
AN
“—~
+
Type
Split &
10 ini (software) +—=4 gini
(hardware)
vol'-(C5) +(+8
f)-()
2+3)
])
Split valuc of Age is sm allest,
so we will select Age 4s root
v
node.
From the database we can
see that, Profit = Down
for Age = Old, so we can directly write
node. down ‘Down’ for Old branch
Next we will check for Age = mid

—
Age | Competition Type Profit
mid Yes software | Down
mid Yes hardware | Down
mid No hardware | Up
mid No software | Up
Now we will calculate Split for only Competition and Type
attributes.
Competition
Split 2 ini (yes) +>42 gini (no)
| Type
2:-(()-@Q)C-(@O}
Split 2...
BIN
gini (software) + 4 gini (hardware)
aL -((a) +(2)) J+al-(Ga)+@)+0)

ly
Split value of Competition is smallest, so we will select

Competition as mext node below mid branch.
- From the database we can see _ that, Profit = Up
for Age = New, so we can directly write down ‘Up’ for New
branch node.
— _ Now we will check the value of ‘Profit’ from the database, for all branches, :
° Age = Mid and Competition = No -> Profit = Up
°° Age =Mid and Competition= Yes -> Profit = Down
Tech-Neo Publications..A SACHIN SHAH Venture

Machine Learning (MU-Sem 6-Com Learning with Regression and Trees . Pa @ no 4.39
Final Decision Tree is,
Old Mid New
[competition]
No Yes
Example 4.7.3 : Suppose we want Gini index to decide whether the car will be stolen or not.-The target classification is “cat
is stolen?” which can be Yes or No.
Carno | Cojour | Type Origin | Stolen?
1 Red Sports | Domestic Yes
4 Yellow | Sports | Domestic No
5 Yellow | Sports | Imported Yes
6 Yellow | SUV | Imported No
7 Yellow {| SUV | Imported Yes
8 Yellow | SUV | Domestic No
9 Red SUV_ | Imported No
10 Red | Sports | Imported Yes
vw Solution:
In this example there are two classes Yes and No.

No. of records for Yes = 5
No. of records for No = 5
Now we will calculate Gini of the complete database as,
omen = 1-(1-(($) «(§)))e0s

Next we will calculate Split for all attributes, i.e. Colour, Type and Origin.
Colour
Split +».= 24
= eevol '-((3)3Y,(2Y
+(2)) J5[1-((B)+(3))
5 2V 73
Jeo
Type
Split = 6 oat 4 —
plit = 49 sini (Sports)
+ 75 gini (SUV)
- A (O+@)4h-(G-@)ee)e
ra
[Bl rech-Neo Publications._A SACHIN SHAH Ventu®..
2

origin
Split 4, == >10 Z£ gini (Domestic)

omestic) +>
+ 79 gini (importea)
10( [ 1- a]5
(3) 5
3)’ )]
+ (3) 25 !-((3)(2
*vol 2 )) Jen Sports su
u
Split value of Type is smallest, so we Will select T. ° “s

CI
y Pe as root node.
Next we will check for Type = Spons
i
As attr ibute T ype

p
at roo t, * we have to dec i1 de
aining oO n rem
Consider only Colour tre € attribut
ib} e for Sports
and Origin for Type = branch
Sp Sports
= no Origin | Stolen?
Domestic Yes
2 Domestic } No
3 Domestic Yes
4 Yellow Domestic No
5 Yellow | Sports Imported Yes
10 Red | Sports Imported Yes
Colour
. 4... 2..,
Split = 6 sini (Red) + 6 Bini (Yellow)
= éL-((24)+
-(G)
G@ +G))
))Jean
Origin> suv Module
- 4 4 4
Split = 6 gini (Domestic) + é gini (Imported)
4-4)
Split value of Origin is smallest, so we will select Origin as next node.
Domestic Imported
Next we will check for Type = SUV

for SUV branch.
As attribute Type and Origin is already chosen, we have to decide on only remaining Colour attribute
Now we will check the value of ‘Stolen?’ from the
database, for all branches,
. _ Sports
For, Type = Sports and Origin = Domestic, Stolen? = Yes
as well as No
most Velow
~ So for this type of case we have to select° the Domestic
tan ces for Imported Red
common class. In this example there are 2 ins
Let’s we
Yes as well as No, so we can select any one.
select No,
0 For, Type = Sports and Origin = Import

ed, Stolen? = Yes
Red, Stolen? = No
© For, Type = SUV and Colour =
Venture
(el Tech-Neo Publications...A SACHIN SHAH
-14)
(MU-New Syllabus w.e.f academic year 18-19) (6-14)

» To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
Machine je “earni
Lea Ing (MU-Sem
- 6-Comp)
-( Learning with Regression and Trees .,.Pagg no ay
9 For, Type = SUV and Colour= Yellow, Stolen? = Yes as well as No

~ So for this type of case we have to select the most common class (Since all attributes are already Considered),Inj
example there are 2 instances for No and | instance of Yes, so we will select No. '
— Final Decision Tree is
Domestic
> cue wad MU - May 15, 12 Marks ;

Create a decision tree for the attribute “class” using the respective values :
Eyecolour | Married Sex Hairlength | class

Brown yes Male Long Football
Blue yes Male Short Football
_ ‘ Brown yes Male Long Football
Brown no Female Long Netball
‘ a co, . Brown no Female Long Netball
Blue no Male Long Football
Brown no Male Short Football
Brown yes Female Short Netball
Blue no Male Long Football
Blue no Male Short Football
M4 Solution : "
In this example there are two classes Football

and Netball.
‘No. Of records for Football = 7
No. Of records for Netball = 5
Total No. Ofrecords =12

Now we will calculate Gini of the compl
ete database as,
Gini (T) = 1- (63) +(3) ) = 0.48

Tech-Neo Publications...’ SACHIN SHAH venture

“4, lachine Leaming (MU-Sem 6-Comp)

\ —o
" ——— _Learning with Regrassion and Traas Pacaga no (4-33
Next we will calculate Split for all attributes, i.e, Eyecolor, Married, Sex and Hairlength 208 Pagano ( ).
Eyecolor->
.
. 3 4
Split = 75 gini (Brown) +75
12 ini (Blue)
4
£-()-O)L@-@))eo
| { Married->
NX Split = Tp Bini (yes) + pini (no)
he,
AL @OV ALG)
S~~ s .
Split
7
= 7512 gini (male) +>2 gini (female)
7
= 7¥
= L(G) +3)f0\)\]. ) eS 5 [1-(()+(2)
oy sy
) Jeo
Hairlength->
: 8 4
‘Ie Split = Tp gini i (long) + ne gini (short)
Nez =)
3 + ) i-((3)
By +(4)fly = 0.458
hed Split value of Sex is smallest, so we will select Sex as root node. Module
r
—_—_—
a U4
a) female
Fe
ie
; From the database we can see that,
ee class = Football for Sex = male, so we can directly write down ‘Football’ for male branch.
& class = Netball for Sex = female, so we can directly write down ‘Netball’ for female branch.
© Final decision tree is,
Ny
fr male female
ationss...A SACHIN SHAH Venture

rec - Neo Publicication
[Bal Tech-
(M6-14)
iaten’. Te

Machine Learning (MU-Sem 6-Comp) _ Learning with Regression and Trees ...Page no (4-34
3 eee
see ey MU - May 17, 10Marks
‘For a Sunburn dataset given below, construct a decision tree
Height Weight Location Class
Name Hair
Light No Yes
Swati Blonde Average
Tall Average Yes No

Sunita Blonde
Average Yes No
Anita , Brown Short
Average No Yes
Lata Blonde Short
Heavy No Yes
Radha ' Red Average
Heavy No No
Maya Brown Tall
Heavy No No
Leena Brown Average *
Light Yes No
Rina Blonde Short
Mv Solution :
Hair, Height, Weight and Location.
We will calculate Split for all attributes, i.e.
Hair->
a Bit = ae 3 1,gini
-O)-G-O)
(Red)
gini (Blonde) + g gini (Brown) +
111-280 4
Hei ght->
2... 3...
-O)-alO-O)
2 sini (Average) + gin (Tall) +g gint (Short)
[OOH
Split =
8 3
nv
/
f
Weight->
af
2 3, + g ini (Heavy)
3.
* N
\ Split = § gini (Light) +3 sini (Average)
2
2 2
40-(@+@))
2
+3) J J-
-(Q)-@))
2
8b
2
5) +(5) JJ +ebt-Ua)
:
| 21
= 2[1-((2) +(3) Jel (4-2) :
‘ :
Location-> 3
5... 344
(Yes)
gsm (No) +g gin!
511-2) «4 -(@@)1-%
Split =
llest, so w
Split value of Hair is sma
is
(M6-14) Tech-Neo Publications...A SACHIN SHAH
w.e.f academic y ear 18-19)
(Mu-New Syllabus

Learning wit .
ey 7 BI ith Regression and Trees ...Page no (4-35
Now we will split the remaining attributes considering
& Blonde d ala,
Height->
split = ei ni (Average) + 7 gini (Tall) 424 Bini (Short)
LO -O)40-()
4 (9) *4aL!- ($) +(4))]4 +3 [1-((4) +(2)) Joa
Weight->
split = = sini (Light) + gini (Average)
-2[1-(G)-G) 4 [-(O+-O)] -0s

Location->
Split = 2Zeini (No) +2 4 Bini (Yes)

2 2
= 3 1-((8) «(8 ) aL -() +G))] =
Split value of Location is smallest, so we will select Location node below Blonde branch.
From the database we can see that,
can directly write down ‘Yes’ for No branch.

class = Yes for Hair = Blonde and Location = No , so we
Module
so we can directly write down ‘No’ for Yes branch.
class = No for Hair = Blonde and Location = Yes,
14.
directly write down ‘Yes’ for Red branch.
class = Yes for Hair = Red , so we can
directly write down ‘No’ for Hair branch.
class = No for Hair = Brown, so we can
Final Decision Tree is,
Blonde/ Brown
Venture
fal Tech-Neo Publications...A SACHIN SHAH
(MU-New Syllabus wat academic year 18-19) (m6-14)

wt vi


UExample 4.7.6 [UEIVEVEER DES
indexes and determine,
For a Sunburn dataset given below, construct a decision tree For the following data, Calculate Gini
which attribute is root attribute and generate two level deep deci sion tree.
Sr. No. Income Defauiting | .Creditscore | | Location”
Low High High bad
1
Low High High good
2
High High High bad
3
Medium Medium High bad
4
Medium Low Low bad
5
6 Medium Low Low good
7 High Low Low good

8 Low Medium High bad
9 Low Low . Low bad
10 Medium Medium Low bad
1 Low Medium Low good Yes
12- High Medium High good Yes
13 High High Low bad No
i 14 Medium Medium High good Yes
M ‘solution:
We will calculate Split for all attributes, i.e. Income, Defaulting, Creditscore and Location.
ncome-> , Split = = gini (Low) + 4 gini (High) + gini (Medium)
= &[1-(()+( ))8l1-(2)+G ell

4... 6... . 4
3
5 )+( 25 ¥) ]dae
Defaulting-> Split = 74 ini (High) + 14 Bini (Medium) + 4 gini (Low) = 0.438
7 wa. 7...
Creditscore-> Split = 74 sini (High) + 74 sini (Low) = 0.493
8...
Location-> —_Split = 74 Bini (bad) +e gini (good)
§L-( QV} L(+]-aas

Split value of Location is smallest, so we will select Location as root node.
bad good
Now we will split the bad branch considering Temaining attribu

tes
> tit = 2 in 23 |
Income-> Split = g gini (Low) + § gini (High) + §gini (Medium) = 0.295 :
tino. Pan 3... 3... . 2
Defaulting-> Split = gini (High) + g 8ini (Medium) +9 gini (Low) = 0.34
:
Creditscore->
» 47. 4
Split = ¥ gini (High) + gini (Low) = 0.25
(MU-New syllabus we academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Ventut®

U-Sem 6-Comp)
machine Learning ; (M . : Learnini ge
no (4-37)
arhing with Regression and Trees ...Pa
vit wale of Creditecore is’sinallest, 80 We will ill sele
se l ecl Cred
red its
lit ‘core node c belc
belo Ww bad bran’ch a
i brane
bad good
Creditscora
Low
Now we will split the good brancls considering remaining Fattribut es
2
2
a
Income->
Split
t
= Zgini
6 ent (Low) + 6 sini (High) + Ggini (Medium) = 0,295
. . low, 2
Defaulting-> Split = G sini (High) =0
+ gini (Medium) + gini (Low)
Split value of Defaulting is smallest, so we will select Defaulting node below good branch
Since only one attribute is remaining, we can directly select Income below creditscore= High branch
For Location = bad and creditscore = High and Income = Low, Giveloan= No
Giveloan= Yes
For Location = bad and creditscore = High and Income = Medium,
High, Giveloan= Yes
For Location = bad and ereditscore = High and Income =
oan= No
For Location = bad and ereditscore = Low, Givel
High, Giveloan= No
| = 0.392 For Location = good and Defaulting =
Modulee
= Low, Giveloan= Yes
For Location = good and Defaulting |
14
ng = Medium, Giveloan= yes
For Location = good and Defaulti
SACHIN SHAH Venture

(3 Tech-Neo Publications...A
14)

)
Learning with =Ro rossion and Troos ...Page no 4-38
—!.
Machine Leaming (MU-Sem 6-Co
mp)
—
CLASSIFICATION AND REGRESSION TREE (CART)

a,
M1 4.8
y arinble. Mainly the target variable
onging to the largel
aset into classes bel
— — Classification trees are used to divide t! he dat orical classification trees are used.
can be yes or no, When the Gtr vet varhible type is Caley ‘
has wo classes that
oescontii nuous 24 in that case
eee sl on
PeEPress!
regres trees are used, Let's take an
or
*
Variable is; numeric
— — In certaini applications the target
+
isks Where We Want lo predict
problems or &
of a Mat, Hence regression trees are used for
example of prediction of price
some data instead of classifying the dita.
d chassific ation tree. Let's take an example
of an
on the similari ty of the data the records are classified ina standar
— — Based
we have (wo variables, Income and marital sau is th predict ita person is going to
Income tax evades. In this example are married does not evade the
that 85¢¢ of people who
[n our training data if it showed
evade the income tax or not.
tree. Entropy or Ginny index is used in
split the dat a here and Marit al status becomes a root node in
income tax, we
classification trees.
nse ¥ ariable does not have classes soa
tree is to fita model, The target or respo
— — The main basic working of regression
dat ais split at various split points
nt variable to the target variable. Then the
regression model is fit using each independe
tiking the square of the
sum of squared errors (SSE) is calculated by
for each independent variable. At cach split point
which is having minimum
criteria for root node is to select the node
difference between predicted and actua I value, The
using the recursive procedure,
SSE among all split point errors. The further tree is built
yl 4.9 EXAMPLE OF REGRESSION TREE
Example 1
} ‘ Buying_Price | Lug_Boot | Sufety | Maintenance_ Price?

(In thousand)
‘ | Low Small High 25

j | Low Small Low 30
R) Medium Small High 46

N High Small High 45
High Big High 52
High Big Low 23
Medium Big Low 43
Low Small High 35

Low Big High 38
High Big High 46

Low Big Low 48
Medium Smuall Low 52
Medium Big High dd
High Small Low 30
—— we
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications... SACHIN SHAH Venture

ES acti? Leaming (MU-Sem 6-Comp

> al Le. . . |
2ming with Regression and Trees ...Page no (4-39

ee
sis ndard deviation
ision tree is built wp top down 1 f, from root .

A decision .
with tae
similar values.
values. We use SD to calculate homo node and
geneity of
involved partitioning 6 th the data data iinto subsets that contain instances
5 = . a numerical sample.
= (x-p)
sD Reduction
It is based on the decrease in SD after a dataset is Split on an attributute, Constructing a tree is all about finding attribute
SDR.
that returns highest
p Step1:
SD (Maintenance_Price?) = 9,32
» Step2:
The dataset is then split on the different attribute.SD for each branch is calculated. The resulting SD is subtracted from
SD before split.
Maintenance_ Price(SD)
Low 7.78
Buying_Price Medium 3.49
High 10.87
SD (Maintenance_Price, Buying_Price) = P (Low) SD (Low) + P (Medium) SD (Medium) + P (High) SD (High)
5 4 5
=14 * 7.78 +74 * 3.49 +14 * 10.87 = 7.66
SDR = SD (Maintenance_ Price) — SD (Maintenance_ Price, Buying_Price) = 9.32 — 7.66 = 1.66
Maintenance_ Price (SD)
Small 9.36
Lug_Boot Big 8.37

SD (Big)
SD (Maintenance_ Price, Lug_Boot) = P (Small) SD (Small) + P (Big)
=14 7 Leys
= 8.86
x 9.36 + 14 x 8.3 7
= 0.46
tenance_ Price, Lug_Boot) = 9.32 — 8.86
SDR = SD (Maintenance_ P. rice) — SD (Main
High 71.87
Low
10.59
Safety
Safety) = P (High) SD (High) + P (Low) SD (Low)
SD (Maintenance_ Price,
=x8 787 4—6 x 10.59 = 9.02
. 14 10.5
14
= 0.3
ntenance_ Price, Safety) = 9.32 — 9.02
SDR = SD (Maintenance_ Price) - SD (Mal
[Ral rech-Neo Publications...A SACHIN SHAH Venture

(MU-New syllabus w.e.f academic year 18-19) (M6-14)

4
Learning with Regression and Trees ...Page no (4-40)
Machine Learning (MU-Sem 6-Comp) =
we select Buying_Price
SDR of Buying_Price is highest so
Buying_Price as our root node.
For example if there are less than five

To avoid over fitting we should terminate unnecessary building branches.
les: s than 5% of the entire data set. I prefer to apply the first one, 1
instances in the sub data set or standard deviation can be
this termination condition is satisfied
will terminate the branch if there are less than 5 instances in current sub data set. If
then i will calculate average of sub data set.
» Step 2(a):
_ Now we will consider the records of ‘High’.

Buying_Price | Lug_Boot | Safety | Maintenance_ Price?
(in thousand)
High Small High 45

High Big High * 52
High Big Low 23
High Big High "46
High Small Low 30
‘For Buying_Price = High, SD = 10.87
We will calculate SDR of only Lug_Boot and Safety

Small 75
——
Lug_Boot
; : Big 12.49
SD(High, Lug_Boot) = P (Small) SD (Small) + P (Big) SD (Big)
ES
2 3
=5% 75+ 5% 12.49 = 10.49
SDR = SD(High) — SD (Maintenance_ Price, Lug_Boot) = 10.87 — 10.49 = 0.38

‘ High 3.09
Safety Low 3.50
SD (High, Safety) = P (High) SD (High) + P (Low) SD (Low)
3 2
=5X3.09+¢ x 3.5 = 3.25
SDR = SD (High) — SD (Maintenance_ Price, Safety) = 10.87 — 3.25 = 7.62
SDR of Safety is highest so we select Safety as next node below High branch.
‘o For Buying_Price = High and Safety = High, we can Buying_Price
High ;
directly write down the answer. Low
Medium
o For Buying _Price = High and Safety = Low, we can
_ directly write down the answer. 46.3
— To write down the answer we take average of values of following records, High
o For Buying_Price = High and Safety = High,
Maintenance _Price = (45 + 52 + 46) /3 = 47.7
(MU-New Syllabus w.e-f academic year 18-19) (M6-14) el Tech-Neo Publications...A SACHIN SHAH Venture

Machine Learning (MU-Sem 6-Comp) Learning with Regression and Trees ...Page no (4-41)
ing_Price = Hi = :D: 30) /2 = 26.5
Dre Uh, q o For Buying_Price = High and Safety = Low, Maintenance _Price = (23 +
in ‘ey by, o hy, » Step 3:

ign ashy 4 Now we will consider the records of ‘Medium’
ThyMine hi! 4 i 1
Buying_Price Lug_Boot | Safety Maintenance_ Price?
4 (in thousand)
Medium Small High 46
| Medium Big Low 43
Medium Small Low 52
Medium Big High 44
Buying_Price
For Buying_Price = Medium, we can directly write
down the answer as 46.3. The answer is calculated by taking
the average of values of Maintenance_ Price for Medium
records (average of 46, 43, 52, and 44).
» Step4:
Now we will consider the records of ‘Low’.

Buying_Price | Lug_Boot | Safety | Maintenance_ Price?
(in thousand)
Low Small High -o.

Low Small Low 30 Module
“a3
Low Small High 35
tow | Big | Wien

TA”
Low Big Low

[8
48
mall
For Buying_Price = Low, SD = 7.78
Now only Lug_Boot attribute is remaining, so we can directly take Lug_Boot as a next node below Low branch.
- To write down the answer we take average of values of following records,
30 + 35) /3 = 30
o For Buying_Price = Low and Lug_Boot = Small, Maintenance_Price = (25 +
= (38 + 48) /2 = 43.
o For Buying_Price = Low and Lug_Boot = Big, values of Maintenance_Price
— Final Regression Tree is,
Buying_Price

ee —————————
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419 -
and Trees ...Page
Learning with Regression <= no (4-42)
Machine Learning (MU-Sem 6-Comp) —— =
Example 2
Outlook | Temp | Humidity Windy | Hours Play?
Day |
1 | Rainy Hot High False | 25
2 | Rainy Hot High True 30
3 | Overcast | Hot High False | 46
4 | Sunny Mild | High False | 45
5 | Sunny Cool | Normal False | 52
6 | Sunny Cool | Normal True 23

_7 | Overcast | Coo! | Normal True 43
8 | Rainy Mild | High False | 35
9 | Rainy Cool | Normal False | 38
10 | Sunny | Mild | Normal False | 46
[1 | Rainy Mild | Normal True 48
12 | Overcast | Mild) | High True 52
13. | Overcast | Hot Normal False | 44
#4 | Sunny Mild | High True 30
Standard deviation
the data into subsets that contain instances
A decision tree is built up top down from root node and involved partitioning
with similar values. We use SD to calculate homogeneity of a numerical sample.
I, 2
it | SD,S =\ 2ce—w) =9.32
SD Reduction
ute. Constructing a tree is all about finding attribute
It is based on the decrease in SD after a dataset is split on an attrib
an
that retums highest SDR.
» Step1:
ee Le
SD (Hours Play?) = 9.32
» Step2:
resulting SD is subtracted from
The dataset is then split on the different attribute.SD for each branch is calculated. The
SD before split.
Hours Play(SD)
Rainy 1.78
Overcast 3.49
Sunny 10.87
Outlook
+ P (Sunny) SD (Sunny)
SD (HoursPlay, Outlook) = P (Rainy) SD (Rainy) + P (Overcast) SD (Overcast)
4
=3 x 7.78 +74% 3.49 +a xX 10.87 = 7.66
SDR = SD (Hours Play) — SD (Hours Play, Outlook) = 9.32 — 7.66 = 1.66
el Tech-Neo Publications..A SACHIN SHAH Venture


=-——
4 4
=7Jq 14 x 10.51 +—14 x 8.95 +776 % 7.65 = 8.84
SDR=S
P (Hours Play) ~ SD(Hours Play, Temp) = 9.32 - 8.84 = 0.48
Hours Play (SD)
High
Humidity
on
8.3
SD (Hours
(H Play, a
Humidity)
=7
= p (High) SD (High) + P (Normal) SD (Normal)
a 7
=14 x 9.36 +55 8.=37
8.86
SDR = SD (Hours Play) ~ SD (Hours Play, Humidity) = 9.32 — 8.86 = 0.46

|
Hours Play(SD)
CU that hits False 7.87
Windy True 10.59
SD (HoursPlay, Windy) = P (False) SD (False) + P (True) SD (True)
8 6
= 14% 7.87 + 14 X 10.59
= 9.02
SDR = SD (Hours Play) — SD (Hours Play, Windy) = 9.32 — 9.02 = 0.3
id SDR ° outlook is highest so we select outlook as Modu

thr our root node.
|aNe Sunny Rainy 4 |
Overcast
> Step 2(a):

Now we will consider the records of ‘sunny’.
Outlook | Temp |:Humidity | Windy Hours Play?
Day
4 | Sunny | Mildi Highi _ False 45
ft (
Cool | Normal False 52
5 Sunny
% * ne |
6
Sunny Cool | Normal | True | 23
{|__—_—
|Oe
Sunn
10 ||su y y Mild Normal | False __| 46
nn
y
YL, |}
| Mia_|
14 | Sunny .
Hi
High True | 30
For Outlook = Sunny, SD = 10.87
SACHIN SHAH Vent| ure

Tech-Neo Publications...A
. -19) (M 6-14)
(

Learning with Regression and Trees ...Page no(4-44)

We will calculate SDR of only Temp, Humidity and Windy

Hours Play (SD)
Cool 14.5
Temp Mild 731

SD (Mild)
.SD (Sunny, Temp) = P (Cool) SD (Cool) + P (Mild)
2
=5x145+5 3 _
x 7.31= 10.18
= 0.69
SDR = SD(Sunny) — SD (Hours Play, Temp) = 10.87 — 10.18
Hours Play (SD)
High 75
Humidity Normal 12.49

SD (Sunny, Humidity) = P (High) SD (High) + P (Normal) SD (Normal)
i i | ae 3B
=5x75+5% age
12.49 = 10.49
SDR = SD (Sunny) — SD (Hours Play, Humidity) = 10.87 — 10.49 = 0.38
Hours Play (SD)
False 3.09
Windy True 3.50

SD (Sunny, Windy) = P (False) SD (False) + P (True) SD (True)
=2x 3.09 +2x 3.5 =3.25

SDR = SD (Sunny) — SD (Hours Play, Windy) = 10.87 — 3.25 = 7.62
SDR of windy is highest so we select windy as next node below sunny branch.
o For Outlook = sunny and Windy = false, we can directly write

Rainy
’ down the answer.
o For Outlook = sunny and Windy = true, we can directly write

down the answer False
— To write down the answer we take average of values of following

records, .
‘oO. For Outlook = sunny and Windy = false, Hours Play = (45 + 52 + 46) /3 = 47.7
o For Outlook = sunny and Windy = true, Hours Play = (23 + 30) /2 = 26.5
"> Step3:
‘ , F x .
* . .
.Now we will consider the records of ‘overcast’. For Overcast we will directly write down the answer.
us w.e.f academic year 18-19) (M6-14) . Tech-Neo Publications...A’ SACHIN SHAH Venture

Machine Learning (MU-Sern 6-Comp) Learning with Regression and Trees ...Page no (4-45)
Day | Outlook | ‘Temp Humidity | Windy | Hours Play?
3 | Overcast | Hot High False | 46
7 | Overcast | Cool | Normal Truce 43

12 | Overcast | Mild High Truce 52
13. | Overcast | Hot Normal False | 44
The answer is calculated by taking the average of

values of Hours Play for overcast records (average
of 46, 43, 52, and 44),
> Step 4:
Now we will consider the records of ‘Rainy’.
Day | Outlook | Temp | Humidity | Windy | Hours Play?
I Rainy Hot High False 25
2 Rainy Hot High True 30
8 Rainy Mild High False 35
9 Rainy Cool Normal False 38
I Rainy Mild Normal True 48
For Outlook = Rainy, SD = 7.78 M odule

Now we will calculate SDR of only Humidity and Temp
Hours Play (SD)
Hot 2.5
Temp Cool 0
Mild 6.5
(Cool) + P (Mild) SD (Mild)
SD (Rainy, Temp) = P (Hot) SD (Hot) + P (Cool) SD
2 ! 2 -
#5 x25 +5 KO+5 x 6.5 =3.6
SDR = SD (Rainy) — SD (Rainy Play, Temp) = 7.78 — 3.6 = 4.18

Hours Play(SD)
a
a
High 5
Humidity
Normal 5
SD (Sunny, Humidity) =
P (High) SD (High) + P (Normal) SD (Normal)
2
=2 x5+5% 5=5
rs Play, Humidity) = 7.78 - 5
= 2.28
SDR = SD (Rainy) - SD (Hou


| To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419 °Y
(4 4p
(MU-Sem 6-Comp) Learning with 1 Regression and Trees ...Page no
Machine Learning
=
— SDRof Temp is highest so we select Temp as next node below rainy branch.
o For Outlook = Rainy and Temp = Cool, we can directly write down the answer.
the answer.
© For Outlook = Rainy and Temp = Hot, we can directly write down
© For Outlook = Rainy and Temp = Mild, we can directly write down the answer.
— To write down the answer we take average of values of following records,
o For Outlook = Rainy and Temp = Hot, Hours Play = (25 + 30) /2= 27.5
© For Outlook = Rainy and Temp = Mild, Hours Play = (35 + 48) /2 = 41.5
o For Outlook = Rainy, Temp = Cool, Hours Play = 38
— Final Regression Tree is,

Outlook
Mild
415
> 4.10 UNIVERSITY QUESTIONS AND ANSWERS
© May 2015
Q.1 Create a decision tree for the attribute “class” using the respective values. (Ans. : Refer Example 4.7.4) (12 Marks)
Eyecolour | Marrled | Sex | Halrlength | class

Brown yes male Long Football
Blue yes male Short Football
} Brown yes male Long Football
Brown no female Long Netball i

Brown no female Long Netball 8
Blue no male Long Football E
Brown no female Long Netball :
Brown no male Short Football F
| Brown yes female Short Netball P:
Brown no female Long Netball :
_ Blue no male Long Football
Blue no malo Short Football
»Q.2 Write short note on Issues in Decision Tree (Ans. : Refer section 4. 4)
(10 Marks }
MU-New Syllabus w.ef academic year 18-19) (M6-14) [al rech-Neo Publications...A SACHIN SHAH Venture -

a sLeamng (MU-Sem 6Coms)
: ne :
Leaming with Regressi ion and Trees .. .Page no (4-47
® expiain Regression kine, ScatsNet plot, Error in predic
andtio
Best npesting line. (Ans. : Refer section 4.1) (4 Marks)
Sgression Technique. (Ans. :

Refer section 4.1)
(5 Marks)
For a Sunbumn dataset

a 5 , given bel
“ ow + Qastuct a decision tree, (Ans.
: Refer Example 4.7.5) (10 Marks)
, | ame
— Hair
& | Height | Weigh
eig t | ;
Location | Class
1 sii onde |
es j Cur
Average | Light | No | Yes
| F | Sunita {
Ven on,
Blonde | Tal | Average | Yi€s
| j Anta | Brow | : 3 N Oo
ta Town Show | Average | Yes No
1 Lata | Blonde {| Short | Average No Yes
q | acha | Red | Averece | Heavy
ac
2 No Yes
| Maya | Brown | Tall | Heavy No No
| Leena | Brown | Average | Heavy | No No
; | Rina | Blonde =] Shot S| Light S| Sess No
| @6 Folowing table shows the midterm and final exam grades obtained for students in a datab
ase course. Use the
egression to predict the final exam grade of a student who received 86 in the midterm
exam. (Ans. : Refer Example 4.2.3) (10 Marks)
Midterm exam (Xx) | 72 | 50 | 81 | 74
| 9+ | es | 59 | 83 | a6 | 33 | 88 | a1
Final exam (Y) |s+|s3|77| 78 a 75 | 48 | 79 | 77 | 52 | 74 | 90
rite short note

Ah nts
on: Logistic
so
Regression. (Ans. : Refer section 4.3) (10 Marks)
Module
}
= ‘
Q@8 Fora Sunbum deteset given below, construct @ decision tree For the following
determines which atiribute is root attribute and generate two
(Ans. : Refer Example 4.7.6)
data, Calculate Gini indexes and
level deep decision tree.
(10 Marks)
alll
Defaulting | Creditscore | Location Give Loan?
Sr.No. | Income
High High bad No
- 1 { Low
| Low High High good No
ee 2
High High High bad Yes
Be 5
Medium Medium High bad Yes
i ; |
Low bad No
Bs
e. =5 | Medium Low Low good Yes
Beka, §
: 6 | Medium Low good Yes
.”—d*~S , Low
‘sae
t r
ee
=
bad No
N
High
=
High
||
3}
4
8 _Low Medium
Low Low bad 9
bad No
Yes
Low
| og
10
fe
| Medium Medium good
tedium Low
| tT Low | ___ Medium __ High good Yes
|e | High | MR Low bad ne

13 {High | SS good Yes
|__-High_
14 | medum __
| _Medum
fel Tech-Neo Publications..A SACHIN SHAH Venture
Syllabus w.ef academic year 18-19) (6-14)

Machine Learning (MU-Sem 6-Comp) (Learning with Regression and Trees ...Page no (4-48)
the steps.
Q.9 — , Explain how regression problem can be solved using St leepest descent method. Write down
(Ans. : Refer section 4.3)
>. Dec. 2019
Q.10 Given the following data for the sales of car of an automobilei company for sixi consecutive
i years. . Predi ct the sales for
“next two consecutive years. (Ans. : Refer Example 4.2.4) (10 Mar ks)
Years | 2013 | 2014 2015 | 2016 | 2017 | 201 8
Sales | 110 | 100 | 250 | 275 | 230 | 300
: Explanation : Y = a + bx where a is the Y-Intercepy

Multiple Choice Questions and b is the slope of the line. a and b coefficients are
; required. :
Q.4.1 Linear Regression is represented by following equation Q.4°5 Linear regression belongs to which type of oar |
. .
(a) theta : bX' where a is X-intercept and b
ou. dbis Slope of learning algorithm?
is lope (a) Supervised Learning (b) Hybrid Learning q
(b) Y = a + bX where a is the slope of the line and b is (c) Unsupervised learning ;
X-Intercept (d) Reinforcement Learning v Ans, : (a)
(c) Y =a + bX where a is the Y-Intercept and b is the Explanation : Best fitting line, is drawn from input and
‘Slope of the line output points which is used for prediction.
(d) Y¥ =a + bX where a is the slope of the line and b is Q.4.6 Choose the appropriate method used to find best fit the :
the Y-Intercept Ans. : (c) data in logistic regression.
Explanation : Definition of linear regression
(a) Jaccard Distance (b) Least Square ame
A correlation between age and percentage of getting
(c) Maximum likelihood (d) Pearson coefficient
infected with COVID-19 is — 1. What you can interpret ;
¥ Ans. : (0) FB
‘rom this _ . . ; Explanation : Maximum likelinood(probability) is used
(a) Age is a good predictor and is negatively correlated E
to decide the class in logistic regression. i
(b) Age is a good predictor and it is positively correlate
d .
(c) Age is not good precitor and it is negatively Q. 4.7 Choose the “propriate method used to find best fit the
conelated » data in linear regressi on. a
" @) Age is not good predictor and it is positively related (a) Least Square error (b) Maximum likelihood
Y Ans, : (a) (c) R-square (d) Pearson coefficient
Explanation : Age is a important factor
in COVID SO,
Wwe can say age is a good predictor and it is negatively aa
correlated. Explanation : Least square error is used to draw
regression line. ay
In linear regression, which plot should be used to Q.4.8 On given data logistic regression model is build. Tt has
Fepresent the relationship between dependent variable
and independent Variable? training accuracy as X and testing accuracy as Y.
features are added in this already build model how Ifitwill
new
Se ~@) Box plot (b) * Bar graph effects on accuracy. Choose appropriate one.
+, © Scatter 4) Plot Histogram — ¥ Ans. : (c) (a) Testing accuracy will decrease ;
—renation 3 Scatter plot is used to Plot predicted
Output vs j : (b) Training accuracy will decrease
: .
ha ae (c) Testing accuracy will increase Hf ne
How Many ‘coefficients do you need to estimate in a (d) Training accuracy witl increase or remain sam
Simple linear regression model.?
0 () s ©) 1 @ 2. Y Ans. : (d) Explanation : If we add more features to a train
‘then training accuracy will increase
or remain same
We academic year 18
-19) (M6-14)
Tech-Neo Publications...
A SACHIN SH/ Vent

m of Leaming wiWith
To find the minimum or the maximu Regression and Trees ...
Page no (4-49
to zero because
a
function, we
sett he gradient Q. 4.15 Find the Statement which
i is true in case of gini inde
(a) The value of the gradient at extrema of x
(a) The gingini jindex is pj
is always equal to zero the function attribute is biased towards multivalued
(b) Depends upon the type of problem (b) Gini lindex
i does not favours equal sized partitions
(c) The value of the gradient at extrema of th (c) siardidoas,
Gini inde X favour t €st when impurity
i one is there 4in both
e function
is always equal to maximum
‘)
(d) The value of the gradient at extrema of the functi (d)d When the number of classes is large Gini index
is a
Netion good choice.
is always equal to minimum YA Y Ans. wt: (a)(a
Ns. (a) Explanation : Property of gini index method.
Explanation : Property of gradient,
Q. 4.16 The Family of Decision Tree learning algorithm is
Q. 4.10 Which
a of the following iss a disad
Sadvantage of decisi
cision (a) Unsupervised learning model a
(b) Supervised learning model
(a) Factor analysis
(c) Stochastic leaming Model
(b) Decision trees are robust to outliers
(d) Reinforcement learning Model ¥ Ans. : (b)
(c) Decision trees are prone to overfit Explanation : In decision tree training data contains
Ne
(d) Decision trees are not prone to overfit input as well as output.
Y Ans. :(c) Q. 4.17 Temperature Prediction is
Explanation : If we continue to split all the branches (a) Classification problem (b) Regression Problem
then it will lead to overfitting. (c) Clustering problem (d) Astrological Problem
4)
Find the statement which is not true in case of gini index.
¥ Ans. : (b)
nd Q. 4.11
Explanation : Since we are predicting temperature
(a) Thegini index is biased towards multivalued attribute
(numerical value).
he
(b) Gini index does not favour equal sized partitions
(c) Gini index favour test when purity is there in both Q. 4.18 How many Coefficients are needed to estimate a simple
partitions. linear regression model with one independent variable ?
&) 2 @3 @d4 v Ans. : (b)
Gini index is not (@~l
(d) When the number of classes is large we
¥ Ans. : (b) Explanation : Along with one independent variable
)
a good choice.
size partitions. will require Y-Intercept and slope of the
line. Module
Explanation : Gini index favours equal price of house
Q. 4.19 You want to design a system to predict the
actual is nearby
0.4.12 Which one of these is not a
tree based learner? and after prediction if predicted and
ed for the loan
(b) D3 same then deal is registered and process
(a) CART deed is done. The
he
(d) Random forest approval. Once loan is approved sale
(c) Bayesian Classifier : (c) selling price of a house depends on
the following factors.
v Ans. of bedrooms, number
For eg. It depends on the number
on baye’s the year the house was
Explanation : Bayesian classifier is based of kitchen, number of bathrooms,
of the lot. Also the facilities
theorem. built and the square footage
‘a) Loan sanction depends
that are available near the house.
ent cas es is___— customer. Given these factors,
Ww Q. 4.13 Finding the number of Covid Pati on the credit history of of
of the house is an example
(a) Classification model (6) Astr
Regrolog on one
essiical Mode
predicting the selling price
task and also what
process you will follow to
(d)
as (c) Clustering Model Y Ans. : (b)
which
design? Classification, deal
aw . ients ( surveys Binary
ill of pats (a) Locality roval, sale deed
Explanation : Since we are finding number dit sco ring, loan app
registration, cre l classification, deal
numerical output). survey, Multileve
(b) Locality it scoring, loan approval, sale deed
registration, cred sion,, deal
rat, Logistic regression is __— survey,
!
Simple linear regres
dee d
(c) Locality Joan approval, sale
(a) Supervised Regression gi st ra ti on , credit scoring, deal
re regressio n,
Multiple linear sale deed
(b) Supervised Classification (d) Locality survey, t scoring, loan approval,
fe gistration, credi Y Ans. : (d)
(d) Reinforcement learning
on iso AH Venture
Explanation : Logistic regressi pu bl ic ations..A SACHIN SH
output is 4! Te ch -N eo
deciding class. In this expected [e l
6-14)
~ (MU-Ne
New Syllabus w.e.f academic year 18-19) (M
|
L titay Willy Legian

ast ad Chetty Age Tt (A:b0)
Machine Leaming (MU-Sem 6-Comp) deanna aan Wetted itt Wut dae teal
Te hid Wi
Multiple Hint gE t a ly
Explanation: Locality wane, foe pratent ten woes VE THe $yfetinnie (CMe TULIA
senting htt TON CITA Eig
regression, deal Wepistration, eayatit enaeanent OE Bete naperE ATH
tea t sa ve d
(uititetic XY awd yricatahioith anbeveygt OV UNE Ala UTA At hed, inti Taye Ceo Uti Wit
approval, sale deed As price
thant on . Ninel, ele thie (elle,
(0/1) is predicted based ant mow Abvavnn cyan Etat teaplet vliverltien Cee
Li atl:
Mtv alitaset ae ventional betaine, Prapuens the PM
Q. 4.20 Suppose there are five instaners, \ DAA eli it [nt wyeyy
shown a the taille vases ACUTE (tect teailin t Kolaviney Guaebatter
having three Feats. ps and ay Rove, Caan roe deer) peerletl, wate WHE era
ee att rt action
below !
itvvnield evict ast Hwee rive, ait tlh heer WHlelt Why
Instances: Dy au]? N Tt
tthe tuesdavnennny Heer ad ete EO
| : Lo} dif al povllewal label a Ade (eet paced Ela eitenye titty
he lait vlvedttars A UIE yanne cessed fo mec Tee pail
2 ad ly
anny votre qarcyponaend Whonnne’
3 we 40 MD Vv Nau
E (il)
md ub we) Aol é
d AL] AT] 28 ned te eeacateles itn Dat Tita
Hyptoation boca
5 §.0 | S27 08 (est Erte én 1)
In order to find the dependence between two varlables Q. aad

la
Phe selling pote ot ie tenia atepreentee ce Tie Hal liye
we use the Pearson's Conchition factory, Bor axnunple, lt alopuentil an the ainber ad
Coefficient. Based on your understanding of Corelation horlrvonia, Aber ab kite wiiber af nvteecncatiiy, thie
Coefficient, choose the corre! tyqe OP tht
yet (le Treacy wit Prvatt avnved anes peqqtaane! Coeati
(a) A strong positive convlation between pam q fot Giver thee fineland, prediediayy the welling, plea al
(b) A strong negative correlation between |! and qa tho homes ome evimphe at whteh ade?
(c) A Weak positive correlation between p and 6, (0) Wnty eleadtieatian
(d) A weak negative cormelation between pam t. (bh) Matiilabel ela tiation
¥v Ans. ttn) (8) Sdnypte Cia nila
Explanation : Correlation coctlicient: value of extetly GD Muatiple tiene nyeadtan ¥ Annet (l)
1.0 means there is a perfect’ positive relationship Baplanatton r Shree we ane peddle tion vee tine garlic Ht ba
between the two variables. Fora positive inerease tone a evample ot ramadan, Selly peter alegre ant
variable, there is also a positive increase in the second trvmaltdpte: Cetin so TE RHI ate Hiner recenein,
variable. A value of -E means there ix a perfect uepative
relationship between the Qwo variables, ‘This shows (hat Q. 4.24 A feat FE ean take certain vatner Ay Fh GD and TE
the variables move in opposite directions for a positive rypresents grade ot tents Crane cottyye, Wiloleth al tlie
increase in one variable, there is vdeerease in the second followliy statement by true ti fallow ligy eae?
variable. While the strength of the relationship: varies: in QW) Beate PE is ar exdaat nenidinal var titihe,
ytte
degree based ow the absolute value of the correlation (b) Poattne EE Idan exiiiple ab ontinal vielubles
coefficient. Ce) Hedoesnt belong tomy of the above eiitegorys
(Doth of these ¥ An 1(4)
Q. 4.21 The sales of a company (in thousands) for each month
are shown in table below : Eaphiuation : Ordlaat ville ane the varlibtes whilvlt
Find least square regression line, y = ax +b, Use Tine asa have some order In thede categorie, Hor example, peal
model to estimate the sales of company in the 6" month, A should be considered ay tiple goracte thin grade Hh,
X(month) | | | 2.{ 3 | 4] 5 Q, 4.25 Whiel of the fatlowliy bv tiie far a deetslon tee?
Y(sales) | 12} 19 | 29] 37] 45 Qt) Adeelston give tea exaiple oft Ulnvear alow un
(a) 84 (b) 60 (b) The catropy of anode qypleatly: decreas a We ay
(c) 74 (d) 80 ¥ Ans. (0) dlown a deedton tree,
Explanation : 84, from calculation of regression Ine (©) Hitropy ts rieastine of gndty,
Y=aX+b (Aw attribute with tower mutant information alyouild
¥ An th)
be preferred to other ati bates,
4 22 Suppose you want to develop multi classificntion
Exphimitlon Property ofa deelton tive,
problem and you are only allowed to use binary logistic
» classifiers to solve a multi-class classification problem,
Given a training set with 2 classes, this classifier_cun cust EIS
e
rs} Tech-Neo Publicatlons..A SACHIN SHAH Ventur

a nemeecias

og
Learning with Regression and Trees ...Page no (4-51)

oe,
——S———————
Suppose you gol a situation where you find that linear
Q, 4.26 IG(S, Road Type) = 0, IG(S, Speed Limit) = |
egression model is underfitting the data in such situation
Since the decision tree is constructed on the feature
which of the following options woluld you consider?
—
having maximum information gain, (a) is the correct

ot A
(a) Will add more features

ee
answer.
(b) Will strat intoducing higher degree features
Q. 4.30 Consider a simple linear regression model with One
FT
eee Fe
' (c) Will remove some features

independent variable (X). The output variable is Y. The
(d) None of above Y Ans, : (a) equation is : Y = aX + b where a is the slope and b is the
Explaaation : In case of under-fitting, you need to intercept. If we change the input variable (X) by 1 unit,
we
induce more features in variable space or you can add by how much output variable (Y) will change?
aa
some polynomial degree variables to make the model (a) 1 unit (b) By slope
more complex to be able to fit the data better,
(c) By intercept (d) None v Ans. : (c)
Consider the dataset, S given below : Explanation : Equation for simple linear regression:
Q. 4.27
Y =a+bxX. Now if we increase the value
Elevation | Road Type | Speed Limit | Speed + 1) Le.
ofX by | then the value of Y would be a + b (x
steep Uneven Yes Slow value of Y will get incremented by b.
steep Smooth Yes Slow y
Q. 4.31 The following table shows the results of a recentl
correlat ion of the number of
flat Uneven No Fast conducted study on the
acute
hours spent driving with the risk of developing
steep Smooth No Fast fit line for this
backache. Find the equatio n of the best
Elevation, Road Type and speed Limit are the features data.
and Speed is the target label that we want to predict.
No of hrs(x) | Risk on a scale {y)
Find the entropy of the dataset, S as given above:
10 95
(a) 0.5 (b)O (c) 1 (d) 0.7 v Ans. : (c)
9 80
Explanation : For a dataset, S with C many classes, the
entropy of the set, S given by H(S) is defined as : 2 10
\ H(S) = X p, log p, where p, is the probability of an 15 50
the element of S belonging to a class. In this case,
10 45
wyatt P(slow) = 0.5; P(fast) = 0.5 and hence H(S) = | Module
ES 16 98 nde
Q. 4.28 Find the information Gain if the dataset is split at the
11 38
eA feature “Elevation” :
0.675 (d) 0.325 v Ans. : (d) 16 93
antl (a) 1 (b) O(c)
The feature, Elevation has 2 Choose which of the options is correct?
vias Explanation
a split on the featur e
values = {Steep, Flat]. For (a) y =3.39x + 11.62 (b) Y=4.69x + 12.58
a having feature
Elevation, one subtree would be of inputs (c) Y =4.59x + 12.58 (d) Y=3.59x + 10.58
g featur e, Flat. For the feature,
we Steep and the other havin Y Ans. : (c)
5 les, out of which P(slo w) = 2/3
Steep, there are 3 examp
ni py (Steep ) = 0.9 and Explanation : For each x calculate the value of Y using
and P(fast) = 1/3 Thus, Entro
the given equations. Then calculate error for each
Entropy(Flat) = 0.
equation. Equation with the lowest error is the desired
(1/4) *0)
Thus Information Gain = | — ((3/4) * 0.9 + answer.
= 1 — 0.675 = 0.325
Q. 4.32 Pruning is a technique that reduces the size of decision
Q. 4.29 Find the feature on which the parent node must be
trees by removing sections of the tree that provide little
chosen to split the dataset, S based on information gain power to classify instances. This is done in order to
avoid
(a) Speed Limit (b) Road Type
(a) overfitting (b) underfitting
(c) Elevation
Y Ans. : (a)
(c) Both (d) None v Ans. : (a)
we
Explanation : Using the Information Gain formula, Explanation : Pruning reduces the complexity of the
s.
can find the information gain for each of the feature final classifier, and hence improves predictive accuracy
The values should be IG(S, Elevation) = 0.325, by the reduction of overfitting.
[B) Tech-Neo Publications...A SACHIN SHAH Venture


Learning with Regression and Tr ees...
Machine Learnin MU-Sem 6-Comp
For Explanation: When model is trained using infinite
Q. 4.33 We define a concept as the conjunction of literals. number of training examples then the model exhibits
t
the: given instance space, X, find the specific concep
lower variance in data.
using Find-S algorithm.
| Sky: | Airtemp | Humidity,| Wind | Enjoys port Q. 4.39 As the number of training examples goes to infinity,
your model trained on that data will have
sunny warm High Strong Yes
(a) Lower bias (b) Higher bias
sunny warm ’ High Strong Yes Y Ans, (c)
(c) Same bias
rainy cold. Normal | Strong No Explanation : There will be no changes in bias.
‘| sunny warm High Weak Yes
Q. 4.40 The error function most suited for gradient descent using
cloudy cold Normal | Strong No logistic regression is
‘ means don’t care and ‘0’ indicates no value (a) The entropy function
(a) <Sunny,Cold,High,Strong> (b) The squared error
(b) <Sunny, 0, ?, Strong> (c) The cross-entropy function
(c) <Sunny, Warm, High, ?> (d) The number of mistakes ¥ Ans. : (c)
(d) <Sunny, 0, Warm, Strong> Vv Ans.: (c) Explanation : For logistic regression, the cross-entropy
function (loss function or cost function) is convex. A
‘ . Explanation : For Sunny,Warm,High,? The class is Yes.
convex function has just one minimum; there are no local
Find the number of instances possible in X using the minima to get stuck in, so gradient descent starting from
possible values that can be seen in the table given in any point is guaranteed to find the minimum. Since the
above Question. Cross Entropy cost function is convex a variety of local
(a) 12 '(b) 48 optimization schemes can be more easily used to
(c) 36 (d) 24 v Ans. : (d) properly ininimize it. For this reason the Cross Entropy
Explanation :we count possible values for each attribute cost is used more often in practice for logistic regression
and multiply them, 3*2*2*2 = 24. than is the logistic Least Squares cost.
Q. 4.35 Find the number of syntactically distinct hypotheses Q. 4.41 Regarding bias and variance,. which of the following
given in the table Question 33. statements are true?
Aint : Each attribute can have 2 more values : ? and 0 (a) Models which overfit are more likely to have high
bias
(a) 160 (b) 320
(c) 80: (d) 40 Y Ans. : (b) (b) Models which overfit are more likely to have low
bias
Explanation : For each attribute possible values as
(c) Models which overfit are more likely to have high
3*2*2*2 and each attribute can have 2 more values so,
variance
5*4*4*4 = 320
(d) Models which overfit are more likely to have low
Q. 4.36 We can get multiple local optimum solutions if we solve variance Y Ans.
: (b), (c)
"a linear regression problem by minimizing the sum of
Explanation : The bias of a classifier gets reduced when
squared errors using gradient descent.
the training set error lowers down to zero causing low
(a) True (b)- False Y Ans. : (b) bias, while due to overfitting che gap between the training
Explanation : we do not get multiple optimum solution. error and test, error becomes higher, causing high
_ Q. 4.37 When a decision tree is grown to full cept, it is more
variance.
likely tofit the noise in the data. . Q. 4.42 Consider a binary classification problem. SupposeI have
(a) True. (b). False. v Ans. : (a) trained a model on a linearly separable training set, and
. Explanation : As tree grows to full depth, possibility of now I get a new labeled data point which is correctly
Tedundant and noisy data gets increased. classified by the model, and far away from the decision
boundary. If I now add this new point to my earlier
_ As the number of training examples goes to infinity, training set and re-train, in which cases is the leat
your model trained on that data will have
decision boundary likely to change?
(a) Lower variance (b) Higher variance (a) When my model is a perceptron. :
(c) Same variance Y Ans. : (a) (b) When my model is'logistic regression. | if
. (MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A'SACHIN SHAH Venture

nn cde ve cp nn eS
Sem 6-Comp)
ine Learning (MU- Learning with Re ression and Trees ...P
(c) When my model is an SVM, age no (4-53
(d) When my model is HMM. ExplPlanatio
ion n : After addiing a feature in
¥ Ans. : (b) feature spac
e,
wh
y ether that feature iiss i Important or
unimportant features
Explanation : If we add new point to already trained e R-squared always increase.
Fr wth th ain pence, pe

model and we are using logistic regression then the
Q. 4.47
decision boundary will change.
sion tree algorithm?
Q, 4.43 When doing teast-squares regression with regularisation 1, Number of samples used for split
(assuming that the optimisation can be done exactly), 2. Depth of tree
increasing the value of the regularisation parameter
3. Samples for leaf
—
(a) 1 and2 (b) 2and3
(a) will never decrease the training error.
(c) | and3 (d) Can’t say ¥ Ans. : (d)
(b) will never increase the training error.
Explanation : For all three options a, b, and c, it is not
(c) will never decrease the testing error. necessary that if you increase the value of parameter the
(d) will never increase the testing error. Y Ans. : (a) performance may increase. For example, if we have a
Explanation : Increasing
the value of regularization very high value of depth of tree, the resulting (ree may
parameter will not decrease training error. overfit the data, and would not generalize well. On the
other hand, if we have a very low value, the tree may
Q. 4.44 High entropy means that the partitions in classification underfit the data. So, we can’t say for sure that “higher is
are better”.
(a) pure (b) not pure
Q. 4.48 In Regression tree which method is used
(c) useful (d) useless Y Ans.
: (b)
(a) Gini index (b) ID3
Explanation : Entropy is a measure of the randomness in
(c) Standard deviation reduction
the information being processed. The higher the entropy, Y Ans. : (c)
(d) None of above
the harder it is to draw any conclusions from that
Explanation : Standard deviation reduction is calculated
information. Bt is a measure of disorder or purity or is
less for all attributes and based on that values tree
unpredictability or uncertainty. Low entropy means
constructed.
uncertain and high entropy means more uncertain.
four attributes plus Q. 4.49 In Regression tree output attribute is
Q. 4.45 A machine learning problem involves
and 2 possible values (a) discrete (b) categorical
aclass. The attributes have 3, 2, 2, v Ans. : : (c) . 4
values. How many (c) continuous (d) all of above
each. The class has 3 possible
examp les are there? Explanation : In regression tree output is continuous
maximum possible different
(b) 24 whereas in classification tree it is categorical.
(a) 12
(d) 72 ¥ Ans. : (d)
(c) 48 Q. 4.50 Another name for output attribute
ble different examples are
Explanation : Maximum possi ute and
(a) predictive variable
ble values of each attrib
the products of the possi (b) Independent variable
3*2*2*2*3=72
the number of classes; (c) estimated variable
a linear regression
Q. 4.46 Adding a non-important feature to (d) dependent variable Y Ans. : (a)
in.
model may result Explanation : Since we predict the output attribute.
re 2. Decrease in R-square
1. Increase in R-squa
(b) Only 2 is correct
(a) Only 1 is correct
(d) None of these ¥ Ans. : (a)
(c) Either | or 2
Chapter Ends...
o00

support Vector Machine : Maximum Margin Linear Se

Parators, Quadratic Programming solution to finding maximum
margin separators, Kernels for learning non-linear func tions.
Clustering : Expectation Maximization Algorithm » Supervised learning
after clustering, Radial Basis functions.
Rule based classification

" 541 Example of Rule based Classifier...
5.1.2
5.1.3
5.1.4
52 Classification by Backpropagation ....ccsscsestenstatatatntntntatensiisieianenunauisipusiessiiieeseeeecs 56

5.21 Generalised Delta Learning RUC .....sseeceseescssrsenssntentatiatnsinprnietnstiniiatistinisesisstieseseesees
eee 5-6
5.2.2 Error Back Propagation Training
53 Bayesian Belief Network... esusescsssssssasecseceseeecees.
5.3.1 BayeS TREOPEM.........ceceseeseessesssecnssneesssssssssessssesusanecsuserereemeesess
54
5.4.1 Markov MOdels......ccesssssssesesssess

UExample 5.4.3 EUIURA VEN MEMMU ME ..--.s.essccessscesssessessettnsessastnsesuerstsssssastesanestastsstssssiuaaseseseuseeeecc. 5-16
SE
5.4.2 Main issues usi

ng HMMs
5.5.1. Maximum Margin Linear Separators

. + ar
ator deneeeeeessenssens
vencecenenmarneesees eet
5.5.2 Quadratic Programming Solution to Find Maximum Margin Sep
eves “Saay
5.5.3 Kemels for Leaming Non-Linear FUNctions .cvesscsscessen ese
—
5.5.4 Rules for the Kernel FUMCtion ..csssccsssssesssecvsesseseesernsennrnnnnn eT
5.6
5.6.2 Hierarchical Clustering

nessestsernrenr
rene csn
nen oceneessn e
TES
ETC
SIME UONPY
UExample 5.6.9 PUIUBU
UExample 5.6.10 BXTUBSDUE) me RAi ORE UCD «------esceeeererererrrerrtreet
5.6.3 Expectation - Maximization Algorithm .ve-cmwere er smnmuertnsnttnintnnttttntn, 5.51

5.6.4 Supervised Learning after Clustering.....-.--sesscsscrererrt
5.6.5 Radial Basis Function ~
5.6.6 RBF Learning Strategies
5.7

{ classifier classifies records usin

whasc¢ g 4 set of IE-THE :
pole ; *SSLOTTIE-THEN rules, The rules can be expressed in the following form
condition) + :
\ ynere LHS of above rule is known as an antecedent or condition
RHS of above rule is known as rule consequent
condition is a conjunction of attribute, Here Condition consists Of one or more attribute tests which are logically
ANDed-
y represents the class label
assume a rule RI,

al: IF Buying_Price = high AND Maintenance_Price = high
AND Safety = low THEN Car_evaluation = unacceptable
gule RI can also be rewrilten as:

: +a — high) A . . : .
gi; (Buying_Price = high) * (Maintenance_Price = high) * (Safety = low) —+ Car_evaluation = unacceptable
Ifthe antecedent is true for a given record, then the Consequent is given as output.
2 5.11 Example of Rule based Classifier
Buying_Price | Maintenance_Price Lug Boot | Safety | Evaluation?

High High Small | High | Unacceptable
Medium High Small | High | Acceptable
Low Medium Small High | Acceptable
Low Low Big High | Acceptable
Low Low Big Low | Unacceptable
High Low Big High | Acceptable
Low Medium Big High | Acceptable
High Medium Big | Low | Acceptable Module
Medium Medium Small Low | Acceptable 5
Medium High Big High | Acceptable
Low Medium Small Low | Unacceptable
- RI: (Buying-Price = high) * (Maintenance_Price = high) “ (Safety = low) — Car_evaluation = unacceptable
- R2:(Buying-Price = medium) * ( Maintenance_Price = high) * (Lug_Boot = big) — Car_evaluation = acceptable
- R3:(Buying-Price = high) * (Lug_Boot = big) + Car_evaluation = unacceptable
- R4:(Maintenance_Price = medium) * (Lug_Boot = big) * (Safety = high) — Car_evaluation = acceptable
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

MachCihin
e Learnin
MU-Sem 6 “
-CoCom
7 5.11.2 o
Application of Rule based Classifier
medium
Car | triggers rule
R1 unacceptable
Car 2 triggers rule
R2 > acceptable
Car 3 triggers both R3
and R4
Car 4 triggers none of
the rules
ran 5.1.3 .
Characteristics of Rule based Classifier
1, Mutually Exclusive Rules : Rule based Classifier comprises of mutually exclusive rules where the Tule
ate
in dependent of each other, Each and every record is covered by at most one rule.
Solution : Arrange rules in the
order
Arrangement of Rules in the Ord
er
Rules are assigned a priority and based on this they are arranged and ranks are associated. When a test record is Blvenas
a input to the classifier, a label of the class with highest priority triggered rule is assigned. If the test record does not trigger
any of the rules then a default class is assigned.
Rule-based ordering
In rule based ordering individual rules are ranked based on their quality.
— RI: (Buying-Price = high) (Maintenance_Price = high) * (Safety = low) > Car_evaluation = unacceptable
— R2: (Buying-Price = medium) * (Maintenance_Price = high) * (Lug_Boot = big) + Car_evaluation = acceptable

-— R3: (Buying-Price = high) * (Lug_Boot = big) + Car_evaluation = unacceptable
— — R4: (Maintenance_Price = medium) * (Lug_Boot = big) * (Safety = high) — Car_evaluation = acceptable
Car | Buying_Price | Maintenance_ Price | Lug_Boot | Safety | Evaluation?

3 High medium big high 9
— Car 3 triggers rule R3 first — unacceptable
Class-based ordering
In class based ordering rules which belong to the same class are grouped together.
— R2: (Buying-Price = medium) * (Maintenance_Price = high) “ (Lug_Boot = big) ++ Car_evaluation = acceptable
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Ventue

7
=
To Learn The Art
: Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
: 0a 9 With Classi;
BS (a
Maintenance: .
bo .
= dium) (Lug_ Boot == big) 4 (Saf
e: e mel ety Te=hieeeeen
uyiinno-gPr-Picreice == hi hig i‘nt
C _
callon and Clustering Page!
RI: gh) * (Mainten P r i ce = high ~Y
= h) ~Ca i mace
ance ) » (S afety = | lo w) — C a
’
r_evaluati =
pp... on unacceptab
Ra: Car Buying Price
big) ~ Car_evaly at ‘
ion = Un
Cx
acceptable
s an
le
Maintenance Pri
3 High ice
Safety | Evaluation?
st triggers rule R4 first + acceptab
le
high ”
exhaustive Rules : It said to have a
complete
snibute values combination,
7 ae *
Every instance is ro
CoVerage for th e
ed Classifier if it accounts for each doable
fed with a mi mab
nimum ofe one
tule.
ns + Usea default class
gotutio
m 5.1.4 Building Classification Rules
, pirect Method : Sequential Cov
ering Algorithm
Training data is used
to extract IF-THEN
Tules in
it is not requii re
d toc on struct
a decis 10n t tree Initia
1 | y . In sequentia
iven class ers § l covering
covers m num algorithm
maximu number of ’ C )
the records .
of t lat class,
1D.
. as records. This is often as a result of the path to every

leaf in a decision tree corresponds to a rule.
The Decision tree induction will be thoug

ht of as learning a group of rules at the
same time.
Algorithm
(a) Start using an empty rule.
(b) Grow a rule using the, learn one rule at a time method. Either we can use general to specifi
c or specific to general
strategy.
(c) Remove training records covered by the rule, otherwise the next rule will be exactly
same as that of the previous
tule.
(d) Repeat above two steps until stopping criteria is reached. Stopping criteria can be computation of Module
significant gain.
2. Indirect Method : From Decision Trees 5
In the Indirect method we will learn how to generate a rule-based
classifier by extracting IF-THEN rules from a
decision tree.
Points that we need to remember while extracting a rule from a decision tree :
© For each path from the root to the leaf node one rule is created.
o Each splitting criterion is logically ANDed to form a rule condition or antecedent.
© The leaf node represents the class prediction, forming the rule consequent.
MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

% 5.2.1 Generalised Delta Learning Rule erj (hidden) Layer k

La
‘
th e weigh ¥ (0Uutp
We will derive a general expression for Layer! (input)
not an output Z
that 1s
increment AV, for any layer of neurons —s |
layer pa |
Iti layer
The above is a simplified diagram for the mu put) Layer j (hi den) Layer k (outpyy
perceptron network which represents the three lay
er input
payer i (in the outpy, a
(i), hidden (j) and output (k)- laye rand O represents the
e hidden W represents the weight 7
he output of th rand Clo
At the input layer input Z is applied, Y represen's t and the hidden
lay¢
present between the input
output layer. V is the weight vector
present between the hidden and the output layer.
(5
layer
According to —ve gradient descent formula for hidden
AV, = -ndE/dV,
We miy write the term dE /d Vj, as +(5.29
“<)
dE/d Vi = dE/dnet, * dnet/dV,
resented as dY,
can be rep
(dE /dnet, ) is the error signal for hidden layer which
i.e, Zi
dnetj/dV,, represent the input applied at the input layer
we ue
By substituting this values in Equation (5.2.2) and Equation (5.2.1)
AV, = 1 *dY,*Zi “—
As we are saying that dY, = — dE/dnet,
We can write this term as

B24
dY, = —dE/dY, * dY,/dnet,...
js the differentiation of error w.rt
nd term is nothi ng but the fj (net) and the first term
In the above cquation the seco
As we know error is given as E = (d - oy
dEMY, = d/dY,(%¥ (4 - f(net, (y)’ )
~¥ {d, - O, ) dd, f (net, (y))
~¥ (dO, ) fF Cnet, J * Cd (net, YAY)

WW
as Wk, which is
ry . .
ent In the output layer and d (net, /dY,

f
i
(net, ) = d,, whic h is

‘
the error
‘
pres
We can write (d, - O,) * f
,
the hidden and output layer.

the weight vector present between
Thus we will get

dEAY, = - ¥ da Way
(4) we will get

By substituting this in Equation
dY, = (Dd Wk) fi (net)
year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture
(MU-New Syllabus w. e.f academic

cubstituting this in Equation (3) we Will get
-
By? _ ats
AV,, = n*Zi* fj (net,) * r dy, Wkj)
pete ff (net) =¥’(l- 0°)

for bipolar and Oc.
W O) for unipolar function. This is
the Generalised Delta learning
rule.
5.2.2 Error Back Propagation Training
Bo
pack propagation, the network learns durin
r g atraining
iy Eregist the input is. appli.ed . Phase. The Steps followed dur
ing learning are :
to the input layer to calcu
late the Output of the hidden layer. The output of the hidden
pecomes the input of the next layer. Finally, layer
the output of the Output layer is calculated.
The desired and the actual output at the outp
ut layer are Compared with each other and an error signal is generated.
The error of the output layer is back propa
1
3. ;
etwork can be properly adjusted. Baled to the hidden layer so that the weights connected in each layer of the
n
when Back propagation network is trained for the
correct classificati on for a
to the network in order to check training data, then a testing data is applied
if the unseen patterns are correctly
cl assified or not.
Initialise weights W and

V
Submit pattern Z and Comput

e layers
response, Y =f(V* Z) and
O =f (W* Y)
Compute cycle error

E=E+%(d-o/ Besh
Begin 1
training Calculate error dO, dY training

cycle dO = (d, - O,) * F( net, ) x
dY = W; *dO*Y"
!
Adjust weights of output layer as
W=W+n*d0"y'
Adjust weights of hidden layer as

V=Van*dy'z
More patterns Yes

in the training set
Mal tech-Neo Publications...A SACHIN SHAH Venture

Example 5.2.1 ; Classify the input of following network usin

21 -
Welt W=-1.V=| set Jee
Vv Solution :
Feedforward stage
First, we will calculate the output of Hidden Layer as,
Y, { (0.6 *2+1*0)=0.76
Y, f(-1+0.6*14+2*0)=04
Now we will calculate the output of Output Layer as,

O = £(0.76*(-1)+0.4* 1+ 1) =02
Back propagation of error
First, we will calculate the Error of the Output Layer as,
)*02)=
dO = (t-Oyf (0) =(0.9-0.2(1-0. 0.112
.2*
Now we will calculate the Error of the Hidden Layer as,
a¥; = dO # (= 1) #£ (Y,) = 0.112 * 1) * 0.76 (1 - 0.76) = ~ 0.02

= 0.026
dy, = dO * (1) *f (Y,) = 0.112 * (1) * 0.4 (1 - 0.4)
Updation of weights
First we will update the weights between the Hidden and the Output Layer as,
W,, = Wy, +1 *dO* Y,=-1 +0.3 *0.112 * 0.76 = - 0.975
Wi. = W,, + *dO* Y,=1+0.3 * 0.112 * 0.4 = 1.013
Bias is updated as
40.3 O=-
W, = Wy+n*d 1
* 0.112 =-0.996
Now we will update the weights between the Hidden and the Input Layer as,
V,, = Vy tn *dY,* Z, =2 40.3 * (- 0.02)* 0.6 = 2.0036
V.tn*dY, *Z,=1+0.3 * (-0.02)*0=1
"
x<
Vx = Vo tn *d¥, *Z,=1+0.3 * 0.026 *0.6= 1.3

LearnjNing
witi h Classification and Cluste
= Var + *dY, #7 25 ring ...
240.3% 0.02
7 26* 0=2
pias i updtated as
Vo. = Viotn *dY, =~
é>
1+0.3* 0.026 =
=-0.992
<7 BAYESIAN BELIEF NETWORK
yi ———
5.3! Bayes Theorem

a
To understand the Bayesian belief network first we will reyj Ise the
. miiian « h Concepts of
the conditional probability is represen Bayes theorem.
ted in the followin
8 Way.
P(A, C)
P(C/A) =
P(A)
P(A,C)
P (A/C)
P(C)
The Bayes theorem is defined using the following
equation
P(C/A) =
r(S)P@
ciP(A)
Now we will see the simple Bayes Theorem

example
Given : A shop owner understands that due to
Rain there is inc Tease in sale of
umbrella 50% of the time. Prior
probability of rainy weather is 1/50,000. Prior Probability
of an y Customer purchasing umbrella is 1/20. If a customer
has purchased umbrella, what's the Probability the weather is rai
ny?
U
P(E )p R
P(R/U) = R nf 0.5% 1/50,000
P(U) ~ T7290 = 9.0002
3, 5.3.2 Bayesian Classifiers
- InBayesian classifier each attribute and class label is considered as a random

variable. When the records with attributes
(A,, Az... A,) are given as a input to the classifier the goal
is to predict the class C. Specifically we can say that, we
want to find the value of C that maximizes P (C | Aj, Agy...5 Ag):
- Now the question is, can we estimate P (C | A,, A2,...,A,) directly Module
ut
from data. The approach used is to compute the
posterior probability P(C | Aj, A3,...,A,) for all values of C using the Bayes theorem.
P(A,, Aj, ..., A, 1C) P(C)
P(CIA,,A;,...,A,) = P(A, Ayu A,)
- Now we have to select the value of C that maximizes P(C | A,, Ay,..., A,). It will be as good as selecting
the C that
gives a maximum value for P (A;, A,, sees A, 1 C) P(C). Next question is calculation of P (A,, Agyeses A, 1 C), that we
will see in the following sections.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) al Tech-Neo Publications...A SACHIN SHAH Venture

5.3.3 Naive Bayes Classifier wa Ay IC) ean he representa,Nteq in; he

her. S° Pp (Ap Ay:
. sach ot
hen
When class
¢ is given
ey assume attributes are independe nt of eacl
tollowing way,
| ’ Ay Ag AIC) =P (AIG) P (AIG) P(A! C;) ) as aclass Cy
assisiglgned
cord 1s+. as
e te st re
PAI C)) is calculated for all classes C, and attributes Aj. Th
PCC) *P (A C)) is maximum.

Example 1: Naive Bayes Classifier salary- | Savings? q (Y/N)
- :pivorced) Saat?||
| 125k N
|Person id | Loan(Y/N) Status($:SingleM:Married,D L
| Y = 100k N
2 M
| : - g5k | Yl
4 ,
;
5
: 7 EE
; a}
NY - fF2z0k |ON
85k Y
8 N S a
75k N
9 N MT
10 N s 90k_|___* __
Class: P(C) = N./N
P(N) = 7/10, P(Y) = 3/10
If the attributes are discrete then the probability is calculated as:

P(A, IC) = LAIN,
having attribute A, and belongs to class C,
Here | A, | represents the number of records
Examples : P (Status = MIN) = 4/7, P (Loan = Yly)=0

combination of A
If the attributes are continu ous then the probability is calcu
lated using Normal distribution for each
and C,
(x-py
P (A/C) Le
6
u"
210°
= 2975
For (Salary, N): mean = 110 and variance
(20-110)
| — "ye 2075
P (Salary= 120/N) = Yon« 2075 e "5 = (),0072
is given for testing

Suppose a record X = (Loan = N, M, Salary = 120K)
P (Loan= YIN) = 3/7
P (Loan = NIN) 4/7

(MU-New Syllabus wef academic year 18-19) (M6-14)
walkie:

puoan= NIY) = I
= 27
= SIN)
p (status
= 17
= DIN)
p (Status
= 4/7
= MIN)
p (Status
= 27
= SIY)
p (Status
= 17
DIY)
p (Status =
0
MIY) =
p (Status =
For salary : If Savings = N : mean = 110, variance = 2975
if savings = Y : mean = 90, variance = 25
p(x | Savings = N)
= P(Loan=NI Savings = N) x P (Status =
MI Savings = N) x P (Salary = 120K | Savings = N)
= 417x4/7 x 0.0072 = 0.0024
p(xi Savings = Y)
= P(Loan=N1 Savings = Y) x P (Status = M | Savings = Y) x P (Salary = 120K | Savings
= Y)
= 1x0x1.2x10° =0
since P (XN) P (N) > P (XIY) PCY)
Therefore P (NIX) > P (YIX) = > Savings = N for the given record.
prample 2 : Naive Bayes Classifier
x
We want to classify record < Red, Domestic, SUV>

Car no Colour Type Origin Stolen?
1 Red Sports Domestic Yes
2 Red Sports Domestic No
3 Red Sports Domestic Yes
4 Yellow Sports Domestic No
5 Yellow Sports Imported Module
Yes
6 Yellow SUV Imported No
5
7 Yellow SUV Imported Yes
8 Yellow SUV Domestic No
9 Red SUV Imported No
10 Red Sports Imported Yes
P(yes) = 5/10, P (no) = 5/10
Color-> P (Red/yes) = 3/5, P (Red/no) = 2/5
P(Yellow/yes) = 2/5, P (Yellow/no) = 3/5

(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications...A SACHIN SHAH Venture

Type-
YPe-> P(SUV/yes) = 3/5, P (SUV/no)
= 2/5
P(Sports/yes) = 2/5, P (Sports/no) = 3/5
1
Origin-> P (Domestic/yes) = 3/5, P (Domestic/no) = 2/5
P(Imported/yes) = 2/5, P (Imported/no) = 3/5
P (Red. Domestic, SUV/Y)*P(Y)
1/5*5/10
P (Red/yes)* P (Domestic/yes)* P (SUV/yes) *P(Y) = 3/5*2/5*
= 0.024
P (Red, Domestic, SUV/N)*P
(N)
P (Red/no)* P (Domestic/no)* P (SUV/no)*P (N) = 9/5 #3/5*3/5*5/10

il
= 0.072
P (Red, Domestic, SUV /N)*P (N)> P (Red, Domestic, SUV /Y)*P(Y)

So, record <Red, Domestic, SUV> is classified as Stolen = No
UExample 5.3.1
KGB MU - Dec. 19, 10 Marks ifier toto find
=Cool, Wind= Strong use naive Bayes classifier
fi whether the
For a unknown tuple t =<Outll ook =Sunny , Temper ature
sa
class for PlayTennis is yes or no. The dataset is given below
Outlook [wind
Temperature | Play Tennis _
Sunny Hot Weak |_____ Ne
Hot Strong | No
Sunny
Hot Weak Yes
Overcast
Rain Mild Weak | Yes__
Rain Cool Weak | Yes__J
Cool Strong No
Rain
Cool Strong Yes
Overcast
Mild Weak No
Sunny
Cool Weak Yes
Sunny
Mild Weak Yes
Rain
Mild Strong Yes
Sunny
Mild Strong Yes
Overcast
Hot Weak Yes
Overcast
Mild Strong No
Rain
ow Solution :
P (Sunny, Cool, Strong /Yes)*P(Yes)
P (Sunny/yes)* P (Cool/yes)* P (Strong/yes)*P(Yes) = 3/5*1/5*3/5*5/14
= 0.0257
P (Sunny, Cool, Strong /No)*P(No)

P (Sunny/no)* P (Cool/no)* P (Strong/no)*P(no) = 2/9*3/9*3/9*9/14
0.0158
i

(MU-New Syllabus w.e-f academic year 18-19) (M6-14)

To Learn The Art Of Cyber Security & Ethical
Learn;
Hacking
. Contact Telegram - @crystal1419
with Classification and Clustering --.
armi*Png
} Strong IN)
anys Cool, Strong /Y)*P (Y)>P (Su: nny, » CooCool,
(N)
-ecord ze sunny, Cool, Strong > is classified as Play Tenpj
nis = Yes
so
3.4 payesian Belief Network
5e °
-
AB wey ., twork is an uncor mmon gs Ort of
chart (c:
(called a coordinated diagram) together w i
r table of probabilities. Ss. The graph
Ta ith oe
7 :
comprises of arcs and nodes. The discrete or continuous va ni jables ar
rangement . are
causal relationshi
i ps
ee
ed by
the nodes. The between the variables are represented
cteprosen
by the arcs .
_
. y
Belief Networks empower us tc » model and

son .
esian reason about uncertainty, Bayesian Belief Networks suits
'
Bay ss a
ties b ased on objective data as well as subje . e probabilities. The important utilization of Bayesian Belief
“ apili
l
rovab cliv
networks js the use of revising probabilities based on the observati

, on of actual
actual events.
3.2: We need to find the probability that ‘Ram is held up!

ample °-
Sham held up Ram held up
Sham held up | Bus strike

T F
T 0.2 | 0.8
0.3 | 0.7
Ram held up | Bus strike

T F
T 0.9 | 0.1
F 0.4 | 0.6
Bus strike
T F
0.2 0.8
Module
5
uv Solution :
strike) * P (No bus strike)
e) * P (Bus strike) + P (Ram held up! No bus
P(Ram held up) = P (Ram held up! Bus strik
= (0.9* 0.2) + (0.1 * 0.8) = 0.26

ity
This is called the marginal probabil of
the light of actual observations
ion of Bay esi an Beli ef Net wor ks is in revising probabilities in
The most vital utilizat
events,
= T’ can be used.
e is ab ke. In this case evidence ‘bus strike
Let's, for example, we know ther us stri
ng hel d up (0.9) and Sham being
bilit y tables alrea dy te Il us the rev ised probabilities for Ram bei
The conditional proba
held up (0.2),
HIN SHAH Venture
al ec h-Neo Publications..A SAC
Tec
.Page no S14
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram
n and - Cl ustering ..
@crystal1419
lo
sificat
w i th clas
Lea r n i n Id up.
a t h a t
sps amatiis he vo determine
Rerv
re is a bus strike put do know
we do not know if the
Suppose, however, that
and we10 canCe use
‘Ram held up = true’ ’
3
the evi den ce that

Then we can enter
(a) The (revised) probability that there is a bus strike, and

|
0.69
The (revised) probability
that Sham will held up * 0.2) 0.26 =
(b) held up) = (0.9
P (bus sr
ike) / P (RAM s Strike)
P (bus strike! Ram held up) = P (Ram held up! bus strike) * s st ri ke) * P (No bu
(c)
ld up! No bu
held up! bus strike) * P (bus st rike) + P (Sham he |
(d) P (Sham held up) = P (Sham
* 0.31) = 0.38
= (0.2 * 0.69) + (0.8
propagation.
lle das
on the arrival of the evi dences, this is ca
Probabilities arc updated based er.
at it is du
e to rain or sprinkl
We need to find the re as on th
Example 5.3.3 : It is observed that grass is wet.
Sprinkler
False False
False | True
True False
True True
IV] Solution :
on using the given graph.
First we will write the joint probability distribution equati
R) P (S|R)P(R)
Joint probability distribution: P (G, S, R) = P (GIS,
er)
We need to find the reason of wet grass (Rain or Sprinkl
to rain,
First we will find the probability of wet grass due
P(G=T,R=T)
=T)
P(R=TIG P(G=T)
Lseqrp P(G=T, S, R=T)

Xs, Re (T, F) P(G=T,S,R)
SACHIN SHAH Venture

[el rech-Neo Publications..A
year 18-19) (M6-14)

Learming wi;
Ng with Classification and Clustering ---
now &
ach probability term is to be written in th e form of io ows,
P(G=TIS=T.R-E NER J0int probability distribution as foll
gz T,S= T.R=T) = T :
7 =TIR=
= 0.99* 0.01 * 0.2=0,00198

_
pl +=
Dee ae
probability j
gimilatly we will calculate all probabilities and final ity is calculated as
0.001 584, ed as below,
_ = 0.00198, oH + 0.1
pRetiG=t T . Dare + 0.1584 = 35 77
tet + Orr
e will find the probability of wet 8ra
Now ¥
grasss due to sprpri
j nkler,
P(G=T,S=T) 5
= SE
p(SeTIG=T) = —pegaqy G=TS=TR) = 64,23
Xs Re (T, F) P (G= T. S, R)
Nek ) the . reason for the Wet grass is “sprinkler is on”.

gneR S=T!G=
(R =T | G= T
pa MARKOV MODEL
* 5.4.1 Markov Models
tha he o.
arkov model is a discrete finite system 1 called as initial state.
has N distin ct states. The model ,starts at time t=~
that ilities that are
ystem moves from present state to next state with the transition probab
increas
step increa th syste ystem me
seses the
‘“ the time
As the time “te Suc
assigned fo present state, < uch type of system is known ass a discrete, ; or finiteite Markov model
Markov model.
:
-arete Markov Model every a, indica li
probabiility
Ty a, ) indicates the probab actstion to state j . from state i.. The a, are stored in
of transi
In Discrete
s are indicated by vector P-
A (a,} matrix. p, Is Hhe prod ibility to begin from a given state i. These start probabilitie
vy -
is th z rn ba ili
— [¢ atrix.
arkov Model Property
Attt | time the state of the system depends only on the state of the system at time t
P(X 41 = X41 PX,=X, Xyip =X, Xp =X) Xy= Xp)
=P (X,.1 = det Xr = Bd)
s is “stationary”
ins : Pro bab ili tie s ind ependent oft when proces
Markov Cha Mod
= %41 | X= X) = Pi
So, for all t, P(X.)
ut wi
ility with which the system will be present in the next system witho
This can be inferred as Py, represents the probab
ent system is in state ¢.
depending on the value of 1, if the pres
has purc hased milk, then there is a 90% chance that his next purchase will also be milk.
Example 5.4.1: Suppose a person 80% chance that his next purchase will also be bread.
d then there is an
If the same person purchased brea
y purc hased milk, what is the probability that he will purchase bread two
Let's assume that a person currentl
hases from now?
purchases from now and three purc
0.1
— Cus 0.8
00m)
0.2

[MU-New Syllabus w.e.f academic year 18-19) (M6-14)

Vv Solution :
[ 0.9 0.1 |
A
0.2 0.8 0.17 |
0.90 O.1 09 O01 ]_ 0.83
0.66 ,
|= 0.34
A’ = 0.2 0.8 0.2 0.8 ,
T; ne ywalis 0.17
sat.
hase brea d two purc hasesacs from
purc
The probability of he will 0. ;
0.9 O.1 0.83 0.17 0 438 0.562
=
\' _ . .
Lo2 08 JL 0.34 0.66 A
“=
s of now js 0.219
The probability of he will purchase bread three purchase from
s the probability for the Observation
nny, what i
Example 5.4.2: Given that the weather on day
i
1 (t= 1) 8 : nny. The states a re represented as, Rainy. ,
, su
y, rainy, sunny, ee
O = sunny, sunny, sunny, rain
transition matrix Is given as A.
Cloudy: 2, Sunny: 3. The
04 03 03
Az=| 0.2 06 0.2

0.1 0.1 08
M Solution:
) P (213) P (312)
p (313) P (313) p (113) P (II) P GIL
|
P(OlModel) = P( Dy De Ie By ba te ees
1 #0.4%03 0.1 *0.2= 1.536x 10°"
Ey hay yy Ay, YY YY ayyityy= 1.0 # 0.8 * 0.8 * 0.
UExample 5.4.3 TIS y’. Transition
states: ‘Rain’ and ‘Dr
DEA eee
Consider Markov chain model for ‘Rain’ and ‘Dry’ is shown in following fig ure Two
probabilities : P(‘Rain''Rain’) = 0.2, P(Dry''Rain’) = 0.65, P(‘Rain'I'Dry’) = 0- ; ‘Rain’, ‘Dry’}.
say P(‘Rain’) = 0.4, P(‘Dry) = 0.6.Caloulate a probability of a sequence of states {‘Dry’, "Ral n’,
0.2 Sain) a3
——— bp 33 0.7
)..
= &
Mi
P (O11 Model) P (Dry, Rain, Rain , Dry | Model) = P (Dry) P (Rain! Dry) P (Rain! Rain) P (Dry! Rain)
0.6 * 0.3 * 0.2 * 0.65 = 0.0234
5.4.2 Main issues using HMMs
1. Evaluation Problem : Given observation sequence O = O,, O,, ....., 0; and A how to compute P (O 1).
For example, calculation of probability of observing HTTHHHT.

0.9 0.1
A = [ 03 07 | b, (H) = 0.5, b, (T) = 0.5, by (H) = 0.8, by (T) = 0.2, 2, = 0.6, my = 0.4
Here A represents transition probability matrix and b represents emission probabilities
Suppose state sequence Q = FFBFFBB
Then, P (OIA) by (H) b, (T) by (T) b;, (H) b, (H) by (H) by (T)
0.5 * 0.5 * 0.2 * 0.5 * 0.5 * 0.8 * 0.2

pase state sequence Q = BFBFBRR
Then. P (O1A) = by CH) b-(T) b,
0.8 * 0.5 *02% nee

(OCH) beat) b, (H) ba(T)
O.8 * OR 0.2
got each given state sequence Q the prob ability of
of &* - O wilt be diffe
Tent.
cow the question is how likely are we lo get this
No path Q.
2 FFBFFBB = Fr arr Orn Aor M7 ayy ag = 0.6 * 0.9 «6 1*03%9

=r . MSN OI* OE 0.7
(=
5 BFBFBBB = Ts nr Orn Apr Ap App dyy = 0.4 #0349) « 03% 0.1*0.7*07
ch given path Q has its own probability, So the uestion is how to
find P (QA)
solution :
- rd Procedure : The
forward algorithm is
most!
aein
s a specific state when we know
Y used in applications that need us to determine the probability of
bei 2
about the sequ ence of observat ions
jnitialisation = 4 (i) = 7, b, (0)
N
Oh (J) x a (i) aly (0,,,)

"
i=]
Induction :
O HTTHHHT
a, (F) 7; be (H) = 0.6*0.5 = 0,2

a, (B) = ™,b, (H) =0.4*0.8 =0.48
Maximum probability is of B, so B is selected.
@,(F) = [0 (F) app + (B) ag.) b, (T) = (0.2 * 0.9 + 0.48 * 0.3) * 0.5 = 0.162
a, (B) [ct (F) apg + &, (B) agg] by (T) = (0.2 * 0.1 + 0.48 * 0.7) * 0.2 = 0.0172
Maximum probability is of F, so F is selected.
Similarly @ to &, are calculated, Based on this path Q is decided. Once we get Q we can easily calculate P (O | A).
Backward procedure
Y~
Module
Initialisation : B; (i) = |
N 5
Induction:
B, (i) = LX a,b,O.1 8410)
j=l
O = HTTHHHT
NII T ais 6)
B.(F) = |
SANA AERA
B,(B)= 1
B.,(F) = agg by (T) * 1 + apg bp (T) *1=0.9* 0.5 + 0.10.2 =0.47
* 05 +0.7* 0.2 = 0.29
B_,(B) = agrby (T) * 1 + agg by (T) * 1 = 0.3
te Tech-Neo Publications...A SACHIN SHAH Venture
Sa ta i

Maximum Prob
AbITiLY is Of F, so F is
selected,
Suey
Bk ) = , by CH)
Ay B | (: F
+ ay)
y by (HD * BB)
= 0.9 * 0.5 * 0.47 + 0.1 * 0.8 * 0.29 = 0.2
34
fp »(F) =
Any by CH) * CF) + agg
by (H) * BB)
= 0.3" 0.5 * 0.47 +0.7 * 0.8 * 0,29 = 0.2329
Maximum Probability is
of F, so F is selected.
Similarly, § ito get Q we can easily calculate P (Q | h)
74re calculated, Based on this path Q is decided. Once we
D. .
er 7c OQ = Gq).
te
ecoding Problem ; Given observation sequence O and 4 how to choose state sequenc Q= GiGi.
For example, what is hidden

coin behind each flip.
M Solution :
Forward-Backward
algorithm
. a, (1) By (i
y,Q) = N ORD
x a (i) BW
Jel
y, (F) and y, (B) is calculated and maximum is selected.
Based on this path sequence Q is decided
Learning Problem : How to estimate A = (A, B, M) so as

to maximize P (Q!A)
For example, How to estimate coin parameter A.
Mi Solution :
Use EM algorithm
Guess initial HMM parameters
E step : Compute distribution over paths
M step : Compute max likelihood parameters
But how do we do this efficiently
Also known as the Baum-Welch algorithm
Compute probability of each state at each position using forward and backward probabilities
— (Expected) observation counts
Compute probability of each pair of states at each pair of consecutive positions i and i + / using forward (i) and
backward (i + 1)
— (Expected) transition counts

5.501 Maximum Margin Linear Separator

,
ector machine jy 4 lype of
ot V
sup?
. points
ints MUD EHV
iseCdd ley
dat arAF UNSEEN (NOL from Nn the the ie:. « ths
raining dataset) suppor Mat Can f Xe Used fry
Classification
of regression, Even if the 4
‘e take
pet's tak an exampl e 48¢l that belgn
of dat , A VOCIOF maching
Bs 9 Wo different ¢ ichine clasy ifies the data prop
gata js separated from each othe pro e nly.
r Properly, Jy this Clee
Ate ony 5.
‘ t
ych a WAY that the inpu “BONG, and the dist ribution of data 18 proper means
SPaWeece jgte div
ot, ide
« d m
into two regi e WE CAN drawaW » i pht , the
"
5wa pint o t OT is Mrai fine /
(decision boundary) on the gra
s that be ph in
at Dloelngosngstoto one “Bion,
one catelegory Jigs
Ca
OD One si
jies ON the opposite side, Such lype of de Of the
1c! data jg Call deCC
ejISION boy
ed ay line ndary and the data points
ary Y separable of other category
sy data,
Linear Data
Non-linear Data x
Inseparable Data
Fig. 5.5.1; Diff
erent types of
data
We want that our classifier should

be designed in a manner that if a data
then we will be more confiden point is far away from the decision bou
t about the prediction ndary
we have made.
We would like to find the data point near
to the %
separating hyperplane and also make
sure that this
point should be far away from poe
Support
the separating line as
possible. This is called as margin. We would like to
ee
find the greatest possible margin, because if we
trained
o
our classifier on limited data or made a mistake, we
would want it to be as robust as possible.
oO :
Module
“
1 asi
Support vectors are the points which are nearest to the
separating hyperplane. We have to maximize the 7
ista the support vectors and the
a
separating a
line. Fig. 5.5.2 : Support vectors and Margin
Distance between a hyperplane (w, b) and a point x is calculated as,
Iw'x+bl
Distance = Iwill
(MU-New Syllabus w.e.f academic year 18-19) (M6-14)4 [al rech-Neo Publications..A SACHIN SHAH Venture

To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419 ..-Page ng 5-29
Cluster fication and
5.5.2 Quadratic Programming Solution to Find Maximum Margi in Separator
A hyperplane is formally defined by the following notation as,
FQ) = wy4b
In above equation, w represents the weight vector and b represents the bias.
By eralt .
scaling the values of w and b we can represent the optimal in mM, any ways. As a matter of “ONE non
hyperplane
“mong all the possible notations of the hyperplane the one selected is
Iwxebl = |
Here x represents the training records closest to the hyper eral the training5 records that are closest t, the
plane. In g¢ n
is called as the can
onical hyperplane.
hyperplane are called as support vectors. This notation
The distance between a point x and a hyperplane (w, b) is given by the result of geometry as follows,
Distance = Iwx+bl
Theil
WwW
support vector is given as,

i nce to the e suppo
In general, the numerator is equal to one for the canonical hyperplane and dista
Distance, = WYX+bI_
1 I
Ilwll Ilwll
Margin is twice the distance to neare Xp
st samples
9
M =
Ilwll
Ultimately, the task of maximizing M is same as

compared to the task of minimizing a function L (w)
Subject to some conditions. The conditions used to
model the requirements for correct classification of
all training samples x, by the hyperplane are formally
Stated as,
‘ Ls,
min L(w) = 5 Ilwi subject to y, (w'x +b) 2 I foralli.
w,b
Where y, represents the labels of training Fig. 5.5.3 : Solution to find maximum margin
This is a problem of Lagrangian optimization that can be solved using Lagrange multiplier to calculate weight vector
“w’ and the bias ‘b’ of the optimal hyperplane.
Let's assume that we have 2 classes of 2 dimensional data to separate. Let's also assume that each class consist of only
one point
These points are

X, = A,=(3,3)
4 B, = (6, 6)
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [a] Tech-Neo Publications_A SACHIN SHAH Venture

Learning (MU-Sem 6-Comp
ne
hyper
ma plane that separates theses > e
" and th A MIASSCS
] .
fiw) = Sli
_constralints are,
ne &
c(web)
P = y lw
YIWK, +b] 1>q
ce, (web) = Tlwx, +h. 1>0
c,(w, b) = ~TIwxy+bl-139
wexts WE put equation into for

m ofLagrangian
4
L (w, b. m) f(w )—m, © (w b)
;
"
M)C,(w, b)
= 2‘ wll" ~m, (w
. x, +b) - 1)~m
,,
2 ~ (Wx, +b)
~ 1)
=
2
Filwil —
™M (OWX) + by — 1) +m
, 2 ((wx 2 +b)+1
+1)
Ww ¢ solve for the gradient of Lagrangian
VL (w. b, m) Vi(w) —m
, Ver (w, b) +m
2°Ve3
C,(w, b =(
(W, by)
—|; w,
( 3 b, m) Ww-m + m,
MV 1X, x 7 0
5 (5.5.1)
ap L (webs m) = —m,+m,=0
;
(5.5.2)
= L(w,b,m) = (wx, +b)-1=09
dA,
(5.5.3)
17 (w,b,m) = (wx,+b)+1=0
(5.5.4)
Equating Equation (5.5.3) and (5.5.4), we get
(wx, +b)-1 = (wx,+b)+1
(wx,)-1 = (wx,)+1
(wx,;)—(WX,) = 2
w(xX,;-X,) = 2
w is divided into parts as, Module

Ww
W(x,-x,) = 2
(w,, W,) [(3, 3)- (6, 6)] = 2

(w,, W)) [(- 3,-3)] = 2
—3w,-3w, = 2
w, = —(0.67+w,) (5.5.5)
de

Machine Learni g (MU-Sem 6-Camp)
Adding values to Equation (5.5.1) and combining with Equation (55.2)
(4). Wy)=m, C1. 1+ m, (2,2)

=0
From Equation (5.4.2)
m, = m;
(Wy, Wy) - m, (3, 3) + mA, (6,6) =0

(w,.W3)4+m, (3.3) = 0 ASS ¢
w,+3m, = 0 A557
5
w,+3m, = 0 |
Equating these we get,
Putting this in Equation (5.5.5)
Ww, = Wi=- 0.34
Putting this in either Equation (5.5.6) or Equation (5.5.7) will give
m, = m,=0.11
And finally, using this in Equation (5.5.3) and Equation (5.5.4)

b = 1-(wx,) or =-—1-(wx,)
- 0.34), (6, 6)) = 3.04
= 1 ~((-0.34, — 0.34), (3, 3)) or= 1 - (- 0.34,
5.5.3 Kernels for Learning Non-Linear Functions

solution to this
linearly separab le data. Support vector machine provides the
Linear classifiers are able to separat e only
hyperplane is
space into a featur e space that contai ns non linear features. A
problem by transforming an input
the same.
constructed in the feature space so that other equatio ns remain
g a high dimensional
known as non-li near suppor t vector machin e. Here we se parate the data linearly usin
This is also
res ult is going to be non linear if we
of variables are used for this purpose. The
space. Kernel functions with its own set
convert this back to the original feature space.
® 0
space
Fig. 5.5.4 : Mapping of Input space to Feature
ssed with
Particularly, the data is preproce
X — (x)
ic yea r 18-19) (M6-14) [Bal recth-neo Publications..A SACHIN SHAH Venture

(MU-New Syllabus w.e.f academ

non? (x) is mapped to y
t
and F(x) = wW®(x)+b
Fig. 5.5.5: i
. Mapping of one feature space to another feature space
:
transforms the data inj to an easily
Kernel functiojon understandable form. This is done via mapping input spacect to
e space. In support ve aching 3
of this 1s
calculated of two vectors and the result
cor machine inner Product is
ano Aon ber. Wh .
always 4 single number. When we replace this inner product by a kernel it is called as kernel trick.
here are different algorithms that use different kinds of kernel functions. Among the different types of kernel functions
such as nonlinear, a radial basis function (RBF), polynomial, and sigmoid, RBF is mostly used. The reason for this
is along the X-axis radial basis function gives the localized and finite response
The number of support vectors will be determined based on the different criteria’s such as what is the complexity of the
model, how much slack is allowed. One or more than one support vectors need to be defined for every complications in
the final model from the input space. Support vector machines output compromises of support vectors and alpha. This is
used to specify the effect of support vectors on the final decision.
_ qf we select the model with high complexity it will result in to over fitting. For better generalization if large margin is
selected then it may Iead to incorrect classification. And accuracy depends on the trade-off between these two selections
criteria. If we over fit the data then the range of support vectors may vary from very less to each single point. This
tradeoff is controlled through the selection of kernel and its parameters. Module
_ Insupport vector machine the data points are tested by taking the dot product of each support vector with the test point. 5
Hence the computational complexity increases, if we increase the number of support vectors. Classification of test
points will be faster if we have less number of support vectors.
%, 5.5.4 Rules for the Kernel Function

y
Kernel function or a window is defined as follows: 1
If Ixil< 1 then K(x)=1 else 0.
This kernel function is shown by the Fig. 5.5.6, 1 1 ) x

K((z-x,/h) = 1 Fig. 5.5.6

For a tixed y alue ~

of P Xu the functi
Fig. 5.5.7 on lakes the valu
e as 1 as shown
In
ByY select
selecting the “gument
of K (), window can
Centerd at the©
POINpo; LX, and to be moved to be :
be of radius h ,
‘ Figa
. 5.5.7
5.5.5 Different Types of
SVM Kernels
| > olynomial
>
kernel
Pol ynomiailal kernel

kemel ;js Mostly used .
in image processing methods.
»,
Polynomia.l Ker
, » :
nel is represented as,
Kix. xy) = (xo x,t 1)?

HeCre P rep
“pre
ressent
ents the degree of the
polynomial.
2. Gaussian kernel
There are some - . ‘ .

applications where prior knowledge is. not available.
‘
For thisa typ eC of
P
a plic ¢ations Gaussian kere] Mel is;.
used,
Gaussian kernel is
defined as,
K (x,y) =
ix-yir
vo{ S22")
<0
Gaussian radial basis function

(RBF)
This is also used for the applications where prior knowledge is not availabl
e.
Gaussian radial basis function is defined
as,
K (x,,%,) = exp(yl x-y Ir) for y>0
Sometimes it is parametrized using the value ofy as 1/20"
4. Laplace RBF kernel
Laplace RBF kernel is defined as,

Ilx-yll
K (x,y) = exp} -
o
5. Hyperbolic tangent kernel
Hyperbolic tangent kernel is used in neural networks.
It is defined as,
K (x,,X,)
ay
= tanh (kx, * K+), for some (not every) k > 0 andc <0,
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [ial rech-Neo Publications..A SACHIN SHAH Venture

b oid kernel can be used as a Proxy for ie

sistiol ecural Networks,
as-
iis defined K (x,y) = tanh (a x"y +c)
sel gunction of the first kind Kerne}

c
fine d as.
JI i, de
Jie (Gilx~y i)
K(x,¥) =
Ix-ypeorh
were represents the Bessel function of first type
anova radial basis kernel
be used
In regression problems this kernel can
ed as,
Wis defin n
Kxy) = = exp (~o (x'~ yy 2d)

k=]
<F5 CLUSTERING
Genre sence ec Tren Torrence

eee ee ees
cet ofof da.
itionine aa set
proces: of partitioning
ion : “Clustering £ is a process ee Se we ie eet ;
0oODefinition data in a set of meaningful sub-classes, called as’
fe ee nin tnitnneN nw sina wEvSsneaiaee oobShenadiwesuss-dexesUstenathens Sule i
inclustering we group the “similar” objects in one cluster and “dissimilar” objects in another cluster.
va 5.6.1 K- means Clustering
To solve the well known clustering problem K-means is used, which is one of the simplest unsupervised learning
algorithms. Given data set is classified assuming some prior number of clusters through a simple and easy procedure. In odule
k- means clustering for each Cluster one centroid is defined. Total there are k centroids. 5
of centroids. To get the
The centroids should be defined in a tricky way because result differs based on the location
much as possible. Next, each point from the
better results we need to place the centroids far away from each other as
for all the points.
given data set is stored in a group with closest centroid. This process is repeated
step new k centroids are calculated again from the
The first step is finished when all points are grouped. In the next
result of the earlier step.
process
done for the data points and closest new centroids. This
After finding these new k centroids, a new grouping is
is done iteratively.
Venture
[el Tech-Neo Publications..A SACHIN SHAH

The aim of this algorithry ivy y
anothe ;
defined as follow, i
The process is repeated untess and until no data point mov es from one
ction ‘th
Minimize an objective function such as sum of a squared CTF fur
k x
Je ¥ M94 her
oe Na ‘entre C
x) and the cluster ¢ r Wis
( int
j . Een ta ape’ na data
‘
P
“|
Here tx,- CW shows the selected distance measure betwee nters

onters:
pnrmesntalinncat ihe di nenective cluster c€
representation of the distance of the n data points from their respective
The algorithm is comprises of the following steps }

ie : ' cluster.
Identify the K centroids for the given data points that we W ant to €
seat :
1,
has the nearest centroid.
2. Store each data point in the group that
centroids. ; oo
3. l data
When aall di points hav been stored, redefine the > KK cent
nts have other. The result of this process is the
. the no data points : move from . one . grou 10 al
4. Repea t a9
Steps 2 and 3 until - 7
zed can be cal cul ate d.
metric to be minimi bal minimum obj
ective
clusters from which the rreSsponding to glo .
solution i s o
arantee the the most opunimalnal solu
co
The k-means algori
ans algorithm
algo does. not guarantee al random selection of cluster center,
Ini t!
can be be proved ese willWj always terminate,
red thatth: the process - of times to reduce this‘ effect,
function, altho hough iitt can e
fora nu mb
affects the performance of the algorithm. The k-means algorithm is appli ed data points
nd we know thal th. e ,
: i s a
5. ee X Of f the same clas:m e class are present @
Let's assume
ass th:hat n sample data points
T E POIS Xie Xaver B n r i. x can be stored in cluster j, if
J of the data points |
n clu ste
belongs to k clusters, k <n. Let m, represents the mean
Ix —m, ll is the minimum of all the k distances.
The k-means procedure is shown below:

Select initial values for the means m,, M3, ».. M
Until no data point moves from one group to another
© — Use the calculated means to group the data points into clusters
o Forifrom 1 tok
Mean of all of the samples for cluster i is used to

\nitialise number of cluster K
replace M, with the
o end_for ; No
Calculate/Initialise
end_until Centroid
— The K-means algorithm is implemented in three steps.
- Iter
Iterate ata point
stable ( =(= no data
untilil stable 2
poi move group) Distance
. objectsto
centroids
1. Determine the centroid coordinate
Determine the distance of cach data point to the

to
Grouping
Ou based
A on
centroid minimum distance
3. Group the data points based on minimum

distance.

ae — -
; Given
{ 2, 4,The
To Learn 10, Art
12: 3.Of20, 30, ops
Cyber Len
Security & Ethical Hacking Contact Telegram - @crystal1419
5} Aemene rent
“om ’ oh
eo eon: JMET ig We 2
anol assign ineans: M, = 3, 1m, = 4
pmbers hich are close to mean m,

qre n “4a Te fTouped Inte me
st
agai cal culate new Mean for new chustey gtou p. SEER, ard cahers int
(2.3). k, = {4, 10, 12, 20, 39, 3
Ki = W, 2S) m, =?
Ss Mm = 16
(2, 3.4}. ky = (10. 12, 20. 30, 11,25) m
Ki ~
kK, = (2.3, 3,4, 10}, ‘ ky = (12, 20, 29 elt, +25) m, 13m, =}
==4.75,m, 2
2.3.4, 10.195 12]. ks = 120, 30,25)m, 19 6
, s (
kK, 1=7,m,=25
final clusters
= (2, 3, 4, 10, 1, 12}, k,= {20, 30, 25)
Randomly assign alternative values to each cluster

k= (10, 2, 3, 30, 25}. ky= 4, 12, 20, 11,31) m, =
= 14m,
= 156
Re assign
K, = (2: 3,4, 10, 11, 12), k, = (20, 25, 30, 31}

m, =7, m, = 26.5
Re assign
oe (2, 3,4, 10, 11, 12}, k; = {20, 25, 30, 31) m, =7, m, = 26.5
Final clusters
K, = (2.3.4, 10, 11, 12}, ky = (20, 25, 30, 31)

example 5.6.3 : Let's assume that we have 4 types of items and
each item
i has 2 attributes or features. We need to grou
these items in to k = 2 groups of items based on the two features.
. ° _
Object | Attribute 1 (x) | Attri
bute 2(y)
Number of parts | Colour code
Item 1 1 1
jana > ;
Module
Item 4 5 4
VI Solution :
lhitial value of centroid
Suppose we use item | and 2 as the first centroids, c, = (J, 1) andc, = (2, 1)
The distance of item 1 = (1, 1) toc, = (J, 1) and with c, = (2, 1) is calculated as,
D= (1-1) +(1-1) =0 |
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture

— d Clustering ...Page no 5.
Machine
To Learn Learni Art
Learnin
The MU-Sem 6-ComSecurity & Ethical Hacking
Of Cyber Learning with Cla
Contact gsification- @crystal1419
Telegram 82 28 :
D= Va-2 40-9 =1 |
The distance of item 2= (2.1) toe, =(1, 1) and with c, = (2, 1) is calculated as. |
D= V2Q-N+0-1
=1
D= V2-27+0-1y
=0
The distance of item 3 1) is calculated as,
= (4,3) toc, =(1, 1) and with c, = (2
D= Va-1+G-1F =361
I
D = Va-27+@-1 =283
The distance of item 4 = (5,4) toc, = (1, 1) and with c, = (2. 1) is calculated as,
D= V6 +G-1=s5
Ul
D= V(5-2))4+G-1)'=4.24
Objects-centroids dis
tance
b° = | * 1361 5 ome? group|

10 283 4.24 c,=(2,1) group2
: group 2.
between group 1 and group
To find the cluster of each item we consider the minimum Euclidian distance
Fr abov . s
om the above object centroid distance matrix we can see,
Item 1 has minimum distance for group], so we cluster item | in group 1.
Item 2 has minimum distance for group 2, so we cluster item 2 in group 2.

Object Clustering
G! piooe)
~ O11]
Iteration 1 : Determine centroids
C, has only one member thus c, = (1, 1) remains same.
C, = (24+4+45/3,14+3 44/3)
= (11/3, 8/3)
The distance of item | =(1, 1) toc, = (J, 1) and with c, = (11/3, 8/3) is calculated as,
D= VU- 1) +(1-1y =0
D= VC - 11/3) + (1 - 8/3) =3.41
The distance of item 2 = (2, 1) toc, = (1, 1) and with c, = (11/3, 8/3) is calculated as,
D= V2-)'+0-1=1
D = V2- 113) + - 8/3) = 2.36

Fn = of item 3 = (4, 3) toc, =(1, 1) and ihe =
os (2. 1) is caloulg
te p= Va- IY +(3-1)°
=36)
“
D= V(4- 113) + 3-87
= 0.47
fitem 4 = (5,4) toc, =(1, 1) and with c, = (2 \
. gistane® ° )is calculated as,
rh D= Vi5-17 + @=1y <5
D = V(5- 11/3) + (4-87=3)
1.89
ids distance
centro!
on p= | O 1 361 5 7S =4.1) group1
3.41 236 047 189 |. _(18
2= 3) group2
pove object centroid distance matrix we can see,

ea
roe ) . + otc
1 has minimum distance for group], so we cluster item 1
om -
in group | .
Ite :
2 has minimum distance for group I, so we cluster item 2 in group I
> jem?= has minimum distance for group 2, so we cluster item 3 in group
2.
prem - .
c 4 has minimum distance for group 2, so we cluster item 4 in group 2.
Item
ystering
oniot atnefet d 22 ]
0011
rmine centroids
tionson +2! *Dete
Cc, (1 + 2/2, 1 + 1/2)
= (3/2, 1)
Cc; (4 + 5/2, 3 + 4/2)

= (9/2, 7/2)
of item 1 = (1, 1) toc, = (3/2, 1) and with c, = (9/2, 7/2) is calculated as,
The distance
p= Vd-32+0-1=05
p= VU-92y7 +(1-72y =43
is calculated as,
The distance of item 2 = (2, 1) toc, = (3/2, 1) and with c, = (9/2, 7/2) Module
p= VO-32)+0-1)=05
D = V2-92y+0-72y =3.54
is calculated as,
The distance of item 3 = (4, 3) toc, = (3/2, 1) and with c, = (9/2, 7/2)
D (4-3/2) + G- 1) = 3.20
"
D= VG-92y) +G-72y =0.71

The distance of item 4 = (5, 4) to c, = (3/2, 1) and with c, = (9/2, 7/2) is calculated as,
D= V6-32)+4-1) =461
MU-New Syllabus w.e.f academic year 18-19) (M6-14) [el Tech-Neo Publications_A SACHIN SHAH Venture

D= (5-972) +14 ~7ny = 0.71

Objects-centroids distance
1 ) group |
rib
Cc
'
—_—
2 be 0.5 3.20 4.61
4.3 3.54 0.71 0.71 m( 1) group2
Nhe
C,=\2°2
From the above object centroid distance matrix we can sce,

| in group iF
~ — Item | has minimum distance for group1, so we cluster item
— Item 2 has minimum distance for group |, so we cluster item 2 in group
Item 3 has minimum distance for group 2, so we cluster item 3 in group 2-
— ‘
Item 4 has minimum distance for group 2, so we cluster item 4 in‘ group 2.
Object Clustering
5 1100
G =
0011
Obj
*=G!, Objects ,
does not move from
clusters
group any more. So, the finalal clusters are a 5 follows:
G"=G_,
— — Item 1 and 2 are clustered in group |
— — Item 3 and 4 are clustered in group 2

points into 3
has 2 features . Clu ster the data
Example 5.6.4: Suppose we have eight data points and each data point
clusters using k-means algorithm.
Data points | Attribute 1(x) Attribute 2(y)
1 2 10
2 2 5
3 8 4
4 5 8
5 7 5
6 6 4
7 1 2
8 4 9
M Solution :
Initial value of centroid
Suppose we use data points 1, 4 and 7 as the first centroids, ¢, = (2, 10), c, = (5, 8) and c, = (1, 2)
The distance of data point | = (2, 10) toc, = (2, 10), ¢, = (5, 8) and with ¢, = (1, 2) is,
D= (2-2) +10- 10) =0

D= (2-5) +(10-8) =3.61
D= (2-1)
+ (10-2) = 8.06
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [al Tech-Neo Publications..A SACHIN SHAH Venture
a A aia

qne D V2-2)' 415-10 <5

D = VQ-5)46—245 84
D= VO-17 46-27-3456
ea of data point 1 = (8, 4) to “1 = (2,1
0), © =(5, 8) and with 5 (1, 2/is,
D= (8-2) + (4-107 = 8 49
D = V(8-5)+(4—8) <5
D = V1) +42) = 7.28
aes of data point I = (5, 8) toc, = (2, 10),
¢, =(5, 8) and with c, = (1,2) js
D= VO-27+l=@-3,6 .
D= V(S-5) +(8-8y=9
D= VS - 1° + (8-27 = 72)
she distance of data point 1 = (7, 5) toc, = (2, 10), ¢, =(5, 8) and with c, = (1, 2) is ’
D = Y(7-2)'+(5-10)°=7.07
D= V(7-5)4+(5-8)
= 3.6]
D= V7- +62) = 671
qhe distance of data point I = (6, 4) to c, = (2, 10), c, = (5, 8) and with ¢, = (1, 2) is,
D= V6-2)+4-10¢=72)
D= V6-5)+@-8 =4.12
D = V6-1'+G-2)=539
The distance of data point 1 = (1, 2) toc, = (2, 10), c, = (5, 8) and with c,=(1,2)is,
D= Vd - 2) +210) =8.06
D= Vo -57 +8 =721
D= (1-1) +(2-2)y'=0
The distance of data point | = (4, 9) to c, = (2, 10), c, = (5, 8) and with c,=(1, 2)is,
Module
D= Vi4-27+0- 10)
= 2.24 5
D= Vi4-5) 419-8) = 1.4
D (4-1)
+ (9-2) =7.62
Objects-centroids distance
0 5 848 3.61 7.07 7.21 8.06 2.24 ]¢,=(2.10) group1

0
D = 3.61 4.24 5 O 3.61 4.12 7.21 14 | c,=(5,8) group2
8.06 3.16 7.28 7.21 671 5.39 0 7.62 Jc,=(1,2) group 3
MU-New Syllabus w.ef academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture

omp
Machine Learning (MU-Sem 6-C
rix we can sec,
troid distance mal +4
From the above object cen
<A|
distance for grou
— Data point 1 has mi inimum of
group3, s¢ oe
m dist ance for
Data point2 has minimu
16
owe €
imu m d ista nce for gro up 2,8
— Data point 3 has min
soe for group
2
sO
wec
da? soint 39 grouP
4 has min imu m dist ance
— Data point clustet a
ta
16in gr0t p
for gro up2,+ so we
m dist ance
— Data point 5 has minimu Juster data poin qin
2, so we € us group
distance for group“»
- Data poin (6 has minimum ter data
point
for gr ou -
p 3,8 owe clus
ce group 2.©
- Data point 7 has minimu m distan ata poin ,sinn
; a: imu m
; ance for group2, $0 We © luster d
dist
— Data point 8 has min
Object Clustering
10000000
ge} ooriri1o!
901000010

c, = (2, 10) remains same.
Clhas only one member thus
44.9/5) = (6)
Cc; (B4547 464454484544
, 3.5)
C, (2+ 1/2, 5+ 2/2) =(15
group |
Objects-centroids distance = (2, 10)
8.06 2.24
7.07 7.21 group 2
c, = (6, 6)
5 8.48 3.61
0
141 2 640 3
a
5.66 4.12 2.83 2.24 group 3

p' = | 1.58 6.04 c,= (15. 3.5)
6.25 5.7 57 4.52
6.52 1.58
Object Clustering
1000000 1
01000010
C, =)
(2 + 4/2, 10 + 9/2 (3, 9.5)
(6.5, 5.25)
C, = (8 +5474 6/4,4+48+5+4/4) =
C,
(2+ 1/2, 5 + 2/2) = (1.5, 35)
group |
1.12 2.35 7.43 2.5 6.02 6.26 7.76 1.12 7 ¢,=@, 9.5)
2_ | 654 4.51 1.95 3.13 0.56 1.35 6.38 7.68 c, = (6.5, 5.25)
group 2
652 1.58 6.52 5.7 5.7 452 1.58 6.04 J c,=(1.5,3.5) group3
Tech-Neo Publications...A SACHIN SHAH Vent

eon enture

P
ing (MU-Sem 6-Comp
uct”
G=]/00101 494
mine centroids
‘ 3: Deter
lion Cc, == (2 +5443, 10494
8/3)Y7 == (3,67
(35.67, 9 )
Ge (8+7 +4 673,445 44/3)
=(7 4.33)
C; = (2+ 12,5 + 2/2) = (1.5 3.5)
1.95 4.33 6.61

D = 6.01 5.04 ths an ‘ 67 1.05
ls ee
6.52 1.58 nee
6.52 ‘7§7 5.7
5 452 1.58
tt 6.04
eon c,=(1.5,3.5)
{en group
oa 3
oti ioc clustering

10010004
G’ = 00101109
0100001090
eG 2 ObjecMhiects
does not move from group
g any more. So, the final clusters
are as follows:
pata points 1, 4 and 8 are clustered in group |
pata points 3.5 and 6 are clustered in group 2
3
pata points=2 and 7 are clustered in group
mple 5.6-5 MU - May 15, 10 Marks
yexamp
K-means algorithm on given data for k = 3. Use c, (2), c2 (16) and C, (38) as initial cluster centres.
31, 12, 15, 16, 38, 35, 14, 21, 23, 25, 30
pata :2, 4: 6. 9.
o Solution -
¢, =2,C,= 16,c,= 38
The numbers which are close to mean are grouped into respective clusters.
ky =3)
f= (24.6. 1,30)
(12, 15, 16, 14, 21, 23, 25), ks= (335, Modul
for new cluster group.
Again calculate new mean 5
c, = 3.75, ¢) = 18,c,;=
;
32
New clusters
3). ky = (12, 15, 16, 14, 21, 23, 25), ks =(31,35,

= (24,6, ¢2 = 18, ¢3 =32
30} c, = 3.75,
Clusters remains unchanged
Final clusters
ks = (31, 35, 30}

6, 3}, k, = {12, 15, 16, 14, 21, 23, 25),
K,=(2,4,

CTA RT MU - May 16, 10 Marks yas initial cluster centres.
Apply K-means algorithm on given data for k = 3. Use c, (2), ¢) (16) and ¢3 (38
14, 21, 23, 25, 30
Data : 2, 4, 6, 3,31, 12, 15, 16, 38, 35,
MW Solution :
¢,= 38
¢, = 2.¢)= 16,
The numbers which are close to mean are grouped into resp ective clusters.
ky = (2.4.6.3),
ky = (12, 15, 16, 14, 21, 23, 25), ka= (31, 35, 30)
Again calculate new mean for new cluster group.
Cy = 3.75, ¢,= 18, ¢,= 32
New clusters
= 32
Ky = (24
3},,6,
ky = (12, 15, 16, 14, 21, 23, 25), ky= (31,3530) = 3.75, C) = 18,
, %
Clusters remains unchanged
Final clusters
K, = (2,4,6.3
k= (12,
),15, 16, 14, 21, 23, 25}, ks= (31, 35, 30]
t& 5.6.2 Hierarchical Clustering
Agglomerative Hierarchical Clustering

¢ cluster. In the next step, pairs of clusters
— In agglomerative clustering initially each data point is considered as a singl
merge din to a single cluster.
At the enda
are merged or agglomerated. This step is repeated until all clusters have b een
<__——— Root: One node
single cluster remains that contains all the data points.
— Hierarchical clustering algorithms works in top-

down manner or bottom-up manner. Hierarchical
clustering is known as Hierarchical agglomerative
clustering. |
— In agglomerative, clustering is represented as a Leaf: Individual

os . P A Bc cE F+—
where each merge is clusters
dendogram as in Fig. 5.6.1
represented by a horizontal line Fig. 5.6.1 : Dendogram
level, then each connected forms a
— Acclustering of the data objects is obtained by cutting the dendogram at the desired
cluster.
:
— The basic steps of Agglomerative hierarchical clustering are as follows
1. Compute the proximity matrix (distance matrix) 2. Assume each data point as a cluster.
3. Repeat 4. Merge the two nearest clusters.
5. Update the proximity matrix 6. Umtil only a single cluster remains
In Agglomerative hierarchical clustering proximity matrix is symmetric i.e., the number on lower half will be same as
the numbers on top half.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications..A SACHIN SHAH Venture

——
aawith Llassitication and Clustering ..;
roaches to defining the distance between ely lets
Ye a distinguish th
e differg gle linkage:
oe.
oso! and Average linkage clusters. rorithm’s 1... 510
pif ete yinkage
} co? sina Zee: the distance between two Clusters
js
it 7 ine"ad 10 be eq ual to shortest dist ance from any Object and their measured
features
of one ¢ Juster to any member of other cluster,
> ie :
con”
ne" c ald (ii) object i > cluster r and object
j cluster g Compute distance matrix
the distance between two clusters is i

0 “‘i j tolinkage.
mp jete be equal to greatest distance from any Set abject as cluster
“of! ide fone cluster to any member of other cluster,
er ©
em a i— cluster r and object j cluster 5
“lax {d QJ ), object
yt «linkage, We consider the distance between any
era .
0 ers A and B is taken to be equal to average of all
/ ster wr ts oe Merge 2 closest clusters
we «betwee? pairs of object i in A andj in B.ie., mean
- gances
gs petween elements of each other.
ce
gist" Update distance matrix
Mean (dG. i object i cluster rand object j—> cluster s
pt fh $)7
7: The table below shows the six data points. Use all link methods to find clusters. Use Euclidian distance
mle?" measure.
Xty
D,/04 | 053
Dz | 0.22 | 0.38
Dg | 0.35 | 0.32
Dg | 0.26 | 0.19
Ds | 0.08 | 0.41
Dg | 0.45 | 0.30
| solution -
twe will solve using single linkage

firs
0.38) is,
The distance of data point D, = (0.4, 0.53) to D, = (0.22,
D= Jo4 ~ 0,22)" + (0.53 — 0.38)" = 0.24 Modul
spe distance of data point D = (0.4, 0.53) to D, = (0:35, 0:32) is, 5
(0.4 - 0.35)" + (0.53 - 0.32)" = 0.22
to D, = (0.26, 0.19) is,
The distance of data point D, = (0.4, 0.53)
D= (0.4 ~ 0.26) + (0.53 - 0.19)" = 0.37
to Dy = (0.08, 0.41) is,
The distance of data point D, = (0.4, 0.53)
D=V0.4 — 0.08)" + (0.53 - 0.41) = 0.34

is,
The distance of data point D, = (0.4, 0.53) to D, = (0.45, 0.30)
pb = VO4—045) + (0.53 - 0.30) = 0.23

.
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram usterin rf
- @crystal1419 -36
Cl
oe an d
Learning
(MU- sifiifcicaattil
Machine ing with clas
Similarly we will calculate all distances.
Distance matrix
D, 0
D, | 0.24] 0
D, | 0.22 | 0.15 | 9
D, | 0.37 | 0.20 | 0.15 | 9
0
D, | 0.34 | 0.14 | 0.28 | 0.29
0.25 | 0.11 | 0.22 0.39 | 0
D, | 0.23 |
Dd,2 D, Ds D, Ds
D, nis two in one cluster and recalculate distance
0.11 is smallest. D, and Dy have smallest distance. So, W e combine this
matrix.
. 0.23) = 0.22
Distance ((D,, D,), D,) = min (distance (D,, D,), distance (Dg, Dy) = ™n (0.22,
: ‘ 0.25) = 0.15
Distance ((D,, D,), D,) = min (distance (D,, D,), distance (D,, D,)) = mn (0.15,
: 22 = 0.15
Distance ((D,. D,), D,) = min (distance (D,, D,), distance (D,, D4) = ™) (0.15, 0.22)
‘ . 39) = 0.28
Distance ((D,. D,), D,) = min (distance (D,, D,), distance (D,, Dx) = mn (0.28, 0.39)
Similarly we will calculate all distances.
Distance matrix
D, 0 |
D, | 0.24] 0 |_|
(DD) [0.22] 015] 9 |_|
D, | 037/020} 015 | 0
D, [0.34] 014] 0.28 | 0.29] 0
D, D; (Dy.D,) Dy Ds
. : : and recalculate
‘ distance
0.14 is smallest. D, and D, have smallest distance, So, we combine this two in one cluster and
matrix.
(D,, D,))
Distance ((D,, D,), (D,, Ds)) = min (distance (D,, D,), distance (D,. D,), distance (D,, D,), distance
= min (0.15, 0.25, 0.28, 0.29) = 0.15
Similarly, we will calculate all distances.
Distance matrix
D, 0
(D,,D,) | 0.24] 0
(D,,D,) | 0.22 | 0.15 0
b, | 037] 020 | O15 | 0

D, (DD, (D,D,) Dy
0.15 is smallest. (D,, D,) and (D,, D,) as well as D, and (D,, D,) have smallest distance. We can pick either one.

D>,
- gmallest. » We combi
ysis i Mbine thisj two in: one cluster and recalcu
alculate
0. ce matt’ .
(2 p
3 ie at rix
Dea D 1 (D,, D,,

Gana
D,, D, D,)
row as <ingle cluster remains (D;, Ds, Ds, Dy, D,, D,)
, e represent the final dendogram for single linkage as,
ext Ww
—_——_
Root. One node
D3 D6 D4 D5 p4 D1~— Leaf: Individual

Clusters
ji | :
we will solve using complete linkage
matrix
piste? D,| 0
D, | 0.24] 0
D; | 0.22 | 0.15] 0
D, | 0.37 | 0.20] 0.151 0
Dy | 0.34 | 0.14 | 0.28 [0.291 0
Dg | 0.23 | 0.25 | 0.11 | 0.22 039 | 0
D, D, D, D, D; Dy
0.11 is smallest. D, and D, have smallest distance. So, we combine this two in one cluster and recalculate distance
matrix.
Distance ((D3, Dg), D,) = max (distance (D5, D,), distance (D,, D,)) = max (0.22, 0.23) = 0.23 Module
Similarly, we will calculate all distances, 5

Patace matrix
D, |0
D, | 024] 0
(D,, D,) | 0.23 | 0.25 0
D, 0.37 | 0.20} 0.22 0
D, 0.34 | 0.14 | 0.39 | 0.29) 0
D, D, (WD, D, Ds

with Classitic ation and Cluster
Machine Learning (MU-Sem 6-Comp) Leamin jyster and recalculate distancy
: in one ¢
. . this two
0.14 is smallest. D, and D, have smallest distance. So, we combine
matrix.
Distance matrix
D, 0
(D,, D,) | 0.34
(Dy, D,) | 0.23 | 0.39
D, |037} 0.29
D, (DD, (D.Dd_ Ps
one cluster and recalculate
bi ne these (W o in
We com
0.22 is smallest. Here (D,, D,) and D, have smallest distance. So,
distance matrix.
Distance matrix
D, 0
(DD) | 0.34
(Dy Dg, Dy) | 0.37 0.39
D, (Dy Dy Dy) (Dy Dor Pa)
teri calculate distance
. these: two in on e cluster and re
combine
‘ and D, have smallest distance so, we
0.34 is smallest. (D,, D,)
matrix,
Distance matrix en
(D,, Ds, Dy) 0 1
(Dy, Dg, Dy) 0.39
(D,,Ds,D,) (Dy, Dy Da)
Now a single cluster remains (D,, D;, D,, Dy, D,, D,)
Next, we represent the final dendogram for complete linkage as,
D3 D6 D4 D2 ODS D1
Now we will solve using average linkage

Distance matrix
D, | 0
D, | 0.24 | 0
D, | 0.22 | 0.15 | 0
D, | 0.37 | 0.20 | 0.15 | 0
D, | 0.34 | 0.14 | 0.28 | 0.29 | 0
D, | 0.23 | 0.25 | 0.11 | 0.22 | 0.39 | 0
D D, D, D Di DB
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) fl Tech-Neo Publications...A SACHIN SHAH Venture

0. x
m init” 2 (Ds Dg)» D,)= 1/2 (dista nee (D,, .
Sotus. (DD
we will calculate all distances, # Y= 120.29 4 0.23) = 0.23
t
icet matrix
ya”
(D Dn't , D
iis gmallest. D, and Dy have smallest distance So, we comb a D Ds
‘om ae :
ine this two in one cluster and recalculate distance
~
™malls:ix
_ranice ma trix
pista
D, have smallest distance. So, w i is .

(D3, Do) and D, € combine this two in one cluster and recalculate distance matrix
pistance matrix
D, 0
(D,, Ds) 0.24 0
(D;, Dg, Dy) | 0.27 | 0.26 0
D, (D,, Ds) (D;, Dy, Dy)
0.24 is smallest. (D,, D,) and D, have smallest distance. So, we combine this two in one cluster and recalculate distance
matrix.
Distance matrix
(D,, Ds, Dy) 0 0
(DyD,.D,)| 0.26 0 Module
(D;,Ds,D,) (D3, Dy, Dy) )
Now a single cluster remains (D,, D,, D,, D3, Dg, Dy)
Next, we represent the final dendogram for average linkage as,
D3 D6 D4 D2 D5 D1
Venture
Tech-Neo Publications...A SACHIN SHAH
ademic year 18-19) (M6-14)

Machine Learning (MU-Sem 6-Comp) foll
owing distance matrix and dray,
e on the
Example 5.6.8: Apply single linkage, complete linkage and average linkag
dendogram.
Mi Solution :
First we will solve using single linkage

Distance matrix
P,
P,
Py 3| 0
P, 7|0
p,}9}8{5)4]9
P, Pp, Py Py Ps fsPy Ps .
juster and recalculate distance matrix,
clus
2 is smallest. P, and P, have smallest distance. So, we combine this two in one
Distance ((P,, P,), P,) = min (distance (P, P,), distance (P;, P3)) = min (6, 3)=3
Distance matrix
(P;. Py)
P,
P, 9 0
P, 8 4] 0
(P,,P;) Py Pa Ps
. . ar r and recalculate distance
So, we combine this two in one cluste
3 is smallest. (P,. P;) and P, have smallest distance.
matnx.
= 1 4 7
(P>, P,). distance (P,. P,)) = min (9, 7)
: ?
.
Distance ((P,, P;, P,), P,)) = min (distance (P,. P,), distance
Distance matrix
(P,, P,P) 0
P, 7 0
P, 5 4] 0
(P,,P3P,) Py Ps
4 is smallest. P, and P, have smallest distance.
Distance matrix
(P,, P>, Py) 0
(Py, Ps) 5 0
(P,, P;, Py) (P,, Ps)


MU-Sem 6-Comp)
To Learn
ingle cluster The Art
remains (P,, Of
P,, Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
Py, P,P,
sint 2's
present the final dendogram for cin). Cation ang Chis

ee
We rep “MIE linkage os
as
Ne"
arte
_) solve using complete linkage .

we will
Ww
Ne rix
pir rane
mat
.— emallest. P, and P, have smallest dist ance, So,

JS > we combinIne
thi S two 1i
. S Tec alculate
pista
nN one Cluster distance
ce (CF Ir P,), P,) ), distance (P and é i mat
= max 3, P;)) = i
max (6, 3) = 6
(distance (F il 3
similarly, We will calculate all distances, | |
pistance matrix
is smallest. P, and P., have smallest dis : . .

4is sm 4 5 tance. So, we combine this two in one cluster and recalculate distance matrix
Distance matrix Module
ee P, 6 0
_ oma
(Py, Ps) 10 7 0
(P;,Ps) Py (P,, Ps)
6 is smallest. (P,, P,) and P, have smallest distance. So, we combine
this two in one cluster and recalculate distance
matrix,
(MU-
New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

Distance matrix
(P,P, P,) 0
(P,, Ps) 10 0
(P,, P», Py) | (Par Ps)
Now s
a sing l €- Clu ster
] ust re Mai
a ns (P P Pa k Py, P, ,P,)
xt, We represent the fina

l dendogram for complete
linkage as,
No; w we will solve

P1 P2 P3 P4
| PS
usin g average linkage
Distance matrix
2
6 310
10 7| 0
9 5 4|0
P, P, Py Py Ps
2 is smallest. P, and P, have smallest distance. So, we
combine this two in one cluster and recalculate distan
ce matrix,
Distance ((P,, P,), P,) = 1/2 (distance (P,, P,), distan
ce (P,, P,)) = 1/2 (6, 3) = 4.5
Similarly, we will calculate all distan
ces.
Distance matrix
(P,P) | 0
P, 45 | 0
P, 95 | 7] 0
Ps 85 | 5} 4{ 0
(P,, P3) Py P, Ps
4 is smallest. P, and P, have smallest distance. So, we combine this two in one cluster and recalculate distance matrix,
Distance matrix
(P;, P3) 0
P, 45 |0
(PyPs)| 9 | 6] 0
(P,,P;) Py (Py, Ps)
(MU-New Syllabus w.e.f academic year 18-19) (M6-1

4) Tech-Neo Publications. SACHIN SHAH Venture

4, (Py Pp) and P, have smallest distance, g, eatin and Clustering

...P
» We comb;
ne this two in one cluster and recalculate distance
ce (P,, P,, P,)
io" (P,,P,) |8
|
(P,, P,, P;) (P,, P,)
ale cluster remains (P), P>, P3, P,, Ps)

_g sifle
Now
I represent the final dendogram for average linkage as,
er w €
P1 P2 P3 pg ps
6.9 MU- Ma 16, 10 Marks
(ET
le thm on
jo merative clustering algori given data and draw dendrogram. Show three clusters with its allocated points.
- : - —— :
singlee link method.
A| 9 [v2 | Vo |v VE I i |
Bl v2 | o |] Ww [3] lye
C | Vio | Ve o |W lysl 2
D | Viz | v5 0 2 3
E}] VB | 1 | v5 2 | 0 | Via
F | ¥20 | 8 2 3 113 | 0
a Solution :
Distance matrix
A 0
B | 1.414 0 Modu
C | 3.162 | 2.828] 0 : 5 2
D | 4.123 1 2.236 | 0 ZZ
E | 2.236 1 2.236 | 2 | 0
F | 4.472 | 4.242] 2 3 | 3.6 0
A B Cc D E F
1 is smallest. B, D and B, E have smallest distance. We can select anyone. So, we combine
B, D in one cluster and
recalculate distance matrix using single linkage.
MU-New Syllabus wef academic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

Distance matrix -
a eg rermsmemernend
A 0 —_——
BD] 14s] 0 | oe
Cc | 3.162 | 2.236] 0
E [226] 1 | 2.236) 0
F | 4472] 3 2 | 3.6 :
A Bo cE F
.
re scalculate d distance matrix :
ate
i £s aU ; sJuster
one clus and
1 is smallest, B, D and E have smallest distance. So, we combine this (wo 1?
Distance matrix
B,D,E | 1.414 0
Cc 3.162 1
F | 4472] 3
] is smallest. B, D, E and C are combined together.
Distance matrix
A 0
B,D,E,C | 1.414 0
F 4.472 2 0
A BD,E,C F
:
In the questions three clusters are asked with their allocated points. Three clusters are A, (B, D, E, C) and
ar
F,
UExample 5.6.10 [UEIEYEIARO ir
For the given set of points identify clusters using complete link and average link using Agglomerative clustering.
A |B
P, 1 1
P2 | 1.5 | 1.5
P3 5
P, 3
P, | 4
Ps, | 3 | 35
| Solution :
First we will solve using complete linkage

mallest
P4 and P6 have smallest dist P6
ance, We ¢; an
7 sate distance matrix using
15is § 7 ‘ select CCl ;anyo
ne, So, we
complete linkage, combine thi
¢ cule 5 in one cluster and
te matrix
sons smallest. P1 and P2 have smallest

distance, So, we combine
0. this {wo in one cluster and
recalculate distance
matrix.
sagnce matrix -————___

pista Pip2{ 0 To
P3 | 5.6561 0
P4,P6 | 5.201 | 2.236] 9
PS | 4.242 | 1.414 | 1.118
0
PI,P2) P3P4.P6 Ops
1,118 is smallest. P4, P6 and P5 are combined together.

Pistance matrix
P1,P2 0
P3 5.656 | 0
Module
P4,P5,P6 | 5.201 | 2.236 0
PI,P2 P3 ~—-P4,P5,P6
5
2.236 is smallest. P4, P5, P6 and P3 are combined together.
P1,P2 0
P3,P4,P5,P6 | 5.656 0
P1,P2 P3,P4,P5,P6
Next we will combine all clusters in a single cluster.
Now we will solve using average linkage.
. Syllabus w.e.f academic year 18-19) (M6-14)

MUNew BY Tech-Neo ications..A..A S SACHIN SHAH Venture
Tech- Publications.

Machine Leamin MU-Sem
To Learn 6-Com
The Art )
Of Cyber Leaming Contact
Security & Ethical Hacking with Classification
Telegramand- @crystal1419
Clustering ...Page no 5-46
Distance matrix
Pl} 0
P2 | 0.707 | 0 _4 =
P3 | 5.656 | 4.949 | 0
P4 | 3.605 | 2.915 | 2.236] 0
PS } 4.242 | 3.835] 1414] 1 0
Po | 5201] 25 | 1.802 | 05 | 1.118 | 0
Pl =p2)o ps PH PS PG
O.5 1.5 isis smallest. P4 and P6 have smallest distance. We can select anyone .So,
~. wewe combine
combine this in one clust er and
recalculate distance matrix using comp
lete linkage.
Distance matrix
Pl 0
P2 | 0.707 0
P3 | 5.656 | 4.949 0
P4.P6 | 4.403 | 2.707 | 2.019 0
PS | 4.242 | 3.535 | 1.414 | 1.059 | 0
Pl P2 P3 -P4,P6 PS
0.707 is smallest. P] and P2 have smallest distance. So, we combine this two in one cluster and recalculate distance
matrix.
Distance matrix
P1,P2
P3 | 5.302 0
P4,P6 | 3.55 | 2.019 0
P5 | 3.888 | 1.414 | 1.059 | 0

PI,P2 P3 P4,P6 PS
1.059 is smallest. P4, P6 and PS are combined together.
Distance matrix
P1,P2 0
P3 5.302 0
P4,P5,P6 | 3.66 | 1.817 0
P1,P2 P3 P4,P5,P6
2.236 is smallest. P4,P5,P6 and P3 are combined together.

P1,P2 0
P3,P4,P5,P6 | 4.07 0
P1,P2 P3,P4,P5,P6
.
Next we will combine all clusters in a single cluster
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) [al Tech-Neo Publications...A SACHIN SHAH Venture

fe ps nodels s, The method is executed in an iterative way.y- | In the p 8 Method. This method is based on the concept of
j zit” «are calculated that hasae the maxim
: um likelih
oe ood of ils attributTesence
es of miss;Sing data the probability
ility distrib
distribution
ef
8.
input 10 the EM algorithm as the dat
ie
ste?
‘i Maxim.‘aat
iz
ion step is executed, in which the param
eters tre,
Of the Probability distribution of each model ;
e are again
wiated: : lgorithm i
cal .
stopping criteria of the algorithm is, when the distribution Parameters conver ge or reach
. Convergence is guarantee . the maximum number of
ns-
d as the algori thm increases the likelihoo
d at each iteration until it reaches the
erat? local
asi gorithm is executed in the following Way :
: re EMjnitialisation-step
al : Model's , ,
parameters are assigned to random values,
EXP ectation-step : Assign points to the model that
fits each one best
M aximization-step : Update the parameters of the model using
the Points assi gned
in the earlier step
values converge
Iterarate until parameter
st
Consider a set of
arting parameters given a set of incomplete (observed) data. Assume observed data come from a
specific model.
Use these to “estimate” the missing data. Formu
late some parameters for that model. Use this
to guess the missing
value (E step).
0 Use “Complete” data to update parameters. From missing data and observed data find the m ost likely parameters
(M step).
oO
Repeat step 2 and 3 until convergence.
Now let’s understand EM algorithm with the help of example. In example 1 we will see what au se
of EM algorithm is
Jin the example 2 we will see how to use EM algorithm.
fl
ample 5.6.11 : Suppose Coin A and B is used for tossing. Each coin is tossed 10 times. Following table
observation sequence of getting H and T when coin A and B is used. s the show
What is the Probability of getting H if
coin A and B is used?
Module
Number of Toss wd 5 7
Coin Used for Toss | 1|2/3|4/5]617/8]91
]| 10
B H}|H}H]|H/|HI TI TI TIT] T
A H|H/]|H|H]|H|JH|HIHI TI T
A H}H|H|H]H]|H|H|H|H] T
B H}/H}|H|H/I T/T] T/T ITI] T
A H}|H/]H|H|H/|H|H] TIT] T
ee
sademic year 18-19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

I solution:
Let's first calculate number of H and T for each coin as follows mem
Number of Toss _—— Coin B

Coin Used for Toss |} 1. | 2/3 {4/5 ]6 [7 8 | 9 10 | Coin oT
B ninlulululr tr {e{ tet —-

A Hi} H| HH Tlafata pty! 6H,
A winlalalulal alee Le
B uinlululr{ Tr} title a
A nlalululalufuft{ tet | 7H, aT | ____
Total ee
of get er 24
Probability is used, P,= a6 * 0.
robability of getting Head when coin A
Asi . (
Probability of getting Head when coin B is used, Py=SHI = 0,45
calculate the
not given then how we will
In this example if the coin state is hidden ie. whether coin A or B is used 1s
probability?
coin is tossed 10 times. Following table shows the
A and B is used for tossi ng. Each is not
Example 5.6.12 : Suppose Coin of getting H and T for each roun
d. But which coin is used for
whic h round
observation sequence
and B?
known. Then how to calculate the probability of getting H for coin A
Number ofToss
7 | 8/9 | 10
Round number | 1 | 2 | 3 | 4
0 HIH/H{H}H| T/T TIT] T
H|H|H/|H|H|H]|H]H |T | T
1
| H| T
2 H/H|H/|H/]H/H]|H]H
HIH/H/H/T/T{ TI TIT T
3
HI H|H/H|H/H/H/ TIT T
4
M Solution :
is used.
state is not known (Coin A or B) then EM algorithm
When only observation sequence is known, but the
Now Let's solve this example using EM algorithm.
Assume P, = 0.6 and P, = 0.5
Round 0: In round 0 there are 5 H, 5 T and total tosses are 10.
B coin as,
Now we will calculate probability of using A and
A (P,)"(1- Pa)” "= (0.6)(1 = 0.6)" * = 0.00079626
il}
B (P,)( = Py) " - 0.5)!" * = 0.0009765

"= (0.5)(1
Now we will apply Normalization,
Ay = Xap=0-45
ALL


To Learn The BArt Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419
= 0.55
By = A+B
| al calculate probability of getting H and T for Coin 4
w an dB
yo" Ay = Ay* Number of H=0,45* § = 9 5¢ Send,
A; = Ay* Number of T=0.45
# 5 _ 5 25
By, = By * Number of H = 0.5
5* 5=2.75
Ay = By * Number ofT = 0.55*
5 — 2.75
round 1 there are 8 H, 2 T and total tosses are 10

:in
B coin as,
io _yewill calculate probability of using A and
“a: .
d I ‘
3 ed-Pa” “= (0.6)'(1 - 0.6)"""* = 0.002687

pal Pa) = 05) -0.5)""* = 0.000975
we will apply Normalization,
Now
A
Ay = A+B>%73
B
By = A+B 9:27
xow; We will calculate probability of getting H and T for Coin

A and B for round | as ;
Ay = An * Number of H =0.73* 8 = 5,84
A, = Ay * Number of T = 0.73* 2 = 1,46
By, = By * Number of H =0.27* 8 =2.16
Ay By * Number of T = 0.27* 2 = 0.54
pound
und 2: 2: In round 2 there are 9 H, 1 T and total tosses are 10.
Now we will calculate probability of using A and B coin as,
Ae (PyU-P,) = (0.6) (1 -0.6)""-* = 0.004031

B= (Pa (1- Py) = (0.5)(1 -0.5)'""* = 0.0009765
Mod5
An = A
A+B = 0.80
By = 44p=0.20B a
Now we will calculate probability of getting H and T for Coin A and B for round 2 as,
Ay = Ay * Number ofH = 0.80 * 9 = 7.2
A; = Ay * Number
of T=0.80 * 1 =0.8
By, = By * Number of H=0.2*9=1.8
Ay = By * Number of
T = 0.2 * | = 0.2
MU, ae
” New Syllabus w.e.f academic year 18-19) (M6-14) [al Tech-Neo Publications...A SACHIN SHAH Venture

tion and Clustering ...Page no (5-59
an Classifica
(MU-Sem 6-Com Learning with C
Machine Learning
Round 3: In round 3 there are 4 H, 6 T and total tosses are 10.
A = (P,)"(1-P,)8- "= (0.6)(1 - 0.6)'""* = 0.0005308

B = (P,)'(1-P,)* "= (0.5)*(1 - 0.5)'°~* = 0.0009765
A
Ay = 7yp=035
By = 44B p= 0.65
; as,
Now we will calculate probability of getting H and T for Coin A and B for round 3
Ay = Ay * Number of H = 0.35 * 4= 1.4

Ay = Ay * Number of T=0.35 *6=2.1
B, = By * Number of H = 0.65 * 4 = 2.6
Ay = By * Number of T = 0.65 * 6 = 3.9
Round 4: In round 4 there are 7H, 3 T and total tosses are 10.
A = (P,)(L-PA)Y "= (0.6)'(1 - 0.6)" 7 = 0.001792

"= 0.5)’ (1 - 0.5)!" 7 = 0.0009765
B = (Py)"(—Py)®
Ay =
A
AGB =O
B
By = 7, p= 035
Now we will calculate probability of getting H and T for Coin A and B for round 4 as,
Ay = Ay * Number of H = 0.65 * 7 = 4.55
A; = of T = 0.65 * 3 = 1.95
Ag * Number
By, = By * Number of H = 0.35 * 7 = 2.45
A, = By * Number of T= 0.35 * 3 = 1.05
Now we will calculate P, and P, by summarizing the results of round 0 to round 4,
Coin A Coin B
Ay | Ay | By | By
0 2.25 | 2.25] 2.75 | 2.75
J 5.84 | 146] 2.16 | 0.54
2 7.2 | 08 18 | 0.2
3 1.4 2.1 26 | 39
4 4.55 | 1.95 | 2.45 | 1.05
Total | 21.24 | 8.56 | 11.76 | 8.44

ting H when Coin A is used, P,=
of £¢ 2H . + yp =0.71
>
Apillt
, hen Coin B is used, P,= =H
wy 0 r
yililY
getting Hw TH+Ep = 0.58
pab values of P, and P, second iteration j
ce new SaPplied, This
jap these f P, and P, and then that will be the fina)
ot" iq the values ° Probabilities ess and until there will
Py supervised Learning after Clustering

ob
} $. svaneuibe
jced learning the target variable is not known, The re]
rvis ationship b
+ . etween j‘ , : .
{0 past h e network parameters. "Put and output pattern is studied
ate #
10 u ing
ised learning &
gives the output as a label OF a tar: :
&et which can be used :
past urchased history of customers, we can divid e th
n the P’ e Customers Y Supervised learning. For example.
in to
st ¢ ses particular items frequently. Then this result can different Broups such as customers who
be Used for crosss sel};
Ng purpose,
/ -nother example of document clustering. Suppose we have
rg soe 9 d to sports, fashion 4 set of docum .
5 $a ‘ashion ¢ ati n. This: 2 .
a and educatio type of problem c an ents that contains different news
: as re key b € Solved by grou
ceywords. After pi i the d
using g the the clc uster learning, a numb
i er of Clusters are crea
sna c cotet d ibasen
d ion keyword
ne
gjqnilantve
ly t vill contai Atte
n similar docu . me
nts terms. After
Creati‘ ng the cluste
ch clus rs seman lic features
can be used to identify
cluster’
ters depend on supervised model like SVM to make accurate
categorizations,
these
raim is to create
more accurate categorizations beca
_ in th is OU sent can be related to use if we want tot €st a new
these categorizati document we should know
izati ons or not, if
this. doe ume
6.5 Radial Basis Function} |
g 5
ho
dial Basis Function network is a multiplayer feed forward network. In RBF ‘n’ number of input neurons and ‘m’
Ra I
per of output neurons are present. Single hidden layer exists

num between input and Output layer.
Hypothetical connection is present between the input layer and the hidden layer, whereas weighted connections are
p
resent
>
between hidden layer and output layer.
Weights present in all interconnections are updated using the

training | a Igorithm ° Inputlayer Hidden layer Output layer
Radial basis function network, represented in Fig. 5.6.2, can be
: oT +5
shown as a mapping : RW > NR
- Let the input vector is P € X' and prototype of the input vector
isCe KR" (1 Si<u) be the. The output of each RBF unit is as
MC ewe nene
follows:
R(P) = R,(IP-Cll)i=1....... u (5.6.1)

Where (ll ll) indicates the Euclidean norm on the input space. Fig. 5.6.2 : Radial Connected neural network
MiNlew Sylabus wef academic year 18-19) (M6-14) Tech-Neo Publications... SACHIN SHAH Venture

Learning (hat it js
‘ancien 5 du e (0 the reason
: ear pais (uncle
. ; fang
~ocestan function : mostly used among all passible
| radial ba
~ Generally, the Gaussian is
(5
1.6.2)
factorizable. Hence
7/0) |
RP) = exp - CPG)
Where o, represents the width of the i" RBE unit.
— RBF neural network's i" output y, (P) is,

(5,0,3)
ul
y(P) = LY R(PYXwGid
=] ho cani ive output is W Gj i).
pll ficld to the i"
ath ngth of the | rece
is w(j, 0), R, = 1, and weight or stre
of j”
the outp ut
can say from
nthe following analysis.acteWe
where, Bias Pa
; i, :
] work _ complexitwe y is reduced by not taking bias in to consideratio
n | near
= Net char rized by a li
uts of radi al basi s func tion cl assifier are
outp
ie
.3) that the

Equation (5.6.2) and Equation(5.6
discriminant function.
19 the k-dimensional
separability of classes on
— They create linear decision boundar ies in the output space. Consequently,
e er by th ¢ no e ir transformati
nl in
fi ey i
s: seperability 1s ere
‘
at ed
space strongly affects the performance of RBF classifier. Thi
carried out by the RBF units. a complex pattern classification
s The orem slales that
Theorem on the separability of patterns also called as Cover’ able than in a low-
—
li ke ly to be lincarly se¢P ar
onal space nonlinearly is more
problem represented in a high-dimensi inpul §| pace. If
we increase the
dime nsio n of
zr where ris the the case of smal‘ l
dimensional space, the number of Gaussian nodes u especially, 1 1 .
. e of ove fitting.
i
Gausssian
numberor of of Gaus units then it‘ may result in poor gene ralization becaus
training sets .
of hyperspheres,
num ber of sub spa ces which are in the form
RBF neural network classifies the input space int © a
— ing approaches
ork s to sol ve the problems An this cluster
ly used in RBF neural netw
Clustering algorithms are wide also called as unsupervis
ed learning algorithms.
the patte rns is nol used hence they are
category information about
Ya. 5.6.6 RBF Learning Strategies
design of RBF network.

Learning strategies are followed in the
1. Fixed Centers selected at random

ion of centers may be
tion defi ning the acti vati on func tion of hidden unit s. The locat
Assume fixed radial basis func
chos en randomly from training data set.
RBE centered at t; is defined as,

5 -m(Ix-t1")
G(ix-
- w
I") =—_
exp
Hemax)
ance between selected centers

is numb er of centr es and dy (may represents maximum Euclidian dist
Where m
ted as
Gaussian matrix is construc
G= (Wy, (x), W(X), b)
SHAH Venture
18-19) (M6-14) aI Tech-Neo Publications...A SACHIN

ing to fits ( selects
7 to second selected center and ed cente r
and y, (x) represents the hidden
yo en u represents the bias,
invyerse matr ix is constructed as
se gi = (GG) 'S
oo blem for XOR the i :

: gre designing the pro © mput and desired Fesponses are as follows
Tint
nput nattera
pattern |
(1,1)
(0,1)
(0,0)
(1,0)
qi calculate the Euclidian distance between all the centres,

Wwe wi
ijian distance between (1, 1) and (0, 1) centres

= Va-07+0-1) =Vi
gucl 101
(1, 1) and (0, 0) centres

gyclidian distance between
(1-0) +(1-1) = 2
and (1, 0) centres
guclidian distance between (1, 1)
= Va-N+0-0y =I
(0, 1) and (0, 0) centres
guclidian d istance between
J(o-0y7 +(1-0y =V1
(1, 0) centres
Fuclidian distance between (0, 1) and
-
+ (1-0)
afco =y2 Module
es 5
Fuclidian distance between (0, 0) and (1, 0) centr
[0-07 +(1-0y =I
Maximum Euclidian distance is between (1, 1), (0, 0) centres and (0, 1), (1, 0)
We will select the two centres as t, = (1, 1) and t, = (0, 0)
-m(Ilx-t I" )
G(Ilx-t,l) = exp
di. (max)
2
2 -2(ilx-e 10
G(llx-t.1r = ex -2(lx=4N")
) p 2
Wa
‘New Syllabus w.e.f academic year 18-19) (M6-14) tal Tech-Neo Publications...A SACHIN SHAH Venture
aaa a 8,

e no (5-54
To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - us
nd Cl terin ...Pag
@crystal1419
lo' n a icat
s!
Clas
in with
Leal
G(ilx-11P) = exp(-lx-4
I")
Now we will construct a G matrix (X-b y’) and third column is for
nd column. exp (-
For the first column we use, exp (~ (X - t,)° ) and for the seco
bias (1)
Now we will see how the values are calculated
For example for first row and first column X = (1, 1) andt, = (1, 1)
= exp(-((1-
1) +(1-1))) =exp (0) =!
For example for second row and first column X = (0, 1) and t, = (1,1)
= exp(-((0-1)' + (1 - 1))) = 0.3678

For example for third row and first column X = (0, 0) and t, = (1, 1)
= exp (-((0- 1)" +(0-1)°))

= 0.1353
For example for fourth row and first column X = (1, 0) and t = (1, 9)
= exp (-((1- 1)" + (0 1)*) = 0.3678

For example for first row and second column X = (1, 1) and t, = (0, 0)
= exp(-((1-0)'
+ (1 —0)°))
= 0.1353
For example for second row and second column X = (0, |) and t, = (0, 0)
4
= exp (-((0-0) + (1 -0))) = 0.3678

For example for third row and second column X = (0, 0) and t, = (0, 0)
= exp(-((0-0) + (0-0)))=1
For example for fourth row and second column X = (1, 0) and t, = (0, 0)
= exp (-((1-0) + (0-0))) = 0.3678

1 0.1353 1
0.3678 0.3678 |
0.1353 1 I
0.3678 0.3678 |
Now by multiplying G by G we will get

1.2889 0.5412 1.8709
G'G = | 0.5412 1.2889 1.8709
1.8709 1.8709 4
Now we will find the inverse of this matrix by dividing the adjacency matrix by determinant
Determinant = 1.2889(1.2889 * 4 - 1.8709 * 1.8709) — 0.5412 (0.5412 * 4 - 1.8709 * 1.8709)
+ 1.8709(0.5412 * 1.8709 — 1.2889 * 1.8709) = 0.2392
Now to find the adjacency matrix we find the values row wise and we consider the alternate + and — sign for the above
matrix
For first row first column = + (1.2889 * 4 - 1.8709 * 1.8709) = 1.6553
For first row second column = — (0.5412 * 4— 1.8709 * 1.8709) = 1.3355
For first row third column = + (0.5412 * 1.8709 — 1.2889 * 1.8709) = — 1.3989

es are calculated and we will pe
1.6553 1.3355
13355 1.6553 1.3989

gajaceh? - 1.3989 1.3089 1 3694
Gia) = adjacency/determinant
1.82
Gi ' = TeaO'Glel=|
(GG 0.6727TX ~= 1.24 96
1 2496 Lie27 ~ 1.2496
0.67
09202 1.4202 0.938299
99 _),
1.2496
1.8273 ~ 1.2496 0.6727 1 1.42
W = G'id= 0.6727 ~— 1.2496 1.8299 2496 0
0.9202 1.4202 _ 9.99

920293
Sete = 1,249
1a ST
|
— 2.5018 AIL Y
W = - 2.5018
2.8404
; aterpolation with regularisation

git
. ered , at it is defined as
centers are) selected. RBF cent
Wy) = lX-tlly]
expl-(
+s n Me matrix is constructed as
yatio
yorerP®
gd = (Wy OD, ¥209, 30), Wy (X))
w(x) represents the hidden function corresponding to centers8.
wh ere YN
n weight matrix is calculated as
ie Wels
W = Oo. ! d
—7a14: Design a XOR problem with given data set as (1,1) , (0,1 , ; ,
pample 8 interpolation matrix. (0,1) , (0,0) , (1,0) and also find Weight vector and
4 Solution :
As we are designing the problem for XOR the input and desired responses are as follows:
Input pattern | Desired output
(1) 0
(0,1) 1 Module
(0.0) 0 5
(1,0) !
Aswe have to select all the centers t, = (1, 1). t = (0, 1), t, = (0,0) and t, = (1, 0)
The Interpolation matrix is given as
I 0.3678 0.1353 0.3678
0.3678 1 0.3678 0.1353
> =
0.1353 0.3678 1 0.3678
= 0.3678 0.1353 0.3678 1
'MN "Syllabus w.e.f academic year 18-19) (M6-14) SACHIN SHAH Venture
eI Tech-Neo Publications...A
i

Machine Learning
do = 0.1809 -0.4918 1.3373 — O.4VIB

0.4918 0.1809 - 04918 1.3373
~ (9837
1.5182
Woe 6 =! _ 99337
1.5182 eer
1 5.7 UNIVERSITY QUESTIONS AND ANSWERS Oo
> May 2015 ion 5.6.1)

(5 Marks)
s rithm for clustering analysis. (ANS. - :Refer sect
Q.1 Describe the essential steps of K-mean algo
nt res.
initial cluster ce
Q.2 Apply Igori
pply K-m K-means algorithm 16) and¢ 4 (38) as
on given data for k == 3. Use ¢, (2), C (16) (10 Marks)
Examp le 5.6.5)
Data : 2, 4, 6,3, 31, 12, 15, 16, 38, 35, 14, 21, 23, 25, 30 (Ans. ; Refer
with
q hyperplane, margin and support vect(5orsMarks)
Q.3 > What is SVM ? Explain the following terms: hyperplane, separatin
suitable example. (Ans. : Refer section 5.5.1)
> May 2016

stat centres. aimed
C2 (16) and c, (38) as initial cluster
Q.4 Apply K-means algorithm on given data for k = 3. Use c; (2), arks
25, 30 (Ans. : Refer Example 5.6.6)
Data : 2, 4, 6, 3, 31, 12, 15, 16, 38, 35, 14, 21, 23,
(5 Marks)
? (Ans. : Refer section 5.5.1)
Q.5 — Whatare the key terminologies of Support Vector Machine
support vector
maximum margin separation in
Q.6 Write detail notes on: Quadratic Programming solution fo r finding (10 Marks)
machine. (Ans. : Refer sections 5.5.1 and 5.5.2)
ith its allocated
clustering algorithm on given data and draw dendrogram. Show three clusters wit
Q.7 Apply Agglomerative
points. Use single link method. (Ans. : Refer Example 5.6.9) (10 Marks)
a B c d E F
al o | v2 | vio |vi7 | v5 | v20

bi 2 | o | ve | 3 | ' [vie
c | {10 8 0 V5 v5 2
d|ji7 | | v5 0 2 3
e| V5 | 5 2 o | ¥13
1|y20 | vie | 2 3 |v |
> May 2017
(10 Marks)
the margin ? (Ans. : Refer sections 5.5.1 and 5.5.2)
Q.8 Whats Support Vector Machine 9 How to compute
using Agglomerative clustering.
y clusters using complete link and average link
Q.9 — For the gi ven set of points identif
(10 Marks)
(Ans, : Refer Example 5.6. 10)
Venture
(M6-14) [al Tech-Neo Publications...A SACHIN SHAH
(MU-New Syllabus w.ef academic year 18-19)
°:

machine Learning (MU-Sem 6-Com
Learning with Classification and Cluster
in g Pago no (8:52
A B
P, 1 1
P, 15 | 1.5
3 5 5
Ps | 3 4
Ps} 4 | 4
Po | 3 | 35
> May 2019
: . IDain! ‘Dry’. Transition
a. 10 Consider Markov chain model for ‘Rain’ and ‘Dry’ is shown in following
. orgs te fliet
figure Two states: ‘Rain and ‘Dry bilities :
probabilities: P(‘Rain'l'Rain’) = 0.2, P(‘Dry'Rain’) = 0.65, P(‘Rain'I'Dry’) = 0.3, airiec.t iti abilities «
P(‘Dry'I'Dry’) = 0.7, Initial prob
say P(‘Rain’) = 0.4, P(‘Dry’) = 0.6.Calculate a probability of a sequence of states {‘Dry’, 'Rain’, ‘Rain’, ‘Dry’).. io)
(Ans. : Refer Example 5.4.3) rks
coma
0.2 SL rain ) oss Coy YS 07

= 03 —
.
g.11 ‘
Explain following terms Initial hypothesis, Expectation step and Maximization step w.r.t E-M algorithm. in How
Explain
Initial hypothesis converges to optimal solution? (You may explain it with
an example)
(Ans. : Refer section 5.6.3)
(10 Marks)
g.12 Explain following terms w.r.t Bayes’ theorem with proper examples : (a) Independent probabilities (b) Dependent
Probabilities (c) Conditional Probability (d) Prior and Posterior probabilities Define Bays theorem based on these
Probabilities. (Ans. : Refer section 5.3.1)
(10 Marks)
Q.13 Draw and discuss the structure of Radial Basis Function Network. How RBFN can be used to solve non linearly
separable pattern ? (Ans. : Refer section 5.6.5)
(10 Marks)
Q.14 Illustrate Support Vector machine with neat labeled sketch and also show how to derive optimal hyper-Plane?
(Ans. : Refer sections 5.5.1 and 5.5.2) (10 Marks)

> Dec. 2019
Q.15 Why is SVM more accurate than logistic regression ? (Ans. : Refer section 5.5)
(5 Marks)
Module
ae
Q.16 Explain Radial Basis Function with example. (Ans. : Refer section 5.6.5)
(5 Marks)
Q.17 Explain various basic evaluation measures of supervised learning Algorithm for Classification.
(Ans. : Refer sections 5.1, 5.2 and 5.3)
(10 Marks)
Q.18 Define Support Vector Machine. Explain how margin is computed and
optimal hyper-plane is decided ?
(Ans. : Refer sections 5.5.1 and 5.5.2)
(10 Marks)
Q.19 Write short note on : Hidden Markov Model. (Ans. : Refer section 5.4)
(5 Marks)
Q.20 Write short note on : EM algorithm. (Ans. : Refer section 5. 6.3)
(5 Marks)

[el Tech-Neo Publications...A SACH
IN SHAH Venture

Machine Leaming (AU-Sem 6-Co yse naive Bayes classifier to fing

, Wind= Strong
Q.21) For 4 unknown tuple 1 »<Outliook =Sunny, Temperature = Cool (Ans.- Reter Example 5.3.1) (10 Marks)
whether tne class for Play Tennis is yes or t
no. The dataset is given bel
DY"ow ay
ne eee ee naeent Tennis
“Pl Ten nis _
__ Outlook _| Temperature | a Wind |
adi No
___ Sunny __Hot Weak ho
_ Sunny Hot Strong 7 |
_Ovarcast Hot Weak | —
n en toenansssg-sMildroomed
Raiit
ecb Se Weak a 03ae
Rain Cool Weak +1
0
Rain Cool Sing
Overcast Cool Strong |
__Sunny Mild Weak | No
Sunny Cool Weak | __—-¥eS_——_
Rain Mild Weak | ves__
Sunny Mild Strong | ¥eS
Overcast Mild Strong Yes
Overcast Hot Weak Yes
Rain Mild Strong Lc
. : . 9
Multiple Choice Questions Q,5.5 In Clustering which idea tvst ” output

Is notabou
(a) We do not have
Q.5.1 Which is not 4 desirable property of a logical rule based (by We group data in to different groups
system 7 (c) [tis unsupervised method y
(by Attachment (J) We have the idea about output Ans. : (d)
(a) Locality
(c) Truth Functionality (dy Global attribute Explanation : We do not know the expected output.
.
¥ Ans. :: (b) Q.5.6 Which of the following clusteri. ng algorithms suffers
Explanation : Remaining three are the properties of Rule from the problem of convergence al local optima ?
based system, (a) K-Means clustering algorithm
(by K-Means clustering algorithm and Expectation-
Q.5.2 A rule-based system consists of a bunch of IF-THEN
rules. Maximization clustering algonthm
(c) Agglomerative clustering algorithm and Diverse
(a) TRUE (by) NO
(c) MAYBE (dy) CANT SAY Y Ans. : (a) clustering algorithm
: 5 ing algorithm and Di
Explanation : Rule base classifier classifies the records (d) Agglomerative clustering alee & erse
a clustering algorithm and K- Means clustering
by matching i
y matching if and then conditions. algorithm Y Ans. : (b)
Q.5.3° SVM can be used to solve problems. Explanation : Out of the options given. only K-Means
(a) Classification (by Regression clustering algorithm and EM clustering algorithm has the
(c) Clustering drawback of converging at local minima.
(d) Both Classification and Regression Y Ans.:(d) | Q.5.7 | Which of the following algorithm is most sensitive to
Explanation : With the help of SVM we can get outliers ?
categorical as well as numerical output. (a) K-means clustering algorithm
(b) K-medians clusteri Igori
Q.5.4 SVMisa learning algorithm. ering algorithm
(c) K-modes clustering algorithm
(a) Supervised (b) Unsupervised
(d) K-medoids clustering algorithm Y Ans. : (a)
(c) Semisupervised — (d) Reinforcement ¥ Ans, : (a)
Explanation : Expected output is known.
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) al Tech-Neo Publications..A SACHIN SHAH Ventur
e

ie no 5-59
(MU-Sem 6-C omp) with Classification an J Clus

tering. .Pa
machine, Learning Learning
v Ans. : (d)
tion
r ‘xplanation
p nai : Out of all the options, K-Means clustering (d) Hypothetical connec values are
weight
algorithm
_ 1s Most sensitive to outliers aas it uses the ome : . hidde n layet
Explanation : In RBF ey are no
t user
of cluster data points to find the cluster center = gre «. Th
learning strate
calculated using the
Which algonthm is used for solving temporal defined.
Q. 5.8 thm 1"
probabilistic reasoning ? 1s 4 clustering algori
Q.5.14 Which of the following
(a) Hill-climbing search
machine learning ?
(b) Hidden markoy model
(a) CART
(c) Depth-first search zation
(b) Expectation Maximi
(d) Breadth-first search b
vAAns. : (b) Bayes
(c) Gaussian Naive 2 (D)
Y Ans.
Explanation : HMM: model us es the probability
— (d) Apron Naive ba yes
along with time dimension. rin g. C ART and
: EM is clu ste
Explanation or ty use
ri
d ub
How does the state of the process is described in HMM ? ri thms whereas Ap
Q. 5.9 are classifi cation algo
(a) Literal (b) Single random variable Association algorithm.
(c) Single discrete random variable ective when
Q.5.15 SVMs are less eff
random variable v Ans. : (c)
(d) me ly seperable
(a) The data is linear
Explanation : States are represented in HMM model and ready to use
(b) The data is clean
using Single discrete random variable. OV erlapping points
sy and contains
(c) The data is noi Ans. : (¢)
false statement related to Support vector dat a is cl ea n
Select the The
Q. 5.10 (d) bution od data is
is noisy and distri
machine. Explanation : If data to draw proper
(a) SVM can be used as binary classifier not proper then w ¢
will not be able
(b) SVM can be used as Multi-class Classifier decision boundary.
ity from given data TP = 30,
(c) SVM can not perform non-linear classification Calculate the Sensitiv
and non-linear Q. 5.16 = 10
(d) SVM can be used for linear TN = 930, FP = 30, FN
classification Y Ans. : (c) (b) |
(a) 0.75 v Ans. : (a)
SVM can perform non-linear (d) 0.99
Explanation (c) 0.86
kernels .
classification using the non liner Explanation :
EP 3040 L=
Sensitooivity = [7p + FN)
is 0.75
Q.5.11 In Regression trec output attribute
(b) Discrete
(a) Categorical data TP = 30,
Y Ans. : (c) from given
(c) Numerical (d) Range Q. 5.17 Calculate the Accuracy
. = 10
.
tree at the leaf node we get TN = 930, FP = 30, FN
Explanation : In regression (b) |
(a) 0.96
the numerical outpul. v Ans, : (a)
(c) 0.86 (d) 0.99
valid iterative strategy for (TP + TN)
Q.5.12 Which of the following is/are . soe _ + P+
TN +F
cy =TTp+| TN
)
EP +FNFN)
clustering analysis? Explanation : Accura
treating missing values before
(a) Nearest Neighbor assignm
ent 960 Module
(b) Imputation with mean
= To00 = 96 5
tion Maximization be used to answer any
(c) Imputation with Expecta Q. 5.18 How the bayesian network can
algorithm query?
v Ans. : (c) Joint distribution
lier (b)
(d) Imputation with out (a) Full distribution
n with EM
we perform imputatio (c) Partial distribution (d)
Zero distribution
Explanation : When ore applying
ues can be handled bef Y Ans. : (b)
algorithm missing val
clustering. first we have to find
Explanation : In Bayesian network
nection between from this we can
on network the con joint probability distribution then
Q. 5.13 In Radial basis functi
is called as
input and hidden layers answer any query.
n
(a) Weighted connectio
(b) Intra connection
SACHIN SHAH Venture

[| Tech-Neo Publications...

~Sem 6-Comp)
The basic purpose

Q.5.19 For Clustering, mate! does not require af clustering 18 tO
©. 5.24
(a) Unlabeled data (a) Combine the data pomts inte single proup
ib) Labeled data
(¢) Numencal data (hy Classify the data point into different Classes
(d} Categorical data
5 alues of inpat data points
fc} Preciet the output
¥ Ans stb)
m group ¥ Ans, . (b)
Explanation ; In clustenng labelled qd) Search data tre
data is not present on propertics
st ering based of comm
In this based on common properties data is grouped Explanation : Tn clu
erent clusters.
together, data is grouped in to difl
step
the two repeated steps, the
Q.5.20 In EM algonthm that finds maumum hkehhool Q. 5.25 In EM algonthm, Out of
esumates for a model with latent vanabl In
es. You are optimization step
Supposed to modify the algorithm so that it finds
MAP (a) minimization step (b)
maximization step
estimates instead. Which step or Meps
do you need to (¢) normalization step (d)
modify? Y Ans.
: (d)
of EM algonthm.
Explanation : As per working
(a) Expectation (>) Maximization
(c) No modification required to the separating
closest
(d) Sorting Q. 5.26 The training examples
¥ Ans. : (b) hyperplane are called as
Explanation : In maximization Mep we reiniti (b)_— Test vectors
alize the (a) Training vectors
parameters. ors (d) validation vectors
{c) Support vect
Q.5.21)
Y Ans.
: (¢)
The main use of “kernel trick” are
t to decision
(a) Used to perform binary classification Explanation : The points which are closes
vector s.
(b) Used to perform Multi-class classification boundary are called as support
(c) Used to perform linear classification the false statement regarding soft margin of
Q. 5.27 Identify
(d) Implicitly mapping the inputs into high-dimensional SVM.
allows for
feature spaces Y Ans. : (d) (a) Iris a modified maximum margin idea that
Explanation mislabeled examples.
: For non linear classification inputs are
the "yes"
mapped to high-dimensional feature spaces. (b) If there exists no hyperplane that can split
and “no” examples, the Soft Margi n metho d will
Q. 5.22 The goal of the SVM is to choose a hyperplane that splits the examp les as
(a) Find the optimal separating hyperplane which cleanly as possible

minimizes the margin of waining data (c) Ituses the concept of slack variables
(d) It does not use kernels. v Ans: .
(d)
(b) Find the optimal separating hyperplane which
maximizes the margin of training data Explanation : SVM uses the kemels.
(c) Finds hyperplane without any criteria Q. 5.28 If you are using Multinomial mixture models with the
(d) Does not find hyperplane for classification eXpectation-maximization algorithm for clustering a set
¥ Ans. : (b) of data points into: two. clusters, which of the
assumptions are important ?
Explanation : If margin is maximum then we can gain (a) All the data points follow two Gaussian distribution
more confidence in our classification. (b) All the data points follow n Gaussian distribution
Q. 5.23 Probabilities in Bayes theorem that are changed with the (n> 2)
help of new available information are classified as (c) All) the data points follow two multinomial
distribution
(a) Independent probabilities (d) All the data points follow n multinomial distribution
(b) Posterior probabilities
(n> 2) ¥ Ans. : (c)
Explanation : For Multinomial mixture models data
(c) Interior probabilities
v Ans. : (b) points should follow two multinomial distributions.
(d) Dependent probabilities
Explanation : Based on new evidences original

probabilities are updated.

Machine Le arming (MU-Som
6 Comp
Q. §.29 The incthod
of fing
ting Kidder stueture fre 1
token free
data is calleg M uniabeled (0) the probability of chasse C given a spe
populanon P
(a) Supervised lea
rning . . role Geken Fresit
(0) the probability of class © given a sary alien
(b) Unsupervis : fC witte im
ed | e aming of €
(©) Reinforcemen population P divided fy the probability
t leaming ¥ Ans. id
the entire population P
(d) Instructional Learni Explanation: Standard definiuion of Hit
ng
Explanation : In Un ¥ Ans. :(b)
supervised learming Q. 5.34 ima relatively
ised unlabeled data is Instead of representing knowledge
‘ ry
declarative, static way Gas a buneh of “lve
things mee
oritis«a ‘
Q. 5.30 Classification Irie), rule based system represent Knowledye tn tert
Problems are distinguished — trom that tell you what you should do or what you
estimation proble
ms in th al
(a) classification pro could conclude in different situations
ble Ms require
the output attribute to
be numeric. (a) Raw Text (b) A bunch of rules
(b) classification pro (c) Summarized Text

' blems require the outp
ut attribute to
he cate gorical, (d) Collection of various Texts Y Ans. : (b)
(c) classification Problems Explanation : Ino rule based system Knowledge ts
do not allow an Output
altnbute. represented using TP-THEN rules,
(d) classification problems are Q. 5.35
designed to predict future Autonomous Question / Answering systems are
outcome,
Y Ans. : (b) (a) Expert Systems
Explanation : In classification
Output attribute should be (b) Rule Based Expert Systems
discreet.
(c) Decision Tree Based Systems
Q. 5.31 Which statement is true about
Prediction problems? (d) All of the mentioned v Ans. : (d)
(a) The output attribute must Explanation : Above all) methods can be used to
be categorical,
(b) The output attribute must be numer implement autonomous Q & A system.
ic.
(c) The resultant model is designed Q. 5.36
to dete rmine future Where does the Hidden Markov Model ts used?
outcomes.
(a) Speech recognition
(d) The resultant model is designed to ¢ lassify current (b) Understanding of real world
behaviour, v Ans. : (c) (c) Both Speech recognition & Understanding of real
Explanation : In prediction problems based on historical
world
data model is trained to predict the output of future.
(d) None of the mentioned Y Ans. : (a)
Q. 5.32 Unlike traditional production rules, association rules Explanation : In speech recognition hidden signals are
present for this HMM is used,
(a) allow the same variable to be an input attribute in
one rule and an output attribute in another rule. Q. 5.37 Where does the baye’s rule can be used?
(b) allow more than one input attribute in a single rule (a) Solving queries
(b) Increasing complexity
(c) require input attributes to take on numeric values.
(d) require each (c) Decreasing complexity
rule to have exactly one categorical
(d) Answering probabilistic query Module
output attribute. : (a)
Y Ans. Y Ans. : (d)
Explanation ; As per working of association rules like Explanation : Based on probability concept queries are 5
apriori algorithm. answered in baye's system.
Q. 5.33 Given desired class C and population P, lift is defined as Q. 5.38 What does the bayesian network provides?
(a) Complete description of the domain
(a) the probability of class C given population P divided (b) Partial description of the domain
by the probability of C given a sample taken from (c) Complete description of the problem
the population. (d) None of the mentioned Y Ans. : (a)
(b) the probability of population P given a sample taken Explanation : Bayesian belief network provides the
from P. complete scenario of a particular event.
Tech-Neo Publications... SACHIN SHAH Venture

. ification an
Machine
i Learningi -
(MU-Sem -Com
6-Comp) Learni ing with Class! ast memories the data and finally
imeit i
i trai«soning t
Q. 5.39 What are the main components of the expert systems? during lesUng
distance during
(a) Inference Engine computes the Fing
th e fo ll ow in g 2 statements
(b) Knowledge Base
been gi ven se of k-NN?
(c) Inference Engine & Knowledge Base Q.5.45 You have are true in ca
tions j/
oul W hiIc.ch of thesA e op
(d) None of the mentioned Y Ans. : (c) of very large value of k, we may include
J]. In case
Explanation : Knowledge base is used to store the ints from ot
her classes into the neighborhood. ;
knowledge and inference engine is used to provide the points Il value of k, the algorithm 1,
In case of too sma
inference.
Nr
very sensitive (0 not
Q.5.40 Three components of Bayes decision rule are class prior.
d 2 is False
likelihood and (a) | is True an
d 2 is Truc
(a) Evidence (b) Instance (b) | is False an
(c) Confidence (d) Salience Y Ans.
: (a) -) Both are Truc
(c) 0 Y Ans. :(c)
Explanation : Based on new evidences rules are (d) Both are False
Explanation : Both the options afe tric and ie sell
updated.
Q. 5.41 In Bays theorem. unconditional probability is called as
explanatory. |
the statement is True/False > k-NN
(a) Evidence (b) Likelihood Q.5.46 State whether
(c) Prior (d) Posterior Y Ans. : (a) algorithm does more computation on test time rather than
Explanation : Basic definition train time.
Q. 5.42 In Bays theorem, class conditional probability is called Y Ans. : (a)
(a) True (b) False
as
: The training phase of the algorithm
Explanation
(a) Evidence (b) Likelihood feature vectors and class
only of storing the
consists
(c) Prior (d) Posterior Y Ans. : (b) In the t esting phase, a test
labels of the training samples.
the label which are most
Explanation : Basic definition point is classified by assigning
samples nearest to that
frequent among the k training
Q. 5.43 One person is tossing a coin inside a closed room and he utation.
query point — hence higher comp
tells only the output (Head or Tail) to the person who is x and
Q. 5.47 Suppose. you have given the following data where
standing outside the room .The type of coin which he is ent
y are the 2 input variables and Class is the depend
using whether fair coin or biased coin is not known. variable
Suppose you want to find out the type of coin which class
Uly
algorithm will help you?

(a) Hidden Markov Model 0 +
0
(b) Discrete Markov Model
1
(c) Prediction Model I +
(d) Classification Model v Ans. : (a) I +
Explanation : HMM, Since it has the ability to predict 2

2 +
the underlying model for which the output is known.
Suppose, you want to predict the class of new data point
Q. 5.44 K-Nearest Neighbor is a algorithm x=I and y=! using Eucledian distance in 3-NN. In which
(a) Non-parametric, eager (b) Parametric, eager
class this data point belong to?
(a) + Class (b) ~ Class
(c) Non-parametric, lazy (d) Parametric, lazy
(c) Can't Say (d) None of these
Y Ans. (ce)
Explanation : KNN is non-parametric because it does
~ Ans. : (a)
Explanation : All three nearest point are of + class so
not make any assumption regarding the underlying data
this point will be classified as + class
distribution, It is a lazy learning technique because
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) B) Tech-Neo Publications...A SACHIN SHAH
Venture

Whar
'S the joing Po
Ndivonay hability eleatyiby
classes are independent of each other
ove
PrObab UO 1 fortes cf
. ties?
») All the features of a class are independe Not each POPS DNS A
( NS '
other most prob PPPS AD,S wD
¢ mos able feature for
a class is the mo (ce) ND PD yep NY$a. 49; ‘ \
-) Th tant feature st 2 (SD YS 49
to be considered for WW)! PD) , PP ,6 {Dy
impo cla ssi fic ation PS DRS
1 the features of a class are
0 Dh dy
conditionally
(dé A
di ic pendent on each other. Y Ans, : (b) Explanation LF Y Ans. 4d)
‘on : Naive Bayes TMy the » figure,
Assumption js that dD w fea See that
ane 2 8TE5 NOL denende 1) and
pxplanatl of aclass are ind
ependent of eac
al] the
any
“Pendent on ANY
Variable as
they dow"
fealu! res *
h other which in
Homing direct
“ have
ase in real life. Bec 1s
from D i»
ed edg es S$) has an Weoming
ause of this assump i hence
tion, the
nol the ca § 1 depends on edye
Bayes Classis fier, edges from D,. S, has 3
ificr is called Naive
rooming
D, an dD.
class’ hence §
der the following datase Sh Wan incoming 2 depends on D, and
b,
le t. a, b, ¢ are the te ¢ dge from Dy
Const Sy depends on bd,
(5 49 and Vis
Kis the clasz1
s(1/,0): Alures Hence, dis the answer,
5 Q.5:5.51
What isi the Markoy bla
; ' nket of Vanable, S,
(a) D, ) D,
;
0 (c) D, and dD, () None ¥ Ans. stb
)
Explanation + Jp 4 Bay
esian Network, the Mark
blanket of Node, X is ov
the set consisting of N"s
parents,
1] = S children and
0 Parents of X's chi
ldren. hy the given
0] 0 diagram, Nanable, S, has
a Parent Dy and no chi
. Hence, the Correct answer ldren.
Classify the testst ins
instance given belo
tance given w into class 1/0 usi is (b),
be ng
a Nai ve Bayes Classifier. MS | Q.5.52 Suppose you are Usi. ng
‘ a Linear SVM classifie
Class classi r with 2
aj bic Ss fication Proble
blem. m. Con
Consid
sider
e the following dat
in which the points a
o;o};1]? circled red Fepresent
support vectors
0 (b) Will the decision bou
1 ndary change if any of the
(a Y Ans. : (b) Points are removed? red
Explanation:
P(K=Ila=0,b=0,c=1)=3
/6* 1/3*2/3*2/3 = 0.7407
344
P(K=0la=0,b=0,c=1)= 3/6* 1/3* 1/3*2/3 = 0.03703 © i
P(K=Ila=0, eO'c= =1la=0,.b=0,c=1)
b=0,c=1)> P(K=
50 A patient goes to a doctor with sy
0.5. The doctor sus mptoms S,. S and S,.
pects disease D, and D,
2p + @
and Syenies a -
J:
Bayesian network for the relation amon ;
g the disease and 1 Module
symptoms as the following : ©
D,
i i i i + 5
1 2 3 4 5
x
Fig. Q. 5.52
(a) Yes (b) No Y Ans,
: (a)
Explanation : These three examples are pos
8, 22 itioned such
S3 that removing any one of them int
roduces stack in the
ig. Q. 5.50 constraints. So the decision bou
Fig. Q.5 ndary would completely
change.
MUNew Syllabus w.ef academic. year 18-1

9) (M6-14) Hal techn-Neo Publiceat ations...A SACHIN SHAH Venture
ies ve

ransformed feature
poundary in the
(b) the cis
decision
linear sae 4 . :
Q, 5.53 Consider the data-points in the figure below, feature space is
Xo Hy
i dects«ion boundary in the original
(c) the
linear ve inl he original Y feature
space is
bou nda ry Ans. : (b), (d)
(d) the deci sion
non -linear As per working of RBF.
;
Explanation :
5 tatements is/are true about
ic h of th e following
Q. 5.57 Wh
?
kernel in SVM ional data to high
el fu nc ti on map low dimens
1, Kern
ace
dimensional sp
rity function
2. It’sa simila se but 2 is True
bu t2 is Fa lse (b) 1 is Fal
(a) 1 is True (d) Both are False
(c) Both are Truc v Ans.
: (c)
M
per working of SV
Fig. Q. 5.53 Explanation : As
?
Let us assume that the black-colored circles represent What is tr uc about
K-Mean Clustering
Q. 5.58 e to cluster center
ely s¢ nsitiv
positive class whereas the white colored 1, K-means is extrem
circles represent negative class. Which of the following initializations gence
lead to Poor conver
among H,, H, and H, is the 2. Bad initialization can
maximum-margin hyperplane? speed ring
Jead to bad overall cluste
(a) HH, = (b) Hy 3, Bad initialization can
(a) | and 3 (b) | and2
(c) H, = (d) None of the above. Y Ans. : (c) Y Ans. : (c)
(c) 2 and 3 (d) 1,2 and 3
Explanation : H, docs not separate the classes. en statements are true.
Explanatio nz All three of the giv er
H, does, but only with a small margin.
K-means_ is extremely sensitive lo cluster cent
a lization can lead to Poor
H, separates them with the maximal margin.
initialization. Also, bad initi
as bad overall clustering.
convergence speed as well
Q. 5.54 The soft’ margin SVM is more preferred than the a horizontal line on y
Q. 5.59 If in the following fi gure we draw
hard-margin svm when : clusters we will get?
axis for y = 2 how many number of
(a) The data is linearly separable
(b) The data is noisy and contains overlapping point
Y Ans. : (b) 2.54
Explanation : When the data has noise and overlapping

2.0
points, there is a problem in drawing a clear hyperplane
without misclassifying.
1.54
Q. 5.55 After training an SVM, we can discard all examples 1.04
which are not support vectors and can still classify new
examples? 0.54
(a) True (b) False Vv Ans. : (a)
D F E—E C A B
Explanation : Since the support vectors arc only
responsible for the change in decision boundary.

Fig.
Q. 5.59
(a) | (b) 2 (c) 3. (Wd) 4 % Ans.:(b)
Q. 5.56 Suppose that we use a RBF kernel with appropriate
Explanation : Since the number of vertical lines
parameters to perform classification on a particular two
intersecting the red horizontal line at y = 2 in the
class data set where the data is not linearly separable. In
this scenario dendrogram are 2, therefore, two clusters will be formed.
(a) the decision boundary in the transformed feature

space is non-linear
_ (MU-New Syllabus wef academic year 18-19) (M6-14)

Bl Tech-Neo Publications..A SACHIN SHAH Venture

Machine Leaming (M
U-Sem 6-Com
_Learing Clusterin .. Page
and Custering
Classification and
itn Classification ne (ESD5
want 1 Com)
Page no
Assume, you en Learning with
ing. Which
If two variables V, and V; are used for cluster
Q.5.60 i hi
clos - oO clu .
7 observations into 3 | @- 5-61
usters using K-\Mea Ns clustStet . of the following are true for
K means clustering with
iterationi the Clusters - C,.cer ae After first
esva
* Xx Cy has the following k=3?
lation of 1, the cluster
1c: Ht oe |. If V, and V, 2 has a corre
2 c. a 110.4 84.7.7)
ght line
mM centroids will be in a strai
cluster
3. {5.5 piac
€ e 2. If V, and V; has a correlation of 0, the
Wat Sati ).(9.9))
3
aight line
t centroids will be in str
nen on?
for second econd iterati Muster centroids if youy want to proceed
Choose the correct ans cr!
(a) C)544.0,:0
22.2
2- 2, ).¢ (7,(77)
7
bc: 12 (ay 1 Only
(2.2), Cc, > (0.0), Cy: (5.5)
= (by 2 Only
ccoc:116.6). C, 2 (4.4).0, 549.9)
(c) Both 1 and 2
(d) None of these : (a)
v Ans.
Vv! Ans. :s (a) (d) None of the above
Explanati ion : Finding
F V,
Explanation : If the correlation between the variables
centroid for data points in cluster
>
C,= (Stes
Findi
pes seey =i&.4}
and V,15 1. then all the data poin
“
ts will be in a
. .
stra
centroids will form
ight
a
Ing Centre . line. Hence. all the three cluster
nd for data points in cluster .
C= (0+ 4) (440) ) = (2,2) straight line as well.
7 a)
Find Ing centroidi for data points in cluster

C.= +915 +9)
3 2°: 2 ) =0.7)
Hence, C, > (4,4), Cy: (2.2),

C, : (7.7)
Chapter Ends...
Q00

Chapter...
ensionality Reduction ©
2 University Prescribed Syllabus .
Dimensionality Reduction F echniques, Principal Component Analysis, Independent

Component Analysis, Single value
decomposition.
Dimensionality Reduction Techniques..........ssssesssssssesssserssnessseesvessssscesssssassesssssesesssnvesaessaessasesusessessvees
|
6.1.1 Dimension Reduction Techniques in ML..u........ccccsccsessscesssseesseescecssssesssesseessssescoceseneeseres
6.1.1(A) Feature Selection... ceesescsesssssesssesrsessssenessensessssseessnesesessenesnenssesesseseseessseeetes

6.1.1(B) Feature Extraction ........cccccsccsessecessssensseesssessssanssnsssecssanessennssaes
6.2 Principle Component Analysis
6.3 Independent Component AMAlySis ..ssessesssesssssesssseassennsssesseesesnseeteeerinsssterimanseeetiasenatinaseanttnnsceecituveneeeesennnneseeenett

nenantsuanesansanstaneenes ensnett
6.3.1 Preprocessing fOr ICA .sssssssssvsssssesssseeseesnsseetnsseesnsseniseersnertenseetssneaneanieenneest
6.3.2 The Fast ICA Algorithm ..ssetinienennneinenninnnninnistniitnnnsinnnninnnnianrninnnesetie

eeet
6.4 Single Value Decomposition ......ssssssssseeessenssseeersna
sseestnntsresseeescneerenanennaq ennannnaeater eesneeaa
neaseennee
eimnertia
inecnesante ienini tt
CES
6.5 University Questions and AnSWeIS ...vssssuter
sssscrssreeee erseaee
Multiple Choice QuestiOns....sss
nsnenees anaceransosneess
Chapter EmdS.ressesssseennrsnseesssnsensseen

Y=
no (6-2)
Dimensionality Reduction ...Page
Machine Learning (MU-Sem 6-Comp) =)
‘6.1 DIMENSIONALITY REDUCTION TECHNIQUES —$—_

us
. . . i in arran: ement of information having tremendo
Dimension Reduction alludes to the way toward changing over a 8 These
: ; i : . . : . comparable data briefly,
i
guaranteeing thal t it passes on
dimensions into information with lesser dimensions
methods are normally utilized while tackling machine learning issues to acquire bette
r properties for classification <
regression problem.
We should take a gander at the picture demonstrated as

follows. It indicates 2 dimensions P, and P,, which are
given us a chance to state estimations of a few objects in
cm (P,) and inches (P,). Presently, if we somehow P2
(inches)
managed to utilize both the measurements in machine tt
rT
learning, they will pass on comparable data and present
a lot of noise in the system, so it is better to simply
utilizing one dimension. Dimension of information
present ishere
changed over from (2D
(from P, and.P,) to LD (Q,), to make the information
moderately less demanding to clarify.
There are number of methods using which, we can get k
dimensions by reducing n dimensions (k < n) of
informational index. These k dimensions can be
specifically distinguished (sifted) or can be a blend of
dimensions (weighted midpoints of dimensions) or new
dimension(s) that speak to existing numerous
dimensions well. A standout amongst
the most well-
known use of this procedure is Image preparing.
Now we will see
the significance of applying
Dimension Reduction method :
o It assists in information packing and diminishing
the storage room required
o It decreases the time which is required for doing 22 20 poi
[omen pomnnny
15 - ° o a
same calculations. If the dimensions are less then 9 08 09 0% ooo °
to L 8 |
processing will be less, added advantage of having
ccaes fize at eee
less dimensions is permission to use calculations
i Ot ee
° Bq % 0 ° % 0, @6°%0 |
unfit for countless. -5 r o°gpor 2 Po % eo 0%," 9 20 4

~10 = OA & geo
o It handles | 0% 3040 0% % 88
8%a ]|
with multi-collinearity that is used to 15 > 2 6 ° °
enhance the execution of the model, -20 | °°
It evacuates
excess highlights. For example: it 25 |
makes no sense -80 ~60 —40 20020406080
to put away an incentive in two distinct units 24
(meters and inches).
Fig. 6.1.2 ; Example of Dimension Reduction

18-19) (M6-14)
Tech-Neo Publications...A SACHIN SHAH venture

tt
Dimensionality Re
duction
--Page no (6-3)
that point spok

It is helpful in noise eva
cuation additionally and
bi eCa
of this we can enhance the use
xecution of model
els,
performance
Classifier
o The classifier’s
performance Us
ually will degrad
e fi ora
larg
e number of features,
. ett tt Number of variables
oo.
jp
Fig. 6.1.3 : Classifier performance
@ and amount of data
6.1.1 Dimension Reduction Techniqu
es in ML
wm 6.1.1(A) Feature Selection
Given a set of features. F=

(X00. . X,}. The feature sele
ction problem is to fin d a subset F’ c
On, learner’s ability to classify the patterns. F that maximizes the
Finally F" should maximize some scoring functio
n.
Xy
HO Ready X, X;
Feature selection
Xn X;
Feature Selection Steps
a Feature selection is an optimization problem.
’ Step 1 : Search the space of possible feature subsets.

function.
:1 > Step 2 : Pick the subset that is optimal or near-optimal with respect to some objective
‘
| Subset selection
- There are d initial features and 2". possible . subsets.

4
classifiers.
oe based“ *, a ne ho the lowest probability of_error of all such
“‘ tees
o Classifier 0!
so we need some heuristics.
~ — Itis not possible to go over all 2" possibilities,
i ’ — Here we select uncorrelated features.
i . ~ Forward search
a a © Start from empty set of features.
. Module
© Tryeach obremay™e aa for adding specific feature.
ry venes in validation error. 6)
Pop o Estimate classification/regression a
f ' © Select feature that gives maximum ” ‘eI Tech-Neo Publications..A SACHIN SHAH Venture
-19) (M6-14)
year 18

Dimensionality Reduction ...Pa ge no (6-4)

o . Stop when no significant improvement.
- Backward search
’ o Start with original set of size d.
o Drop features with smallest impact on error.
TS 6.1.1(B). Feature Extraction

is to map F to some feature set
Feature Ex' traction task
. Suppose a set of features F = (Xj, sssscsssseesesey Xy} is given. The
F” that will maximize the leamer’s ability to classify patterns.
oy v1 a
XQ Y2 X2
Feature extraction =f
Xn = , Yn = Xn
to M-dimensional vectors to achieve low error.

=! A projection matrix w is computed from N-dimensional
e extraction methods.
Independent component analysis are the featur
Z = w'x. Principle component analysis and
a _ Dt 6.2 PRINCIPLE COMPONENT ANALYSIS

It is
stage when dealing with high dimensional information.
=i". — Dimensional reduction is regularly the best beginning
ing
denoising, and in a wide range of uses, from signal process
utilized for an assortment of reasons, from perception to
‘ to bioinformatics.
n tools és Principal Component Analysis (PCA).
A standout amongst the most broadly utilized dimensional reductio
dispersed, furthermore, chooses the subspace which
- PCA verifiably accept that the dataset under thought is typically
the sample covariance matrix, at that
rs expands the anticipated difference. We consider a centered data set, and develop
ar point q-dimensional PCA is identical to anticipating onto the q-dimensional subspace spread over by the q eigenvectors
of S with biggest eigenvalues.
into another arrangement of variables, which are straight blend of unique
In this system, variables are changed
components. They are calculated so that first
variables. These new arrangement of variables are known as principle
which each
principle component S represents a large portion of the conceivable variety of unique information after
able variance.
succeeding component has the most noteworthy conceiv
| [4° principal com ponent
I. ;
— The second principal component should be symmetrical to the
-t jt 4 wt. | Lee
primary principal component. As it were, it does its best to
by the | || _ ; 3 eI “op onthe Cae
catch the difference in the information that isn't caught
|
7 - -| 0
biti Lt L|
{i a primary principal For two-dimensional dataset
component. |
en |. L
[nes there can be just two principal components. The following isa
depiction of the information and its first and second principal
component. You can sec that second principal component is | - “ee fof
aeAt ddPpp yr
symmetrical to first principal component. Fig. 6.2.1
y
wach’ NES pakicadcie A SCAT venture
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) 7 - ublications....
i

ee
: Machine Learning (MU-Sem 6. Comp)
=. The he p principal
ine components are cones:
P NE SENSitive 16 the size Dimensionatity Reduction --Page no (6-5
institutionalize factors before pplying Pca “122 of estimation NOW {0 seatle this 4} issue we )
ili
event that interpretability
i PP ying PCA to your , informa ionScaleal colfee
of the Oulcomes is : Applyi We ought to dependably
Mal collection fosey
AS et
SS crj i
importance, In the
isn't the correct BY
Stem for your task,
11
1.6109
MW Solution:
First we will Sind the mean valucs
X
X,, = Ly tte eeaeeenesenees
” N = number of data
points = 10)
X, = 1.81
Y
Yn _ x N
Y, m = 1.91
Now we will find the covariance matrix, C = [eo a

[]*
yx Cyy
(X= X,)"
Cyx = LS = 0.6165
mation, {hy (X-X,) (¥-Y,) = 9.61544
Cyy = Cyx=“No
anal priest
(Y¥- Yq) 0.7165
|
Cy = y=
is (PCA) eas 0.61544)
= 10.61544 0.7165
ip #8
is used.
mat i Now to find the eigen values following equation
q sige ; IC-Al =0
[ois 0.61544] -a{ 1 0 || 6

0.61544 0.7165 Oo 1
solv ing that equation we will get two eigen values
t we wil l get quadratic equation and
By solving the above determ inan
of Aas 2,= 0.0489 and 2, = 1.284.
values.
corr esponding to eigen 2, M=0.0489.
Now we will find the eigen vectors © eig' n val u
ini g to fir st ige
First we wiil find the first eigen
vector correspond Vv
[¥] - 0.0489 [¥"]
0.6165 0.61544]
0.7165 J LY 12
CxVi = 1, xVi= [961544
jons as
. iI get two equation
From the above equation We will £ Module
89V
0.6165 V,, + 0.61544 Viz 00m
ul
0.0489V
0.6154 4 V,, + 0.7165 Vie
544Vi fal Tech-Neo Publications..A SACHIN SHAH Venture
W
r
w.e.f academic yea
(MU-New Syllabus
a4 lain Spee ee
neni renee

Machine Learning (MU-Sem 6-Comp) Dimensionality Reduction ...Page ng (6-6) :
- To find the eigen vector we can take either Equation (1) or (2), for both the equations answer will be the same, let’s take
the first equation ‘
= 0.6165 V,, + 0.61544 V,, = 0.0489 V,,
0.5676 V,, + 0.61544 V,,

Vip = - 0.61544 V,,
Vip = - 1.0842 V,,
— Now we will assume V,,= 1 then V,, =— 1.0842
— 1.0842
vie lM ]
- As we are assuming the value of V,, we have to normalized V, as follow
ae 1 [- 1.0842) _/- 0735]
IN Niosa+rb 1 tL 0.677
| - Similarly we will find the second eigen vector corresponding to second eigen value, 1, = 1.284
\ .
i _ _[ 0.6165 0.615441 [ye] - [|
: Cx = ox Va=[ esas 0.7165 | Lv,,) = 184Lv,.
i — From the above equation we will get two equations as
' 0.6165 V>, +0.61544V,, = 0.0489 V>,

i 0.61544 V2, +0.7165Vz, = 0.0489 Vy.
— To find the eigen vector we can take either Equation (1) or (2), for both the equations answer will be the same, let’s take
| iP" the first equation
‘ = 0.6165 V,, + 0.61544 V,, = 1.284 V2, — 0.6675 V3, =— 0.61544V,
Vy, = 0.922 Vip

\ — Now we will assume V,)= | then V2, = 0.922
\ [°%”"|
} V2 = 1 J.
‘ee — As we are assuming the value of V.. we have to normalized V, as follows :

| 1 [97] _poary)
Le Von = Mog2e+rt 1 J ~ 10.735
Now we have to find the principle component, it is equal to the eigen vector corresponding to maximum Eigen value, in
this 4, is maximum, hence principle component is Eigen vector Vy.
i 0.677
ore Principle component = | 9735
|. }jp6.3" INDEPENDENT COMPONENT ANALYSIS.

een . a . : ik Pave e
p ; — Independent component analysis breaks a multivariate signal in to autonomous non-Gaussian signals. For instance, ¥
f : . . $ . 4 a te $
take a example of sound which is a typical a signal consist of the numerical expansion. in sound at each time, signal
from some sources are included.


— = =P)
— The inquiry at that point is whether We can
ino oe Dimensionality Reduction --Pa
al c these Contri
‘al ge no (6-7)
At the point when the factual ibu
buti
ting sources from
freedom Presur Mpti the w atched @gercg
results. This is addition ally utilized for ;
on is right, blin
d Ic A partition
of a blended si gnal
ate signal or not.
Sign als ¢ hat leads in great
£ al shouldn't he cre ate
1 d by t lendir
- Astraightforward use of ICA is the g for Investig, ati
on pur pos
“Get-togett Mer part i.
es,
example information
.
mat : Consisting of j Party issue + In which the basic disc
ndividu als talking ourse Signals are Sep
at th arated from
and the dela y in time, the issue is im me time in a roo m. When We do not
proved “es consider the echoes
- A separated and def
erred Signal repres
ee oe ents ad uplicate 6 i
supposition isn't violated, i. Part. and accordingly the meas
F8 reliang urable autonomy
ignals. This comprises the situation

. = D, here D indicate s the dimension of the
7in ut
asurementand
instances of underdetermined fi.c.} >Dya i i :
nd overdetermined (i.e, J < D)Oy have
en nen easar ement i J. Different
been examined.
— Independent Component wuuysis
separates blended si gnals that leads
to quality outcomes depending on two beliefs
three effects of blend ing source Signals. and ‘
- Two beliefs :
'
1. The source signal
signalss are autonomous of one another.
§‘:
2. Each source signal values have non-Gaussian conv eyances,
t
- Three effects of blending source signals :
a
’
1. Autonomy : Considering to belicf 1, ICA comprises of the autonomous source signals; their signal blends are most Jl
va
certainly not, This is based on the fact that the ugnal blends share a sumilar source signals. t;
2. Ordinariness : “Considering the Central Lamit Theorem, the dispersion of an aggregate of {tee integulat factors
having limited Muctuation approaches towards a Gaussian circulanion”, Freely, an aggregate of two autonomous f
irregular factors as a rule has 4 dissemination nearer to Gaussian than any of the two unique variables. At this point y
we think about the calculation of each signal as the random variable,

i
3. . Mulltifaceted
ultifaceted nature .: The transient complexity of any of the signal blend is more noteworthy than that of ite least
n
difficult constituent source signal.

ls we happen to extricate from
- tial foundation of ICA. On the off chance that the signa
Those standards add to t he essentia and have non-Gausaia n histograms or have how
like sources signals,
an arrangement of blends are a ulonomous
ignals,als, ali thatal p< pon they
should be source signals.
comple xit y like sour ce sign
ents We may pack
factual autonomy of the evaluated compon
1s by amplifying the
for autonomy, and this decision overes the type of the
* < oo
si an inte rmed iary

ICA finds the indepeny a es toee characten-
"
=. one of numerous ap pro
ach gs of autonomy for ICA are :
¢ on. The two general meanin
t ana lys ts alcu lati
Independent componen 2 Non-Gaussianity marimi
zation
zation
1. Mutual informatio n minimi : doing the calculations that utilizes mea
o fic A by
sures euch ae
groupsSm
yal data aded by as
Minimiza tion
jev ed for Mut
is achiev naianity group of ICA calculations, peru
le iv
Dive
i er
jbl er
rg pe
e nc e.
Hback-1e tersis.
maximum
i entropy PY or Ku ses measures sucl pas negentropy ad hus
:
far as possible hypothes is, U! Tech-Neo Publications..A SACHIN SHAH Venture
. = -14)
118-19) (M6
w Sy ll ab us w. e- f academic ye a
(MU-Ne

Dimensionality Reduction ...Page no (6-8)

— ——=—
Machine Learning (MU-Sem 6-Comp) ———
dim ens ion red uction as
is utilize cen tering, whitening, and
pendent compone nt analys
OrdinaryoO caiculations for . Inde comple xity Y of the issue for the real iteratj
ventures with the end goal to strea' mlin e and diminish the Crative
preproces sing
calculation. i
‘ : main: part examina inati i
tion or solitary esteem
ne sintegration,
disi
and whiteni ng can be accompl ished with
Dimension reduction eh imi i ec: ion i
ements are dealt with similarl y from the earlier before ¢he calculation is run,
Whitening guarantees that complet e measur
component : analysis , JADE,
ICA, infomax, kernel-independent
Surely understood calculations for ICA incorporate Fast
tingly right requesting of the source
and among others. ICA can't recognize the real number of source signals,an interes
signals.
_ signals, nor the correct scaling (counting sign) of the source
to reah time applications,
t comp onen t analy sis is vital to dazzl e signal detachment and has numerous down
Independen
e of) the scan for a factorial code of the information,
_ It is firmly identified with (or even an exceptional instanc the
to such an extent that it gets interestingly encoded by
i.e., another vector-esteemed portrayal of every datum vector
the code segments are factually autonomous.
subsequent code vector (misfortune free coding), yet
“latent factors” demonstrate. Expect that we watch n direct
To thoroughly characterize ICA, we can utilize a measurable
blends x,... x, of n independent segments
Xj = jy Spt Ajy Sy teeeeeeeeee aj,8y» for allj
the time record't, we expect that every blend x,
‘In Independent component analysis demonstrate, we have now dropped
te time signal. The observed values x((#), e.g., the
and also every free part s, is a random variable, rather than a legitima
of this arbitrary variable. -Without loss of
: mouthpiece motions in the get-together party issue, are then an example
s have zero mean: If this isn't
sweeping statement, we can expect that both the blend variables and the free segment
mean, which makes the
valid, at that point the noticeable variables x, can simply be focused by subtracting the example
model zero-mean.
by
It is helpful to utilize vector-matrix documentation rather than the aggregates like in the past condition. Lets indicate
manner by s the random vector with
x the arbitrary vector whose components are the blends x,... x,, and in like
components 5, ...", 5, Lets represent mean by A the matrix with components a;. All vectors are comprehended as
documentation,
column vectors; subsequently x", or the transpose of x, is a column vector. Utilizing this vector-matrix
the above blending model is composed as x= As
the matrix A columns; :
Some of the time we require
q' ! mns; represent columns by a, and the model is composed as
The factual model x = As is known as independent component analysis. The ICA is a generative model, which implies
that it depicts how the observed information are created by a procedure of blending the part The independent
parts s;.
matrix is
components are inactive variables, implying that they can't be specifically observed. Additionally the mixing
thought to be obscure. All we observe is the arbitr. ‘ary vector x, and we should
est it. This
must bé doné'under as general suppositions as could Teasonably be expected. inal
e ae
The beginning stage for ICA is the plain straightforward Suspicion that the parts s, are factually aut ous. It will be
e factually autonom'
seen undemeath that we s hould likewise expect that the autonomo “Gaussian
appropriations. us component must have non

lal Tech-Neo Publications...A SACHIN SHAH Venture

Ls
——
6-Comp)
! —
- Nonetheless, .
. in the essent . angled) pe ci
issue is extens|
ial} Made] a
we don't ty eCcent these
ivel
y disentang e die
spersion
square, yet k (on the of |
this presumpt F a chance that
ion Can Wa — theyv aree
be now nd nti ne Know-
n, the
Fig. its opposi the 8 thhsat the Ob ,
te, say W, an c
Mt tpoinl sc ur e bl en di
- d AC quire the t, Substeqeuent ng brid is
ICA is firm en to ¢
ly id entified With (just by : 5 = Wx
the lechni
implies here a unique sign que called
al, i.e, indepe ndent componblent
ind SOUrCE se
paration (B wo
- SS) or blind
“Blind” implie n Si mi lar t 6 the spea si Bnal partitio
n,
s that we know ker in a cocktail A “source”
pr ‘Actic Party issue,
suppositions on the source sign
source partition.
- Innumerous applic
ations, it would
: be more Scnsible to
mean incl udin: g a nois: e term in‘ the mode accept th al there is
some no ise in the
. . l. F or estimations, whic
the commotion fre St raightforwardness,
e model is suf Wwe ex cl ud e any noise ter hich would
ficiently tro ubleso ms, since the estimation
me in iself, and is
by all accounts ade
of
wT quate for some app
6.3.1 Preprocessing for ICA lications,
Preprocessing is except
ionally helpful before
actually applying an ICA
some preprocessing methods calculation on the informat
that lead into less com ion. Now we will sec
plex ICA estimation and a better model.
Centering
— The essential and vital Preprocessin
g method is to center x. Centering is
done by subtracting its mean vector m=
from x, so that x becomes a zero-mean E(x]
variable. This means s is zero-mean too.
This preprocessing is made exclusively to improve the ICA
calculations : It doesn't imply that the mean couldn't be
assessed.d. In the wake of assessing A is the mixing matrix that contains centered informat
ion, Calculations can be
finished b y including
i the mean ve ctor r of s aga
gaini b aack
C K to the ¢ fe focused
CUSCCO evinlt ations
NS OF S. s. A ! m= 2g)
sed ev of represents the 2 m [t mendiin vecto
v
ector
ofs ,> where m is the mean that was subtracted in the preprocessing.
Whitenin
. sedure
initially whitening the observe d variables is another important preprocessing procedure ini ICA. This
° This indicates that
f ICA, we need to change the observed vect indica “s
:
before actually do ing the calcu lation 0 , or x straightly with the desired goa
.
er white vector xX .. The parts of t he @ white vectorss are uncorrelated and their differences measure up to
2
that we get anoth iance matrix ofx meets the eet dose
identity matrix :
solidarity. As it were, the covar ia’
~-T _
E {x x } =I One prevalent technique of whitbis enin=e g uses : the cigen
igen- value decay of
we ‘ n
e is+ constant ly co nceivable. T
the orthogonal i C
m atrixi of eigenvectors 0 fE{ xx° ) and comer
~The whitening chang represents
{ xx} = EDE’, here =E diag(dj.. «dy
the covariance matrix E E[xx'] can be assessed uniformly from the available
D
to corner network of its igenvalues 15is D. te by
possib
n should now be
example x(1),.--- x(T). Wshhiteni
x =ED BT x peration as Do” -1= ding -W
wea “Wy
dd. tjHis
x
ent-wisc 0 = diag (d, .
asim ple compon Module
.
- ix D’'” D is calculated by
where
x
{zx
the matri
~ x} =! -
easy to check that now E Tech-Neo Publications..A SACHIN SHAH Venture
7
4
mic year 18-19) (M 6-14)
w.e-f acade
(MU-New Syllabus

\=
ion ...Page no (6-1 0)
Dimensionality Reduct (G10)
Machine Learning (MU-Sem 6-Comp) =
shrewd activity as _.
where the grid D — 1/2 D is registered by a straightforward segment
E{ x x T} =]
It is anything but difficult to w
1/2),..., [d_n] * (- 1/2)).
atch that now
D— 1/2 = diag ({d_1]
— Whitening transforms the mixing matrix into a new one, A. We have from (4) and (34):
x =ED'"E'As=As
- Brightening changes the blending lattice into another one, A. We have from (4) and (34) :
seen from
The utility of whitening resides in the fact that the new mixing matrix A is orthogonal. This can be
E {xx} T) =AE(ss}A'=AA =I
The utility of brightening lives in the way that the new blending lattice A is symmetrical.
— Whitening decreases the quantity of parameters to be evaluated. The new, orthogonal mixing matrix A needs to be
calculated instead of assessing the parameters which are the components of the original matrix A. An orthogonal
matrix consists of n(n — 1)/2degrees of opportunity. For example, in two measurements, orthogonal transformation is
consist of a portion of the quantity of
dictated by a solitary angle parameter. In bigger measurements, orthogonal matrix
parameters of a discretionary matrix. Consequently we can say that whitening handles half of the problems of ICA.
Because whitening is an extremely basic and standard methodology, it is significantly less difficult than any ICA
lines.
computations, it is a good thought to lessen the multifaceted nature of the issue along these
ie rT
tt
Fy ‘a
—_t
= L
— Jt will be very useful to lessen the measurement of the | |
|
information in the meantime as we do the whitening. In the
case of eigenvalues d; of E[ xx}, we dispose those that are
too little, as is regularly done in the factual method of
essential part examination. This has frequently the impact of
decreasing commotion. Besides, measurement decrease
anticipates overlearning, which can once in a while be seen in
ICA.
Fig. 6.3.1 : The joint distribution of the whitened mixtures
In whatever remains of this instructional exercise, we accept that the information has been preprocessed by centering
and whitening. For effortlessness of documentation, we signify the preprocessed information just by x, and the changed
mixing matrix by A, discarding the tildes,
3. 6.3.2 The Fast ICA Algorithm
In the first segments, we presented diverse proportions of nongaussianity, i.e. target capacities for ICA estimation. By
and by, one additionally needs a calculation for expanding the difference work. In this area, we present an extremely
productive strategy for augmentation. Here it is accepted that the information is preprocessed by centering and whitening 45
examined in the first area.
Fast ICA for one unit
- Now, we will demonstrate the one-unit adaptation of Fast §CA. By a “unit” we allude to a computational unit,i n the
long run a fake neuron, with a weight vector w that the neuron can be updated by a learning method. The Fast ICA
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Neo Publications. A SACHIN SHA Venture

MaA
ing = (MU-Sem
Machine Learning ( 6-Com
learnipeng n
p)
pra
inr
cie
plse findfsind a Dimens
bearing, 8
Le, ionality Re
gaussi anity is
j. a uniti Vector
w
ducti
estimated by su ch that the Projec
compelled to
the Bu ess of Acgentro ti on w"x ay
solidarity; for py J(w". ). Re ——
iani — 8;
brightened info vi
rmat on this is si .
onof
T
- The Fast ICA depends on a wy emph eq ua l toocob
tlliga
ging
e thede Standard tuation of w"x should
sh ouldtehere be
so
asis cons Whe of w to be unity,
.
likewise inferred as an 4pproximative Newton Pir e for fj din,
c 8 4a great
pre:
est of the non Baussianit
Yele. Indicate by y of w'y. It ten
8,(u) = tanh g the Subordinat ds to be
(a,u), e of he nonquadr
atic work G,
82(U) = uexp-u?/
)
wherel <a, <
is as per the following ; -~
| = 1. The ¢ fi fundamental type of .
the Fast ICA calculation
1. Select an initial
weight vector w
: 2.
3. Calculate, w*= {xg (w'x)} —=E
Calculate, w = w*/ {p’ (w"x)}w
Wri
4 If not Converged, go
back to Step 2,
ly) equival lL Tticat
(nearly) equivalent to |. timations of w point a similar way, i.e.
i their dot-product is
direction. This is again erges to a solitary Point,
in light of the fact that since w and - w characte
the independent compon rize a same
sign. Note likewise that it ents can be characterize
d just up to a multiplicative
is here expected that the informatio n is prewhitened,
The induction of Fast ICA
is as per the followin &. Firs
t note that the maxima of
are gotten at certain optima of the esti mate of the negentropy of
E { G(w"x )} . As indicate w'x
d by the Kuhn -Tucker conditions,
the requirement E ( (w'x)?} =IIWIP the optima of E (G(w'x )} unde
r
=1 are acquired at point
s where
E (xg(w"x)} -Bw =0
Let us try to solve this equation by Newto
n's method. Jacobian matrix JF(w) as JF(w)
= E{xx'g’ (w'x)) — Bl
1. To make easy the inversion of this matrix, we have to
approximate the first term. Since the d ata is spher
approximation seems to be T bag t ed, a reasonable
E {xx"g’ (w"x)} = B{xx"} B{ g’ (w"x)) =
,
(g (wTx)}
Thus, the Jacobian matrix becomes diagonal, and can easily be inverted. Thus
we obtain the following approximative
Newton iteration :
w* = w-[E ( xg(w'x
—B))]
wi/[E (2’(w'x)}
- 8]
> 6.4 SINGLE VALUE DECOMPOSITION
- i
In singul ar V: a ue dlecom positio! n method a trix 1: 1S decomposed
matrix rices:
into three other matrices:
A = USV"
Here, A represents 5
mxnm atrix.
trix. U represents m xn orthogonal matrix. S is an x n diagonal matrix and V is anx n
Orthogonal matrix.
ors as colu mns; ’ S is i
i a diag onal i
matrix which
i contains
i singular
i va lues; , and V has s
~ Mattia: UT pas the Jefe singular yee" lue decomposition original data present in a coordinate system is expanded.
right singular vectors as rows. I n singular valu
. ix is diagonal.
Heve thereovariance:matrix 1s .
ie eed to find the eigenvalues and eige nvectors of AA’ tT. and A’A.
T The Module
— To calculate singular value decomp ositi jon we ni
6)
T
U sii
consists of the e cig
cigenvectors of AA.
i
columns of V consists 0 f eigenve ctors of
A’. The columns of
[el Tech-Neo Publications...A SACHIN SHAH Venture

i 18-19) (M6-6-14)
M U
(MU-New Syllabus w.e.f aca demic year

no (619)
age n(n.
Dimensionality Reduction ...Page
Machine Learning 9 ( (MU-Sem m 6-Comp)
6- SS
~The Square roots of eigenvalues from AA‘ or ATA represents the singular values in S.
— The singular values are arranged in descending order and stored as the diagonal entries of the S matrix, The tana °
' ar
- values are always real numbers. If the matrix A is a real matrix, then U and V are also real.
Example 1 : Find SVD for A =[ ‘ ‘|
First we will calculate ATA = 3 :|
: Now we will calculate eigen vectors V, and V, using the method that we have seen in PCA
1°
1 j
V2 ~2
Mz] 4 V2 1
V2 N2
Next we will calculate AV, and AV,
.
AV, = [? ¥?) AV,
v i)
tc
ll
Next we will calculate U, and U, Cc
Ul = AY [2] U, = Av =[°|
1 = Av, = Lo 2 = TAV, =LI
SVD is written as,
LoL
10] [22 07} V2 V2
LL 0 Bl a 1
>
|
| i Vi
‘pp. 6.5-'-UNIVERSITY QUESTIONS AND ANSWERS
> May 2015
Qa.1 Explain in detail Principal Component Analysis for Dimension Reduction. (Ans. : Refer section 6.2) (10 Marks)
> May 2016

Qa.2 Explain in detail Principal Component Analysis for Dimension Reduction. (Ans. : Refer section 6.2) (10 Marks)
> May 2017
a.3 Use Principal Component analysis (PCA) to arrive at the transformed matrix for the given data.
As [ 5 7 osl (Ans. : Refer section 6.2) (10 Marks)
> May 2019 . :

@.4 Why Dimensionality Reduction is very Important step in Machine Learning ? (Ans. : Refer section 6.1) (8 Marks)
steps to reduce dimensionality using Princip
Q.5'. Why Dimensionality reduction is an important issue? Describe the
Component Analysis method by clearly stating mathematical formulas used. of
0 Marks)
(Ans. : Refer sections 6.1 and 6.2)
Venture i
(MU-New Syllabus w.e academic year 18-19) (M6-14) Tech-Neo Publications... SACHIN SHA


Dimensionality Reduction ...Page no (6-13)
a.6 Write Short note on ISA and compare it with PCA.
(Ans. : Refer sections 6.2 and 6.3)
(5 Marks)
> Dec. 2019
Qa.7 What is Dimensionality reduction? Describe how

Principal Com; Penent Analysis is carrie
d out to reduce dimensionality
of data sets. (Ans. : Refer sections 6.1 and 6.2)
(10 Marks)
Find the singular value decomposition of
22 (10 Marks)
A= [_ 1 ‘| (Ans. : Refer section 6.4)
Multiple Choice Questions Q.6.5 In PCA principal component is a eigen vector for which
eigen value is_
Q. 6.1 Which of the following method is a dimensionality (a) maximum (b) minimum
reduction method?
(c) zero (d) one Y Ans. :(a)
(a) Principal Component Analysis
Explanation : Property of eigenvalue and eigenvector.
(b) Regression
(c) Classification Feature selection is a process of ___
(d) Clustering (a) selecting best k features from original d features
—
Y Ans. : (a)
Explanation : PCA is (b) extracting any k features from original d features
a DR technique whereas
regression, (c) selecting any k features from original d features
classification are supervised learning
examples and clustering is example of unsupervised (d) extracting best k features from original d features
learning. Y Ans. : (a)
Explanation : Out of all features best features are
Q. 6.2 Which of the following property is true for PCA
selected.
Algorithm ?
(a) Data used for PCA is having Less variance Q. 6.7 Out of following which method is used for dimension
teduction?
(b) Maximum number of principal components are
greater than number of features (a) Classification (b) Clustering
(c) All principal components are orthogonal to each (c) Regression
other (d) Independent component analysis(ICA)
(d) PCA is a Supervised learning method Ans. : (c) Y Ans. : (d)
Explanation : Property of PCA algorithm. Explanation ICA is a DR technique whereas
:
regression, classification are supervised learning
Q. 6.3 Single Value Decomposition(SVD) is which type of examples and clustering is example of unsupervised
method? learning.
(a) Regression (b) Classification
Q. 6.8 Dimensionality reduction algorithm ___
(c) Clustering Dimension Reduction ¥ Ans.
: (d)
(a) Reduce Time complexity
Explanation : SVD is used to decompose matrix in to its
(b) Increase Memory complexity
components.
(c) Increase Time Complexity
Q. 6.4 If eigenvalues are roughly equal then ___ (d) Increase overfitting problem ¥ Ans. : (a)
(a) PCA will perform outstandingly Explanation : Since data dimension is reduccd, time
(b) PCA will perform badly required to process the data will also be reduced.
(c) LDA will perform outstandingly Q. 6.9 Which following is an example of Supervised
v Ans.“ : (b)
alt
(d) LDA will perform badly dimensionality reduction algorithm ?
vectors are same in such
Explanation : When all eigen (a) Naive Bayes (b) SVM
case you won’t be able to select the principal compone nts Modul
are equal. (c) PCA (d) LDA Y Ans. : (d)
that case all principal compone nts
because in
mh
year 18- 19) (M6-14) Tech-Neo Publications...A SACHIN SHAH Venture

ee

Dimensionality Reduction ...Page no (6-1 4)

ean
(c) PCA minimize the variance of the data, wheg
Explanation ; Naive bayes and SVM are used for
classification. PCA is unsupervised whereas LDA is LDA mimimize the separation between different
classes
supervised, (4) Both LD AA and PCA are linear transformation
Q.6.10 Which of the following algorithms cannot be used for techniques
reducing the dimensionality of data ? and LDA
7 Ans: ()
transform data to a new
Explanation : PCA
(a) -SNE (b) PCA coordinate system.
(c) LDA (d) Random Forest Y Ans. : (d)
What happens when you get features in lower
Explanation : Random Forest algorithm is used for
classification. dimensions wsing PCA?
(a) The features will still have interpretability
Q.6.11 Which following is ‘an cxample of Unsupervised
(b) The features will lose interpretability
dimensionality reduction algorithm ?
(c) The features must carry all information present in
(a) Naive Bayes (b) SVM
data
(c) PCA (d) LDA Y Ans.
: (c)
(d) The features carry all information present in data
Explanation : Naive bayes and SYM are used for : (b)
Y Ans.
classification. PCA is unsupervised whereas LDA is
supervised. Explanation : When you get the features in lower
dimension then you will lose some information of data
Q.6.12 The key difference between feature selection and most of the times and you won’t be able to interpret the
extraction is lower dimension data.
(a) feature selection keeps a subset of the original
PCA is a deterministic algorithm because, __
features while feature extraction creates brand new
(a) it doesn’t have local minima problem
ones.
(b) feature selection keeps original features while (b) You need to initialize parameters in PCA
feature extraction creates brand new ones. (c) PCA can be trapped into local minima problem
(c) feature selection keeps original features while (d) PCA can be trapped into global minima problem
feature extraction keeps subset of original ones. ¥ Ans, : (a)
(d) feature selection creates new brand features while Explanation : PCA is a deterministic algorithm which
feature extraction keeps original ones. does not have parameters to initialize and it does not
Y Ans. : (a) have local minima problem.
Explanation : Out of all features best features are
When you are applying PCA on a image dataset
selected in feature selection whereas new features are
(a) It can be used to effectively detect deformable
calculated from original features in feature extraction.
objects
Q.6.13 Which of the following statement is correct in (b) It is invariant to affine transforms.
Dimensionality reduction? (c) It can be used for lossy image compression
(a) Removing columns with dissimilar data trends (d) It is invariant to shadows. v Ans. : (c)
(b) Removing columns which have high variance in data
Explanation : When we apply PCA on image some
(c) Removing columns with dissimilar data trends information may be lost.
(d) Removing columns which have too many missing
To produce the same projection result in SVD and PCA
values Vv Ans, : (d)
which following condition has to satisfy ?
Explanation : Coloumns which have too many missing
values are not important for processing hence they are (a) When data has zero median
removed. (b) When data has zero mean
(c) Both are always same
Q.6.14 Which of the following Property of LDA and PCA is
true? (d) Both are not always same ¥ Ans. : (0)
Explanation : Mean of zero is need
(a) PCA maximize the variance of the data, whereas ed for finding bey
LDA maximize the separation between different that minimizes the mean square error of approximation 2
class data in both SVD and PCA.
(b) LDA is Unsupervised whereas PCA is supervised

[al Tech-Neo Publications...A SACHIN SHAH See

m 6-Comp) Dimensionality Reduction ...Page no (6-15)

machine Learning (MU-Se
__ Q.6.23 Suppose we are using dimensionality reduction as
Q 6.19 Which of the following statement is truc? preprocessing technique, i.c., instead of using all the
(a) LDA explicitly attempts to model the difference features, we reduce the data to k dimensions with PCA.
between the classes of data. And then use these PCA projections as our features,
(b) Both attempt to model the difference between the Which of the following statements is correct ? Choose
classes of data. which of the options is correct?
(c) PCA explicitly attempts to model the difference (a) Higher value of ‘k’ means more regularization
between the classes of data. (b) Higher vafue of *k’ means less regularization
(d) LDA on the other hand does not take into account (c) Both of above
any difference in class. Y Ans. : (a) Y Ans, : (b)
(d) None of above
Explanation : Property of LDA. Explanation : Higher k would lead to less smoothcnin
g
preserve more characterist ics in
do we consider in PCA? as we would be able to
Q.6.20 Which of the following offset,
data, hence less regulariz ation.
(a) Vertical « (vw) Perpendicular offset
, which of
(c) Horizontal offset (d) Parallel offsct Q. 6.24 When performing regression or classification
way to prepro cess the data?
Y Ans, : (b) the following is the correct
training PCA
PCA are (a) Normalize the data + PCA —
Explanation : Eigen vectors generated in
—>+ normalize PCA output — training
orthogonal to each other. —» normalize PCA
(b) Normalize the dala > PCA
on in
Which of the following necessitates feature reducti
output — training
Q. 6.21
machine leaming? (c) None of the above
Ans. : (a)
es (d) Al above
of l
(a) Irrelevant and redundant featur data then we apply
Explanation : First we normalize the
(b) Limited training data reduc tion metho d. After that we train PCA
dimension
Final ly we go for
(c) Limited computational resou
rces. and normalize the cigen vectors.
Y Ans. : (d) training.
(d) All of the above
example of feature
: If we limited training data,
are having Q.6.25 Which of the following is an
Explanation
presence of imelevant and extraction?
computational resources and from an email
we have to apply dimensio
n (a) Constructing bag of words vector
redundant features then large high-dimensional
(b) Applying PCA projects to a
reduction techniques. data
s
ber of principal component (c) Removing stopwords in a sent
ence
Q. 6.22 What are the optimum num
? v Ans. : (d)
in the below Fig. Q. 6.22 rit
(d) All of the above
; are used for feature
Explanation : All of above methods
n oo9| 1
+4-
40¢--}+4-4
een
‘ pe
e t
je extraction.
S Co
§3-- ale Q. 6.26 What is PCA components in
Sklearn?
BS 08st pot “
(a) Set of all eigen vect ors for the projection space
onen ts
(b) Matrix of principal comp
oe
3
aw 0.674
matrix
(c) Result of the multiplication
><8-
@i 045 (d) None of the above options
Y Ans. : (a)
hon) PCA components are
8 07]
ES
Explanation : In Sklearn (pyt
vectors for the projection
ele represented by set of all eigen
0
mponent space.
Principal Co
reasonable way to select the
Q. 6.27 Which of the following is a "K9
Fig. Q- 6.22 s
ponent
bana sto number of principal com M.
least 99%
20 smallest value so that at odule
(b) (a) Choose k to be the
(a) 10
of the variance is retained
.
6.
es,
(a) 40 ter 30 k
c) 30 figure that af
can see in the .
Retn, a
na ti on = We e co ns ta nt
c la remains th
onen ts the curve
ie cipal comp
prin ...A SACHIN SHAH Venture
[el Tech-Neo Publications
a
. : -14
(M6-14)
w Syll abus wef aca demic year 18-19)
(MU-Ne

Dimensionality Reduction ...Page no(6-16)

t be Used for
_{b) Choose k to be 99% of m (k = 0.99*m, rounded to Q.6.33 Which of the following algorithms canno
ty of data ?
reducing the dimensionali
the nearest integer).
(b) PCA
(c) Choose k to be the largest value so that 99% of the (a) t-SNE
* (d) None of these Y Ans,: @
variance is retained. (c) LDA
(d) Use the elbow method ¥ Ans. : (a) Explanation : All of the algorithms are the example of
-
Explanation : Standard procedure to select the number dimensionality reduction algorithm.
* of principal components. Q. 6.34 PCA can be used for projecting and visualizing data in
Q. 6.28 Imagine, you have 1000 input'features and 1 target lower dimensions.
(b) FALSE Ans. : (a)
‘ feature in a machine leaming problem. You have to (a) TRUE
ul to plot the
select 100 most important features based on the Explanation : Sometimes it is very usef
relationship between input features and the target ‘data in lower dimensions. We can take the first 2
features. Do you think, this is an example of principal components and then visualize the data using
dimensionality reduction? ‘
scatter plot.
(a) Yes (b) No Y Ans.
: (a) Q.6.35 The most popularly used dimensionality reduction
Explanation : Since we have to select 100 important algorithm is Principal Component Analysis (PCA),
features out of 1000 features this is a example of feature Which of the following is/are true about PCA?
selection. . 1. PCA isan unsupervised method
Q. 6.29 It is not necessary to have a target variable for applying 2. ‘It searches for the directions that data have the
dimensionality reduction algorithms. largest variance
(a) TRUE (b) FALSE Y Ans. : (b) 3. Maximum number of principal components
Explanation : LDA is an example of supervised <= number of features ‘
dimensionality reduction algorithm. 4. All principal components are orthogonal to each
Q.6.30 [have 4 variables in the dataset such as - A, B,C & D. I other
' have performed the following actions : (a) land 2 (b) Land3
Step 1: Using the above variables, I have created two (c) 2 and 3 (d) 1,2and3
. more variables, namely E = A + 3 * B and
(e) l.2and4 (f) Allof the above Y Ans: (f)
| / . FaB+5*C4D.
Explanation : All options are self explanatory.
Step 2: Then using only the variables E and F I have
‘ buiit a Random Forest model. Q. 6.36 Which of the following statement is correct for t-SNE
Could: the steps performed above represent a and PCA?
dimensionality reduction method? (a) -SNE is linear whereas PCA is non-linear
" (a) Tue (b) False v Ans. ; (a) (b) t-SNE and PCA both are linear
Explanation : Yes, Because Step | could be used to (c) t-SNE and PCA both are nonlinear
‘represent the data into 2 lower dimensions.
(d) t-SNE is nonlinear whereas PCA is linear
| : Q. 6.31 Which of the following techniques would perform better Y Ans. : (d)
. for reducing dimensions of a data set? Explanation : (-SNE applies non finear transformation
: (a) Removing columns which have too many missing whereas PCA applies linear transformation.
values
Q. 6.37 What is of the following statement is true about t-SNE in
(b) Removing columns which have high variance in data
comparison to PCA?
(c) Removing columns with dissimilar data trends
(a) When the data is huge (in size), t-SNE may
(d) None of these , Y Ans, : (a) fail 0
produce better results,
’ Explanation : [f columns have too many missing values,
(say 99%) then we can remove such columns. (b) T-NSE always produces better result regardless o
Q. 6.32 Dimensionality reduction algorithms are one of the
the size of the data
, possible ways to reduce the computation time required to (©) PCA always performs better than t-SNE for smaller
build a model. size data.
(a) TRUE (b) FALSE Y Ans. : (a) (d) None of these ‘ ~ v Ans.? (a)
Explanation ; Reducing the dimension of data will take Explanation : As per working of method. “| -
less time to train a model.
: (MU-New Syllabus w.e.f academic year 18-19) (M6-14)

fe) Tech-Neo Publications...A SACHIN SHAH Vel


Dimensionality Reduction Pa ge no (6-17)
Q.6.38 Xi and Xj are two distinct points . in the higher dimension | Q. 6.41 Imagine, 7 you are gi
given the following
representation, where as Yi & Yj are the representations scatter plot
between height and weight
of Xi and Xj in a lower dimension, 9] T 7 ih
1. The similarity of datapoint Xi to
datapoint Xj is the
conditional probability p (li)
7
2. The similarity of datapoint Yi to datapoint
Yj is the a”
conditional probability q (li)
3 5
Which of the following must
for perfect be truc “ ‘
representation of xi and xj in lower
dimensional space?
;
(a) p Gli
= 0)
and q (jli)= 1
2
(b) p (jl) <q Gli) 1
©) pald=aaGin . " oH, 50 100 150 200 250 300

|
(d) p Gli) > q Gli) Y Ans.: (0) Weight
Explanation : The conditional probabilities for similarity Fig. Q. 6.41
of two points must be equal because similarity between Select the angle which will capture maximum variability
the points must remain unchanged in both higher and along
a single axis?
lower dimension for them to be perfect representations. (2) ~ Odegree (6) ~ 45 degree
(c) ~ 60 degree (d) ~90degree Ans. : (b)
Q. 6.39 Which of the following comparison(s) are true about Explanation : Option b has largest possible variance in
PCA and LDA? data.
Both LDA and PCA are linear transformation techniques | @-6.42 The below snapshot shows the scatter plot of two
features (XI and X2) with the class information (Red,
tod iyt
PCA and LDA.
Blue). You can also see the direction of soa
LDA is supervised whereas PCA is unsuper vised.
lana 10
PCA maximize the variance of the data, whereas LDA © Class 2 |
quest iret t classes,
maximize the separation between differen
|
(b) 2and3 5
(a) | and 2
aint (c) | and 3 (d) Only 3
j
¥ Ans. : (e) at
(e) 1,2 and 3 ss
are correct
Explanation : All of the options
—5L a , suLDA’ PCA
Q.6.40 PCA works better ifi there iis? 07 5
-5
i
1. A linear structure 1m the data
flat Fig. 0.642
i surface and not on a
2. If the data lies on a curved
result into better
Which of the following method would
sues in the same unit wa
3. If variables are scaled class prediction?
alg
algorithm with PCA
2 and 3 (a) Building a classification
and 2 (b)
tion of PCA)
(A principal component in direc
1,2 and 3 Y Ans. : (¢) LDA
a (b) Building a classification algorithm with
fs) Cant say
ana H ®i ct.
© c 1si corre
Explanation : Option Y Ans. : (b)
(d) None of these
" to classify these point PCA s,pcA Module
Explanation : If our goal is
harm than good. The majority
projection does only more
d land overlapped on the first
of blue and red points woul
SACHIN SHAH Venture
ee Bl Tech-Neo Publications..A
n(M6-1:
er9)
emic year 18-1
U- Ne w Sy ll abus wef acad
(M
ee

Stes ay Dimensionality Reduction ...Page no (6-19
a 7 a
. fuse the principal component should be normalized to haye Unit
Principal component. Hence PCA would con! Pk), ,
classifier. length. (The negation v = [ 2°23 1S also
Q. 6.43 Which of the following options are correct, when you are correct.)
applying PCA on a image dataset? 0.6.46 If we project the original data points into the La
1. r on . be used i
to effectively detect deform able | subspace by the principal component 2B *
2. &tis invariant to‘affine transforms. What are their coordinates in the 1-d subspace?
3. It can be used for lossy image compression. (a) f-2. (0), ¥2 (6) ¥2,0,V2 |
4. Itis not invariant to shadows. ) ¥2,0) J-2 @ ¥-2,(0),V-2
(a) land 2 (b) 2and3 \ sO Y Ans. s (a)
(c) 3 and4 (d) land4 Y Ans. : (c) Points after
Explanation : The coordinates of three
Explanation : Option c is correct projection should be
Q.6.44 Under which condition SVD and PCA produce the same 2 =xT, v=[-1,-H [22 2) T=V-2, Zy
projection result?
V2
(a) When data has zero median ‘ =xT,v=0,2,=xT;v=V 2
(b) When data has zero mean Q.6.47 For the projected data you just obtained projections
(c) Both are always same V-2, (0), fz. Now if we represent them in the
(d) None of these Y Ans. : (b) original 2-d space and consider them as the
Explanation : When the data has a zero mean vector, reconstruction of the original data points, what is the
otherwise you have to center the data first before taking reconstruction error ? (Context : 45-47)
SVD. (a) 0% (b) 10% (c) 30% (d) 40% ¥ Ans. : (a)
: inte i ion : The reconstruction error is 0, since all
@. 6.45 Ob, tder'S data points in the 2d space : © 11). (0) tice potas are perfectly located on the direction of the
ee first principal component. Or, you can actually calculate
the reconstruction : z, - v.
xp =-¥2 [222] T=[-1,-1]T
x, =0* [0, 0] T= [0,0]

xy =V2*(L1T=0,0
which are exactly x), x,, X3.
Q. 6.48 Which above graph shows better performance of PCA?

Where M is first M principal components and D is total
number of features.
nope
Fig. Q. 6.45
What will be the first principal component for this data?
+].
iCEY) > [eal3 |
(a)
(SAR
land2 (b) 3and4
[dal a Fig. Q. 6.48
(c) land 3 (d) 2and4 Y Ans, : (c) me in

Explanation : The first principal component is (b) Right
[2 x2) (c) Any of A andB
T (you shouldn’t really need to solve any
v=L 2° 2 (d) None of these
SVD or eigen problem to see this). Note that the
(MU-New Syllabus w.e.f academic year 18-19) (M6-14) Tech-Néd Pubtiestieng neh

Machine Learning (MU-Sem 6-Comp) Reduction . _.Page

no (6-19)
Dimensionality
d jo we consider in PCA?
eninsuaueu : PCA is good
if {(M) asymptotes rapidly to @.6.50 Which of the folalowing offset,
; 1.
t
aly | |
1. This happens if the first Cigcnvalucs are big ye
and the
emaind
re if 1
erPr are
orl
small.
« .
PCA 7is bad if all the cigenvalues are
roughly equal. Sec examples of both cases in
Fig. Q. 6.48.
Q. 6.49 Which of the following can be the first 2 principal

components after applying PCA?
Perpendicular offsets
Vertical offsets
(0.5, 0.5, 0.5, 0.5) and (0.71, 0.71, 0, 0)
Fig. Q. 6.50
(0.5, 0.5, 0.5, 0.5) and (0, 0, - 0.71, = 0.71)
(0.5, 0.5, 0.5, 0.5) and (0.5, 0.5, — 0.5, - 0.5) (a) Vertical offset
(0.5, 0.5, 0.5, 0.5) and (- 0.5, -0.5, 0.5, 0.5) (b) Perpendicular offset
Land 2 (b) { awd 3
(a) (c) Both
(c) 2and 4 (dl) 3 and 4 Y Ans. : (d) v Ans. : (b)
(d) None of these
two loading
Explanation : For the first two choices, the Explanation : We always consider residual as vertical
l.
veelors are not orthogona offsets. Perpendicular offset are useful in case of PCA.
Chapter Ends...
goog

ML Techmax

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Techmax

Uploaded by

Copyright:

Available Formats

To Learn The Art Of Cyber Security & Ethical Hacking Contact Telegram - @crystal1419

Scanned with CamScanner

Semester VI - Computer Engineering

Strictly as per the, Revised S yllabus (REV-2016 ‘CBCGS"

Dr. Vaikole Shubhangi Liladhar

Datta Meghe College of Engineering Sector-3, Aj

Dr. Sawarkar Sudh

Datta Meghe College of Enginee!

Scanned with CamScanner

Image Offset (Mr. A visionary... :

Scanned with CamScanner

The Readers of this Book

Scanne' d with CamScanner

the academic year 2018-2014.

A large number of solved examples have been included. So, we

We ave thankful to Shri. Sachin Shah for

Scanned with CamScanner

, bability and Staic

Scanned with CamScanner

‘Introduction to Machine Learning........-.-.--ssssssesseesesees 1-1 to 1-46

© Chapter2 : Introduction to Neural Network ..........:ccesseseteeteeteesesees 2-1 to 2-26

© Chapter3 : Introduction to Optimization Techniques

© Chapter4 : Learning with Regression and Trees

© Chapters : Learning with Classification and Clu

© Chapteré : Dimensionality Reduction

Scanned with CamScanner

_ University Prescribed Syllabus |

1.1 WARING LEAITIING win sujcscnvnssSveesevevveroneytnnerte

1.2 Key Terminology .........sccccscseccseeeeceeesreeestsseesesneeencans coteeecesstaneees sestesvevceveveussecsuaensersdeasssseaseseseenerenesreranedcieeccrsesereeeeeres 13

13 Types of Machine Learnning......s.csssssscssecssseeesssreccssserscesiarcessnneeesnicscetsnnseessenreertansneessnanistenannnicrerntensissscegersogeeceengnansssssscee 155

1.4 ISSUES IM MAaChine L@AMMiNg ..........seseeceseccereseesrsteeeeneeneessneneneneenasnenenanenaneee te UEREEE STEEL SESE

W tees evenser ete nyr—nt

1.6 Steps in developing A Machine Leaming Applicaion..............-ssssssesreeersnnenssenesersetereniennee

Scanned with CamScanner

: Introduction to Machine Learning ...Page no (1-2)

1.1. MACHINE LEARNING

Machine learning mainly focuses on the design and

Machine lear ming can be aj ppl ied to

Scanned with CamScanner

(MU-Sem 6-Comp) Intraduction to Machine Learning ...Page no (1-3)

,» W_1.2___ KEY TERMINOLOGY

Scanned with CamScanner

Introduction to Machine Learning ...Page no (1-4) ;

Scanned with CamScanner

1-4 Machine Learning (MU-Sem 6-Comp)

ld be bl 1.3. TYPES OF MACHINE LEARNING

jables 1. Supervised Learning

Scanned with CamScanner

2. Unsupervised learning Clusters -

Scanned with CamScanner

Machina Learning (MU8em 4-Conp) Intoduiation to Machine Leaning ...Page no (1-7)

A-Neaowt Neighbours Llvent

Support Vector Machines | Rldpe

Table 84.2: Uasupervived tenralny tila

A-Mennsy | Uxpeetution Maxtnization

DUSCAN | Parzen Window

bl 1.4 ISSUES IN MACHINE LEARNING

1.5 HOW TO CHOOSE THE RIGHT ALGORITHM?

Scanned with CamScanner

Learning - .-Page no (1-8)

Supervised Learning Unsupervised Learning

Discrete Classification Clustering

Continuous Regression Density Estimation

— From device to collect wind speed measurement