Professional Documents
Culture Documents
I Introduction
3
The Best Strategy
4
People Involved
I Instructors
I Ambedkar Dukkipati
I Chiranjib Bhattacharyya
I TAs
I Shubham Gupta
I Nabanita Paul
I Tony Gracious
I Shaarad A R
I Mariyamma Antony
I Chaitanya Murti
I Web Support: Aakash Patel
Introduction
6
Introduction
Machine Learning 101 - Building the hype!
Neural FaceApp3
Chatbots5
Speech Generation7
Generating Music8
16
Machine Learning 101 - Building the hype!
I Information Extraction
I Web search
I Question answering
I Knowledge graph mining etc.
I Economics and Finance
I Algorithmic Trading
I Analyzing purchase patterns and market analysis
I e-commerce applications like product recommendations etc.
I Others
I Automated theorem proving
I Robotics
I Advertising
I And many more. . .
17
Data and Models
Basics: Data and Models
I A model is a representation
of real world
18
Basics: Data and Models - Example 1
I Model: A probability
distribution over all
possible sentences of length
≤ 50 with Markov
assumption
19
Basics: Data and Models - Example 2
I Data: An observed
friendship network
involving students from
grade one and grade two in
a school
21
Different types of data
I Sequential Data
I Sentences
I Time series data
I Relational Data
I Tabular data collected during surveys
I Graph structured data
I Multimodal Data
I videos
I medical records
22
Basics: Models
23
Basics: Models - Examples
I etc.
24
Machine Learning Workflow
Machine Learning Workflow
25
Machine Learning Workflow - Data Pre-processing
I Data Cleaning
I Removing outliers
I Filling in missing values
I Denoising the data
I Normalization
I Making data zero mean
I Scaling the values
I Integration
I Combine data from different sources
26
Machine Learning Workflow - Feature Extraction
27
Machine Learning Workflow - Dimensionality Reduction
28
Machine Learning Workflow - Other Components
I Training
I Choose a model
I Use observed data to learn parameters of the model
I e.g., learning weights of a neural network
I Validation
I Use validation strategies to fine tune model
hyperparameters
I Perform model selection
I e.g., using K-fold cross validation to select a value of
regularization parameter
I Testing
I Compute the performance on unseen data
I Diagnose the problems
I Deploy the model
29
Distance Based Classifiers
Data Representation
I Output yn can be
31
Data: In what form?
32
Data in vector space representation
33
Data in vector space representation(contd...)
34
Data in vector space representation(contd...)
I `1 distance between xn , xm ∈ RD
D
X
`1 (xn , xm ) = kxn − xm k1 = ||xnd − xmd ||
d=1
35
Data Matrix
37
A Simple Decision Rule based on Means
1 X
µ− = xn
N−
yn =−1
1 X
µ+ = xn
N+
yn =+1
I Note:- Can we just store the two means and throw away
data.
38
A Simple Decision Rule based on Means (contd. . . )
I Here
I ka − bk2 denotes squared Euclidean distance between a and
b.
I ha, bi denotes inner product of two vector a and b.
I kak2 = ha, ai denotes squared l2 norm of a.
39
The Decision Rule
40
The Decision Rule
42
Decision Rule Based on Means: Some Comments (contd
..)
43
Bayesian Decision Theory
Bayesian Decision Making in Real Life
44
Decision Rule: Based on Prior Knowledge
45
Decision Rule: Based on Prior Knowledge (contd...)
I It looks fine but for every catch the class label is going to
be the same.
46
Decision Rule: Based on class conditional probabilities
Aim here is to get features of the fish and feed it to our model.
47
Decision Rule: Based on class conditional probabilities
(Contd...)
48
Bayesian way....
52
Bayesian Decision Rule as Risk Minimization
Aim: Find the decision rule that minimizes the overall risk R.
I Then Z
R∗ = R(α∗ (x)|x)P (x)dx
x∈RD
is the minimum risk.
53
Two Class Classification and Likelihood Ratio
P (x|ω1 )
=⇒ ψ(x) > c, where ψ(x) = P (x|ω2 )
56
Two Class Classification and Likelihood Ratio: Summary
57
Classification with 0-1 loss
59
Bayes rule in action
60
Minimax Criterion (Two class case)
Implies: Design the classifier so that the worst over all risk for
any value of prior is as small as possible. That is
Minimize the Maximum Possible Risk
61
Minimax Criterion (cont. . . )
62
Minimax Criterion (contd. . . )
63
Minimax Classification
64
Discriminant Functions
g : RD → {ω1 , . . . , ωc }
g(x) = ωi if
gi (x) > gj (x) ∀ j = 1, 2, . . . , c & j 6= i
65
Bayes Classifier as Discriminant Functions
66
Bayes Classifier as Discriminant Functions (contd..)
Now,
(1) P (x|ωi )P (ωi )
gi (x) = P (ωi |x) = Pc
i=1 P (x|ωi )P (ωi )
(2)
gi (x) = P (x|ωi )P (ωi )
(3)
gi (x) = ln P (x|ωi ) + ln P (ωi )
All the discriminant functions g (1) , g (2) , & g (3) are equivalent.
Now,
68
Discriminant Functions for Normal densities
I Normal distribution
1 1 x−µ 2
P (x) = √ exp [− ( ) ]
2πσ 2 σ
1 1
P (x) = exp [− (x − µ)T Σ−1 (x − µ)]
(2π)d/2 |Σ|1/2 2
69
Discriminant Functions for Normal densities
70
Discriminant Functions for Normal densities Σi = σ 2 I
||x − µi ||2
gi (x) = − + ln P (ωi )
2σ 2
where ||x − µi ||2 = (x − µi )T (x − µi )
1
gi (x) = − 2 [xT x − 2µTi x + µTi µi ] + ln P (ωi )
2σ
By ignoring xT x (Since it is same for all the discriminant
functions). We get
gi (x) = ωiT x + ωi0 which is linear
1 1
where ωi = 2 µi , ωi0 = − 2 µTi µi + ln P (ωi ) 71
Discriminant Functions for Normal densities Σi = Σ
1
gi (x) = − (x − µi )T Σ−1 (x − µi ) + ln P (ωi )
2
T
= ωi x + ωi0
72
Discriminant Functions for Normal densities Σi = arbi-
trary
Where,
1
Ωi = − Σ−1 ,
2 i
ωi = Σ−1 µi
1 1
ωi0 = − µTi Σ−1
i µi − ln |Σi | + ln P (ωi )
2 2
73
Summary
References:
74
Acknowledgements
These notes have been prepared over a period of time and I have
referred to many sources. Where ever it is possible, I
acknowledged the sources; if I have missed any, my apologies and
acknowledgments can be implicitly assumed. Several of my
students also contributed to drawing figures etc., and I am very
thankful to them
75