Professional Documents
Culture Documents
▶ Introduction
2
About the course
On ML
3
The Best Strategy
4
Agenda
Introduction
5
Introduction
Machine Learning 101 - Building the hype!
Neural FaceApp3
Chatbots5
Speech Generation7
Generating Music8
15
Machine Learning 101 - Building the hype!
▶ Information Extraction
▶ Web search
▶ Question answering
▶ Knowledge graph mining etc.
▶ Economics and Finance
▶ Algorithmic Trading
▶ Analyzing purchase patterns and market analysis
▶ e-commerce applications like product recommendations etc.
▶ Others
▶ Automated theorem proving
▶ Robotics
▶ Advertising
▶ And many more. . .
16
Data and Models
Basics: Data and Models
▶ A model is a representation
of real world
17
Basics: Data and Models - Example 1
▶ Real World: Sentences
used by people in
conversations about
machine learning
▶ Model: A probability
distribution over all
possible sentences of length
≤ 50 with Markov
assumption
18
Basics: Data and Models - Example 2
▶ Real World: Students
and friendships among
them
▶ Data: An observed
friendship network
involving students from
grade one and grade two in
a school
20
Different types of data
▶ Sequential Data
▶ Sentences
▶ Time series data
▶ Relational Data
▶ Tabular data collected during surveys
▶ Graph structured data
▶ Multimodal Data
▶ videos
▶ medical records
21
Basics: Models
22
Basics: Models - Examples
▶ etc.
23
Machine Learning Workflow
Machine Learning Workflow
24
Machine Learning Workflow - Data Pre-processing
▶ Data Cleaning
▶ Removing outliers
▶ Filling in missing values
▶ Denoising the data
▶ Normalization
▶ Making data zero mean
▶ Scaling the values
▶ Integration
▶ Combine data from different sources
25
Machine Learning Workflow - Feature Extraction
26
Machine Learning Workflow - Dimensionality Reduction
27
Machine Learning Workflow - Other Components
▶ Training
▶ Choose a model
▶ Use observed data to learn parameters of the model
▶ e.g., learning weights of a neural network
▶ Validation
▶ Use validation strategies to fine tune model
hyperparameters
▶ Perform model selection
▶ e.g., using K-fold cross validation to select a value of
regularization parameter
▶ Testing
▶ Compute the performance on unseen data
▶ Diagnose the problems
▶ Deploy the model
28
Distance Based Classifiers
Data Representation
▶ Output yn can be
30
Data: In what form?
31
Data in vector space representation
32
Data in vector space representation(contd...)
33
Data in vector space representation(contd...)
▶ ℓ1 distance between xn , xm ∈ RD
D
X
ℓ1 (xn , xm ) = ∥xn − xm ∥1 = ||xnd − xmd ||
d=1
34
Data Matrix
36
A Simple Decision Rule based on Means
1 X
µ− = xn
N−
yn =−1
1 X
µ+ = xn
N+
yn =+1
▶ Note:- Can we just store the two means and throw away
data.
37
A Simple Decision Rule based on Means (contd. . . )
▶ Here
38
The Decision Rule
39
The Decision Rule
41
Decision Rule Based on Means: Some Comments (contd
..)
42
Bayesian Decision Theory
Bayesian Decision Making in Real Life
43
Decision Rule: Based on Prior Knowledge
44
Decision Rule: Based on Prior Knowledge (contd...)
▶ It looks fine but for every catch the class label is going to
be the same.
45
Decision Rule: Based on class conditional probabilities
Aim here is to get features of the fish and feed it to our model.
46
Decision Rule: Based on class conditional probabilities
(Contd...)
47
Bayesian way....
51
Bayesian Decision Rule as Risk Minimization
Aim: Find the decision rule that minimizes the overall risk R.
▶ Then Z
R∗ = R(α∗ (x)|x)P (x)dx
x∈RD
is the minimum risk.
52
Two Class Classification and Likelihood Ratio
P (x|ω1 )
=⇒ ψ(x) > c, where ψ(x) = P (x|ω2 )
55
Two Class Classification and Likelihood Ratio: Summary
56
Classification with 0-1 loss
58
Bayes rule in action
59
Minimax Criterion (Two class case)
Implies: Design the classifier so that the worst over all risk for
any value of prior is as small as possible. That is
Minimize the Maximum Possible Risk
60
Minimax Criterion (cont. . . )
61
Minimax Criterion (contd. . . )
62
Minimax Classification
g : RD → {ω1 , . . . , ωc }
g(x) = ωi if
gi (x) > gj (x) ∀ j = 1, 2, . . . , c & j ̸= i
64
Bayes Classifier as Discriminant Functions
65
Bayes Classifier as Discriminant Functions (contd..)
Now,
(1) P (x|ωi )P (ωi )
gi (x) = P (ωi |x) = Pc
i=1 P (x|ωi )P (ωi )
(2)
gi (x) = P (x|ωi )P (ωi )
(3)
gi (x) = ln P (x|ωi ) + ln P (ωi )
All the discriminant functions g (1) , g (2) , & g (3) are equivalent.
Now,
67
Discriminant Functions for Normal densities
▶ Normal distribution
1 1 x−µ 2
P (x) = √ exp [− ( ) ]
2πσ 2 σ
1 1
P (x) = exp [− (x − µ)T Σ−1 (x − µ)]
(2π)d/2 |Σ|1/2 2
68
Discriminant Functions for Normal densities
69
Discriminant Functions for Normal densities Σi = σ 2 I
||x − µi ||2
gi (x) = − + ln P (ωi )
2σ 2
where ||x − µi ||2 = (x − µi )T (x − µi )
1
gi (x) = − 2 [xT x − 2µTi x + µTi µi ] + ln P (ωi )
2σ
By ignoring xT x (Since it is same for all the discriminant
functions). We get
gi (x) = ωiT x + ωi0 which is linear
1 1
where ωi = 2 µi , ωi0 = − 2 µTi µi + ln P (ωi ) 70
Discriminant Functions for Normal densities Σi = Σ
1
gi (x) = − (x − µi )T Σ−1 (x − µi ) + ln P (ωi )
2
T
= ωi x + ωi0
71
Discriminant Functions for Normal densities Σi = arbi-
trary
Where,
1
Ωi = − Σ−1 ,
2 i
ωi = Σ−1 µi
1 1
ωi0 = − µTi Σ−1
i µi − ln |Σi | + ln P (ωi )
2 2
72
Summary
References:
73
Acknowledgements
These notes have been prepared over a period of time and I have
referred to many sources. Where ever it is possible, I
acknowledged the sources; if I have missed any, my apologies and
acknowledgments can be implicitly assumed. Several of my
students also contributed to drawing figures etc., and I am very
thankful to them
74