Chapter2 FML

Fundamentals of Machine Learning
Unit 2: Probablity Theory and Bayes Rule
Faculty Name : Dr. Dipti Jadhav

(Font: Source sans pro, Size: 20)
Lecture 1 (Separator Page for every section)
Parameter Smoothing
Introduction
Smoothing is a very powerful technique used all across data analysis.
 Other names given to this technique are curve fitting and low pass filtering.
It is designed to detect trends in the presence of noisy data in cases in which the
shape of the trend is unknown.
3 Lecture 1 – Topic Name

•The smoothing name comes from the fact that to accomplish this feat, we assume
that the trend is smooth, as in a smooth surface.
•In contrast, the noise, or deviation from the trend, is unpredictably wobbly:
4
IMPORTANT
Part of what we try to understand here are the assumptions that permit us to
extract the trend from the noise.
The concepts behind smoothing techniques are extremely useful in machine

learning because:
 conditional expectations/probabilities can be thought of as trends of unknown

shapes that we need to estimate in the presence of uncertainty.
5
 We will focus first on a problem with just one predictor.
Specifically, we try to estimate the time trend in the 2008 US popular vote poll
margin (difference between Obama and McCain).
6
For the purposes of this example, do not think of it as a forecasting problem.
Instead, we are simply interested in learning the shape of the trend after the election
is over.
7
We assume that for any given day x, there is a true preference among the
electorate f(x), but due to the uncertainty introduced by the polling, each data point
comes with an error ε. A mathematical model for the observed poll margin Yi is:
To think of this as a machine learning problem, consider that we want to

predict Y given a day x.
If we knew the conditional expectation f(x)=E(Y∣X=x), we would use it.
But since we don’t know this conditional expectation, we have to estimate it.
Let’s use regression for that!!
8
The line we see does not appear to describe the trend very well.
For example, on September 4 (day -62), the Republican Convention was held and the
data suggest that it gave John McCain a boost in the polls.
However, the regression line does not capture this potential trend. To see the lack of
fit more clearly, we note that points above the fitted line (blue) and those below (red)
are not evenly distributed across days.
We therefore need an alternative, more flexible approach.

9
Two methods for Parameter Smoothing
Bin Smoothing
Kernels
10
Bayesian Belief Networks
Why Bayesian Belief Networks:
To represent probabilistic relationships between two classes.
To avoid dependences between value of attributes by joint conditional probability

distribution.
In Naïve Bayes Classifier, attributes are conditionally independent.
11
Bayesian Belief Networks Definition
It is also known as Bayesian Networks(BN), Belief networks(BN) and Probabilistic

Networks.
BN is defined by two parts:
DAG – Directed Acyclic Graph

CPT- Conditional Probability Table
Nodes-> Variables
Links-> Dependency
12
BN Formal Definition
•BN is a graph with following properties:
•Nodes – set of random variables.
•Directed Links – The real meaning of the link from node X to Node Y means X has
direct influence on Y.
•Each node has a CPT with quantifies the effect that a parent has on the child.
•The graph has no directed cycles.
13
If an arc is drawn from X to Y, means X is a immediate predecessor pr parent of Y. Y is
the descendant of X.
Each variable is conditionally independent of its non-descendants in the graph,

given its parent.
14
Dimensionality Reduction
Reason:
Sometimes the data can contain a huge number of features, some of which are not
even required.
Such redundant information makes modeling complicated.
 Furthermore, interpreting and understanding the data by visualization gets difficult

because of the high dimensionality.
This is where dimensionality reduction comes into play.
15
What is dimensionality reduction?
Dimensionality reduction is the task of reducing the number of features in a

dataset.
If there are many features in the dataset,:

-it is difficult to model them
-Adds redundancy
-Makes no sense to have them in the training data.
Thus feature space/ data space needs to be reduced.
16
Definition :
The process of dimensionality reduction essentially transforms data from high-
dimensional feature space to a low-dimensional feature space.
CAUTION:
It is also important that meaningful properties present in the data are not lost during
the transformation.
This is where CURSE OF DIMENSIONALITY comes into picture!!
17
Curse of Dimensionality
In order to estimate an arbitrary function with a certain accuracy the number of

features or dimensionality required for estimation grows exponentially.
This is especially true with big data which yields more Sparsity.
Now , what is Sparse data?
18
Sparsity in data is usually referred to as the features having a value of zero; this
doesn’t mean that the value is missing.
If the data has a lot of sparse features then the space and computational complexity
increase. the model trained on sparse data performed poorly in the test dataset.
In other words, the model during the training learns noise and they are not able to
generalize well. Hence they overfit.
19
Issues that arise with high dimensional data are:
Running a risk of overfitting the machine learning model.

Difficulty in clustering similar features.
Increased space and computational time complexity.
Non-sparse data or dense data on the other hand is data that has non-zero features.
Apart from containing non-zero features they also contain information that is both
meaningful and non-redundant.
Solution:
To tackle the curse of dimensionality, methods like dimensionality reduction are used.
Dimensional reduction techniques are very useful to transform sparse features to dense
features.
Furthermore, dimensionality reduction is also used to clean the data and feature
extraction.
20
21
Techniques for Dimensionality Reduction
1.Decomposition algorithms
Principal Component Analysis
Kernel Principal Component Analysis
Non-Negative Matrix Factorization
Singular Value Decomposition
2.Manifold learning algorithms
t-Distributed Stochastic Neighbor Embedding
Spectral Embedding
Locally Linear Embedding
3. Discriminant Analysis
Linear Discriminant Analysis
22
Components of Dimensionality Reduction
There are two components of dimensionality reduction:

1. Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
Filter
Wrapper
Embedded
2. Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
23
Collaborative Filtering Based Recommendation System
In Collaborative Filtering, we tend to find similar users and recommend what
similar users like.
 In this type of recommendation system, we don’t use the features of the item to
recommend it.
Rather we classify the users into the clusters of similar types, and recommend each
user according to the preference of its cluster.
24
Example : Movie Recommendation System
25
Method 1: Measuring Similarity
In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings
to the movie, so we can conclude that Movie 3 is also going to be averagely liked by
the User 1 but Movie 4 will be a good recommendation to User 2, like this we can also
see that there are users who have different choices like User 1 and User 3 are
opposite to each other.
One can see that User 3 and User 4 have a common interest in the movie, on that
basis we can say that Movie 4 is also going to be disliked by the User 4. This is
Collaborative Filtering, we recommend users the items which are liked by the users
of similar interest domain.
26
Method 2: Cosine Distance
We can also use the cosine distance between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests.
We can apply the cosine distance between two users in the utility matrix, and we can
also give the zero value to all the unfilled columns to make calculation easy, if we get
smaller cosine then there will be a larger distance between the users and if the
cosine is larger than we have a small angle between the users, and we can
recommend them similar things.
27
Method 3: Rounding the Data:
In collaborative filtering we round off the data to compare it more easily like we
can assign below 3 ratings as 0 and above of it as 1, this will help us to compare
data more easily,.
We can see that User 1 and User 2 are more similar and User 3 and User 4 are
more alike.
28
Example of Slide with Bullets
Lorem ipsum dolor sit ametlorem ipsum dolor sit ametlorem ipsum dolor sit
ametlorem ipsum dolor sit ametlorem ipsum dolor sit ametlorem ipsum dolor sit
ametlorem ipsum dolor sit ametlorem ipsum dolor sit amet.
(Body Copy Font: Source sans pro, Size:18), Bullets Style will be following
 Non igitur de improbo, sed de callido improbo quaerimus

 Nec tamen ullo modo summum pecudis bonum et hominis idem mihi videri
potest
 Amicitiae vero locus ubi esse potest aut
 Atque haec coniunctio confusioque virtutum tamen a philosophis
ratione quadam
 Legimus tamen Diogenem, Antipatrum, Mnesarchum, Panaetium
 primisque familiarem nostrum Posidonium

Lecture 2 (Separator Page for every Lecture / section)
Topic Name
Example 1 for Slide with Picture
Mr. Krishnamurti Nair

Chancellor
Lorem ipsum dolor sit amet, consectetur

adipiscing elit. Ex quo, id quod omnes
expetunt, beate vivendi ratio inveniri et
comparari potest. Virtutis, magnitudinis
animi, patientiae, fortitudinis fomentis
dolor mitigari solet.
Hoc loco discipulos quaerere videtur, ut,

qui asoti esse velint, philosophi ante
fiant.
(Body Copy Font: Source sans pro,
Size:18)

Example 2 for Slide with Picture
Image Source / references: Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cum praesertim illa
perdiscere ludus esset. Item de contrariis.
(Body Copy Font: Source sans pro, Size:18),

Thank You

Chapter2 FML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter2 FML

Uploaded by

Copyright:

Available Formats

Fundamentals of Machine Learning

Unit 2: Probablity Theory and Bayes Rule

Faculty Name : Dr. Dipti Jadhav

Smoothing is a very powerful technique used all across data analysis.

 Other names given to this technique are curve fitting and low pass filtering.

3 Lecture 1 – Topic Name

The concepts behind smoothing techniques are extremely useful in machine

 conditional expectations/probabilities can be thought of as trends of unknown

To think of this as a machine learning problem, consider that we want to

If we knew the conditional expectation f(x)=E(Y∣X=x), we would use it.

Let’s use regression for that!!

We therefore need an alternative, more flexible approach.

Why Bayesian Belief Networks:

To represent probabilistic relationships between two classes.

To avoid dependences between value of attributes by joint conditional probability

In Naïve Bayes Classifier, attributes are conditionally independent.

It is also known as Bayesian Networks(BN), Belief networks(BN) and Probabilistic

BN is defined by two parts:

DAG – Directed Acyclic Graph

•BN is a graph with following properties:

•Nodes – set of random variables.

•The graph has no directed cycles.

Each variable is conditionally independent of its non-descendants in the graph,

Such redundant information makes modeling complicated.

 Furthermore, interpreting and understanding the data by visualization gets difficult

This is where dimensionality reduction comes into play.

Dimensionality reduction is the task of reducing the number of features in a

If there are many features in the dataset,:

This is where CURSE OF DIMENSIONALITY comes into picture!!

In order to estimate an arbitrary function with a certain accuracy the number of

This is especially true with big data which yields more Sparsity.

Now , what is Sparse data?

Running a risk of overfitting the machine learning model.

There are two components of dimensionality reduction:

 Non igitur de improbo, sed de callido improbo quaerimus

29 Lecture 1 – Topic Name

Mr. Krishnamurti Nair

Lorem ipsum dolor sit amet, consectetur

Hoc loco discipulos quaerere videtur, ut,

31 Lecture 1 – Topic Name

32 Lecture 1 – Topic Name

You might also like