Professional Documents
Culture Documents
n : number of samples
d: number of features
• Introduction
Supervized
• Regression Analysis Learning
• Regression Diagnostics
Regression Classification
• Logistic and Poisson Regression
• Naive Bayes and Bayesian Networks Linear Logistic
Regression Regression
• Decision Tree Classifiers
• Data Preparation and Causal Inference Poisson
Regression
Naive Bayes
• Dimensionality Reduction
PCR Ensemble
• Association Rules and Recommenders Regression Methods
• Convex Optimization
Neural
Lasso
• Neural Networks Networks
2
Recommended Literature
• Data Mining: Practical Machine Learning Tools and
Techniques
Ian H. Witten, Eibe Frank, Mark A. Hall, Christopher Pal
http://www.cs.waikato.ac.nz/ml/weka/book.html
Section: 4.1, 4.2, 9.1, 9.2
Alternative literature:
• Machine Learning
Tom M. Mitchell, 1997
3
Formal Definition of Classification
Classification:
Given a database 𝐷 = {𝑥1, 𝑥2, … , 𝑥𝑛} of tuples (items, records) and a set of classes
𝐶 = {𝐶1, 𝐶2, … , 𝐶𝑚 }, the classification problem is to define a mapping 𝑓: 𝐷 𝐶 where
each 𝑥𝑖 is assigned to one class. A class, 𝐶𝑗, contains precisely those tuples
mapped to it; that is,
𝐶𝑗 = 𝑥𝑖 𝑓(𝑥𝑖 ) = 𝐶𝑗, 1 𝑖 𝑛, and 𝑥𝑖𝐷}.
4
Example Applications
• determine if a bank customer for a loan is a low, medium, or high risk customer
• churn prediction (typically a classification task)
• determine if a sequence of credit card purchases indicates questionable behavior
• identify customers that may be willing to purchase particular insurance policies
• identify patterns of gene expression that indicate the patient has cancer
• identify spam mail
• …
5
Algorithms for Classification
• Logistic Regression
• Statistical Modeling (e.g., Naïve Bayes)
• Decision Trees: Divide and Conquer
• Classification Rules (e.g. PRISM)
• Instance-Based Learning (e.g. kNN)
• Support Vector Machines
• …
6
Naive Bayes is a generative model and Logistic regression is a discriminative model.
Naive Bayes models the joint distribution of the feature X and target Y, and then predicts the posterior probability given as P(y|x)
Logistic regression directly models the posterior probability of P(y|x) by learning the input to output mapping by minimising the error.
Naïve Bayes Classifier In the discriminative model, we assume some functional form for p(C_k | x) and
estimate parameters directly from training data.
Logistic regression does not assume that the features are independent or
that they follow a particular distribution.
In contrast, Naive Bayes is a probabilistic model that does assume that the features are independent.
There are several types of Naive Bayes algorithms, including Gaussian Naive Bayes, Bernoulli Naive Bayes,
and Multinomial Naive Bayes, which make different assumptions about the distribution of the features.
For example, Gaussian Naive Bayes assumes that the features are normally distributed,
while Bernoulli Naive Bayes assumes that the features are binary.
7
When can the things go wrong when using Naive Bayes? When adding random variables as independent vars which generated "randomly"
They will obviously not correlated with the dependent variable. Then things go wrong. So whenever you want to apply to Naive Bayes, make
sure that individual attributes (indenpendent vars) should be correlated with the dependent variable. If you use linear regression, for example,
it will not be a problem since coefficients of that variable can be close to zero. However, in Naive Bayes, since all attributes are treated same, it
could be a problem to add a very low correlated attribute to the dependent variable(y).
So, feature selection is really important in
Bayes Theorem: Some Notation Naive Bayes.
Let Pr[𝑒] represent the prior or unconditional probability that proposition 𝑒 is true.
• Example: Let 𝑒 represent that a customer is a high credit risk.
Pr[𝑒] = 0.1 means that there is a 10% chance a customer is a high credit risk.
The notation Pr[𝐸] is used to represent the probability distribution of all possible
values of a random variable 𝐸.
• e.g.: Pr[𝑅𝑖𝑠𝑘] = < 0.7,0.2,0.1 >
8
Conditional Probability
Imagine that 5% of people of a given population own at least one TV.
2% of people own at least one TV and at least one computer.
What is the probability that someone will own a computer, given that they also have
a TV?
9
Conditional Probability and Bayes Rule
The product rule for conditional probabilities
• Pr[𝑒|ℎ] = Pr[𝑒 ∩ ℎ]/Pr[ℎ]
• Pr[𝑒 ∩ ℎ] = Pr[𝑒|ℎ]Pr[ℎ] = Pr[ℎ|𝑒]Pr[𝑒] (product rule)
• Pr[𝑒 ∩ ℎ] = Pr[𝑒]Pr[ℎ] (for independent random variables)
Pr[𝑒|ℎ]Pr[ℎ]
Pr[ℎ|𝑒] =
Pr[𝑒]
10
Bayes‘ Theorem: An Example
Pr[A] and Pr[B] are known:
¬B 2 1
Pr A B = Pr 𝐴 ∩ 𝐵 /Pr[𝐵] = =
8 4
¬A 2 6
Pr 𝐵 A = Pr 𝐵 ¬A =
5 15
2 5
Pr 𝐵 𝐴 Pr 𝐴 ⋅
Pr 𝐴 B = = 5 20 = 1
Pr 𝐵 8 4
20
6 15
⋅
If Pr[B] is unknown, then
Pr ¬𝐴 B =
Pr 𝐵 ¬𝐴 Pr ¬𝐴
= 15 20 = 6/20 = 3
use the law of total probability: Pr 𝐵 8 8/20 4
20
2 5
Pr 𝐵 𝐴 Pr 𝐴 ⋅ 1
Pr 𝐴 B = = 5 20 =
Pr 𝐵 𝐴 Pr 𝐴 + Pr 𝐵 ¬𝐴 Pr ¬𝐴 2 5 6 15 4
⋅ + ⋅
5 20 15 20
Source: https://en.wikipedia.org/wiki/Bayes_theorem
11
Does a patient have Corona or not after a positive test?
Çözümü telefonda...
12
Bayes Theorem
Hypothesis, e.g. play
Posterior Probability
Prior
Pr[𝑒|ℎ]Pr[ℎ]
Pr ℎ 𝑒 =
Pr[𝑒]
Conditional probability
of h given e
14
Dataset
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild High False Yes
Rainy Cool Normal False Yes
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Sunny Mild High False No
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes
Sunny Mild Normal True Yes
Overcast Mild High True Yes
Overcast Hot Normal False Yes
Rainy Mild High True No
15
Frequency Tables
Play Play
No No
𝐏𝐫[𝑵𝒐]
No No
Yes No 5 / 14 = 0.36
Yes No
Yes No
No Sort Yes
Yes Yes
No Yes
𝐏𝐫[𝒀𝒆𝒔]
Yes Yes
Yes Yes 9 / 14 = 0.64
Yes Yes
Yes Yes
Yes Yes
No Yes
16
Frequency Tables Play Outlook I No Yes
Sunny I 3 2
Overcast I 0 4
Rainy I 2 3
Outlook Temp Humidity Windy Play
Sunny Hot High False No
Temp I No Yes
Sunny Hot High True No
Overcast Hot High False Yes Hot I 2 2
Rainy Mild High False Yes Mild I 2 4
Rainy Cool Normal False Yes
Cool I 1 3
Rainy Cool Normal True No
Overcast Cool Normal True Yes
Humidity I No Yes
Sunny Mild High False No
High I 4 3
Sunny Cool Normal False Yes
Rainy Mild Normal False Yes Normal I 1 6
Sunny Mild Normal True Yes
Overcast Mild High True Yes Windy I No Yes
Overcast Hot Normal False Yes False I 2 6
Rainy Mild High True No
True I 3 3
17
Naive Bayes – Probabilities
Frequency Tables
Rainy I 2 3 Cool I 1 3
Sunny I 3/5 2/9 Hot I 2/5 2/9 High I 4/5 3/9 False I 2/5 6/9
Overcast I 0/5 4/9 Mild I 2/5 4/9 Normal I 1/5 6/9 True I 3/5 3/9
Likelihood Tables
18
This multiplication can be made if the attributes are conditionally independent
P(sunny, cool, high, true | yes) = P(sunny|yes) * P(cool|yes) * P(high|yes) * P(true|yes)
Sunny I 3/5 2/9 Hot I 2/5 2/9 High I 4/5 3/9 False I 2/5 6/9
Overcast I 0/5 4/9 Mild I 2/5 4/9 Normal I 1/5 6/9 True I 3/5 3/9
Note: The probability Pr[sunny, cool, high, true] is unknown. We use the law of total probability.
19
Predicting a New Day - Formulas
Again, given a new instance with
• outlook=sunny
• temperature=cool
• humidity=high
• windy=true
If all explanatory attributes are independent and equally important they can be
multiplied.
20
Formulas Continued …
Similarly
Pr 𝑃𝑙𝑎𝑦 = 𝑛𝑜 ∙ ෑ Pr 𝑒𝑖 𝑃𝑙𝑎𝑦 = 𝑛𝑜
𝑖
5 3 1 4 3
= ∙ ∙ ∙ ∙ = 0.0206
14 5 5 5 5
Thus
21
Normalization
Note that we can normalize to get the probabilities:
Pr 𝑒1 , 𝑒2 , … , 𝑒𝑛 ℎ ∙ Pr ℎ
Pr ℎ 𝑒1 , 𝑒2 , … , 𝑒𝑛 =
Pr[𝑒1 , 𝑒2 , … , 𝑒𝑛 ]
0.0053
= 0.205 ℎ = 𝑃𝑙𝑎𝑦 = 𝑦𝑒𝑠
0.0053 + 0.0206
0.0206
= 0.795 ℎ = {𝑃𝑙𝑎𝑦 = 𝑛𝑜}
0.0053 + 0.0206
22
Naive Bayes - Summary
Want to classify a new instance (𝑒1 , 𝑒2 , … , 𝑒𝑛 ) into finite number of categories from the set ℎ.
• Choose the most likely classification using Bayes theorem
• MAP (maximum a posteriori classification)
Assign the most probable category ℎ𝑀𝐴𝑃 given (𝑒1 , 𝑒2 , … , 𝑒𝑛 ), i.e. the maximum likelihood.
“Naive Bayes” since the attributes are treated as independent: Only then you can multiply the
probabilities.
Pr 𝑒1 , 𝑒2 , … , 𝑒𝑛 ℎ = Pr 𝑒1 ℎ ⋅ Pr 𝑒2 ℎ ⋯ Pr[𝑒𝑛 |ℎ]
23
The Weather Data (yet again)
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 Hot 2 2 High 3 4 FALSE 6 2 9 5
Overcast 4 0 Mild 4 2 Normal 6 1 TRUE 3 3
Rainy 3 2 Cool 3 1
Remedy: add 1 to the numerator for every attribute value-class combination, and
the probability can never be zero
Pr 𝑛𝑜 𝑒 = 5/14 ⋅ 1/8 ⋅ 2/8 ⋅ 5/7 ⋅ 4/7 = 0.00456 ⟹ 27.84%
Pr 𝑦𝑒𝑠 𝑒 = 9/14 ⋅ 5/12 ⋅ 4/12 ⋅ 4/11 ⋅ 4/11 = 0.01181 ⟹ 72.16%
25
Modified Probability Estimates
In some cases adding a constant different from 1 might be more appropriate.
3 /3 0 /3 2 /3
5 5 5
Sunny Overcast Rainy
27
In training time, do we also ignore the missing value? e.g. let's say there are 5 no-play, and in the outlook att.
2 of them sunny, 1 of them rainy, 1 of them overcast and one of them is missing. So when calculating the cond probs,
do we calculate, for example p(sunny|no), as 2/4 or 2/5
??? bence 2/5 olarak
Missing Values
• Training: instance is not included in frequency count for attribute value-class
combination.
• Classification: attribute will be omitted from calculation.
• Example: Outlook Temp. Humidity Windy Play
? Cool High True ?
28
Dealing with Numeric Attributes
Usual assumption: attributes have a normal or Gaussian probability distribution
(given the class) we can assume that an attribute has an gaussian dist due to central limit theorem.
becaue when tour data is large, its dist becomes closer to the normal dist (gaussian dist)
The probability density function for the normal distribution is defined by two
parameters:
1 n
• The sample mean : xi
n
i 1
1 n
• The standard deviation :
n 1 i 1
( xi ) 2
( x )2
1
The density function 𝑓(𝑥): f ( x) e 2 2
2
29
Statistics for the Weather Data
Outlook Temperature Humidity Windy Play
Yes No Yes No Yes No Yes No Yes No
Sunny 2 3 83 85 86 85 False 6 2 9 5
Overcast 4 0 70 80 96 90 True 3 3
Rainy 3 2 68 65 80 70
… … … …
Sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 False 6/9 2/5 9/14 5/14
Overcast 4/9 0/5 std dev 6.2 7.9 std dev 10.2 9.7 True 3/9 3/5
Rainy 3/9 2/5
2 3 9
Likelihood of “yes” = 9 ⋅ 0.0340 ⋅ 0.0221 ⋅ 9 ⋅ 14 = 0.000036
3 3 5
Likelihood of “no” = 5 ⋅ 0.0291 ⋅ 0.0380 ⋅ 5 ⋅ 14 = 0.000136
you should consider different type of distributions that reflects the data you have.
31
Numeric Data: Unknown Distribution
What if the data distribution does not follow a known distribution?
32
Kernel Density Estimation
We want to derive a function 𝑓(𝑥) such that
(1) 𝑓(𝑥) is a probability density function, i.e.
∫ 𝑓 𝑥 𝑑𝑥 = 1
Kernel density est put little gaussians on top of each and every observations we have. And then add all of the kernel heights
to obtain the final density function which is a specific one for our data.
33
- A small bandwidth leads to undersmoothing:
It means that the density plot will look like a combination of individual peeks (one peek per each sample element).
- A huge bandwidth leads to oversmoothing.
It means that the density plot will look like a unimodal distribution and hide all non-unimodal distribution properties
(e.g., if a distribution is multimodal, we will not see it in the plot).
Where
1 1 𝑡
−2(ℎ)2
𝐾 𝑡, ℎ = 𝑒
2𝜋ℎ
Kernel (K) selection is another broad topic. But here, we use gaussion dist function as our kernel function.
Adjust “ℎ” (aka bandwidth) to fit data as a parameter. h = 1 is generally a good choice
h aslında burada variance rolünü oynuyor. h ne kadar küçükse her bir sample için oluşturulan ufak gaussianların varyansı o kadar az olur
yani daha dar ve uzun şekle sahip olurlar. bu da test sample'ı geldiği zaman eğer meane çok uzaksa çok çok küçük bir değer, meane yakınsa
çok çok büyük bir değer alacak anlamına gelir. bu yüzden de kernel density fonksiyonun şekli daha zikzaklı, sivri, ve çok tepeli bir görünüm
alır. 34
Discussion of Naïve Bayes
Naïve Bayes works surprisingly well (even if independence assumption is clearly
violated).
• Domingos, Pazzani: On the Optimality of the Simple Bayesian Classifier under
Zero-One-Loss, Machine Learning (1997) 29.
feature selection is really important in naive bayes
However: adding too many redundant attributes will cause problems
(e.g., identical attributes).
• Note also: many numeric attributes are not normally distributed.
Time complexity
• Calculating conditional probabilities: Time 𝑂(𝑛) where 𝑛 is the number of
instances.
• Calculating the class: Time 𝑂(𝑐𝑝) where 𝑐 is the number of classes, 𝑝 the
attributes.
35
What would we do, if the independence assumption was violated?
36
In naive bayes, all attributes are conditionally independence between each other.
however in bayesian belief nets, some attributes are cond indep. between each other
Graphical representation: directed acyclic graph (DAG), one node for each attribute.
• Overall probability distribution factorized into component distributions
• Graph’s nodes hold component distributions (conditional distributions)
37
we love independicies in the probability, since it makes calculations way more easier. Only thing that we need to do is
multiply the given events. In most of the cases attributes/events are not just independent, but they are conditionally
independent. in this case, we can also multiply the prob od the occurance of the event but this time we also consider
the given conditions.
Probability Laws
p(a,b) = p(b|a) . p(a)
p(a,b,c) = p(c|a,b) . p(a,b)
p(a,b,c,d) = p(d|a,b,c) . p(a,b,c)
Chain rule p(a,b,c,d,e) = p(e|a,b,c,d) . p(a,b,c,d)
• Pr[𝑒1 , 𝑒2 , … , 𝑒𝑛 ] = ς𝑖=1,…,𝑛 Pr[𝑒 𝑖|𝑒 𝑖 − 1, … , 𝑒1]
• Pr[𝐴, 𝐵, 𝐶, 𝐷, 𝐸] = Pr[𝐴]Pr[𝐵|𝐴]Pr[𝐶|A, B]Pr[𝐷|𝐴, 𝐵, 𝐶]Pr[E|A,B,C,D]
• The joint distribution is independent of the ordering
Conditional independence
• Pr[ℎ|𝑒1, 𝑒2 ] = Pr[ℎ|𝑒2 ]
• Example:
Rain causes people to use an umbrella and traffic to slow down
Umbrella is conditionally independent of traffic given rain
o 𝑈𝑚𝑏𝑟𝑒𝑙𝑙𝑎 ⫫ 𝑇𝑟𝑎𝑓𝑓𝑖𝑐 | 𝑅𝑎𝑖𝑛
38
The Full Joint Distribution
Pr[e1 ,..., en ]
Pr[en | en 1 ,..., e1 ] Pr[en 1 ,..., e1 ]
Pr[en | en 1 ,..., e1 ] Pr[en 1 | en 2 ,..., e1 ] Pr[en 2 ,..., e1 ]
Pr[en | en 1 ,..., e1 ] Pr[en 1 | en 2 ,..., e1 ]... Pr[e2 | e1 ]P[e1 ]
n (Chain Rule)
Pr[ei | ei 1 ,..., e1 ]
i 1 From the chain rule
n to Bayesian networks
Pr[ei | parents (ei )]
i 1 parents are the dependent variables of that attribute.
39
Bayesian Network Assumptions
Pr 𝑒1 , 𝑒2 , … , 𝑒𝑛 = Pr[E1 = e1 … En = en]
Pr 𝐴, 𝐵, 𝐶, 𝐷, 𝐸 = Pr 𝐴 Pr 𝐵 𝐴 Pr 𝐶 𝐴, 𝐵 Pr 𝐷 𝐴, 𝐵, 𝐶 Pr 𝐸 𝐴, 𝐵, 𝐶, 𝐷
= Pr 𝐴 Pr 𝐵 𝐴 Pr 𝐶 𝐴 Pr 𝐷 𝐴, 𝐶 Pr[𝐸|𝐶]
how do we get such a bayesian network above? how do we know which one are independent or dependent??
- this is called structure learning. there are lots of different topologies for such a network. if you wanna learn this from data
this would be very computatioanlly expensive.
- and we need conditional probs and learn them from the data as well.
40
Example: Alarm
Your house has an alarm system against burglary. Five random variables
The house is located in the seismically active area • A: Alarm
and the alarm system can get occasionally set off by • B: Burglary
an earthquake. You have two neighbors, Mary and • E: Earthquake
John, who do not know each other. If they hear the • J: JohnCalls
alarm they call you, but this is not guaranteed. • M: MaryCalls
They also call you from time to time just to chat.
Burglary Earthquake
Alarm
JohnCalls MaryCalls
[illustration by Kevin Murphy]
41
Example
Burglary Earthquake
Alarm
JohnCalls MaryCalls
42
A Simple Bayes Net
Burglary Earthquake
causes
JohnCalls MaryCalls
43
Assigning Probabilities to Roots
Pr[B] Pr[E]
Burglary 0.001
Earthquake 0.002
Alarm
JohnCalls MaryCalls
44
Conditional Probability Tables
Pr[B] Pr[E]
Burglary 0.001
Earthquake 0.002
B E Pr[A|B,E]
T T 0.95
T F 0.94
Alarm F T 0.29
F F 0.001
JohnCalls MaryCalls
45
Conditional Probability Tables
Pr[B] Pr[E]
Burglary 0.001
Earthquake 0.002
B E Pr[A|B,E]
T T 0.95
T F 0.94
Alarm F T 0.29
F F 0.001
A Pr[J|A] A Pr[M|A]
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
B E Pr[A|B,E]
T T 0.95
T F 0.94
Alarm F T 0.29
F F 0.001
P(x1,x2,…,xn) =Πi=1,…,nP(xi|Parents(xi))
A Pr[J|A] A Pr[M|A]
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
47
Calculation of Joint Probability
Pr[B] Pr[E]
Burglary 0.001
Earthquake 0.002
B E Pr[A|B,E]
T T 0.95
T F 0.94
Alarm F T 0.29
Pr[J,M,A,B,E] =
F F 0.001
Pr[J|A]Pr[M|A]Pr[A|B,E]Pr[B]Pr[E]
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998
= 0.00062
A Pr[J|A] A Pr[M|A]
JohnCalls T 0.90 MaryCalls T 0.70
F 0.05 F 0.01
48
Inference in Bayesian Networks
Other than the joint probability of specific Explaining away effect
events, we may want to infer the probability of
an event, given observations about a subset of
other variables.
Radio Alarm
Radio
Call
Call
[Figure by N. Friedman]
49
Bu slaytı anlamadım?????????????
50
Learning Bayes Nets
parameter learning: learning the conditional probs.
Method for evaluating the goodness of a given network
conditional dists. need to be learned from the data.
• Measures maximize the joint probability of training data given the network (or the
sum of the logarithms thereof)
Summarize the Log-Likelihood of training data based on the network
• Example:
Akaike information Criterion (AIC): −2𝐿𝐿 + 2𝐾
Minimize AIC, with 𝐾=number of parameters
structure learning: learning or creating the bayesian network
Method for searching through space of possible networks
• Amounts to searching through sets of edges because nodes are fixed
• Examples: K2, Tree Augmented Naive Bayes (TAN)
51
Bayes Nets Summary
Bayes Nets can handle dependencies among attributes.
52
Note: Bayesian vs. Frequentist Statistics
Frequentists: Models are fixed, but data varies
Bayesians: Data is fixed, but model parameters vary
53