You are on page 1of 18

CS 464 Summary

Ch 2 Probability Review

Bayes Rule
Pixie Ppg
p x II apexes

Chain Rule

PCA Xan r Xn PCN P X2 Xi Pan I Xin Xn i

especially useful when there are conditional independence across


variables

picking the right order can often make evaluating the probability much
easier

Conditional independence

XLY 12 P X 712 PCXI ZI PLY 12

Ch 3 Estimation

Maximum Likelihood Estimator MLE Chooses O that maximizes the


probability of observed data C likelihood of the data
09 T
PCD 107 i o

MLE estimate of 0

O arg PID 10
fax
A Bound I From Hoeffding'sinequality

Let N an at and I met OH


ant at
Let 0 be the true parameter For any Eso
2N E
PI Id o d e ze

Probably Approximately Correct I Pact

I want to know the thumback O within E on with probability at least


I S 0
How manyflips or how big do I set N
2
P I 18 0 1
2,4 I 252 0.05

2 NE E In 0.0512
N Z in co 05121 3.8 i go
2 xo is a
MLE of Gaussian Parameters

Prob of iii d samples D xii rXn

PCD IM o
In e

MMLE I Once arg PCDI yur o


II
M Mie It Xi

EMLE
I IN
2
Xi My

The Mce for the varian ie of a Gaussian is biased That is the expectedvalue
of the estimator is not equal to the true parameter An unbiased variance
estimator
yaunbiased
It I Xi M
Maximum A Posteriori MAP Estimation

Our priorcould be in the form of a probability distribution

posterior p old PID1051017 Frio


PD normalization

BetaDistribution

p Col 04 l o B of 0 El
4 s o
p

When the data is sparse this allows us to fall back to the prior and
avoid the issues faced by MLE
When the data is abundant the likelihood will dominate the prior and the
prior will not have much of an effect on the posterior distribution

Chapter 4 Naive Bayes

Naive Bayes assumes

PC Xi r r xn147 IT PCA17

Random variable features Xi and Xj are conditionally independent of each


other given the class label Y for all i j
new
y Pl Y P Xinew 17 yr
arg max
Tk
yr II
To avoid underflow
new new
argmax logPLY yr t
I log Piti 17 yr
JK

y newt argmax
Yk
log ite t
I logDijk
Chapter5 Feature Selection

The objectivein classification regression is to learn a functionthat relates


values of Features to value of outcome variableest

Often we are presented with many features


Not all of these features are relevant

Feature selection is the taste of identifying an optimal set of features


that are useful for asuratelypredicting the outcomevariable

Motivation For Feature Selection

Accuracy Generalizability

Interpretability a
Efficiency

Three Main Approaches

1 Treat Feature selection as a separate task


Filtering based feature selection
wrapper based feature selection

2 Embed feature selection into the task of learning a model


Regularization

37 Do not Sele it features instead construct new features that effectively


representcombinations of original features
Dimensionality reduction

Feature Selection as a Separate Task

filteringbased featureselection wrapper based feature selection

all features

Feature selection callslearningmethod


v anytimes usesis to
subset of features helpselectfeatures

learningmethod my y learning method


v
model
StoringFeatures For Filtering
Mutual Information statistical Tests Variance Frequency
I
E statistic Chisquarestatistic
Compares thefrequencies of a term between
different classes
Information
Reduction in uncertainty l amount of surprise in the outcome

I x x
log ply 1092Pox

If the probability of an event happening is small it happens


the information is large i
Observing the outcome of a coin flip is head

I log 112 I
The outcome of a dice is 6

I log 116 2.58

En
The entropy of a randomvariable is
the sum of the information
provided by its possible values weighted by the probability of each value

Entropy is a measure of uncertainty

H I pix log pix

Mutual information
ICY Y is the reduction of uncertainty in one variableuponobservation
of the other variable
A measure of statisticaldependencybetweentwo randomvariables

HCNY IN 47 He IN Hex 7 Hexi I Hey x Is x y

HIN
T
my
Icty Halt HH Hart Hat H IX H HWI HCY IX

The mutual information betweenfeature vector and class label measures the
amount by which the uniertainty in the class is decreased byknowledge of the
feature

Definition

c ee
I said
teeth efe no et 1092
ply pÉÉ

E statistic
t It X2 The distribution of t approaches
from uniform to normal distribution
gets as number of samples grow

ForwardSelection us BackwardElimination

both use a hill climbingsearch

forward selention backward elimination

efficient forchoosing a efficientfor discarding a small subset


small number of features of the features

Misses features whose usefulness preserves featureswhoseusefulness


requires other features require otherfeatures

Embedded Methods Regularization

Ridge Regression
Ecw
I
Cy'd Wo
II Xi'd wi t X
II Wiz L2

LASSO

ECW
go
I Cy'd Wo
II Xi'd wi t X
II twit
L1 and L2 penalties can be used with other learning methods logistic
regression neural nets sums etc

Chapter G Feature Extraction

PCA Applications
Noisereduction Data visualization Data compression

How could we find the smallest subspace thatkeeps the most information
about the original data
A solution principal component Analysis
Feature Extraction

Rather than picking a subset of features xuxa r xn create new


features from the existing ones

2i Wo t
I wi xi

2k Wo t
I wi't Xi

Principal Component Analysis

PCA vectors originate from the center of mass

Principal component L i points in the direction of the largestvariance

Each subsequent principal component

is orthogonal to the previous ones r and


points in the direction of the largest variance of the residualsubspace
PCA Algorithm

CovCX 47
I I Xi Mx yi my

Compute the covariance matrix I


Psa basis vectors the eigenvectors of I

largereigenvalue more importanteigenvectors

Steps in PCA
Mean enter the data

Computecovariance matrix or the scatter matrix

Calculate eigenvalues and eigenvectors of covariance matrix

Eigenvector with largest eigenvalue Xi is 1st principalcomponent Pc

Eigenvector with k th largest eigenvalue Xk is Kth PC

Proportion of variance captured by Kth Pc Xk Ii be

Scaling Up
Covariance matrix can be reallybig
Use singularvalue decomposition SVD Takes input Xand
finds k eigenvectors

The SVD of man matrix A is givenby theformula A US VT wherei

U mxm matrix of the orthonormaleigenvectors of AAT

VT transpose of a nxn matrix containing the orthormal eigenvectors of


ATA
S diagonal matrix with r elements equalto the root of the positive
eigenvalues of AAT or ATA both matrics have the same positive
eigenvalues anyway I
PCA Summary

PCA

sorts dimensions in order of importance

Applications
get compactdescription
remove noise
improve classification hopefully

Not magic
does not knowclass labels
Canonlycapturelinearvariations

One of manytricks to reduce dimensionality

Chapter 7 Performance Metrics

Model Selection and Validation

Learning typicallyinvolves trying out differentmodels algorithms parameters


feature sets etc 1

How do we selectthe bestmodelamongdifferentmodels

Types of Errors Confusion Matrix

Act
The False
Positive Error
Positive Type I
Classifier
False True
Negative Negative

f
Type I Error
Accuracy
Tp IIIINITN
Precision Recall F Measure

Precision Fraction of true positivesamong all positivepredictions

P
ypIÉp
Recall Sensitivity TPR Fraction of true positive predictionsamongall
positive samples

R
IIF N
Specificity TNR Fraction of true negative predictions among all negative
samples

F Measure Harmonic mean of precision and recall

F IPR
Pt R

F Measure can be weighted to favor Precision or Recall


1 132 PR I favors recall
FB B
p2p R p s 1 favors precision

Micro us Macro Averaging

A macro average just averages the individuallycalculatedscores of


each class

Weights each class equally

A micro average calculates the metric by first pooling all instances of


eachclass

Weightseachinstance equally
Macro average can be dominatedbyminorityclasses

Micro average is not sensitive to class imbalance

If micro aug a macro avg high misclassification rate for majority


class

if mairo avg.cc micro aug high misclassification rate for minority


classes
Chapter 8 Model Selection Validation

Generalization

Definition Modeldoes a goodjob of correctlypredictingclasslabelsof


previously unseen samples

Evaluating Generalization Requires

Unseen dataset with known corre it labels


A quantitative measuremetric of tendencyformodel to predictcorrectlabels

Optimizing ModelComplexity
Most learning algorithms haveknobs that adjusts the model complexity

Regression order of polynomial

NaiveBayes number of features

Decision Trees number of nodes in the tree


KNN number of nearestneighbors

SUM kerneltype cost parameter

Model Selection
Extra sample error estimates

Train test split most used methods


Cross validation
Bootstrap
Chapter 9 Linear Regression
Measure of Error

We can measure the prediction loss in terms of squared error Loss on one
example
Lossc g g y 572
Loss on n trainingexamples

Foxis w 72
Jn Cw
I 21 Cy
d
F IR IR fix i wi wot w int twd xd
Minimize the Squared Loss

Jn Cw
I It yi FCxi i wi l

L It yi wo wi xi l dim

To get the optimal parameter values take derivative

Jn 0 JnCw
of O
w
fwy

W xx xty

Numerical Solution

Matrixinversion is computationally veryexpensive

Using the analyticalform tocompute theoptimalsolutionmaynot be feasibleeven


for moderatevalues of n

Alsoonlypossible if Atx is not singular multicollinearity problem

Determinant is zero
Not full rank

Gradient Descent

General algorithm for optimization

Initialization initialize O
Do step size
Step size
Can change as a function O O FI 5cal
of iteration 3 while all JI SE
Gradientdirection t
Stopping condition stoppingcondition
Gradient vector TJ
I 23ft É I

You might also like