x

© All Rights Reserved

0 views

W5-6

x

© All Rights Reserved

- Chapter14 Multiple Regression and Correlation Analysis
- 2004_SupportVectorMachines
- User Authentication using Keystroke Dynamics
- Out
- cv-notes.pdf
- Fetkovich PAper
- CIIMA - An Investigation of SVM Regression to Predict Longshot Greyhound Races
- Group Assignment Hi6007 statitics
- Optimization of Growth
- Maschinelles Lernen Mit Matlab
- A Close Look on N-Grams in Intrusion Detection- Anomaly Detection vs. Classiﬁcation
- Effect of Motivation, Fatigue, Health and Work Stress when Night Overtime on Performance
- Comparison of Offline Signature Verification.pdf
- 2P1_1
- Data Mining Regression
- 28
- Jiang2014a.pdf
- plugin-art%253A10.1007%252Fs00521-012-1260-3
- vwap14
- Naive Ba Yes

You are on page 1of 41

Parameterized Model

• A very common case is where the structure of the

model is a parameterized mathematical function or

equation of a set of numeric attributes.

• The attributes used in the model could be chosen

• based on domain knowledge regarding which attributes

ought to be informative in predicting the target variable,

• or they could be chosen based on other data mining

techniques,

• such as the attribute selection procedures introduced

previously.

2

Decision Boundaries

3

Instance Space

4

Linear Classifier

5

Linear Discriminant Functions

• Our goal is going to be to fit our model to the data.

• Recall y = mx + b, where m is the slope of the line and b is

the y intercept (the y value when x = 0). The line in previous

figure can be expressed in this form (with Balance in

thousands) as:

Age = ( - 1.5) × Balance + 60

• We would classify an instance x as a + if it is above the line,

and as a • if it is below the line.

+ if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 > 0

𝑐𝑙𝑎𝑠𝑠 𝑥 =

● if 1.0 × 𝐴𝑔𝑒 − 1.5 × 𝐵𝑎𝑙𝑎𝑛𝑐𝑒 + 60 ≤ 0

• This is called a linear discriminant because it discriminates

between the classes, and the function of the decision

boundary is a linear combination—a weighted sum—of the

attributes.

6

Linear Discriminant Functions

• In the two dimensions of our example, the linear combination corresponds to a

line.

• In three dimensions, the decision boundary is a plane, and in higher dimensions

it is a hyperplane.

• Thus, this linear model is a different sort of multivariate supervised

segmentation.

• A linear discriminant function is a numeric classification model.

• For example, consider our feature vector x, with the individual component features being

xi. A linear model then can be written as follows:

• The concrete example from the above example can be written in this form

feature vector x, we check whether f(x) is positive or negative.

7

Linear Discriminant Functions

of the linear function (wi) are the parameters.

• The data mining is going to “fit” this parameterized

model to a particular dataset—meaning specifically,

to find a good set of weights on the features.

• Unfortunately, it’s not trivial to choose the “best”

line to separate the classes. See next slide …

8

Choosing the “best” line

9

Optimizing an Objective Function

• What should be our goal or objective in choosing

the parameters?

• This would allow us to answer the question: what

weights should we choose?

• Our general procedure will be to define an

objective function that represents our goal, and can

be calculated for a particular set of weights and a

particular set of data.

• We will then find the optimal value for the weights

by maximizing or minimizing the objective function.

10

Objective Functions

• “Best” line depends on the objective (loss) function

• Objective function should represent our goal

• A loss function determines how much penalty should be

assigned to an instance based on the error in the model’s

predicted value

• Examples of objective (or loss) functions:

• 𝜆 𝑦; 𝑥 = 𝑦 − 𝑓(𝑥)

2

• 𝜆 𝑦; 𝑥 = 𝑦 − 𝑓 𝑥 [convenient mathematically – linear regression]

• 𝜆 𝑦; 𝑥 = 𝐼 𝑦 ≠ 𝑓(𝑥)

• Linear regression, logistic regression, and support vector

machines are all very similar instances of our basic

fundamental technique:

• The key difference is that each uses a different objective function

11

Logistic regression is a misnomer

• The distinction between classification and regression is

whether the value for the target variable is categorical

or numeric

• For logistic regression, the model produces a numeric

estimate

• However, the values of the target variable in the data

are categorical

• Logistic regression is estimating the probability of class

membership (a numeric quantity) over a categorical

class

• Logistic regression is a class probability estimation

model and not a regression model

12

Example 1 – Iris Dataset

• This is an old and fairly simple

dataset representing various types of

iris, a genus of flowering plant.

• The original dataset includes three

species of irises represented with

four attributes, and the data mining

problem is to classify each instance

as belonging to one of the three

species based on the attributes.

• We’ll use just two species of irises,

Iris Setosa and Iris Versicolor.

• The dataset describes a collection of

flowers of these two species, each

described with two measurements:

the Petal width and the Sepal width.

13

Example 1 – Iris Dataset

• Each instance is one

flower and corresponds

to one dot on the

graph.

• The filled dots are of

the species Iris Setosa

and the circles are

instances of the species

Iris Versicolor.

14

Ranking Instances and Probability Class

Estimation

prediction of whether an instance belongs to the class, but

we want some notion of which examples are more or less

likely to belong to the class

• Which consumers are most likely to respond to this offer?

• Which customers are most likely to leave when their contracts

expire?

• Ranking

• Tree induction

• Linear discriminant functions (e.g., linear regressions, logistic

regressions, SVMs)

• Ranking is free

• Class Probability Estimation

• Tree induction

• Logistic regression

15

The many faces of classification:

Classification / Probability Estimation

/ Ranking Increasing difficulty

• Ranking:

• Business context determines the number of actions (“how

far down the list”)

• Probability:

• You can always rank / classify if you have probabilities!

16

Ranking: Examples

• Search engines

• Whether a document is relevant to a topic / query

• In machine translation

• for ranking a set of hypothesized translations

• In computational biology

• for ranking candidate 3-D structures in protein structure

prediction problem

• In recommender systems

• for identifying a ranked list of related news articles to

recommend to a user after he or she has read a current

news article.

17

Class Probability Estimation:

Examples

• MegaTelCo

• Ranking vs. Class Probability Estimation

• Identify accounts or transactions as likely to have

been defrauded

• The director of the fraud control operation may want

the analysts to focus not simply on the cases most likely

to be fraud, but on accounts where the expected

monetary loss is higher

• We need to estimate the actual probability of fraud

18

Support Vector Machines

• In short, support vector machines are linear

discriminants.

• As with linear discriminants generally, SVMs classify

instances based on a linear function of the features,

like:

machines.

• Oversimplifying slightly, a nonlinear SVM uses different

features (that are functions of the original features), so that

the linear discriminant with the new features is a nonlinear

discriminant with the original features.

21

Support Vector Machines

• The crucial question :

• What is the objective function that is used to fit an SVM

to data?

• Skip the mathematical details for an intuitive

understanding.

• There are two main ideas.

• Instead of thinking about separating with a line,

first fit the fattest bar between the classes.

• See figure in the next slide.

22

Support Vector Machines

• First idea: a wider bar is

better.

• Once the widest bar is

found, the linear

discriminant will be the

center line through the

bar (the solid middle line

in Figure 4-8).

• The distance between the

dashed parallel lines is

called the margin around

the linear discriminant,

and thus the objective is

to maximize the margin.

23

Support Vector Machines

• The second idea:

• How do they handle

points falling on the

wrong side of the

discrimination boundary?

• Typically, a single line

cannot perfectly separate

the data into classes.

• This is true of most data

from complex real-world

applications.

24

Support Vector Machines

• In the objective function that measures how well a

particular model fits the training points, we will simply

penalize a training point for being on the wrong side of the

decision boundary.

• In the case where the data indeed are linearly separable, we

incur no penalty and simply maximize the margin.

• If the data are not linearly separable, the best fit is some

balance between a fat margin and a low total error penalty.

• The penalty for a misclassified point is proportional to the

distance from the decision boundary, so if possible the SVM

will make only “small” errors. Technically, this error function

is known as hinge loss.

• See next slide.

25

Loss Functions

26

Loss Functions

• The term “loss” is used for error penalty, across data

science.

• A loss function determines how much penalty should be

assigned to an instance based on the error in the model’s

predicted value—in our present context, based on its

distance from the separation boundary.

• Several loss functions are commonly used.

• Support vector machines use hinge loss.

• Hinge loss incurs no penalty for an example that is not on the wrong

side of the margin.

• The hinge loss only becomes positive when an example is on the

wrong side of the boundary and beyond the margin.

• Loss then increases linearly with the example’s distance from the

margin, thereby penalizing points more the farther they are from

the separating boundary.

27

Loss Functions

• Zero-one loss, as its name implies, assigns a loss of zero for a correct

decision and one for an incorrect decision.

• For contrast, consider a different sort of loss function.: Squared error

• Squared error specifies a loss proportional to the square of the distance from

the boundary.

• Squared error loss usually is used for numeric value prediction (regression),

rather than classification.

• The squaring of the error has the effect of greatly penalizing predictions that

are grossly wrong.

• For classification, this would apply large penalties to points far over on

the “wrong side” of the separating boundary.

• Unfortunately, using squared error for classification also penalizes

points far on the correct side of the decision boundary.

• For most business problems, choosing squared-error loss for

classification or class-probability estimation thus would violate our

principle of thinking carefully about whether the loss function is aligned

with the business goal.

28

Linear Regression

• We need to decide on the objective function we will use to

optimize the model’s fit to the data.

• An intuitive notion of the fit of the model is:

• how far away are the estimated values from the true values on the

training data?

• In other words, how big is the error of the fitted model?

• Presumably we’d like to minimize this error.

• For a particular training dataset, we could compute this

error for each individual data point and sum up the results.

• Then the model that fits the data best would be the model

with the minimum sum of errors on the training data.

• And that is exactly what regression procedures do.

29

Linear Regression

• The method that is most natural is to simply subtract

one from the other (and take the absolute value).

• This is called absolute error, and we could then

minimize the sum of absolute errors or equivalently the

mean of the absolute errors across the training data.

• This makes a lot of sense, but it is not what standard

linear regression procedures do.

• Standard linear regression procedures instead minimize the

sum or mean of the squares of these errors.

• Analysts often claim to prefer squared error because it

strongly penalizes very large errors.

30

Linear Regression

• Importantly, any choice for the objective function has

both advantages and drawbacks.

• For least squares regression a serious drawback is that

it is very sensitive to the data:

• erroneous or otherwise outlying data points can severely

skew the resultant linear function.

• An important thing to remember is

• Once we see linear regression simply as an instance of fitting

a (linear) model to data,

• we see that we have to choose the objective function to optimize,

and

• we should do so with the ultimate business application in mind.

31

Class Probability Estimation and

Logistic Regression

• For many applications we would like to estimate the

probability that a new instance belongs to the class of

interest.

• In many cases, we would like to use the estimated

probability in a decision-making context that includes

other factors such as costs and benefits.

• Fortunately, within this same framework for fitting

linear models to data, by choosing a different objective

function we can produce a model designed to give

accurate estimates of class probability.

• The most common procedure by which we do this is

called logistic regression.

32

Logistic Regression

• An instance being further from the separating

boundary intuitively ought to lead to a higher

probability of being in one class or the other, and

the output of the linear function, f(x), gives the

distance from the separating boundary.

• However, this also shows the problem: f(x) ranges

from –∞ to ∞, and a probability should range from

zero to one.

• To limit in [0,1], we can use the logistic function:

33

Logistic regression (“sigmoid”)

curve

34

Application of Logistic Regression

• The Wisconsin Breast Cancer Dataset

• This is another popular dataset from the the machine

learning dataset repository at the University of California

at Irvine.

• Each example describes characteristics of a cell

nuclei image, which has been labeled as either

benign or malignant (cancerous), based on an

expert’s diagnosis of the cells.

35

Wisconsin Breast Cancer Dataset

36

Wisconsin Breast Cancer dataset

were computed: the mean (_mean), standard error (_SE),

and “worst” or largest

37

Wisconsin Breast Cancer dataset

38

Wisconsin Breast Cancer dataset

• The performance of this model is quite good—it makes

only six mistakes on the entire dataset, yielding an

accuracy of about 98.9% (the percentage of the

instances that the model classifies correctly).

• For comparison, a classification tree was learned from

the same dataset (using Weka’s J48 implementation).

• The tree has 25 nodes altogether, with 13 leaf nodes.

Recall that this means that the tree model partitions

the instances into 13 segments.

• The classification tree’s accuracy is 99.1%, slightly

higher than that of logistic regression.

39

Non-linear Functions

• Linear functions can actually represent nonlinear

models, if we include more complex features in the

functions

40

Non-linear Functions

• Using “higher order” features is just a “trick”

• Common techniques based on fitting the parameters of

complex, nonlinear functions:

• Non-linear support vector machines and neural networks

• Nonlinear support vector machine with a “polynomial

kernel” consider “higher-order” combinations of the

original features

• Squared features, products of features, etc.

• Think of a neural network as a “stack” of models

• On the bottom of the stack are the original features

• Each layer in the stack applies a simple model to the outputs of the

previous layer

• Might fit data too well (..to be covered)

41

Simple Neural Network

42

Linear Models versus Tree

Induction

• What is more comprehensible to the stakeholders?

• Rules or a numeric function?

• How “smooth” is the underlying phenomenon being modeled?

• Trees need a lot of data to approximate curved boundaries

• How “non-linear” is the underlying phenomenon being

modeled?

• If very, much “data engineering” needed to apply linear models

• How much data do you have?!

• There is a key tradeoff between the complexity that can be modeled

and the amount of training data available

• What are the characteristics of the data: missing values, types

of variables, relationships between them, how many are

irrelevant, etc.

• Trees fairly robust to these complications

43

- Chapter14 Multiple Regression and Correlation AnalysisUploaded byragcajun
- 2004_SupportVectorMachinesUploaded bymrdl1988
- User Authentication using Keystroke DynamicsUploaded byGRD Journals
- OutUploaded byMubarak Ahmed
- cv-notes.pdfUploaded bythomas
- Fetkovich PAperUploaded byAnonymous moOnbb4wg
- CIIMA - An Investigation of SVM Regression to Predict Longshot Greyhound RacesUploaded byJimmyDuncan
- Group Assignment Hi6007 statiticsUploaded bychris
- Optimization of GrowthUploaded bysiamak77
- Maschinelles Lernen Mit MatlabUploaded byMarcell
- A Close Look on N-Grams in Intrusion Detection- Anomaly Detection vs. ClassiﬁcationUploaded byy.bahador1216
- Effect of Motivation, Fatigue, Health and Work Stress when Night Overtime on PerformanceUploaded byInternational Journal of Innovative Science and Research Technology
- Comparison of Offline Signature Verification.pdfUploaded byiaetsdiaetsd
- 2P1_1Uploaded byHafsa Ihsan
- Data Mining RegressionUploaded bypriyankarora
- 28Uploaded byrapatil_rtg
- Jiang2014a.pdfUploaded bysrikanth madaka
- plugin-art%253A10.1007%252Fs00521-012-1260-3Uploaded bydelia2011
- vwap14Uploaded byvvvpnsc
- Naive Ba YesUploaded bySubhash Mantha
- RegressUploaded byAtreya Chakroborty
- IPOTEZA1Uploaded bykonadelle4390
- 06a2.pdfUploaded byPETER
- Conference SVM ClassifierUploaded byAdhiyaman Pownraj
- Gra FiacaUploaded byMoldes Acero Inoxidable
- APPENDIX for Commercial Bank DevelopmentUploaded byRICHARD DOMINIC
- Preliminary Test Estimation AndUploaded byijscmc
- Week3 HandoutTable of Selecting Proper Stat TestsUploaded byAndrei Corniciuc
- Analysis of Intrusion Detection ApproachesUploaded byMini Ahuja
- LAtihan SPSSUploaded byBudi ABank Tarigan

- ME 63- Course Plan FEAUploaded bynarayananx5
- Wikipedia Sources: Managing sources in rapidly evolving global news articles on the English WikipediaUploaded byHeather Ford
- Microwave Mobile CommunicationsUploaded bysandipg
- Armor of AthasUploaded byansomone
- STID1103 Chapter 1 Introduction to ITUploaded byAzS94
- Roadshow PresentationUploaded bygreatnavneet
- Power Generation Using Maglev WindmillUploaded byesatjournals
- Tekomar Performance Evaluation SWUploaded bymaronnam
- Akanji PaperUploaded byjehadyam
- Earthquake Resistant Design of StructuresUploaded byMilan Lamsal
- finaldraft-rhetoricalanalysisUploaded byapi-302384940
- Ib Dt Market InnovationUploaded byNat Pav
- learner analysisUploaded byapi-277840595
- QuestionUploaded byarghga mandal
- untitled iterations - vague terrainUploaded byRiccardo Mantelli
- ClimateChangeUploaded byDevesh Prakash
- Effect of Ni Content of 304Uploaded byKiatchai Ittivikul
- App CitrixUploaded bycarver
- resume v4Uploaded byapi-304731972
- Ferroli DOMIcondensF28 Boiler ManualUploaded byLau454
- Chintan NotesUploaded bynayan123321
- CS LOreal Elvive BLAZE (1).xlsxUploaded bysatna007
- HVDC Transmission en Corriente ContinuaUploaded byWilber William Moscoso Zamudio
- Gartner Five Trends Internal AuditUploaded byShirley Liew
- Logitech ConferenceCam CC3000e Data Sheet_WEB-PGSUploaded bypinke01
- PayRoll SystemUploaded byDeepak M Sahani
- Matlab ManualUploaded bysapperdeflap
- XCS 353 Computer NetworksUploaded byamrajee
- Testing PDFUploaded byJoe Scannell
- p-130415892177843818Uploaded byManisankar Dhabal