You are on page 1of 108

UNITIIDATAANALYSIS

RegressionModeling
MultivariateAnalysis
BayesianMethods
BayesianParadigm
BayesianModeling
InferenceandBayesianNetworks
SupportVectorandKernelMethods
AnalysisofTimeSeries:LinearSystemsAnalysis
NonlinearDynamics
RuleInduction
FuzzyLogic:ExtractingFuzzyModelsfromData
FuzzyDecisionTrees
RegressionAnalysis

Regression analysis is a tool for building


statistical models that characterize relationships
among a dependent variable and one or more
independent variables, all of which are
numerical.
Simple linear regression involves a single
independentvariable.
Multiple regression involves two or more
independentvariables.

9-2
Regression: Introduction
Whatisregression?
Regression is a statistical technique to determine the linear
relationship between two or more variables. Regression is
primarilyusedforpredictionandcausalinference.
In its simplest (bivariate) form, regression shows the
relationship between one independent variable (X) and a
dependentvariable(Y),asintheformulabelow:

Basicidea:
Usedatatoidentifyrelationshipsamongvariablesandusethese
relationshipstomakepredictions.
Regression Model
The variable we are trying to predict (Y) is called the
dependent (or response) variable.
The variable x is called the independent (or predictor,
or explanatory) variable.
Our model assumes that
E(Y | X = x) = 0 + 1x (the population line) (1)
The interpretation is as follows:
When X (house size) is fixed at a level x, then we assume the
mean of Y (selling price) to be linear around the level x, where
0 is the (unknown) intercept and 1 is the (unknown) slope or
incremental change in Y per unit change in X.
0 and 1 are not known exactly, but are estimated from sample
data. Their estimates are denoted b0 and b1.
A simple regression model: Consider a model with only
one independent variable,.
A multiple regression model: a model with multiple
independent variables.
PurposeofRegressionAnalysis
Thepurposeofregressionanalysisistoanalyzerelationships
amongvariables.
Theanalysisiscarriedoutthroughtheestimationofa
relationshipandtheresultsservethefollowingtwopurposes:
1. Answerthequestionofhowmuchychangeswithchangesin
eachofthex's(x1,x2,...,xk),
Yisthedependentvariable
2. Forecastorpredictthevalueofybasedonthevaluesofthe
X's
Xistheindependentvariable
SimpleLinearRegression
Findsalinearrelationshipbetween:
oneindependentvariableXand
onedependentvariableY
Firstprepareascatterplottoverifythedata
hasalineartrend.
Use alternative approaches if the
data is not linear.

Figure 9.1

9-6
SimpleLinearRegression
Example 9.1
Home Market Value Data
Sizeofahouseistypically
relatedtoitsmarketvalue.
Figure 9.2
X=squarefootage
Y=marketvalue($)
Thescatterplotofthefull
dataset(42homes)
indicatesalineartrend.
Figure 9.3

9-7
Simple Linear Regression

FindingtheBestFittingRegressionLine
Twopossiblelinesareshownbelow.
LineAisclearlyabetterfittothedata.
^
Wewanttodeterminethebestregressionline.
Y=b0+b1X
where
b0istheintercept
b1istheslope
Figure 9.4

9-8
Simple Linear Regression
Using Excel to Find the Best Regression
Line
Market value = 32673 +
The regression model
35.036(square feet) explains variation in
market value due to
size of the home.
It provides better
estimates of market
value than simply
using the average.

Figure 9.5

9-9
Building Good Regression Models
Predicting Average Bank Balance
using Regression
Home Value and Education
are not significant.

Figure 9.17

9-12
BuildingGoodRegressionModels
SystematicApproachtoBuildingGoodMultipleRegressionModels
1.Constructamodelwithallavailableindependent
variablesandcheckforsignificanceofeach.
2.Identifythelargestpvaluethatisgreaterthan.05
3.RemovethatvariableandevaluateadjustedR2.
4.Continueuntilallvariablesaresignificant.
FindthemodelwiththehighestadjustedR2.
(DonotuseunadjustedR2sinceitalways
increaseswhenvariablesareadded.)

9-13
Building Good Regression Models
Identifying the Best Regression Model
Bank regression after removing
Home Value
Adjusted R2 improves slightly.

Figure 9.18
9-14
Regression Modeling
Steps
Defineproblemorquestion
Specifymodel
Collectdata
Dodescriptivedataanalysis
Estimateunknownparameters
Evaluatemodel
Usemodelforprediction
The 13 Steps for
Statistical Modeling in
any Regression
Part 1: Define and Design
In the first 4 steps, the object is
clarity. You want to make everything
as clear as possible to yourself.
1.Writeoutresearchquestionsintheoreticalandoperationalterms
A lot of times, when researchers are confused about the right statistical method to use, the
realproblemistheyhaventdefinedtheirresearchquestions.Theyhaveageneralideaofthe
relationshiptheywanttotest,butitsabitvague.Youneedtobeveryspecific.
2.Designthestudyordefinethedesign
Depending on whether you are collecting your own data or doing secondary data analysis,
youneedaclearideaofthedesign.Designissuesareaboutrandomizationandsampling:
3. Choose the variables for answering research questions and determine their level of
measurement
Every model has to take into account both the design and the level of measurement of the
variables.
Level of measurement, remember, is whether a variable is nominal, ordinal, or interval.
Withininterval,youalsoneedtoknowifvariablesarediscretecountsorcontinuous.
4.Writeananalysisplan
Writeyourbestguessforthestatisticalmethodthatwillanswertheresearchquestion,taking
intoaccountthedesignandthetypeofdata.
Itdoesnothavetobefinalatthispointitjustneedstobeareasonableapproximation.
5.Calculatesamplesizeestimations
Thisisthepointatwhichyoushouldcalculateyoursamplesizesbeforeyoucollectdataand
afteryouhaveananalysisplan.Youneedtoknowwhichstatisticaltestsyouwilluseasa
basisfortheestimates.
Part3:Refinethemodel
10.Refinepredictorsandcheckmodelfit
Ifyouaredoingatrulyexploratoryanalysis,orifthepointofthemodelispureprediction,
youcanusesomesortofstepwiseapproachtodeterminethebestpredictors.
Iftheanalysisistotesthypothesesoranswertheoreticalresearchquestions,thispartwillbe
moreaboutrefinement.Youcan
Test, and possibly drop, interactions and quadratic or explore other types of nonlinearity
Testthebestspecificationofrandomeffects
11.Testassumptions
Because you already investigated the right family of models in Part 1, thoroughly
investigated your variables in Step 8, and correctly specified your model in Step 10, you
shouldnothavebigsurpriseshere.Rather,thisstepwillbeaboutconfirming,checking,and
refining. But what you learn here can send you back to any of those steps for further
refinement.
12.Checkforandresolvedataissues
Steps11and12areoftendonetogether,Dataissuesareaboutthedata,notthemodel,but
occurwithinthecontextofthemodel
13.InterpretResults
Now,finally,interprettheresults.
Youmaynotnoticedataissuesormisspecifiedpredictorsuntilyouinterpretthecoefficients.
Thenyoufindsomethinglikeasuperhighstandarderrororacoefficientwithasignopposite
whatyouexpected,sendingyoubacktoprevioussteps.
MultivariateAnalysis
WhatisMVA?

Multivariate analysis (MVA) is defined as the simultaneous analysis of more


thanfivevariables.Somepeopleusethetermmegavariateanalysistodenote
caseswheretherearemorethanahundredvariables.

MVAusesALLavailabledatatocapturethemostinformationpossible.The
basicprincipleistoboildownhundredsofvariablesdowntoamerehandful.

MVA

Graphical representation of MVA

Statistical Model (internal


Tmt X1 X4 X5 Rep Y avec Y sans
to
1 -1 -1 -1 1 2.51 2.74 .
1 -1 -1 -1 2 2.36 3.22 . .. . software)
1 -1 -1 -1 3 2.45 2.56
..
2 -1 0 1 1 2.63 3.23
.
2 -1 0 1 2 2.55 2.47 . .
2 -1 0 1 3 2.65 2.31 . .
3

3
-1

-1
Raw Data:
1

1
0

0
1

2
2.45

2.6
2.67

2.45

4
-1

0
impossible to
1

-1
0

1
3

1
2.53

3.02
2.98

3.22

4 0 interpret
-1 1 2 2.7 2.57

4 0 -1 1 3 2.97 2.63

5 0 0 0 1 2.89 3.16 Y
5 0 0 0 2 2.56 3.32 trends
5 0 0 0 3 2.52 3.26

6 0 1 -1 1 2.44 3.1
trends X
6 0 1 -1 2 2.22 2.97
X trends
6 0 1 -1 3 2.27 2.92
X
hundreds of columns X

thousands of rows
2-D Visual Outputs
Example: Apples and Oranges

A good example of these ideas is Apple versus Orange.

Clever scientists could easily come up with hundreds of different


things to measure on apples and oranges, to tell them apart:
Colour, shape, firmness, reflectivity,
Skin: smoothness, thickness, morphology,
Juice: water content, pH, composition,
Seeds: colour, weight, size distribution,
etc.
+1 -1

However, there will never be more than one difference: is it an


apple or an orange? In MVA parlance, we would say that there is
only one latent attribute.
Graphical Representation of MVA

The main element of MVA is the reduction in dimensionality.


Taken to its extreme, this can mean going from hundreds of
dimensions (variables) down to just two, allowing us to create a 2-
dimensional graph.

Using these graphs, which our eyes and brains can easily handle,
we are able to peer into the database and identify trends and
correlations.

This is illustrated on
the next page

Peering into the data


MultivariateAnalysis
Manystatisticaltechniquesfocusonjustone
ortwovariables
Multivariateanalysis(MVA)techniquesallow
morethantwovariablestobeanalysedatonce
Multipleregressionisnottypicallyincludedunder
thisheading,butcanbethoughtofasa
multivariateanalysis
Data-Rich but Knowledge-Poor

As a result, we have become data-rich but knowledge-poor.

The biggest problem is that interesting, useful patterns and


relationships which are not intuitively obvious lie hidden inside
enormous, unwieldy databases. Also, many variables are correlated.

This has led to the creation of data-mining techniques, aimed at


extracting this useful knowledge. Some examples are:
Neural Networks
Multiple Regression
Decision Trees
Genetic Algorithms
Clustering
MVA Subject of
Subject of this
this module
module
Mining data
MultivariateAnalysisMethods
TwogeneraltypesofMVAtechnique
Analysisofdependence
Whereone(ormore)variablesaredependentvariables,
tobeexplainedorpredictedbyothers
E.g.Multipleregression,PLS,MDA
Analysisofinterdependence
Novariablesthoughtofasdependent
Lookattherelationshipsamongvariables,objectsor
cases
E.g.clusteranalysis,factoranalysis
Multivariate Analysis: Benefits
What is the point of doing MVA?

The first potential benefit is to explore the inter-relationships


between different process variables. It is well known that simply
creating a model can provide insight in the process itself (Learn by
modelling).

Once a representative model has been created, the engineer can


perform what if? exercises without affecting the real process. This
is a low-cost way to investigate options.

Some important parameters, like final product quality, cannot be


measured in real time. They can, however, be inferred from other
variables that are measured on-line. When incorporated in the
process control system, this inferential controller or soft sensor
can greatly improve process performance.
BayesianMethods

BayesianParadigm
BayesianModeling
InferenceandBayesianNetworks

Referpdf
SupportVectorandKernel
Methods
SVMs are currently of great interest to theoretical
researchers and
applied scientists.
By means of the new technology of kernel methods, SVMs
have been very successful in building highly nonlinear
classifiers.
SVMs have also been successful in dealing with situations
in which there are many more variables than
observations, and complexly structured data.
Wide applications in machine learning, natural language
processing, boinformatics.
Kernel methods: key idea
Input space X Feature space F
x1 x2 inverse map -1
(xn)
(x) (xn-
(x1)
1)

xn (x
xn 2)
-1 k(xi,xj) = (xi).(xj)

kernel function k: XxX R kernel-based algorithm on K


Kernel matrix Knxn
(computation on kernel matrix)

: X R2 H R3
(x 1, x 2 ) (x , 1x ,2 x 2
1 x22 )
Kernel
PCA

Using kernel function, linear operators of


PCA is carried out in a reproducing kernel
Hilbert space with a linear mapping.
Kernel methods: math
background

k(xi,xj) = (xi).(xj)

kernel function k: XxX R kernel-based algorithm on K


Kernel matrix Knxn
(computation on kernel matrix)

Linear algebra, probability/statistics, functional analysis, optimization


Mercer theorem: Any positive definite function can be written as an inner
product in some feature space.
Kernel trick: Using kernel matrix instead of inner product in the feature space.
Representer theorem (Wahba): Every minimizer of min{C( f ,{xi , yi }) ) admits
f H H
( f m

a representation of the form f (.) i K (., xi )


i1
Multiclass support vector
machines
Multiclass SVM as a series of binary problems
One-versus-
rest: Divide the
K-class problem
into K binary
classification
subproblems of
the type kth
class vs.
not kth
class,
k = 1, 2, . . .,K.

One-versus-
one: Divide the
K-class problem 37
Multiclass support vector
machines
A true multiclass SVM
To construct a true multiclass SVM classifier, we need to
consider all K classes, 1,2, . . . ,K, simultaneously, and the
classifier has to reduce to the binary SVM classifier if K = 2.
One construction due to Lee, Lin, and Wahba (2004).
Provide a unifying framework to multicategory SVM when
there are
either equal or unequal misclassification costs.

38
Which Separating Hyperplane to
Use?
Var1

Var2
39
Maximizing the Margin
Var1 IDEA 1: Select the
separating
hyperplane that
maximizes the
margin!

Margin
Width

Margin
Width
Var2
40
Support Vectors
Var1

Support Vectors

Margin
Width
Var2
41
Setting Up the Optimization
Problem
Var1
The width of the
margin is:
2k
w

w x b k So, the problem is:
w
2k
max
w
w x b k
k k Var2
s.t. (w x b) k , x of class 1
(w x b) k , x of class 2
w x b 0

42
Setting Up the Optimization
Problem
Var1
There is a scale and
unit for data so that
k=1. Then problem
becomes:

2
max
w x b 1 w
w s.t. (w x b) 1, x of class 1
(w x b) 1, x of class 2
w x b 1
1 1 Var2

w x b 0

43
Setting Up the Optimization
Problem
If class 1 corresponds to 1 and class 2
corresponds to -1, we can rewrite
(w xi b) 1, xi with yi 1
(w xi b) 1, xi with yi 1
as
yi (w xi b) 1, xi
So the problem becomes:
2 1 2
max min w
w or 2
s.t. yi (w xi b) 1, xi s.t. yi (w xi b) 1, xi

44
Linear, Hard-Margin SVM
Formulation
Find w,b that solves
1 2
min w
2
s.t. yi (w xi b) 1, xi
Problem is convex so, there is a unique global minimum value
(when feasible)
There is also a unique minimizer, i.e. weight and b value that
provides the minimum
Non-solvable if the data is not linearly separable
Quadratic Programming
Very efficient computationally with modern
constraint optimization engines (handles
thousands of constraints and training
instances).
45
Support Vector Machines
Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space

46
Support Vector Machines
Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space

47
Non-Linearly Separable
Data
Var1 Introduce slack
i variables i

Allow some
instances to fall
i within the margin,
r r but penalize them
w x b 1
w
r r
w x b 1
1 1 Var2

w x b 0

48
Formulating the Optimization
Problem
Constraint becomes :

Var1 yi (w xi b) 1 i , xi
i
i 0

Objective function
penalizes for
i
misclassified instances
r r and those within the
w x b 1
w margin
1
min w C i
2
r r
w x b 1
1 1 Var2
2 i

w x b 0
C trades-off margin width
and misclassifications 49
Linear, Soft-Margin SVMs
1
min w C i yi (w xi b) 1 i , xi
2

2 i i 0
Algorithm tries to maintain i to zero while
maximizing margin
Notice: algorithm does not minimize the number of
misclassifications (NP-complete problem) but the
sum of distances from the margin hyperplanes
Other formulations use i2 instead
As C, we get closer to the hard-margin solution

50
Robustness of Soft vs Hard
Margin SVMs

Var1 Var1

Var2
Var2 w x b 0

w x b 0

Soft Margin SVN Hard Margin SVN

51
Soft vs Hard Margin SVM
Soft-Margin always have a solution
Soft-Margin is more robust to outliers
Smoother surfaces (in the non-linear
case)
Hard-Margin does not require to
guess the cost parameter (requires
no parameters at all)

52
Support Vector Machines
Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space

53
Support Vector Machines
Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space

54
Disadvantages of Linear
Decision Surfaces
Var1

Var2
55
Advantages of Non-Linear
Surfaces
Var1

Var2
56
Linear Classifiers in High-
Dimensional Spaces
Constructed
Var1
Feature 2

Var2
Constructed
Feature 1
Find function (x) to map to
a different space

57
Mapping Data to a High-
Dimensional Space
Find function (x) to map to a different space,
then SVM formulation becomes:
1 s.t. yi ( w ( x ) b) 1 i , xi
min w C i
2

2 i i 0

Data appear as (x), weights w are now weights


in the new space
Explicit mapping expensive if (x) is very high
dimensional
Solving the problem without explicitly mapping
the data is desirable

58
The Dual of the SVM
Formulation
Original SVM formulation
1 2
n inequality constraints min w C i
n positivity constraints w ,b 2 i
n number of variables
s.t. yi ( w ( x ) b) 1 i , xi
i 0
The (Wolfe) dual of this
1
problem
one equality constraint
min i j yi y j ( ( xi ) ( x j )) i
ai 2
i, j i
n positivity constraints
n number of variables
(Lagrange multipliers) s.t. C i 0, xi
Objective function more
complicated y
i
i i 0

NOTICE: Data only appear


as (xi) (xj)

59
The Kernel Trick
(xi) (xj): means, map data into new space, then take
the inner product of the new vectors
We can find a function such that: K(xi xj) = (xi) (xj),
i.e., the image of the inner product of the data is the
inner product of the images of the data
Then, we do not need to explicitly map the data into the
high-dimensional space to solve the optimization
problem (for training)
How do we classify without explicitly mapping the new
instances? Turns out
sgn( wx b) sgn( i yi K ( xi , x ) b)
i

where b solves j ( y j i yi K ( xi , x j ) b 1) 0,
i

for any j with j 0


60
Examples of Kernels
Assume we measure two quantities, e.g.
expression level of genes TrkC and
SonicHedghog (SH) and we use the mapping:
: xTrkC , x SH {x TrkC
2 2
, x SH , 2 xTrkC x SH , xTrkC , x SH ,1}
Consider the function:
K ( x z ) ( x z 1) 2
We can verify that:
( x) ( z )
2
x TrkC 2
z TrkC x SH
2 2
z SH 2 xTrkC x SH zTrkC z SH xTrkC zTrkC x SH z SH 1
( xTrkC zTrkC x SH z SH 1) 2 ( x z 1) 2 K ( x z )

61
Polynomial and Gaussian
Kernels
K ( x z ) ( x z 1) p

is called the polynomial kernel of degree p.


For p=2, if we measure 7,000 genes using the kernel
once means calculating a summation product with
7,000 terms then taking the square of this number
Mapping explicitly to the high-dimensional space means
calculating approximately 50,000,000 new features for
both training instances, then taking the inner product of
that (another 50,000,000 terms to sum)
In general, using the Kernel trick provides huge
computational savings over explicit mapping!
Another commonly used Kernel is the Gaussian (maps to
a dimensional space with number of dimensions equal
to the number of training cases):
K ( x z ) exp( x z / 2 )
2

62
The Mercer Condition
Is there a mapping (x) for any
symmetric function K(x,z)? No
The SVM dual formulation requires
calculation K(xi , xj) for each pair of
training instances. The array Gij = K(xi ,
xj) is called the Gram matrix
There is a feature space (x) when the
Kernel is such that G is always semi-
positive definite (Mercer condition)
63
Support Vector Machines
Three main ideas:
1. Define what an optimal hyperplane is (in way
that can be identified in a computationally
efficient way): maximize margin
2. Extend the above definition for non-linearly
separable problems: have a penalty term for
misclassifications
3. Map data to high dimensional space where it
is easier to classify with linear decision
surfaces: reformulate problem so that data is
mapped implicitly to this space

64
Other Types of Kernel
Methods
SVMs that perform regression
SVMs that perform clustering
-Support Vector Machines: maximize margin
while bounding the number of margin errors
Leave One Out Machines: minimize the bound of
the leave-one-out error
SVM formulations that take into consideration
difference in cost of misclassification for the
different classes
Kernels suitable for sequences of strings, or other
specialized kernels
65
Variable Selection with
SVMs
Recursive Feature Elimination
Train a linear SVM
Remove the variables with the lowest weights (those
variables affect classification the least), e.g., remove
the lowest 50% of variables
Retrain the SVM with remaining variables and repeat
until classification is reduced
Very successful
Other formulations exist where minimizing the
number of variables is folded into the optimization
problem
Similar algorithm exist for non-linear SVMs
Some of the best and most efficient variable
selection methods

66
Comparison with Neural
Networks
Neural Networks SVMs
Hidden Layers map to Kernel maps to a very-high
lower dimensional spaces dimensional space
Search space has multiple Search space has a unique
local minima minimum
Training is expensive Training is extremely
Classification extremely efficient
efficient Classification extremely
Requires number of hidden efficient
units and layers Kernel and cost the two
Very good accuracy in parameters to select
typical domains Very good accuracy in
typical domains
Extremely robust

67
Why do SVMs Generalize?
Even though they map to a very high-
dimensional space
They have a very strong bias in that space
The solution has to be a linear combination of
the training instances
Large theory on Structural Risk
Minimization providing bounds on the
error of an SVM
Typically the error bounds too loose to be of
practical use
68
MultiClass SVMs
One-versus-all
Train n binary classifiers, one for each class against all
other classes.
Predicted class is the class of the most confident classifier
One-versus-one
Train n(n-1)/2 classifiers, each discriminating between a
pair of classes
Several strategies for selecting the final classification
based on the output of the binary SVMs
Truly MultiClass SVMs
Generalize the SVM formulation to multiple categories
More on that in the nominated for the student paper award:
Methods for Multi-Category Cancer Diagnosis from Gene
Expression Data: A Comprehensive Evaluation to Inform
Decision Support System Development, Alexander Statnikov,
Constantin F. Aliferis, Ioannis Tsamardinos

69
Conclusions
SVMs express learning as a mathematical
program taking advantage of the rich
theory in optimization
SVM uses the kernel trick to map indirectly
to extremely high dimensional spaces
SVMs extremely successful, robust,
efficient, and versatile while there are
good theoretical indications as to why
they generalize well

70
FuzzyLogic:
ExtractingFuzzyModelsfromData
FuzzyDecisionTrees
Brief History
Fuzzy logic can be defined as a
superset of conventional (Boolean)
logic that has been extended to handle
the concept of partial truth - truth
values between completely true and
completely false
Brought up by Lofti Zedah in the 1960s
Professor at University of California at
Beckley
How it Works
Basics of Fuzzy Logic (Rules)

Operates similar to humans


Humans base their decisions on conditions
Operates on a bunch of IF-THEN
statements
An example is A then B, if C then D
where B and D are all set of A and C.
Steps by Step Approach
Step One
Define the control objectives and criteria.
Consider question like
What is trying to be controlled?
What has to be done to control the system?
What kind of response is needed?
What are the possible (probable) system failure modes?
Step Two
Determine input and output relationships
Determine the least number of variables for
inputs to the fuzzy logic system
Steps by Step Approach
Step Three
Break down the control problem into a series of
IF X AND Y, THEN Z rules based on the fuzzy
logic rules.
These IF X AND Y, THEN Z rules should define
the desired system output response for the
given systems input conditions.

Step Four
Create a fuzzy logic membership function that
defines the meaning or values of the input and
output terms used in the rules
Steps by Step Approach
Step Five
After the membership functions are
created, program everything then into
the fuzzy logic system

Step Six
Finally, test the system, evaluate results
and make the necessary adjustments
until a desired result is obtain
Steps by Step Approach
The above steps are summarized into
three main stages
Fuzzification
Membership functions used to graphically
describe a situation
Evaluation of Rules
Application of the fuzzy logic rules
Diffuzification
Obtaining the crisp results
Steps by Step Approach
Input Membership
Functions
Sample Fuzzy Rule
Base
Output Membership
Function
Inverted Pendulum
Task:
To balance a pole on a mobile platform
that can move in only two directions,
either to the left or to the right.
Inverted Pendulum
The input and output relationships of
the variables of the fuzzy system are
then determined.
Inputs:
Angle between the platform and the
pendulum
Angular velocity of this angle.
Outputs:
Speed of platform
Inverted Pendulum
Use membership functions to
graphically describe the situation
(Fuzzification)
The output which is speed can be
high speed, medium speed, low
speed, etc. These different levels of
output of the platform are defined by
specifying the membership functions
for the fuzzy-sets
Inverted Pendulum
Inverted Pendulum
Define Fuzzy Rules
Examples
If angle is zero and angular velocity is zero,
then speed is also zero
If angle is zero and angular velocity is
negative low, the speed is negative low
If angle is positive low and angular velocity
is zero, then speed is positive low
If angle is positive low and angular velocity
is negative low, then speed is zero
Inverted Pendulum
Inverted Pendulum
Finally, the
Defuzzification stage is
implemented.
Two ways of
defuzzification is by
Finding the center of
Gravity and
Finding the average mean.
Inverted Pendulum
Example Application

http://www.aptronix.com/fuzzynet/jav
a/pend/pendjava.htm
Other Applications
Coal Power Plant Creditworthiness
Assessment
Refuse Incineration Plant Stock Prognosis
Water Treatment Systems Mortgage Application
AC Induction Motor Hi-Fi Systems
Fraud Detection Humidifiers
Customer Targeting Domestic Goods - Washing
Quality Control Machines/Dryers
Speech Recognition Microwave Ovens
Nuclear Fusion Consumer Electronics
Television
Truck Speed Limiter
Still and Video Cameras -
Sonar Systems Auto focus, Exposure and
Toasters Anti-Shake
Photocopiers Vacuum Cleaners
DataFuzzyDecisionTrees
A decision tree is a classifier expressed as a recursive partition of the in-
stance space.
The decision tree consists of nodes that form a rooted tree , meaning it is a directed
tree with a node called root that has no incoming edges.
All other nodes have exactly one incoming edge. A node with outgoing edges is
called an internal or test node. All other nodes are called leaves (also known as
terminal or decision nodes).
In a decision tree, each internal node splits the instance space into two or more sub-
spaces according to a certain discrete function of the input attributes values.
Decision Trees (1). Introduction

Decision Tree is a flow-chart like structure in which internal node


represents test on an attribute, each branch represents outcome of
test and each leaf node represents class label.

sunny Outlook rain


overcast
Humidity Windy
normal high yes false true

yes no yes no

5
Unordered Fuzzy Decision Trees
H(B)=24,684
B1 = 0.275 = 0.16
B2 = 0.275 = 0.75
I(B; Ai1,j1,,Aiq-1, j q-1, Aiq)
max B3 = 0.450
Cost (Aiq) f = 1.000
H(B| A2)=20,932
I(B; A2) = 3,752
A2
H(B | A2,1)=8,820 H(B| A2,2)=8,201 H(B| A2,3)=3,911

B1=26,6% B1=37,1% B1 =16,3%


B2=54,1% B2=15,9% B2 = 4,9%
B3=19,3% B3=47,0% B3=78,8%
f= 0,381 f = 0,350 f = 0,269

A1 A4

B1=26,2% B1=37,6% B1=11,0% B1=17,2% B1=53,4%


B2=64,4% B2=50,4% B2=16,3% B2= 7,4% B2=22,9%
B3= 9,4% B3=12,0% B3=72,7% B3=75,4% B3=23,7%
f =0,239 f=0,086 f=0,056 f=0,157 f =0,193

A4 A1

B1=14,2% B1=33,2% B1=57,9% B1=55,1% B1= 36,9%


B2=67,8% B2=62,3% B2=39,1% B2=14,2% B2= 13,6%
B3=18,0% B3= 4,5% B3= 3,0% B3=30,7% B3=49,5% 12
f =0,088 f =0,151 f =0,068 f =0,096 f =0,028
Fuzzy Decision Rules (1). A priori

Fuzzy Decision Rule is path from root to leaf:

If (A2 is A2 ,3 ) then B (with degree of truth [0.169 0.049 0.788])

If (A2 is A2 ,2 ) and (A4 is A4 ,1 )


A2
then B is B3 (with degree of truth 0.754)
.. B3=78,8%
A1 A4
Input attribute 3 have not influence to attribute B
(for given thresholds = 0,16 = 0,75). B =64,4% B =50,4%
2 1 B3=72,7% B3=75,4% B1=53,4%
f=0,086 f=0,056 f=0,157

A4 A1

B1=67,8% B1=62,3% B2=57,9% B2=55,1% B3=49,5%


f =0,088 f =0,151 f =0,068 f =0,096 f =0,028

13
Fuzzy Decision Rules (2). A posteriori
Fuzzy Decision Rule is path from root to leaf
One example describes by several Fuzzy Decision Rules

A1 A2 A3 A4
A1,1 A1,2 A1,3 A2,1 A2,2 A2,3 A3,1 A3,2 A4,1 A4,2
0.9 0.1 0.0 1.0 0.0 0.0 0.8 0.2 0.4 0.6
1.0 A2 0.0

0.0
B1 =16,3%
0.9 A1 0.0
A4 B2 = 4,9%
B3=78,8%
0.1
B1=37,6% B1=11,0% B1=17,2%
0.4 A4 0.6 B2=50,4% B2=16,3% B2= 7,4%
B3=12,0% B3=72,7% B3=75,4% A1
W3=(A2,1A1,2)=(1.00.1)=0.10
B1=14,2% B1=33,2%
B2=67,8% B2=62,3% B1=57,9% B1=55,1% B1= 36,9%
B3=18,0% B3= 4,5% B2=39,1% B2=14,2% B2= 13,6%
B3= 3,0% B3=30,7% B3=49,5%
W1 = 0.36 W2=(A2,1A1,1A4,2)=(1.00.90.6)=0.54

B1=14,2% B1=33,2% B1=37,6% B =26,8%


B2=67,8% 0.36 + B2=62,3% 0.54 + B2=50,4% 0.10 = B1=63,1% 14
2
B3=18,0% B3= 4,5% B3=12,0% B3=10,1%
Decision Tables

A2

B2 B3 B3

A1 A4

B2 B1 B3 B3 B1

A4 A1

B1 B1 B2 B2 B3

Truth table vector column: 15


= [1111 2020 2222 1111 2020 2222 2222 2222
2222]T
Basic of Knowledge Representation
Initial date

Fuzzy Decision Tree

Fuzzy Decision Rules

Multiple-Valued Logic

Decision Tables
Truth table vector column

Sensitivity Analysis
Testability Analysis
16
Reliability Analysis
Fuzzy Decision Making Support System

Fuzzy Decision Rules


(FDT, DT)

Numeric variables

Fuzzy de-Fuzzy
Fuzzy Analysis
Linguistic variables

17
SOFTWARE FOR EXPERIMENTAL
INVESTIGATIONS
We create software application Multiprognos by C++ ver. 5.02

Block 1. Data preparation


Read and write data
Separate data into 2 parts (learning and testing data) (1000)
File initialization
Learning data (70%) Testing data (30%)

Block 2. Initial data fuzzification


Transformation from numeric values of the input attributes
into linguistic values

Block 3. Design of models


Induction of Fuzzy Decision Trees Block 4. Analysis of
Induction of Decision Trees YS-FDT the results
C4.5 CART nFDT Calculation of the
decision errors
C4.5p CARTp oFDT Writing data with
incorrect decisions
Statistical methods and algorithms sFDT
Saving of the models
Bayes kNN 18
AlgorithmicFrameworkforDecisionTreesInduction
Illustration of Decision Tree
with Replication
AdvantagesandDisadvantagesofDecision
Trees
Advantages
1. Decision trees are selfexplanatory and when compacted they are also easy to
follow. In other words if the decision tree has a reasonable number of leaves, it can be
grasped by nonprofessional users. Furthermore decision trees can be converted to a
set of rules. Thus, this representation is considered as comprehensible.
2. Decision trees can handle both nominal and numeric input attributes.
3. Decision tree representation is rich enough to represent any discrete value
classifier.
4. Decision trees are capable of handling datasets that may have errors.
5. Decision trees are capable of handling datasets that may have missing
values.
6. Decision trees are considered to be a nonparametric method. This means
that decision trees have no assumptions about the space distribution and
the classifier structure.
Disadvantages
EndofUNITII