Professional Documents
Culture Documents
class.
given data set is divided into training and test sets, with training set
used to build the model and test set used to validate it.
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Non-linear regression
missing values
Typical applications
Credit approval
Target marketing
Medical diagnosis
Fraud detection
formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the classified result from
the model
Accuracy rate is the percentage of test set samples that are correctly
If the accuracy is acceptable, use the model to classify data tuples whose class
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
February 19, 2024 Moso J : Dedan Kimathi University 10
Issues: Data Preparation
Data cleaning
Preprocess data in order to reduce noise and handle missing
values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Speed
time to construct the model (training time)
Applications
Decision Tree
An internal node is a test on an attribute.
A branch represents an outcome of the test, e.g., Color=red.
distribution.
At each node, one attribute is chosen to split training
a leaf node.
decision tree
age?
<=30 overcast
31..40 >40
no yes no yes
Gain ratio
Gini index
“purest” nodes
Popular impurity criterion: information gain
Information gain increases with the average purity of
p p n n
I ( p, n) log 2 log 2
pn pn pn pn
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
outliers
Result is in poor accuracy for unseen samples
age?
Rules are easier to understand than large trees
<=30 31..40
One rule is created for each path from the root to a >40
Instance-based learning:
Store training examples and delay the processing (“lazy evaluation”)
until a new instance must be classified
Typical approaches
k-nearest neighbor approach
Instances represented as points in a Euclidean space.
Locally weighted regression
Constructs local approximation
Case-based reasoning
Uses symbolic representations and knowledge-based inference
_
_ _ .
+
_ .
+
xq + . . .
_ + .
February 19, 2024 Moso J : Dedan Kimathi University 37
Discussion on the k-NN Algorithm
function graphs)
Multiple retrieved cases may be combined
instance space
Bayesian Classification
Classification by back propagation
Genetic algorithm
Rough set approach
Fuzzy set approaches
P( H | X) P(X | H ) P( H )
P(X)
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the
P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many probabilities, significant
computational cost
February 19, 2024 Moso J : Dedan Kimathi University 44
Towards Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class labels, and each
tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(C i|X)
This can be derived from Bayes’ theorem
P(X | C ) P(C )
i i
Since P(X) is constant for all classes, P(Ci | X) P(X)