You are on page 1of 24

CLASSIFICATION & PREDICTION

Introduction:
•Classification and Prediction are forms of data analysis that can be used to extract
models.
• Classification describes data classes and predicts categorical labels.

• For example, a classification model may be built to categorize bank loan


applications as either safe or risky.
• Prediction predicts future data trends and models continuous-valued functions.

•Example a prediction model may be built to predict the expenditures of potential


customers on computer equipment given their income and occupation.
•Many classification and prediction methods have been proposed by researchers in
machine learning, expert systems, statistics, and neurobiology.
Classification:
• Data classification is a two step process.
•In the first step, a model is built describing a predetermined set of
data classes or concepts.
•The model is constructed by analyzing database tuples described by
attributes.
•Each tuple is assumed to belong to a predefined class, as determined
by one of the attributes, called the class label attribute.
• In the context of classification, data tuples are also referred to as
samples, examples, or objects.
The data tuples analyzed to build the model collectively form the
training data set.
The individual tuples making up the training set are referred to as
training samples and are randomly selected from the sample
population.
• In the second step model is used for classification.
• First the predictive accuracy of the model is estimated.
• The accuracy of a model on a given test set is the percentage of test
set samples that are correctly classified by the model.
• If the accuracy of the model is considered acceptable, the model can
be used to classify future data tuples for which the class label is not
known.
Classification Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


(Model)
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
IF rank = ‘professor’
Dave Assistant Prof 6 no OR years > 6
Anne Associate Prof 3 no THEN tenured = ‘yes’
9/15/23 4
Classification Process (2): Use the Model in
Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)

NAME RANK YEARS TENURED


Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph
9/15/23
Assistant Prof 7 yes 5
Supervised Learning VS Unsupervised Learning:
• Since the class label of each training sample is provided, this step is
also known as supervised learning i.e., the learning of the model is
`supervised' in that it is told to which class each training sample
belongs.
• In unsupervised learning or clustering, the class labels of the training
samples are not known, and the number or set of classes to be
learned may not be known in advance.
Prediction:
• Prediction can be viewed as the construction and use of a model to
assess the class of an unlabeled object, or to assess the value or value
ranges of an attribute that a given object is likely to have.
•In this view, classification and regression are the two major types of
prediction problems where classification is used to predict discrete or
nominal values, while regression is used to predict continuous or
ordered values.

Classification and prediction have numerous applications including


credit approval, medical diagnosis, performance prediction, and
selective marketing.
All Electronics Store…….
• Example: Classification
• To classify customers for promotional advertisements.

• Example: Prediction
•To predict the major purchases in next year.
Issues Regarding Classification and Prediction:
Preparing the data for Classification and Prediction:
1. Data Cleaning:
• This refers to the preprocessing of data in order to remove or reduce
noise by applying smoothing techniques.
•Example: the treatment of missing values e.g., by replacing a missing
value with the most commonly occurring value for that attribute, or
with the most probable value based on statistics.
• Although most classification algorithms have some mechanisms for
handling noisy or missing data, this step can help reduce confusion
during learning.
2. Relevance Analysis:
• Many of the attributes in the data may be irrelevant to the classification or
prediction task.
•For example, data recording the day of the week on which a bank loan
application was lead is unlikely to be relevant to the success of the application.
• Furthermore, other attributes may be redundant.

• Hence, relevance analysis may be performed on the data with the aim of
removing any irrelevant or redundant attributes from the learning process.
•In machine learning, this step is known as feature selection.

•Including such attributes may otherwise slow down, and possibly mislead, the
learning step.
•Such analysis can help improve classification efficiency and scalability.
3. Data Transformation:
• The data can be generalized to higher-level concepts.
•Concept hierarchies may be used for this purpose.
•For example, numeric values for the attribute income may be
generalized to discrete ranges such as low, medium, and high.
• Similarly, nominal-valued attributes, like street, can be generalized
to higher-level concepts, like city.
•The data may also be normalized. Normalization involves scaling all
values for a given attribute so that they fall within a small specified
range, such as -1.0 to 1.0, or 0 to 1.0.
•Comparing Classification Methods:

1. Predictive accuracy:

This refers to the ability of the model to correctly predict the class label of new or

previously unseen data.

2. Speed:

This refers to the computation costs involved in generating and using the model.

3. Robustness:

This is the ability of the model to make correct predictions given noisy data or data with

missing values.

4. Scalability:

This refers to the ability of the learned model to perform efficiently on large amounts of
data.

5. Interpretability:

This refers is the level of understanding and insight that is provided by the learned model.
Classification by Decision Tree Induction:
What is a decision tree?
• A decision tree is a flow-chart-like tree structure, where each
internal node denotes a test on an attribute, each branch represents
an outcome of the test, and leaf nodes represent classes or class
distributions.
• The topmost node in a tree is the root node.
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
30..40 >40

student? yes credit rating?

no yes excellent fair

no yes no yes

9/15/23 Data Mining: Concepts and Techniques 14


• A typical decision tree is shown in above figure.
• It represents the concept buys computer, that is, it predicts whether
or not a customer at All Electronics is likely to purchase a computer.
• Internal nodes are denoted by rectangles, and leaf nodes are
denoted by ovals.
• In order to classify an unknown sample, the attribute values of the
sample are tested against the decision tree.
• A path is traced from the root to a leaf node which holds the class
prediction for that sample.
• Decision trees can easily be converted to classification rules.
Decision Tree Induction:
• The basic algorithm for decision tree induction is a greedy
algorithm which constructs decision trees in a top-down
recursive divide-and-conquer manner.
Basic algorithm (a greedy algorithm):
• Tree is constructed in a top-down recursive divide-and-conquer
manner.
• At start, all the training examples are at the root.
• Attributes are categorical (if continuous-valued, they are discredited
in advance)
• Examples are partitioned recursively based on selected attributes.
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain).
Conditions for stopping partitioning:
• All samples for a given node belong to the same class.
• There are no remaining attributes for further partitioning – majority
voting is employed for classifying the leaf.
• There are no samples left.
Attribute Selection Measure:
• The information gain measure is used to select the test attribute at
each node in the tree.
• Such a measure is referred to as an attribute selection measure or a
measure of the goodness of split.
•The attribute with the highest information gain or greatest entropy
reduction is chosen as the test attribute for the current node.
•This attribute minimizes the information needed to classify the
samples in the resulting partitions and reflects the least randomness
or impurity" in these partitions.
Tree pruning:
• When a decision tree is built, many of the branches will reflect
anomalies in the training data due to noise or outliers.
• Tree pruning methods address this problem of over fitting the data.
•Such methods typically use statistical measures to remove the least
reliable branches, generally resulting in faster classification and an
improvement in the ability of the tree to correctly classify
independent test data.
• There are two common approaches to tree pruning.
1. Prepruning Approach:
•In the prepruning approach, a tree is pruned" by halting its
construction early e.g., by deciding not to
further split or partition the subset of training samples at a given
node.

Upon halting, the node becomes a leaf.


•When constructing a tree, measures such as statistical significance,
x2, information gain, etc., can be used to assess the goodness of a
split.
2. Postpruning Approach:
• In postpruning approach it removes branches from “fully grown “
tree.
•A tree node is pruned by removing its branches.
• the lowest unpruned node becomes a leaf.
Extracting Classification Rules from Trees:
• The knowledge represented in decision trees can be extracted
and represented in the form of IF-THEN rules.
• One rule is created for each path from the root to a leaf node.
• Each attribute-value pair along a path forms a conjunction in the
rule antecedent (“IF” part).
• The leaf node holds the class prediction forming the rule
consequent (“Then” part).

9/15/23 Data Mining: Concepts and Techniques 23


Example:
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer =
“yes”
IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN


buys_computer = “yes”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“no”

You might also like