You are on page 1of 19

MODULE-04: CLASSIFICATION

& PREDICTION

WHAT IS CLASSIFICATION & PREDICTION? WHAT IS CLASSIFICATION?


❖ There are two forms of data analysis that can be used for extracting models describing ❖ Following are the examples of cases where the data analysis task is Classification
important classes / to predict future data trends. These two forms are as follows −
➢ Classification ➢ A bank loan officer wants to analyze the data in order to know which customer
➢ Prediction (loan applicant) are risky or which are safe.
❖ Classification models predict categorical class labels; and prediction models predict ➢ A marketing manager at a company needs to analyze a customer with a given
continuous valued functions. profile, who will buy a new computer.
❖ For example, we can build a classification model to categorize bank loan applications as ❖ In both of the above examples, a model or classifier is constructed to predict the
either safe or risky, or a prediction model to predict the expenditures in dollars of categorical labels. These labels are risky or safe for loan application data and yes
potential customers on computer equipment given their income and occupation. or no for marketing data.
WHAT IS PREDICTION? HOW DOES CLASSIFICATION WORKS?
❖ Following are the examples of cases where the data analysis task is Prediction ❖ With the help of the bank loan application that we have discussed above, let us understand the
working of classification. The Data Classification process includes two steps −
➢ Building the Classifier or Model
 Suppose the marketing manager needs to predict how much a given ➢ Using Classifier for Classification
customer will spend during a sale at his company.
❖ Building the Classifier or Model
 In this example we are bothered to predict a numeric value. Therefore the data ➢ This step is the learning step or the learning phase.In this step the classification algorithms
analysis task is an example of numeric prediction. In this case, a model or a build the classifier.
predictor will be constructed that predicts a continuous-valued-function or ➢ The classifier is built from the training set made up of database tuples and their associated
ordered value. class labels.
 Note − Regression analysis is a statistical methodology that is most often used ➢ Each tuple that constitutes the training set is referred to as a category or class. These
tuples can also be referred to as sample, object or data points.
for numeric prediction
USING CLASSIFIER FOR CLASSIFICATION ISSUE REGARDING CLASSIFICATION & PREDICTION
❖ In this step, the classifier is used for classification. Here the test data is used to estimate ❖ The major issue is preparing the data for Classification and Prediction.
the accuracy of classification rules. The classification rules can be applied to the new
Preparing the data involves the following activities −
data tuples if the accuracy is considered acceptable.
➢ Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying smoothing
techniques and the problem of missing values is solved by replacing a
missing value with most commonly occurring value for that attribute.
➢ Relevance Analysis − Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.

CLASSIFICATION & PREDICTION METHOD


ISSUE REGARDING CLASSIFICATION & PREDICTION
➢ Data Transformation and reduction − The data can be transformed by any of
the following methods.
■ Normalization − The data is transformed using normalization.
Normalization involves scaling all values for given attribute in order to
make them fall within a small specified range. Normalization is used
when in the learning step, the neural networks or the methods involving
measurements are used.
■ Generalization − The data can also be transformed by generalizing it to
the higher concept. For this purpose we can use the concept hierarchies.
DECISION TREE
CLASSIFICATION METHODS ❖ Decision Tree Mining is a type of data
mining technique that is used to build
Classification Models. It builds classification
models in the form of a tree-like structure, just
like its name. This type of mining belongs to
supervised class learning.
❖ In supervised learning, the target result is
already known. Decision trees can be used
for both categorical and numerical data.
The categorical data represent gender,
marital status, etc. while the numerical data
represent age, temperature, etc.

EXAMPLE OF DECISION TREE WHAT IS THE USE OF DECISION TREE?


❖ Decision Tree is used to build classification and regression models. It is used to
create data models that will predict class labels or values for the decision-
making process. The models are built from the training dataset fed to the
system (supervised learning).

❖ Using a decision tree, we can visualize the decisions that make it easy to
understand and thus it is a popular data mining technique.
CLASSIFICATION ANALYSIS GENERAL APPROACH TO BUILD CLASSIFICATION
❖ A two-step process is followed, to build a classification model.
TREE
➢ In the first step i.e. learning:A classification model based on training data is built.

➢ In the second step i.e. Classification, the accuracy of the model is checked and then
the model is used to classify new data.
➢ The class labels presented here are in the form of discrete values such as “yes” or
“no”, “safe” or “risky”.

REGRESSION ANALYSIS HOW DOES A DECISION TREE WORK?


❖ Regression analysis is used for the prediction of numeric ❖ A decision tree is a supervised learning algorithm that works for both discrete and continuous
variables. It splits the dataset into subsets on the basis of the most significant attribute in the
attributes. dataset.
❖ Numeric attributes are also called continuous values. A model ❖ How the decision tree identifies this attribute and how this splitting is done is decided by the
algorithms.?
built to predict the continuous values instead of class labels is ❖ The most significant predictor is designated as the root node, splitting is done to form sub-nodes
called the regression model. The output of regression analysis is called decision nodes, and the nodes which do not split further are terminal or leaf nodes.
❖ In the decision tree, the dataset is divided into homogeneous and non-overlapping regions. It
the “Mean” of all observed values of the node. follows a top-down approach as the top region presents all the observations at a single place which
splits into two or more branches that further split. This approach is also called a greedy approach as
it only considers the current node between the worked on without focusing on the future nodes.
❖ The decision tree algorithms will continue running until a stop criteria such as the minimum
number of observations etc. is reached.
❖ Once a decision tree is built, many nodes may represent outliers or
EXAMPLE OF CREATING A DECISION TREE
noisy data. Tree pruning method is applied to remove unwanted
data. This, improves the accuracy of the classification model. #1) Learning Step: The training data is fed into the system to be analyzed by a
❖ To find the accuracy of the model, a test set consisting of test tuples and classification algorithm. In this example, the class label is the attribute i.e. “loan
class labels is used. The percentages of the test set tuples are correctly decision”. The model built from this training data is represented in the form of
classified by the model to identify the accuracy of the model. If the decision rules.
model is found to be accurate then it is used to classify the data tuples
#2) Classification: Test dataset are fed to the model to check the accuracy of
for which the class labels are not known. the classification rule. If the model gives acceptable results then it is applied to
❖ Some of the decision tree algorithms include Hunt’s Algorithm, ID3, a new dataset with unknown class variables.
CD4.5, and CART.

DECISION TREE INDUCTION ALGORITHM


DECISION TREE INDUCTION
❖ Decision tree induction is the method of learning the decision trees from the training
set. The training set consists of attributes and class labels. Applications of decision tree
induction include astronomy, financial analysis, medical diagnosis, manufacturing, and
production.
❖ A decision tree is a flowchart tree-like structure that is made from training set tuples. The
dataset is broken down into smaller subsets and is present in the form of nodes of a tree. The
tree structure has a root node, internal nodes or decision nodes, leaf node, and branches.
❖ The root node is the topmost node. It represents the best attribute selected for classification.
Internal nodes of the decision nodes represent a test of an attribute of the dataset leaf node or
terminal node which represents the classification or decision label. The branches show the
outcome of the test performed.
❖ Some decision trees only have binary nodes, that means exactly two branches of a node, The image above shows the decision tree to prevent the heart attack

while some decision trees are non-binary.

Splitting criteria ATTRIBUTE SELECTION MEASURES


Attribute selection measures are also known as splitting rules because they determine how
the tuples at a given node are to be split.

An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.

If we were to split D into smaller partitions according to the outcomes of the


splitting criterion, ideally each partition would be pure (i.e., all the tuples that fall into a
given partition would belong to the same class).

Conceptually, the “best” splitting criterion is the one that most closely results in such a
scenario.
ATTRIBUTE SELECTION MEASURES INFORMATION GAIN
The following are three popular attribute selection measures—INFORMATION GAIN, ❖ This method is the main method that is used to build decision trees. It reduces
GAIN RATIO, AND GINI INDEX. the information that is required to classify the tuples. It reduces the number of
tests that are needed to classify the given tuple. The attribute with the highest
The notation used herein is as follows.
• Let D, the data partition, be a training set of class-labeled tuples. information gain is selected.
❖ ID3 uses Information Gain as its attribute selection measure
•Suppose the class label attribute has m distinct values defining m distinct classes, ❖ The expected information needed to classify a tuple in dataset D is given
Ci(for i = 1, . . . , m).
by:
• Let Ci,D be the set of tuples of class Ci in D.

• Let |D| and |Ci,D| denote the number of tuples in D and Ci, D, respectively.

INFORMATION GAIN INFORMATION GAIN

where pₗ is the non-zero probability that an arbitrary tuple in D belongs to class Ci and Now, suppose we were to partition the tuples in D on some attribute A having v distinct
is estimated by |Ci,D|/|D|. values, {a₁, a₂, . . . , av }, as observed from the training data.

A log function to the base 2 is used, because the information is encoded in bits. How much more information would we still need (after the partitioning) to arrive at
an exact classification? InfoA(D) is the expected information required to classify a tuple
Info(D) is just the average amount of information needed to identify the class label of a from D based on the partitioning by A
tuple in D.

Info(D) is also known as the entropy of D.


INFORMATION GAIN Problem- information gain calculation
Information gain is defined as the difference between the original information
requirement (i.e., based on just the proportion of classes) and the new requirement (i.e.,
obtained after partitioning on A). That is,

Gain(A) = Info(D) – InfoA(D).

Problem- information gain calculation Problem- information gain calculation


Next, we need to compute the expected information requirement for each attribute.
In this example, each attribute is discrete-valued, Continuous-valued attributes have
Let’s start with the attribute age.
been generalized.)
next, find yes and no tuples for each category of age
The class label attribute, buys computer, has two distinct values (namely, {yes, no});
i.e for “youth” “middle aged”, “senior”
therefore, there are two distinct classes (i.e., m = 2). Let class C₁ = yes and C₂ = no.
For the age category “youth,” there are 02 yes tuples and 03 no tuples.
There are nine tuples of class yes and five tuples of class no.
For the category “middle aged,” there are 04 yes tuples and 00 no tuples.
A (root) node N is created for the tuples in D. To find the splitting criterion for these
For the category “senior,” there are 03 yes tuples and 02 no tuples.
tuples, we must compute the information gain of each attribute. We first to compute
the expected information needed to classify a tuple in D if the tuples are partitioned
the expected information needed to classify a tuple in D:
according to age is:
Problem- information gain calculation Problem- information gain calculation
Hence, the gain in information from such a partitioning would be
Gain(age) = Info(D) − Infoage (D) = 0.940 − 0.694 = 0.246 bits.

Similarly, we can compute


Gain(income) = 0.029 bits,
Gain(student) = 0.151 bits,
and Gain(credit rating) = 0.048 bits.
Because age has the highest information gain among the attributes, it is selected as
the splitting attribute.
For age = middle aged all belong to the same class. Because they all belong to class
“yes,” a leaf should therefore be created at the end of this branch and labeled “yes.”

Problem- information gain calculation WHAT IS TREE PRUNING?


The final decision tree returned by the algorithm was shown ❖ Pruning is the method of removing the unused branches from the decision
tree. Some branches of the decision tree might represent outliers or noisy
data.
❖ Tree pruning is the method to reduce the unwanted branches of the tree.
This will reduce the complexity of the tree and help in effective predictive
analysis. It reduces the overfitting as it removes the unimportant branches
from the trees.
#1) Pre Pruning: In this approach, the construction of the decision tree is stopped early. It means it
is decided not to further partition the branches. The last node constructed becomes the leaf node
and this leaf node may hold the most frequent class among the tuples.

The attribute selection measures are used to find out the weightage of the split. Threshold values
are prescribed to decide which splits are regarded as useful. If the portioning of the node results in
splitting by falling below threshold then the process is halted.

#2) Post Pruning: This method removes the outlier branches from a fully grown tree. The
unwanted branches are removed and replaced by a leaf node denoting the most frequent class label.
This technique requires more computation than prepruning, however, it is more reliable.

The pruned trees are more precise and compact when compared to unpruned trees but they carry a
disadvantage of replication and repetition.

Repetition occurs when the same attribute is tested again and again along a branch of a tree.
Replication occurs when the duplicate subtrees are present within the tree. These issues can be The above image shows an unpruned and pruned tree.

solved by multivariate splits.

Advantages Of Decision Tree Classification


EXAMPLE OF DECISION TREE Enlisted below are the various merits of Decision Tree Classification:

Constructing a Decision Tree 1. Decision tree classification does not require any domain knowledge, hence, it is appropriate for the
knowledge discovery process.
2. The representation of data in the form of the tree is easily understood by humans and it is intuitive.
Let us take an example of the last 10 days weather dataset with attributes
3. It can handle multidimensional data.
outlook, temperature, wind, and humidity. The outcome variable will be 4. It is a quick process with great accuracy.
playing cricket or not. We will use the ID3 algorithm to build the decision
tree. Disadvantages Of Decision Tree Classification

https://jcsites.juniata.edu/faculty/rhodes/ida/decisionTrees.html Given below are the various demerits of Decision Tree Classification:

1. Sometimes decision trees become very complex and these are called overfitted trees.
2. The decision tree algorithm may not be an optimal solution.
3. The decision trees may return a biased solution if some class label dominates it.
BAYESIAN CLASSIFICATION BAYESIAN CLASSIFICATION
❖ Thomas Bayes, who proposed the Bayes Theorem so, it named Bayesian “What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities such as the probability that a given tuple belongs
theorem. to a particular class.
❖ It is statistical method & supervised learning method for classification. Naive Bayesian classifiers assume that the effect of an attribute value on a given class
is independent of the values of the other attributes. This assumption is called class-
❖ It can solve problems involving both categorical and continuous conditional independence. It is made to simplify the computations involved and, in this
valued attributes. sense, is considered “naive.”

❖ Bayesian classification is used to find conditional probabilities.

BAYES THEOREM BAYES THEOREM


n Bayesian terms, X is considered “evidence.” As usual, it is described by measurements P(H|X) is the posterior probability, of H conditioned on X.
made on a set of n attributes.
For example, suppose our world of data tuples is confined to customers described
Let H be some hypothesis such as that the data tuple X belongs to a specified class C. by the attributes age and income, respectively, and that X is a 35-year-old customer with
an income of $40,000.
For classification problems, we want to determine P(H|X), the probability that the Suppose that H is the hypothesis that our customer will buy a computer.
hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are
looking for the probability that tuple X belongs to class C, given that we know the Then P(H|X) reflects the probability that customer X will buy a computer given that we
attribute description of X know the customer’s age and income.

P(H) is the prior probability, of H.


For our example, this is the probability that any given customer will buy a computer,
regardless of age, income, or any other information, for that matter.
The prior probability,P(H), is independent of X.
BAYES THEOREM Naïve BAYES THEOREM
P(X|H) is the posterior probability of X conditioned on H. That is, it is the The naive Bayesian classifier, is also called as simple Bayesian classifier
probability that a customer, X, is 35 years old and earns $40,000, given that we know the
customer will buy a computer. Given data sets with many attributes, it would be extremely computationally expensive to
compute P(X|Ci).
P(X) is the prior probability of X. Using our example, it is the probability that a
person from our set of customers is 35 years old and earns $40,000. To reduce computation in evaluating P(X|Ci), the naive assumption of class-conditional
independence is made. This presumes that the attributes’ values are conditionally
Bayes theorem is useful in that it provides independent of one another, given the class label of the tuple (i.e., that there are no
a way of calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X). dependence relationships among the attributes).
Bayes theorem is
Naive Bayes is called naive because it assumes that each input variable is independent.
This is a strong assumption and unrealistic for real data; however, the technique is very
effective on a large range of complex problems.
NAIVE BAYES CLASSIFIER EXAMPLE RULE BASED
It is featured by building rules based on an object attributes.
Rule-based classifier makes use of a set of IF-THEN rules for classification.
We can express a rule in the following from
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age=youthAND student=yes THEN buy_computer=yes
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent (conclusion).
The antecedent (IF) part the condition consist of one or more attribute tests and these tests are
logically ANDed.
The consequent (THEN) part consists of class prediction.
NEURAL NETWORK
The Artificial Neural Network (ANN) bases its assimilation of data on the
way that the human brain processes information. The brain has billions of
cells called neurons that process information in the form of electric
signals. External information, or stimuli, is received, after which the brain
processes it, and then produces an output.

PREDICTION METHODS
LINEAR AND NONLINEAR REGRESSION
❖ It is simplest form of regression. Linear regression attempts to model the relationship between two variables by

fitting a linear equation to observe the data.

❖ Linear regression attempts to find the mathematical relationship between variables.


❖ If outcome is straight line then it is considered as linear model and if it is curved line, then it is a non

linear model.

❖ The relationship between dependent variable is given by straight line and it has only one independent variable.

Y= a + bX

❖ Model 'Y', is a linear function of 'X'.

❖ The value of 'Y' increases or decreases in linear manner according to which the value of 'X' also changes.

MULTIPLE LINEAR REGRESSION LOGISTIC REGRESSION


❖ Multiple linear regression is an extension of linear regression analysis.
❖ It uses two or more independent variables to predict an outcome and a single continuous dependent
❖ Logistic Regression was used in the biological sciences in early twentieth century. It was then
variable.
used in many social science applications. Logistic Regression is used when the dependent
Y = a0 + a1 X1 + a2 X2 +.........+ak Xk +e

where, variable(target) is categorical.


'Y' is the response variable.
❖ For example,
X1 + X2 + Xk are the independent predictors.

'e' is random error. ➢ To predict whether an email is spam (1) or (0)

a0, a1, a2, ak are the regression coefficients.


➢ Whether the tumor is malignant (1) or not (0)
ACCURACY & ERROR MEASURES
CONFUSION MATRIX CONFUSION MATRIX
A confusion matrix is a table that is often used to describe the performance of a classification
model (or "classifier") on a set of test data for which the true values are known. There are two possible predicted classes: "yes" and "no".
If we were predicting the presence of a disease, for example, "yes" would mean they
The confusion matrix itself is relatively simple to understand, but the related terminology can have the disease, and "no" would mean they don't have the disease.
be confusing.

The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.

In reality, 105 patients in the sample have the disease, and 60 patients do not.

CONFUSION MATRIX CONFUSION MATRIX


true positives (TP): These are cases in which we predicted yes (they have the disease),
and they do have the disease.

true negatives (TN): We predicted no, and they don't have the disease.

false positives (FP): We predicted yes, but they don't actually have the disease.
(Also known as a "Type I error.")

false negatives (FN): We predicted no, but they actually do have the disease.
(Also known as a "Type II error.")
CONFUSION MATRIX CONFUSION MATRIX
Accuracy: Overall, how often is the classifier correct? True Positive Rate: When it's actually yes, how often does it predict yes?
TP/actual yes = 100/105 = 0.95
also known as "Sensitivity" or "Recall“
(TP+TN)/total = (100+50)/165 = 0.91
False Positive Rate: When it's actually no, how often does it predict yes?
Misclassification Rate: Overall, how often is it wrong? FP/actual no = 10/60 = 0.17

(FP+FN)/total = (10+5)/165 = 0.09 True Negative Rate: When it's actually no, how often does it predict no?
TN/actual no = 50/60 = 0.83
equivalent to 1 minus Accuracy
also known as "Error Rate" equivalent to 1 minus False Positive Rate also known as "Specificity"

CONFUSION MATRIX
Precision: When it predicts yes, how often is it correct?
TP/predicted yes = 100/110 = 0.91

Prevalence: How often does the yes condition actually occur in our sample? actual
yes/total = 105/165 = 0.64

You might also like