You are on page 1of 62

Data Mining - Classification &

Prediction
2 Forms of Data Analysis in Extracting Models
• CLASSIFICATION
• PREDICTION
• Classification models predict categorical class labels; and prediction
models predict continuous valued functions.
• For example, we can build a classification model to categorize bank
loan applications as either safe or risky, or a prediction model to
predict the expenditures in dollars of potential customers on
computer equipment given their income and occupation.
What is classification?
• A bank loan officer wants to • In both of the above
analyze the data in order to examples, a model or
know which customer (loan classifier is constructed to
applicant) are risky or which predict the categorical
are safe. labels. These labels are risky
• A marketing manager at a or safe for loan application
company needs to analyze a data and yes or no for
customer with a given marketing data.
profile, who will buy a new
computer.
What is prediction?
• Suppose the marketing • Note − Regression analysis is
manager needs to predict how a statistical methodology
much a given customer will
spend during a sale at his that is most often used for
company. In this example we numeric prediction.
are bothered to predict a
numeric value. Therefore the
data analysis task is an example Typical applications
of numeric prediction. In this  Credit approval
case, a model or a predictor  Target marketing
will be constructed that  Medical diagnosis
predicts a continuous-valued-  Fraud detection
function or ordered value.
How Does Classification Works?
The Data Classification process includes two steps −
• Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model
Using Classifier for Classification

In this step, the classifier is


used for classification. Here
the test data is used to
estimate the accuracy of
classification rules. The
classification rules can be
applied to the new data
tuples if the accuracy is
considered acceptable.
Classification and Prediction Issues
• Data Cleaning − Data cleaning involves removing the noise and
treatment of missing values. The noise is removed by applying
smoothing techniques and the problem of missing values is solved by
replacing a missing value with most commonly occurring value for
that attribute.
• Relevance Analysis − Database may also have the irrelevant
attributes. Correlation analysis is used to know whether any two
given attributes are related.
Classification and Prediction Issues
• Data Transformation and reduction − The data can be transformed by
any of the following methods.
• Normalization − The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range. Normalization is used when in the learning step, the
neural networks or the methods involving measurements are used.
• Generalization − The data can also be transformed by generalizing it to the
higher concept. For this purpose we can use the concept hierarchies.
• Note − Data can also be reduced by some other methods such as
wavelet transformation, binning, histogram analysis, and clustering.
Comparison of Classification and Prediction
Methods
• Accuracy − Accuracy of classifier refers to the ability of classifier. It predict
the class label correctly and the accuracy of the predictor refers to how
well a given predictor can guess the value of predicted attribute for a new
data.
• Speed − This refers to the computational cost in generating and using the
classifier or predictor.
• Robustness − It refers to the ability of classifier or predictor to make
correct predictions from given noisy data.
• Scalability − Scalability refers to the ability to construct the classifier or
predictor efficiently; given large amount of data.
• Interpretability − It refers to what extent the classifier or predictor
understands.
Data Mining - Decision Tree Induction
• A decision tree is a structure that includes a root node, branches, and
leaf nodes.
• Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label.
• The topmost node in the tree is the root node.
Decision Tree Induction

The following decision tree is for the


concept buy computer that indicates
whether a customer at a company is
likely to buy a computer or not. Each
internal node represents a test on an
attribute. Each leaf node represents
a class.
Problem PARENTS VISITING
yes
cinema no
• Katara is undecided if she will visit
her parents and join them to watch WEATHER
a movie in the cinema. If not, she
sunny windy
prepares to play tennis if that is Play
rainy
Stay in
sunny weather. However, if it is a tennis
windy day and have enough money MONEY
she will go to the groceries store,
or else go to game arcade. rich poor
Nevertheless, she wishes to stay at Shop to Play in game
home if it is a rainy day. grocery arcade

• Draw a decision tree


The benefits of having a decision tree are as
follows −
• It does not require any domain knowledge.
• It is easy to comprehend.
• The learning and classification steps of a decision tree are simple and
fast.
Decision Tree Induction Algorithm
• A machine researcher named J. Ross Quinlan in 1980 developed a
decision tree algorithm known as ID3 (Iterative Dichotomiser).
Later, he presented C4.5, which was the successor of ID3. ID3 and
C4.5 adopt a greedy approach. In this algorithm, there is no
backtracking; the trees are constructed in a top-down recursive
divide-and-conquer manner.
Tree Pruning
• Tree pruning is performed in order to remove anomalies in the
training data due to noise or outliers. The pruned trees are smaller
and less complex.
• Tree Pruning Approaches
• Pre-pruning − The tree is pruned by halting its
construction early.
• Post-pruning - This approach removes a sub-tree from a
fully grown tree.
Problem: Juana Magiting

Juana Magiting is a writer of romance novels. A movie


company and a TV network both want exclusive rights
to one of her more popular works. If she signs with
the network, she will receive a single lump sum, but if
she signs with the movie company, the amount she
will receive depends on the market response to her
movie. What should she do?
Payouts and Probabilities
• Movie company Payouts
• Small box office – P 200,000
• Medium box office – P 1,000,000
• Large box office – P 3,000,000
• TV Network Payout
• Flat rate – P 900,000
• Probabilities
• P(Small Box Office) = 0.3
• P(Medium Box Office) = 0.6
• P(Large Box Office) = 0.1
Juana Magiting- Payoff Table
States of Nature

Small Box Medium Box Large Box


Decisions Office Office Office

Sign with Movie


P 200,000 P1,000,000 P3,000,000
Company

Sign with TV
P900,000 P900,000 P900,000
Network
Prior
0.3 0.6 0.1
Probabilities
Juana Magiting - How to Decide?

• What would be her decision based on:


• Maximax?
• Maximin?
• Expected Return?
Using Expected Return Criteria
EVmovie=0.3(200,000)+0.6(1,000,000)+0.1(3,000,000)
= P960,000 = EVUII or EVBest
EVtv =0.3(900,000)+0.6(900,000)+0.1(900,000)
= P900,000

Therefore, using this criteria, Juana should select the movie


contract.
Something to Remember
Juana’s decision is only going to be made one time,
and she will earn either P200,000, P1,000,000 or
P3,000,000 if she signs the movie contract, not the
calculated EV of P960,000!!

Nevertheless, this amount is useful for decision-


making, as it will maximize Juana’s expected returns
in the long run if she continues to use this approach.
Expected Value of Perfect Information
(EVPI)

What is the most that Juana should be


willing to pay to learn what the size of the
box office will be before she decides with
whom to sign?
EVPI Calculation
EVwPI (or EVc)
=0.3(900,000)+0.6(1,000,000)+0.1(3,000,000) = P 1,170,000
EVBest (calculated to be EVMovie from the previous page)
=0.3(200,000)+0.6(1,000,000)+0.1(3,000,000) = P 960,000
EVPI = $1,170,000 - $960,000 = P 210,000

Therefore, Juana would be willing to spend


up to P 210,000 to learn additional
information before making a decision.
Using Decision Trees

• Can be used as visual aids to structure


and solve sequential decision problems
• Especially beneficial when the complexity
of the problem grows
Decision Trees
• Three types of “nodes”
• Decision nodes - represented by squares (□)
• Chance nodes - represented by circles (Ο)
• Terminal nodes - represented by triangles (optional)
• Solving the tree involves pruning all but the best decisions
at decision nodes, and finding expected values of all
possible states of nature at chance nodes
• Create the tree from left to right
• Solve the tree from right to left
Example Decision Tree

Chance
Event 1
node
Decision Event 2
node Event 3
Juana Magiting Decision Tree

Small Box Office


P 200,000

Sign with Movie Co. Medium Box Office


P 1,000,000

Large Box Office


P 3,000,000

Small Box Office


P 900,000

Sign with TV Network Medium Box Office


P 900,000

Large Box Office


P 900,000
Juana Magiting Decision Tree

Small Box Office


ER .3 P200,000
?
Sign with Movie Co. .6 Medium Box Office
P1,000,000
ER .1
? Large Box Office
P 3,000,000

Small Box Office


ER .3 P 900,000
?
Sign with TV Network .6 Medium Box Office
P 900,000
.1
Large Box Office
P 900,000
Juana Magiting Decision Tree - Solved

Small Box Office


ER .3 P200,000
960,000
Sign with Movie Co. .6 Medium Box Office
P1,000,000
ER .1
960,000 Large Box Office
P3,000,000

Small Box Office


ER .3 P900,000
900,000
Sign with TV Network .6 Medium Box Office
P900,000
.1
Large Box Office
P900,000
Decision Tree Problem
Dr. No has a patient who is very sick. Solution:
Without further treatment, this
patient will die in about 3 months.
The only treatment alternative is a
risky operation. The patient is
expected to live about 1 year if he
survives the operation; however, the
probability that the patient will not
survive the operation is 0.3.

Draw a decision tree for this simple


decision problem. Show all the
probabilities and outcome values.
Let U(x) denote the patient’s Solution:
utility function, where x is
the number of months to The operation would be preferred as long as
live. U(3) < 0.7.
Assuming that U(12) = 1.0
and U(0) = 0, how low can
the patient’s utility for living
3 months be and still have
the operation be preferred?

For the rest of the problem,


assume that U(3) = 0.8.
Cost Complexity
• The cost complexity is measured by the following two parameters −
• Number of leaves in the tree, and
• Error rate of the tree.
Data Mining - Bayesian Classification
• Bayesian classification is based on Bayes' Theorem. Bayesian
classifiers are the statistical classifiers. Bayesian classifiers can predict
class membership probabilities such as the probability that a given
tuple belongs to a particular class.
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
• Posterior Probability [P(H/X)]
• Prior Probability [P(H)]

where X is data tuple and H is some hypothesis.


• According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
• Bayesian Belief Networks specify joint conditional probability
distributions. They are also known as Belief Networks, Bayesian
Networks, or Probabilistic Networks.
A Belief Network allows class conditional independencies to be defined
between subsets of variables.
It provides a graphical model of causal relationship on which learning can be
performed.
We can use a trained Bayesian Network for classification.
• There are two components that define a Bayesian Belief Network −
Directed acyclic graph
A set of conditional probability tables
Directed Acyclic Graph
• Each node in a directed acyclic graph represents a random variable.
• These variable may be discrete or continuous valued.
• These variables may correspond to the actual attribute given in the
data.
Directed Acyclic Graph Representation
• The following diagram shows a directed acyclic graph for six Boolean
variables.

The arc in the diagram allows representation of


causal knowledge.

For example, lung cancer is influenced by a


person's family history of lung cancer, as well as
whether or not the person is a smoker.
It is worth noting that the variable Positive Xray is
independent of whether the patient has a family
history of lung cancer or that the patient is a
smoker, given that we know the patient has lung
cancer.
Conditional Probability Table
• The conditional probability table for the values of the variable
LungCancer (LC) showing each possible combination of the values of
its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −
Data Mining - Rule Based Classification
IF-THEN Rules
Rule-based classifier makes use of
a set of IF-THEN rules for Let us consider a rule
classification. We can express a R1,
rule in the following from − R1: IF age = youth AND
student = yes THEN
buy_computer = yes
IF condition THEN conclusion

Let us consider a rule R1,


Points to remember −
• The IF part of the rule is called rule antecedent or precondition.
• The THEN part of the rule is called rule consequent.
• The antecedent part the condition consist of one or more attribute
tests and these tests are logically ANDed.
• The consequent part consists of class prediction.

Note − We can also write rule R1 as follows −


• R1: (age = youth) ^ (student = yes))(buys computer = yes)
Rule Extraction
• Here we will learn how to build a rule-based classifier by extracting IF-
THEN rules from a decision tree.
Points to remember −
• To extract a rule from a decision tree −
• One rule is created for each path from the root to the leaf node.
• To form a rule antecedent, each splitting criterion is logically ANDed.
• The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering
Algorithm
• Sequential Covering Algorithm can be used to extract IF-THEN rules
form the training data. We do not require to generate a decision tree
first. In this algorithm, each rule for a given class covers many of the
tuples of that class.
• Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER.
As per the general strategy the rules are learned one at a time. For
each time rules are learned, a tuple covered by the rule is removed
and the process continues for the rest of the tuples. This is because
the path to each leaf in a decision tree corresponds to a rule.
• Note − The Decision tree induction can be considered as learning a
set of rules simultaneously.
Rule Pruning
The rule is pruned is due to the following reason −
• The Assessment of quality is made on the original set of training data. The rule may
perform well on training data but less well on subsequent data. That's why the rule
pruning is required.
• The rule is pruned by removing conjunct. The rule R is pruned, if pruned version of R has
greater quality than what was assessed on an independent set of tuples.
FOIL is one of the simple and effective method for rule pruning. For a given rule R,

FOIL_Prune = pos - neg / pos + neg

where pos and neg is the number of positive tuples covered by R, respectively.
Note − This value will increase with the accuracy of R on the pruning set. Hence, if the
FOIL_Prune value is higher for the pruned version of R, then we prune R.
Miscellaneous Classification Methods
• Genetic Algorithms
• The idea of genetic algorithm is derived from
natural evolution. In genetic algorithm, first of
all, the initial population is created. This initial
population consists of randomly generated
rules. We can represent each rule by a string of
bits.
For example, in a given training set, the samples
are described by two Boolean attributes such as
A1 and A2. And this given training set contains
two classes such as C1 and C2.
Miscellaneous Classification Methods
• Rough Set Approach
• We can use the rough set approach to discover
structural relationship within imprecise and noisy
data.
• Note − This approach can only be applied on
discrete-valued attributes. Therefore, continuous-
valued attributes must be discretized before its use.
• The Rough Set Theory is based on the establishment
of equivalence classes within the given training data.
The tuples that forms the equivalence class are
indiscernible. It means the samples are identical
with respect to the attributes describing the data.
Rule Based Reasoning (RBR)
• Rule Based Reasoning (RBR) requires us to elicit an explicit model of the
domain. As we all know and have experienced, knowledge acquisition has a
set of associated problems.

• Maintenance with a Rule Based System may be a nightmare. If the rules


are not written clearly, it would lead to many sleepless nights of
debugging.

When rules are added or deleted from a rule-based system, the system has
to be checked for conflicting rules and redundant rules. An addition or
deletion of a case from the case base does not any further checking or
debugging. But it have to be noted that while it does not affect the
system’s functioning, it may have an impact on the outcome of the system.
Case Based Reasoning (CBR)
• Case Based Reasoning (CBR) does not require an explicit model.
Cases that identify the significant features are gathered and added to
the case base during development and after deployment.
• This is easier than creating an explicit model, as it is possible to
develop case bases without passing through the knowledge-
acquisition bottleneck.
• Maintenance with Case Based Systems are much easier and
straightforward.
Useful applications of data mining
• Future Healthcare
• Data mining holds great potential to improve health systems. It uses
data and analytics to identify best practices that improve care and
reduce costs. Researchers use data mining approaches like multi-
dimensional databases, machine learning, soft computing, data
visualization and statistics. Mining can be used to predict the volume
of patients in every category. Processes are developed that make sure
that the patients receive appropriate care at the right place and at the
right time. Data mining can also help healthcare insurers to detect
fraud and abuse.
Useful applications of data mining
• Market Basket Analysis
• Market basket analysis is a modelling technique based upon a theory
that if you buy a certain group of items you are more likely to buy
another group of items.
• This technique may allow the retailer to understand the purchase
behaviour of a buyer. This information may help the retailer to know
the buyer’s needs and change the store’s layout accordingly.
• Using differential analysis comparison of results between different
stores, between customers in different demographic groups can be
done.
Useful applications of data mining
• Education
• There is a new emerging field, called Educational Data Mining,
concerns with developing methods that discover knowledge from
data originating from educational Environments. The goals of EDM
are identified as predicting students’ future learning behaviour,
studying the effects of educational support, and advancing scientific
knowledge about learning. Data mining can be used by an institution
to take accurate decisions and also to predict the results of the
student. With the results the institution can focus on what to teach
and how to teach. Learning pattern of the students can be captured
and used to develop techniques to teach them.
Useful applications of data mining
• Manufacturing Engineering
• Knowledge is the best asset a manufacturing enterprise would
possess. Data mining tools can be very useful to discover patterns in
complex manufacturing process. Data mining can be used in system-
level designing to extract the relationships between product
architecture, product portfolio, and customer needs data. It can also
be used to predict the product development span time, cost, and
dependencies among other tasks.
Useful applications of data mining
• CRM
• Customer Relationship Management is all about acquiring and
retaining customers, also improving customers’ loyalty and
implementing customer focused strategies. To maintain a proper
relationship with a customer a business need to collect data and
analyse the information. This is where data mining plays its part. With
data mining technologies the collected data can be used for analysis.
Instead of being confused where to focus to retain customer, the
seekers for the solution get filtered results.
Useful applications of data mining
• Fraud Detection
• Billions of dollars have been lost to the action of frauds. Traditional
methods of fraud detection are time consuming and complex. Data
mining aids in providing meaningful patterns and turning data into
information. Any information that is valid and useful is knowledge. A
perfect fraud detection system should protect information of all the
users. A supervised method includes collection of sample records.
These records are classified fraudulent or non-fraudulent. A model is
built using this data and the algorithm is made to identify whether
the record is fraudulent or not.
Useful applications of data mining
• Intrusion Detection
• Any action that will compromise the integrity and confidentiality of a
resource is an intrusion. The defensive measures to avoid an intrusion
includes user authentication, avoid programming errors, and
information protection. Data mining can help improve intrusion
detection by adding a level of focus to anomaly detection. It helps an
analyst to distinguish an activity from common everyday network
activity. Data mining also helps extract data which is more relevant to
the problem.
Useful applications of data mining
• Lie Detection
• Apprehending a criminal is easy whereas bringing out the truth from
him is difficult. Law enforcement can use mining techniques to
investigate crimes, monitor communication of suspected terrorists.
This filed includes text mining also. This process seeks to find
meaningful patterns in data which is usually unstructured text. The
data sample collected from previous investigations are compared and
a model for lie detection is created. With this model processes can be
created according to the necessity.
Useful applications of data mining
• Customer Segmentation
• Traditional market research may help us to segment customers but
data mining goes in deep and increases market effectiveness. Data
mining aids in aligning the customers into a distinct segment and can
tailor the needs according to the customers. Market is always about
retaining the customers. Data mining allows to find a segment of
customers based on vulnerability and the business could offer them
with special offers and enhance satisfaction.
Useful applications of data mining
• Financial Banking
• With computerized banking everywhere huge amount of data is
supposed to be generated with new transactions. Data mining can
contribute to solving business problems in banking and finance by
finding patterns, causalities, and correlations in business information
and market prices that are not immediately apparent to managers
because the volume data is too large or is generated too quickly to
screen by experts. The managers may find these information for
better segmenting, targeting, acquiring, retaining and maintaining a
profitable customer.
Useful applications of data mining
• Corporate Surveillance
• Corporate surveillance is the monitoring of a person or group’s
behaviour by a corporation. The data collected is most often used for
marketing purposes or sold to other corporations, but is also regularly
shared with government agencies. It can be used by the business to
tailor their products desirable by their customers. The data can be
used for direct marketing purposes, such as the targeted
advertisements on Google and Yahoo, where ads are targeted to the
user of the search engine by analyzing their search history and emails.
Useful applications of data mining
• Research Analysis
• History shows that we have witnessed revolutionary changes in
research. Data mining is helpful in data cleaning, data pre-processing
and integration of databases. The researchers can find any similar
data from the database that might bring any change in the research.
Identification of any co-occurring sequences and the correlation
between any activities can be known. Data visualisation and visual
data mining provide us with a clear view of the data.
Useful applications of data mining
• Criminal Investigation
• Criminology is a process that aims to identify crime characteristics.
Actually crime analysis includes exploring and detecting crimes and
their relationships with criminals. The high volume of crime datasets
and also the complexity of relationships between these kinds of data
have made criminology an appropriate field for applying data mining
techniques. Text based crime reports can be converted into word
processing files. These information can be used to perform crime
matching process.
Useful applications of data mining
• Bioinformatics
• Data Mining approaches seem ideally suited for Bioinformatics, since
it is data-rich. Mining biological data helps to extract useful
knowledge from massive datasets gathered in biology, and in other
related life sciences areas such as medicine and neuroscience.
Applications of data mining to bioinformatics include genetic finding,
protein function inference, disease diagnosis, disease prognosis,
disease treatment optimization, protein and gene interaction
network reconstruction, data cleansing, and protein sub-cellular
location prediction.