You are on page 1of 47

Lecture 04

Post Processing Phase


Post processing Phase

 This phase is concerned with Filtering, evaluation,


visualization and interpretation of patterns generated during
data mining phase
 Patterns are local structures that makes statements only about
restricted regions of the space spanned by the variables,
Filtering Patterns,
Visualization,
Pattern
Interpretation

PostProcessing phase Knowledge

2
Evaluation of patterns

 Pattern evaluation involves assessing interestingness of


patterns using experts opinions and experimental analysis.
 A pattern is interesting if it is easily understood by humans,
valid on new or test data with some degree of certainty,
potentially useful, novel, or validates some hypothesis that a
user seeks to confirm
 Only the approved patterns are retained and the entire process
is revisited to identify which alternative actions could have
been taken to improve the results.
Pattern evaluation: Interesting pattern

Finding interesting
A pattern is interesting if it has the following properties:
1. Easily understood by humans,
2. Valid on new or test data with some degree of certainty
3. potentially useful
4. novel
5.validates some hypothesis that a user seeks to confirm
Pattern evaluation

Finding interesting patterns


 Data mining may generate thousands of patterns: Not all of
them are interesting
 Suggested approach: Human-centered, query-based, focused
mining

April 11, 2023 Data Mining: Concepts and Techniques 5


Pattern Evaluation: Measurements of Interestingness

1. Novel Patterns : new or unique patterns or not previously


known, surprising (this is used to remove redundant rules
Novel Metric: Uniqueness(new)
2. Valid patterns(Accurate pattern): The discovered patterns should
be suitable or applicable on new data with some degree of
accuracy.
Pattern Evaluation: Measurements of Interestingness

3. Potentially Useful(utility): The patterns should potentially lead


to some useful actions, as measured by some utility function.
Utility metric : support
4. Ultimately Understandable(simple): patterns comprehensible
or uncomplicated to humans in order to facilitate a better
understanding of the underlying data.
Simplicity metrics: Rule length, (decision) tree size etc

April 11, 2023 Data Mining: Concepts and Techniques 7


Pattern Evaluation: Statistical Evaluation measures

 Kappa statistic” is used to measure how strongly data items in


the same class resemble each other. That is, the level of intra-
class correlation.
 It is similar to correlation coefficient.
 0.0= complete disagreement (do not resemble each other).
 0.40 to 0.59 = moderate agreement
 0.60 to 0.79= substantial agreement
 above 0.80 = outstanding agreement
 1.0= complete agreement (items strongly resemble each other)
Pattern Evaluation: Statistical Evaluation measures
 An  absolute error is the magnitude of the difference between the
exact value and the approximation (prediction)
 Mean Absolute Error (MAE) can be defined as sum of absolute
errors divided by number of predictions.
 MAE measures set of predicted value to actual value i.e. how
close a predicted model to actual model.
 Small value of MAE means better prediction of model
 Root Mean Square Error (RMSE) is defined as square root of sum
of squares error divided number of predictions.
 RMSE measures the differences between values predicted by a
model and the values actually observed. Small value of RMSE
means better accuracy of model.
 So, minimum of RMSE & MAE is better prediction and accuracy.
9
Pattern Evaluation: Statistical Evaluation measures
 Confusion matrix
 Also known as contingency table.
 It shows the number of correctly classified instances and the
number of incorrectly classified instances,
 The number of Correctly classified instances is obtained by
calculating the sum of diagonals in the matrix.
 all others are incorrectly classified
Pattern Evaluation: Statistical Evaluation measures
 Confusion matrix Example

a b <-- classified as
7 2 | a = yes
3 2 | b = no

 class "a" gets misclassified as "b" exactly twice, and class "b"
gets misclassified as "a" three times).
Pattern Evaluation: Statistical Evaluation measures

True positive (TP): The number of items correctly retrieved


(labelled as belonging to the positive class
True Negative(TN):The total number of items correctly labelled
as belonging to Negative class
False positives(FP):  Items that are incorrectly labeled as
belonging to positive class ( or retrieved items that are not
relevant) . Also known as Type1 error

•False negatives(Fn): Items which were not labelled as


belonging to the positive class but should have been.
(or relevant items not retrieved). Also known as Type11 error
Pattern Evaluation: Statistical Evaluation measures

 Accuracy level is the number of correctly classified items


divided by the total number of all items
 Accuracy (A) = (tp+tn)/Total # samples
Pattern Evaluation: Statistical Evaluation measures
 Precision 
 Also called positive predictive value.
 It is the fraction of retrieved instances that are relevant
 precision=tp/(tp+fp)

 That is, the proportion of instances which truly have


class x among all those which were classified as class x.
 In confusion matrix it is the diagonal element divided by the
sum over the relevant column,
Pattern Evaluation: Statistical Evaluation measures

 Precision Example
a b <-- classified as
7 2 | a = yes
3 2 | b = no
 Precision for class yes=7/(7+3)=0.7
 Precision for class no =2/(2+2)=0.5
 a perfect precision score of 1.0 means that every item in
positive class was relevant
 However, it says nothing about whether all relevant items
were classified in the positive class
Pattern Evaluation: Statistical Evaluation measures
 Recall,
 The number of true positives(tp) divided by the total number
of elements that actually belong to the positive class (tp+fn).
 Also called True positive rate.
 Recall (R) = tp/(tp+fn)
 It is the fraction of relevant instances that are retrieved.
 A perfect recall score of 1.0 means that all relevant items were
classified in the positive class .
 However, It says nothing about how many irrelevant items
were also included in the positive class.
Pattern Evaluation: Statistical Evaluation measures

 Recall example
a b <-- classified as
7 2 | a = yes
3 2 | b = no

 Recall (R) = tp/(tp+fn)


 =7/(7+2)=7/9= 0.778 for class yes 
 =2/(3+2)=0.4 for class no.
Pattern Evaluation: Statistical Evaluation measures
 The True Positive (TP) rate is the proportion of examples which were
classified as class x, among all examples which truly have class x.
 i.e. how much part of the class was captured. It is equivalent to Recall.
 TP rate Example:
a b <-- classified as
7 2 | a = yes
3 2 | b = no
 In the above confusion matrix,
 tp rate is the diagonal element divided by the sum over the relevant row,
i.e. 7/(7+2)=0.778 for class yes and 2/(3+2)=0.4 for class no in our
example.
Pattern Evaluation: Statistical Evaluation measures
 The False Positive (FP) rate
 This is the proportion of examples which were classified as class x, but
belong to a different class, among all examples which are not of class x.
 In the matrix, this is the column sum of class x minus the diagonal
element, divided by the rows sums of all other classes;
 Example
a b <-- classified as
7 2 | a = yes
3 2 | b = no
 FPR for class yes = 3/5=0.6  
 FPR for class no = 2/ 5=0.222 .
Pattern Evaluation: Statistical Evaluation measures

 Precision vs Recall
 A precision score of 1.0 for a class C means that every item
labelled as belonging to class C does indeed belong to class C
(but says nothing about the number of items from class C that
were not labelled correctly)
 a recall of 1.0 means that every item from class C was
labelled as belonging to class C (but says nothing about how
many other items were incorrectly also labelled as belonging
to class C).
Pattern Evaluation: Statistical Evaluation measures
 F-Measure 
 This is a combined measure for precision and recall.
 The two measures are used together in the
 f-measure) to provide a single measurement for a system.
 It is computed as follows:

 F-Measure= 2*Precision*Recall/(Precision+Recall)
Pattern Evaluation: Statistical Evaluation measures

 Receiver operating Characteristic(ROC) curve


  ROC curve, is a graphical plot that illustrates the
performance of a binary classifier system as its
discrimination threshold is varied.
 The curve is created by plotting the true positive rate(TPR)
against the false positive rate(FPR) at various threshold
settings

 For instance
 A value near 0.5 means the lack of any statistical
dependence.
Pattern Evaluation: Statistical Evaluation measures

 ROC Area
 The area under a ROC curve quantifies the overall ability of
the test to discriminate between those individuals with the
condition (TP) and those without the condition(TN).
 A truly useless test : has an area of 0.5.
 Meaning that it is not better at identifying true positives than
false postives.
 A perfect test: Has zero false positives (FP)and zero false
negatives) has an area of 1.00.
 Meaning that it is better at identifying true positives
 The test has an area between these two values.

Pattern Evaluation: Statistical Evaluation measures
 ROC area Example
Pattern Evaluation: Statistical Evaluation measures
 ROC Area
 The graph shows three ROC curves representing excellent,
good, and worthless tests plotted on the same graph.
 The accuracy of the test depends on how well the test
separates the group being tested into those with and without
the condition in question.
 Accuracy is measured by the area under the ROC curve.
 An area of 1 represents a perfect test;
 an area of 0.5 represents a worthless test.
Pattern Evaluation: Statistical Evaluation measures

 ROC area
A rough guide for classifying the accuracy of a diagnostic test is
the traditional academic point system:
0.90-1 = excellent (A)
0.80-0.90 = good (B)
0.70-0.80 = fair (C)
0.60-0.70 = poor (D)
0.50-0.60 = fail (F)
A value near 0.5 means the lack of any statistical
dependence.
Pattern Evaluation: Statistical Evaluation measures
 ROC area Example
Pattern Evaluation: Statistical Evaluation measures

 ROC area
 ROC curves can also be constructed from clinical prediction rules.
 The graphs shows how clinical findings predict strep throat . The
study compared patients in Virginia(VA) and Nebraska (NE) .
 It was observed that the rule performed more accurately in
Virginia (VA) since area under the curve =0.78 compared to
Nebraska, whose area under the curve = 0.73
Pattern Evaluation: Statistical Evaluation measures

 Mathews correlation coefficient(MCC)


  A coefficient of +1 represents a perfect prediction, 0 no
better than random prediction
 A coefficient of −1 indicates total disagreement between
prediction and observation.
 The statistic is also known as the phi coefficient. MCC is
related to the chi-square statistic for a 2×2 contingency table
Pattern Evaluation: Statistical Evaluation measures
 Precision Recall curve(PRC) Area
 The PRC plot shows the relationship between precision and
sensitivity (Recall),
 It have been cited as an alternative to ROC curves for tasks
with a large skew in the class distribution.
 ROC space and PR space differs in the visual representation
of the curves
 In PR space the goal is to be in the upper-right-hand corner
while the goal in ROC space is to be in the upper-left-hand
corner
Pattern Visualization

 Visualization is the process by which data are converted into


meaningful 3-D images or some graphical representation.
 Visualization use good interface and graphics to present the
results of data mining.
 Visualization techniques are important for making the results
useful
Examples
 1. tables
 2. cross tabs
 3. pie/bar chart
4. rooted tree
Main purpose for visualization

Visualization techniques provides graphical results of data


mining in order to support decision making process
Increasing potential
to support
decision making End User
decide how
To use
knowledge

post processing
Visualization Techniques

Data Mining Bio informatics


Knowledge Discovery Analyst

Pre-processing

Data Warehouses / Data Marts

DBA
Data Sources
Paper, Files, Information Providers, Database Systems,
Benefits of pattern Visualization

1. Provide insight into an information space by mapping data


onto graphics
2. Help find interesting regions and suitable parameters for
further quantitative analysis
3. Provide a visual proof of relationships Within group and
between groups at the same time.
Benefits of Pattern Visualization

4. Conveys information easily: picture is worth a thousand


words”
 Example:
Disadvantages of Visualization

1. It Requires human eyes, some people have defective


eyes(e.g color blind),hence they may not notice
difference in colors.
Disadvantages of Visualization

2. It can be misleading
Example of misleading visualization:
Year Sales

1999 2,110 Sales

2000 2,105 2130


2001 2,120 2125
2120
2002 2,121 2115
Sales
2003 2,124 2110
2105
2100
2095
1999 2000 2001 2002 2003

Y-Axis scale gives WRONG


impression of big change
Disadvantages of Visualization

Better Visualization example

Year Sales Sales

1999 2,110
3000
2000 2,105 2500
2001 2,120 2000
1500 Sales
2002 2,121
1000
2003 2,124
500

Axis from 0 to 2000 0


1999 2000 2001 2002 2003
scale gives
correct impression of
small change
Tufte’s Principles of graphical Excellence
 Give the viewer
1. The greatest number of ideas
2. in the shortest time
3. with the least ink in the smallest space.
4. Tell the truth about the data!
Visualization: Example

 Example: Consider the following tree.

39
Visualization: Example
 The data in previous slide can be visualized using as follows:

During visualization, C4.5 creates a threshold and then splits the list
into categories whose attribute value is above the threshold and those
that are less than or equal to it.
Interpretation of Patterns
 This involves explain data mining results by describing patterns
produced during mining
 This requires interaction with domain experts
 Visualization makes interpretation easier

 Example:
The tree in the previous slide shows that
There are 50 setosa in the original dataset without any
misclassification, so this was successful.
46 samples reached virginica leaf and 45 of them were
virginica, but 1 of the samples was not a virginica.

41
Knowledge usage

 The understandable patterns are used to:


 Make predictions or classifications about new data
 Summarize the contents of a large database to support
decision making
 Fund new research.
 Explain existing data
 Graphical data visualization to aid humans in discovering
deeper patterns
Summarized Process of Knowledge Discovery (KDD) process
Integration

Interpretation Knowledge
Da & Evaluation
ta
Mi
nin
Tr g Knowledge
an
sfo
r m

Understanding
Int & __ __ __
Patterns
sel
RawData Cl egra ec t
__ __ __
and
ea tio __ __ __
nin n
g & Rules
Transformed
Target Data
DATA
Data
Ware
house
Required effort for each KDD Step

• Arrows indicate the direction we hope the effort should go.


The structure of Typical Data Mining/knowledge discovery System

Graphical User Interface

Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web

April 11, 2023 Data Mining: Concepts and Techniques 45


Lab Exercise

 Use decision tree learning algorithms in weka machine


learning software to mine weather data .
 Visualize the output
 Interpret the results

 NB
 If ID3 is disabled in the Explorer it is because your data
contains numeric attributes.
 ID3 only operates on nominal attributes.
 J48 operates both nominal and numeric attributes
End

Thank you

Questions

You might also like