CS5550 Data Management and Business Intelligence

Session 7: Data Mining (II) Advanced DM

Session Learning Outcomes
The learning outcomes for this session are that you can:

Understand how to use state-of-the-art DM techniques Discuss the strengths and weaknesses of such methods Discuss the use of key DM tools for business intelligence

CS5550 Session 05

Slide 2

Recap on Data Mining
What it is

Definition IDA, AI, KDD, etc Data to knowledge
Correlation Regression Clustering Visualisation

Some typical tools
  

CS5550 Session 05

Slide 3

More Advanced Techniques        Classifiers (e.g. Decision Trees) Association Rules Time-Series Models Bayesian Networks Principal Components Analysis to plot multidimensional data Graph Based Methods to explore multiple relationships Optimisation .

on time} .Classification “What sort of data is this?” Similar to Clustering but  Supervised Learning – we have sample classes to learn from:   Fraudulent Financial Reporting Y = {fraudulent. truthful} Predicting Delayed Flights Y = {delayed.

Classification  Supervised method unlike clustering x 0.3 -0.2 -0.6 0.05 0 0.1 -0.1 -0.2 0.15 -0.2 -0.5 0.05 0.3 0.4 0.1 0.25 x -0.1 0 -0.15 0.2 .

Decision Trees     Established method for classifying data Originated from biology Easy to understand Commonly used for 60 50 40   Salary Fraud detection Credit Rating 30 20 Class 0 10 0 0 20 40 Age 60 80 Class 1 ? CS5550 Session 05 Slide 7 .

Decision Trees Credit Rating example CS5550 Session 05 Slide 8 .

not combinations) Disadvantages    .Decision Trees Advantages    Transparent & Interpretable Can also perform Feature Selection Can model complex relationships (non-linear) Risk over-fitting data (but can prune trees) Require lots of data Cannot model “diagonal” relationships (splits always on one predictor.

K Nearest Neighbour   Find K observations in the data that are similar to the new observation we wish to classify Requires:    Distance Metric Voting Mechanism Weighting Function CS5550 Session 05 Slide 10 .

g.K-Nearest Neighbour    Distance Metric e. Maximum k=1 + + - k=3 + - k=5 + + - + + o - + + o + + - o - + .g. Euclidian Weighting function Neighbour voting mechanism e.

K-Nearest Neighbour Advantages   Simplicity Few assumptions about data Disadvantages   Slow when large number of datapoints Need lots of data when lots of predictors .

Other Classifiers     Linear Classifiers Artificial Neural Network Classifiers Support Vector Machines Bayesian Classifiers CS5550 Session 05 Slide 13 .

Nappies => Beer Uses notion of support and confidence CS5550 Session 05 Slide 14 . Supermarket purchases Looks for associations between items Builds “If – Then” Rules  e.g.g.Association Rules “What data goes with what” Large amount of “basket data”  e.

C4). then the company has a loss (C2). If the quality of the management is (at least) high and the number of employees is similar to 700. then the company has a loss (C5. If the quality of the management is medium. Rule 4. then the company makes a profit (C1). CS5550 Session 05 Slide 15 . then the company may have a profit or a loss (C3. C6).Association Rules Example: Rule 1. If the number of employees is similar to 420 and the localization is B. Rule 2. Rule 3. If the quality of the management is (at most) low.

Association Rules Advantages Disadvantages Profusion of rules generated  Ignores rare (but potentially interesting) combinations  CS5550 Session 05 Slide 16 .

Neural Networks   Map inputs to outputs using weights Back-propagation algorithm to learn weights θk Output Layer θi Hidden Layer Ii CS5550 Session 05 Input Layer Slide 17 .

Neural Networks    Forecasting Markets Predicting Stock collapse Classifying exceptional behaviour in customers   Fraudulent credit card usage Online monitoring CS5550 Session 05 Slide 18 .

Neural Networks Advantages   Can model complex relationships Versatile Disadvantages     Suffers very badly from “Over-fitting” Need lots of data Do not select features automatically “Black Box” model .

Time-Series Models “Predicting the Stock-Market” Long been the goal of:     mathematicians statisticians computer scientists philosophers “Pi” – Harvest Filmworks 1999 CS5550 Session 05 Slide 20 .

Time-Series Models   Statistical Models “AI” Models such as Neural Networks CS5550 Session 05 Slide 21 .

Bayesian Networks    Overcome “black box” nature of NNs Model data using probabilities and graphs No hidden “layers” or “weights” Models a joint distribution – probability of any event is calculable  CS5550 Session 05 Slide 22 .

Bayesian Networks CS5550 Session 05 Slide 23 .

Bayesian Networks CS5550 Session 05 Slide 24 .

Bayesian Networks Advantages? Disadvantages? CS5550 Session 05 Slide 25 .

Optimisation For searching through huge numbers of possible solutions:  Scheduling processes   Manufacturing Deliveries Efficient loading of crates prior to shipping  Bin Packing of objects   Routing for efficient delivery CS5550 Session 05 Slide 26 .

Optimisation Well known Techniques:      Greedy Searches Hill Climb Simulated Annealing Genetic Algorithms Gradient Descent CS5550 Session 05 Slide 27 .

Optimisation For example: Travelling Salesman Problem Famous NP Hard Problem CS5550 Session 05 Slide 28 .

Travelling Salesman Problem .

5. 2. 7. 9}? CS5550 Session 05 Slide 30 . 4.Optimisation: Bin Packing Trucks with capacity of 10 How few required to store objects of size: {3. 6. 2. 1. 1.

Optimisation: Bin Packing Search techniques for finding the best allocation for objects within fixed size containers Potential “Heuristic” Approaches:     First Fit Next Fit Best Fit Worst Fit CS5550 Session 05 Slide 31 .

Optimisation: Bin Packing Also 2D and 3D approaches CS5550 Session 05 Slide 32 .

Business Intelligence Data Integration + Data Mining + Human Expertise => Business Intelligence:    Improved Decision Making Quicker Response Times Better Broadcasting / Marketting CS5550 Session 05 Slide 33 .

Weaknesses of Data Mining Data Quality Spurious Correlations Over-fitting “Black Box” Modelling Over-reliance – slave to the data “Can’t see the wood for the trees” .

Session Summary This session has examined: Advanced Data Mining Techniques with examples Advantages and Disadvantages CS5550 Session 05 Slide 35 .

Next Session: Guest Lecture Case Study of the application of BI Slide 36 .

Sign up to vote on this title
UsefulNot useful