DATA MINING WITH CLUSTERING AND CLASSIFICATION

fraud detection.DATA MINING y Data Mining is the process of discovering new correlations. using artificial intelligence. y It is currently used in a wide range of profiling practices. patterns. and trends by digging into (mining) large amounts of data stored in warehouses. and scientific discovery. such as marketing . statistical and mathematical techniques. .

Why Data Mining From a managerial perspective: Analyzing trends Wealth generation Security Strategic decision making .

Predictive data mining is further categorized into: Classification Regression .MODELS OF DATA MINING Predictive Model: Predictive models can be used to forecast explicit values. from a database of customers who have already responded to a particular offer. a model can be built that predicts which prospects are likeliest to respond to the same offer. For example. based on patterns determined from known results.

.CONT y Descriptive Model: Descriptive models describe patterns in existing data. and are generally used to create meaningful subgroups such as demographic clusters. They are generally used to create meaningful subgroups. Descriptive data mining is further classified into Clustering Association Sequential analysis.

CLUSTERING Clustering can be considered the most important unsupervised learning technique. it deals with finding a structure in a collection of unlabeled data. . as every other problem of this kind. Clustering is the process of organizing objects into groups whose members are similar in some way . A cluster is therefore a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. so.

CONT .

Where to use clustering? y Data mining y Information retrieval y text mining y Web analysis y marketing y medical diagnostic .

Major clustering methods y Distance-based y Hierarchical y Partitioning y Probabilistic .

CLASSIFICATION y predicts categorical class labels y classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data .

as determined by the class label attribute (supervised learning) y The set of tuples used for model construction: training set y The model is represented as classification rules.Classification A Two-Step Process y Model construction: describing a set of predetermined classes y Each tuple is assumed to belong to a predefined class. decision trees. or mathematical formulae y Model usage: for classifying previously unseen objects y Estimate accuracy of the model using a test set y y y The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set. otherwise over-fitting will occur .

Classification Process: Model Construction Classification Algorithms Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes .

Classification Process: Model usage in Prediction Classifier esti ata Unseen Data (Jeff. 4) NAME RANK Tom M erlisa G eo rg e Jo sep h A ssistan t P ro f A sso ciate P ro f P ro fesso r A ssistan t P ro f YEARS TENURED 2 7 5 7 no no yes yes Tenured? . Professor.

Classification Techniques y Classification by Decision Tree y Bayesian Classification y Classification by Backpropogation y Classification based on Association Rule Mining .

measurements.) are accompanied by labels indicating the class of the observations y New data is classified based on the training set y Unsupervised learning (clustering) y The class labels of training data is unknown y Given a set of measurements. etc.Classification vs Clustering y Supervised learning (classification) y Supervision: The training data (observations. the aim is to establish the existence of classes or clusters in the data . etc. observations.

THANK YOU .

Sign up to vote on this title
UsefulNot useful