You are on page 1of 24

Data Mining: Functionalities, Classification and Task Primitives

PRESENT BY K.Aravind (10mx03) M.Boobalan (10mx05) V.Boopathiraj (10mx06) S.Kadhiresan (10mx18) L.RoshanAli (10mx41) A.Selvaraj (10mx46)

Data Mining Functionalities


It includes
 Characterization and Discrimination  Mining Frequent Patterns, Associations, and Correlations  Classification and Regression  Clustering Analysis  Outlier Analysis

Characterization and Discrimination


y Class/Concept Description: Characterization and

Discrimination
y Data entries can be associated with classes or concepts y describe individual classes and concepts in summarized, concise,

and precise terms. Such descriptions of a class or a concept are called class/concept descriptions.

Characterization and Discrimination


y Data characterization is a summarization of the general

characteristics or features of a target class of data. y Data discrimination is a comparison of the general features of the target class data objects against the general features of objects from one or multiple contrasting classes.

Mining Frequent Patterns, Association &Correlation


y Frequent Patterns are the patterns that occur simultaneously.
y frequent patterns, including frequent itemsets, frequent

subsequences (sequential patterns), and frequent substructures


y Association Rules
y Single dimensional association rules vs Multi dimensional

association rules

Association rules are discarded as uninteresting if they do not satisfy both a minimum support threshold and a minimum confidence threshold.

Classification and Regression


y Classification is the process of finding a model (or function)

y y y y

that describes and distinguishes data classes or concepts for future prediction Class label is known E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values

Cluster Analysis
y Class label is unknown: Group data to form new classes, e.g.,

cluster houses to find distribution patterns y Clustering based on the principle: Maximizing the Intraclass similarity and Minimizing the Interclass similarity

Outlier Analysis
y Outlier: a data object that does not comply with the general

behavior of the data y It can be considered as noise or exception but it is quite useful in fraud detection, rare events analysis y The analysis of outlier data is referred to as outlier analysis or anomaly mining.

Are all Patterns Interesting?


y What makes a pattern interesting? y Can a data mining system generate all of the interesting

patterns? y Can a data mining system generate only interesting patterns?

Cont
y A data mining system/query may generate thousands of patterns, not all of them are

interesting.
y Suggested approach: Human-centered, query-based, focused mining y Interestingness measures: A pattern is interesting if it is easily understood by

humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm
y Objective vs. subjective interestingness measures: y Objective: based on statistics and structures of patterns, e.g., support,

confidence, etc.
y Subjective: based on users belief in the data, e.g., unexpectedness, novelty,

actionability, etc.

Can We Find All and Only Interesting Patterns?


y Find all the interesting patterns: Completeness
y Can a data mining system find all the interesting patterns? y Association vs. classification vs. clustering

y Search for only interesting patterns: Optimization


y Can a data mining system find only the interesting patterns? y Approaches y First general all the patterns and then filter out the uninteresting ones. y Generate only the interesting patternsmining query optimization

Data Mining : Classification Schemes


y General Functionality
y Descriptive data mining y Predictive data mining

y Different views, different classifications


y Kinds of databases to be mined y Kinds of knowledge to be discovered y Kinds of techniques utilized y Kinds of applications adapted

A Multi-Dimensional View of Data Mining Classification


y Databases to be mined
y Relational, transactional, object-oriented, object-relational, active,

spatial, time-series, text, multi-media, heterogeneous, legacy, WWW, etc.


y Knowledge to be mined
y Characterization, discrimination, association, classification,

clustering, trend, deviation and outlier analysis, etc. y Multiple/integrated functions and mining at multiple levels

Cont
y Techniques utilized
y Database-oriented, data warehouse (OLAP), machine learning,

statistics, visualization, neural network, etc.


y Applications adapted
y Retail, telecommunication, banking, fraud analysis, DNA

mining, stock market analysis, Web mining, Weblog analysis, etc.

Data Mining : Task Primitives


y Data mining without user interaction is usually not helpful y Users may request a few data mining primitives to be

performed on data
y specification of data to be mined y set of data in which the user is interested y kinds of knowledge to be mined y background knowledge useful in guiding the discovery process y specification of how knowledge should be visualized

Pieces of a Data Mining Task


y What data to mine
y list of relevant attributes

y Kinds of knowledge to be mined


y y y y y y

characterization discrimination association classification clustering evolution analysis

y Background knowledge
y concept hierarchies

y Interestingness Measures
y separate patterns from knowledge

y Presentation and visualization of patterns

Task Relevant Data


y Mixable view of the data
y name of database or warehouse y name of tables or cubes y conditions for selecting useful data y type = home entertainment y type = fruit y attributes or dimensions (e.g.; name and price)

Kind of Knowledge to be Mined


y Templates or meta patterns may be used to specify output

of results:
y P(X: customer, W) AND Q(X,Y) ->buys(X,Z) y age(X,30..30) AND income(X, 40K49K) -> buys(X,

VCR) [2.2%, 60%] y Might specify to classify input file of customers as likely to buy , not likely to buy y indicates 60% confidence is to be used and such cases should represent 2.2% of all transactions.

Background Knowledge: Concept Hierarchies


y Concept Hierarchy
y defines a sequence of mappings from a set of low-level y concept to higher-level. y location y time y product

y Types of hierarchies
y schema hierarchy y set-grouping hierarchy y operation derived hierarchy y rule-based hierarchy

Concept Hierarchies
y Schema
y total or partial order among an attribute, usually aware house

dimension (time, location, etc.)


y Set-Group
y values for a given attribute are lumped into groups of constants

or range values
y Operation defined
y automatically derived, clustering, extraction, etc.

y Rule-based
y hierarchy may be well defined by set of rules

Interestingness Measures
y Simplicity

e.g., (association) rule length, (decision) tree size y Certainty e.g., confidence, P(A|B) = n(A and B)/ n (B), classification reliability or accuracy, certainty factor, rule strength, rule quality, discriminating weight, etc. y Utility potential usefulness, e.g., support (association), noise threshold (description) y Novelty not previously known, surprising (used to remove redundant rules, e.g., Canada vs. Vancouver rule implication support ratio

Presentation and visualization of patterns


y Different backgrounds/usages may require different forms of

representation
y E.g., rules, tables, crosstabs, pie/bar chart etc. y Concept hierarchy is also important y Discovered knowledge might be more understandable when

represented at high level of abstraction


y Interactive drill up/down, pivoting, slicing and dicing provide

different perspective to data


y Different kinds of knowledge require different representation:

association, classification, clustering, etc.

Thank you!!!

You might also like