This action might not be possible to undo. Are you sure you want to continue?

Introduction

Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Classification of data mining systems

March 3, 2010

Data Mining: Concepts and Techniques

1

**Why Data Mining?
**

**The Explosive Growth of Data: from terabytes to petabytes
**

**Data collection and data availability
**

Automated data collection tools, database systems, Web, computerized society

**Major sources of abundant data
**

Business: Web, e-commerce, transactions, stocks, … Science: Remote sensing, bioinformatics, scientific simulation, … Society and everyone: news, digital cameras, YouTube

We are drowning in data, but starving for knowledge! “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

Data Mining: Concepts and Techniques

March 3, 2010

2

March 3, 2010

Data Mining: Concepts and Techniques

3

**Evolution of Database Technology
**

1960s:

Data collection, database creation, IMS and network DBMS Relational data model, relational DBMS implementation RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) Data mining, data warehousing, multimedia databases, and Web databases Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems

1970s:

1980s:

1990s:

2000s

March 3, 2010

Data Mining: Concepts and Techniques

4

**What Is Data Mining?
**

**Data mining (knowledge discovery from data)
**

Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. Simple search and query processing (Deductive) expert systems

Data Mining: Concepts and Techniques

Alternative names

**Watch out: Is everything “data mining”?
**

March 3, 2010

5

**Why Data Mining?—Potential Applications
**

**Data analysis and decision support
**

**Market analysis and management
**

Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Forecasting, customer retention, improved underwriting, quality control, competitive analysis

**Risk analysis and management
**

Fraud detection and detection of unusual patterns (outliers) Text mining (news group, email, documents) and Web mining Stream data mining Bioinformatics and bio-data analysis

Other Applications

March 3, 2010

Data Mining: Concepts and Techniques

6

**Ex. 1: Market Analysis and Management
**

Where does the data come from?—Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing

Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. Determine customer purchasing patterns over time

Cross-market analysis—Find associations/co-relations between product sales, & predict based on such association Customer profiling—What types of customers buy what products (clustering or classification) Customer requirement analysis

Identify the best products for different groups of customers Predict what factors will attract new customers Multidimensional summary reports Statistical summary information (data central tendency and variation)

**Provision of summary information
**

March 3, 2010

Data Mining: Concepts and Techniques

7

**Ex. 2: Corporate Analysis & Risk Management
**

**Finance planning and asset evaluation
**

cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.)

Resource planning

summarize and compare the resources and spending monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market

Data Mining: Concepts and Techniques

Competition

March 3, 2010

8

**Ex. 3: Fraud Detection & Mining Unusual Patterns
**

Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm.

**Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance
**

Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Analysts estimate that 38% of retail shrink is due to dishonest employees

**Telecommunications: phone-call fraud
**

Retail industry

Anti-terrorism

March 3, 2010

Data Mining: Concepts and Techniques

9

**Knowledge Discovery (KDD) Process
**

Data mining—core of knowledge discovery process

Pattern Evaluation Data Mining

**Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases
**

March 3, 2010

Selection

Data Mining: Concepts and Techniques

10

**KDD Process: Several Key Steps
**

**Learning the application domain
**

relevant prior knowledge and goals of application

Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation

Find useful features, dimensionality/variable reduction, invariant representation summarization, classification, regression, association, clustering

**Choosing functions of data mining
**

Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation

**visualization, transformation, removing redundant patterns, etc.
**

Data Mining: Concepts and Techniques

**Use of discovered knowledge
**

11

March 3, 2010

**Data Mining and Business Intelligence
**

Increasing potential to support business decisions End User

Decisio n Making Data Presentation Visualization Techniques Data Mining Information Discovery

Business Analyst Data Analyst

Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

March 3, 2010 Data Mining: Concepts and Techniques

DBA

12

**Architecture: Typical Data Mining System
**

Graphical User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server

data cleaning, integration, and selection Know ledge -Base

Database

March 3, 2010

**Data World-Wide Other Info Repositories Warehouse Web
**

Data Mining: Concepts and Techniques

13

**Data Mining: On What Kinds of Data?
**

**Database-oriented data sets and applications
**

Relational database, data warehouse, transactional database Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web

Data Mining: Concepts and Techniques

**Advanced data sets and advanced applications
**

March 3, 2010

14

**Data Mining Tasks
**

Classification- Airline Passengers(terrorists) Regression- Investment Strategy Prediction- Flood Prediction Time Series Analysis- stocks Summarization Clustering Association Rules Sequence Discovery

March 3, 2010

Data Mining: Concepts and Techniques

15

**Data Mining Functionalities (1)
**

**Multidimensional concept description: Characterization and discrimination
**

Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Diaper Beer [0.5%, 75%] (Correlation or causality?) Construct models (functions) that describe and distinguish classes or concepts for future prediction

**Frequent patterns, association, correlation vs. causality
**

**Classification and prediction
**

E.g., classify countries based on (climate), or classify cars based on (gas mileage)

**Predict some unknown or missing numerical values
**

Data Mining: Concepts and Techniques

16

**Data Mining Functionalities (2)
**

Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses

March 3, 2010

Data Mining: Concepts and Techniques

17

**Are All the “Discovered” Patterns Interesting?
**

**Data mining may generate thousands of patterns: Not all of them are interesting
**

Suggested approach: Human-centered, query-based, focused mining

Interestingness measures

A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm

**Objective vs. subjective interestingness measures
**

Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc.

March 3, 2010

Data Mining: Concepts and Techniques

18

**Can We Find All and Only Interesting Patterns?
**

**Find all the interesting patterns: Completeness
**

Can a data mining system find all the interesting patterns? Do we need to find all of the interesting patterns? Heuristic vs. exhaustive search Association vs. classification vs. clustering Can a data mining system find only the interesting patterns? Approaches

**Search for only interesting patterns: An optimization problem
**

First general all the patterns and then filter out the uninteresting ones Generate only the interesting patterns—mining query optimization

Data Mining: Concepts and Techniques

March 3, 2010

19

**Data Mining: Confluence of Multiple Disciplines
**

Database Technology Statistics

Machine Learning Pattern Recognition

Data Mining

Visualization

Algorithm

Data Mining: Concepts and Techniques

Other Disciplines

March 3, 2010

20

**Why Not Traditional Data Analysis?
**

**Tremendous amount of data
**

Algorithms must be highly scalable to handle such as terabytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations

**High-dimensionality of data
**

**High complexity of data
**

**New and sophisticated applications
**

Data Mining: Concepts and Techniques

March 3, 2010

21

**Data Mining: Classification Schemes
**

General functionality

Descriptive data mining Predictive data mining Data view: Kinds of data to be mined Knowledge view: Kinds of knowledge to be discovered Method view: Kinds of techniques utilized Application view: Kinds of applications adapted

Data Mining: Concepts and Techniques

**Different views lead to different classifications
**

March 3, 2010

22

Data Modeling & Techniques

March 3, 2010

Data Mining: Concepts and Techniques

23

**Multi-Dimensional View of Data Mining
**

Data to be mined

Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc. Multiple/integrated functions and mining at multiple levels Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc. Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market analysis, text mining, Web mining, etc.

Data Mining: Concepts and Techniques

Knowledge to be mined

Techniques utilized

Applications adapted

March 3, 2010

24

**Data Mining issues
**

Human Interaction Overfitting Outliers Interpretation of Results Visualization of Results Large Datasets High Dimensionality

March 3, 2010

Data Mining: Concepts and Techniques

25

**Data Mining issues
**

Multimedia Data Missing Data Irrelevant Data Noisy Data Changing Data Interpretation Application

March 3, 2010

Data Mining: Concepts and Techniques

26

**Major Issues in Data Mining
**

Mining methodology

Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

Data Mining: Concepts and Techniques

User interaction

**Applications and social impacts
**

March 3, 2010

27

Related Concepts

**Database/OLTP Systems Schema
**

(ID,Name,Address,Salary,JobNo) ER Relational

Data Model

ID

Name Emp Has JOB

Job No.

Job Desc Job

Transaction Query:

SELECT Name FROM T WHERE Salary > 100000

**DM: Only imprecise queries
**

March 3, 2010 Data Mining: Concepts and Techniques

28

**Fuzzy sets and Fuzzy Logic
**

Fuzzy Set: Set membership function is a real valued function with output in the range [0,1]. f(x): Probability x is in F. 1-f(x): Probability x is not in F. EX: T = {x | x is a person and x is tall} Let f(x) be the probability that x is tall Here f is the membership function

DM: Prediction and classification are fuzzy.

March 3, 2010

Data Mining: Concepts and Techniques

29

Fuzzy Sets

March 3, 2010

Data Mining: Concepts and Techniques

30

Fuzzy Classification

Loan Amnt

Reject Accept Simple

**Reject Accept Fuzzy
**

Data Mining: Concepts and Techniques

March 3, 2010

31

Information Retrieval

Information Retrieval (IR): retrieving desired information from textual data. Library Science Digital Libraries Web Search Engines Traditionally keyword based Sample query:

Find all documents about “data mining”.

**DM: Similarity measures; Mine text/Web data.
**

March 3, 2010 Data Mining: Concepts and Techniques

32

Information Retrieval

Similarity: measure of how close a query is to a document. Documents which are “close enough” are retrieved. Metrics:

**Precision = |Relevant and Retrieved| |Retrieved| Recall = |Relevant and Retrieved| |Relevant|
**

March 3, 2010 Data Mining: Concepts and Techniques

33

**IR Query Result Measures and Classification
**

30-tall, 70-not all

IR

March 3, 2010

Classification

Data Mining: Concepts and Techniques

34

Dimensional Modeling

View data in a hierarchical manner more as business executives might Useful in decision support systems and mining Dimension: collection of logically related attributes; axis for modeling data. Facts: data stored Ex: Dimensions – products, locations, date Facts – quantity, unit price DM: May view Data Mining: Conceptsdimensional. data as and

Techniques

March 3, 2010

35

Relational View of Data

ProdID 123 123 150

**LocID Dallas Houston Dallas
**

36

Cube view of Data

37

**Dimensional Modeling Queries
**

Roll Up: more general dimension Drill Down: more specific dimension Dimension (Aggregation) Hierarchy SQL uses aggregation Decision Support Systems (DSS): Computer systems and tools to assist managers in making decisions and solving problems.

38

Aggregation Hierarchies

Product<Type<Company

39

Star Schema Star Schema

40

Data Warehousing

“Subject-oriented, integrated, time-variant,

nonvolatile” Operational Data: Data used in day to day needs of company. Informational Data: Supports other functions such as planning and forecasting. Data mining tools often access data warehouses rather than operational data.

DM: May access data in warehouse.

41

Operational vs. Informational

Operational Data Application Use Temporal Modification Orientation Data Size Level Access Response Data Schema OLTP Precise Queries Snapshot Dynamic Application Operational Values Gigabits Detailed Often Few Seconds Relational Data Warehouse OLAP Ad Hoc Historical Static Business Integrated Terabits Summarized Less Often Minutes Star/Snowflake

42

OLAP

Online Analytic Processing (OLAP): provides more complex queries than OLTP. OnLine Transaction Processing (OLTP): traditional database/transaction processing. Dimensional data; cube view Visualization of operations:

Slice: examine sub-cube. Dice: rotate cube to look at another dimension. Roll Up/Drill Down

**DM: May use OLAP queries.
**

43

OLAP Operations

Roll Up

Drill Down

Single Cell Multiple Cells

Slice

Dice

44

Statistics

Simple descriptive models-data distribution, mean Statistical inference: generalizing a model created from a sample of the data to the entire dataset. Exploratory Data Analysis:

Data can actually drive the creation of the model Opposite of traditional statistical view.

Data mining targeted to business user

**DM: Many data mining methods come from statistical techniques.
**

45

Machine Learning

Machine Learning: area of AI that examines how to write programs that can learn. Often used in classification and prediction Supervised Learning: learns by example. Unsupervised Learning: learns without knowledge of correct answers. Machine learning often deals with small static datasets. DM: Uses many machine learning techniques.

46

**Pattern Matching (Recognition)
**

Pattern Matching: finds occurrences of a predefined pattern in the data. Applications include speech recognition, information retrieval, time series analysis.

DM: Type of classification.

47

DM vs. Related Topics

Area Query DB/OLTP Precise D IR

48

Precise D

**Data Mining Techniques
**

Statistical Point Estimation Models Based on Summarization Bayes Theorem Hypothesis Testing Regression and Correlation Similarity Measures Decision Trees Neural Networks Activation Functions Genetic Algorithms

49

Point Estimation

Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex:

R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee’s salary. Is this a good idea?

Data Mining: Concepts and Techniques

March 3, 2010

50

Estimation Error

Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Why square? Root Mean Square Error (RMSE)

Data Mining: Concepts and Techniques

March 3, 2010

51

**Maximum Likelihood Estimate (MLE)
**

Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function:

Maximize L.

March 3, 2010

Data Mining: Concepts and Techniques

52

MLE Example

Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is:

**However if the probability of a H is 0.8 then:
**

Data Mining: Concepts and Techniques

March 3, 2010

53

Bayes Theorem

Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem:

**Assign probabilities of hypotheses given a data value.
**

54

**Bayes Example(cont’d)
**

Training Data:

55

ID Inco 1 4 2 3 3 2

**Bayes Theorem Example
**

Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%. Assign twelve data values for all 1 2 4 combinations of credit and3income:

Excellen t Good Bad X1 X5 X9 X2 X3 X4 X6 X10

56

X7 X11

X8 X12

**Bayes Example(cont’d)
**

**Calculate P(xi|hj) and P(xi) Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. Predict the class for x4:
**

Calculate P(hj|x4) for all hj. Place x4 in class with largest value. Ex:

P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) =(1/6)(0.6)/0.1=1. x in class h . 4 1

57

Similarity Measures

Determine similarity between two objects. The Similarity between two tuples ti and tj, sim(ti, tj) in a database D is a mapping from D×D to thr range [0,1]. Thus sim(ti, tj)ε [0,1] Similarity characteristics:

58

Similarity Measures

Data Mining: Concepts and Techniques

59

Similarity Measures

Alternatively, distance measure measure how unlike or dissimilar objects are.

Data Mining: Concepts and Techniques

60

Decision Trees

Decision Tree (DT):

Tree where the root and each internal node is labeled with a question. The arcs represent each possible answer to the associated question. Each leaf node represents a prediction of a solution to the problem.

Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs.

61

Decision Tree Example

62

Neural Networks

**Neural Network (NN) is a directed graph F=<V,A> with vertices V={1,2, …,n} and arcs A={<i,j>|1<=i,j<=n}, with the following restrictions:
**

V is partitioned into a set of input nodes, VI, hidden nodes, VH, and output nodes, VO. The vertices are also partitioned into layers Any arc <i,j> must have node i in layer h-1 and node j in layer h. Arc <i,j> is labeled with a numeric value wij . Node i is labeled with a function fi.

63

Neural Network Example

© Prentice Hall

64

Summary

Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining

Data Mining: Concepts and Techniques

March 3, 2010

65

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd