You are on page 1of 58

Data Mining

Definition of Data Mining

Data Mining and Business Intelligence
Increasing potential to support business decisions

Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting

End User

Business Analyst Data Analyst

Data Warehouses / Data Marts OLAP Data Sources Paper, Files, Information Providers, Database Systems, OLTP


Data pyramid

Wisdom Knowledge Information Data

Knowledge + experience Information + rules Data + context

Related Fields Machine Learning

Data Mining and Knowledge Discovery



Knowledge Discovery Process Integration
Da ta

Raw Data

Se & lect Cl io ea n nin g

Tr an sfo r

Mi nin g
__ __ __ __ __ __ __ __ __

Interpretation & Evaluation



ma tio n

DATA Ware house

Transformed Data Target Data

Patterns and Rules

The Evolution of Data Analysis
Evolutionary Step Business Question Enabling Technologies Data Collection (1960s) Data Access (1980s) "What was my total Computers, tapes, revenue in the last disks five years?" "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensional databases, data warehouses Advanced algorithms, multiprocessor computers, massive databases Product Providers Characteristics IBM, CDC Retrospective, static data delivery Retrospective, dynamic data delivery at record level

Oracle, Sybase, Informix, IBM, Microsoft

Data Warehousing & Decision Support (1990s)

"What were unit sales in New England last March? Drill down to Boston."

SPSS, Comshare, Retrospective, Arbor, Cognos, dynamic data Microstrategy,NCR delivery at multiple levels

Data Mining "What’s likely to (Emerging Today) happen to Boston unit sales next month? Why?"

SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups

Prospective, proactive information delivery

Need for Data Mining
„ Data accumulate and double every 9 months „ There is a big gap from stored data to knowledge; and the transition won’t occur automatically. „ Manual data analysis is not new but a bottleneck „ Fast developing Computer Science and Engineering generates new demands „ Seeking knowledge from massive data

When is DM useful
„ Data rich world „ Large data „ Little knowledge about data (exploratory data analysis)

Data mining is not
„ „ „ „ „ „ OLAP DATA WAREHOUSING Data Visualization SQL Ad Hoc Queries Reporting

Data Mining is…
„ „ „ „ „ „ Predictive Modeling Liner/Logistic Regression Neural Networks Decision Trees Clustering Neural Networks Clustering

Data Mining is
„ „ „ „ „ „ „ Segmentation Decision Trees Neural Networks Predictive Modeling Affinity Analysis Association Rule Sequence Generators

„ Increasing data dimensionality and data size „ Various data forms „ New data types
ƒ Streaming data, multimedia data

„ Efficient search and access to data/knowledge „ Intelligent update and integration

Data Mining Survey
Industry Pioneers
„ „ „ „ „ 23% 19% 17% 13% 12% Manufacturing Financial Serv. Tele/Data communication Media Retail/Wholesaler

„ 21.4% Understanding Customer Segments and Preferences, „ 19,5% Identifying Profitable Customers and Acquiring New ones, „ 14,1% Increasing Revenue From Customers.

Results of Data Mining Include:
„ Forecasting what may happen in the future „ Classifying people or things into groups by recognizing patterns „ Clustering people or things into groups based on their attributes „ Associating what events are likely to occur together „ Sequencing what events are likely to lead to later events

Data Mining versus OLAP
„OLAP - On-line Analytical Processing
ƒ Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

Data Mining Versus Statistical Analysis
„Data Mining
ƒ Originally developed to act as expert systems to solve problems ƒ Less interested in the mechanics of the technique ƒ If it makes sense then let’s use it ƒ Does not require assumptions to be made about data ƒ Can find patterns in very large amounts of data ƒ Requires understanding of data and business problem

„Data Analysis
ƒ Tests for statistical correctness of models
‚ Are statistical assumptions of models correct?
• Eg Is the R-Square good?

ƒ Hypothesis testing
‚ Is the relationship significant?
• Use a t-test to validate significance

ƒ Tends to rely on sampling ƒ Techniques are not optimised for large amounts of data ƒ Requires strong statistical skills

Data Mining Tasks...
„ „ „ „ „ Classification Clustering Association Rule Discovery Sequential Pattern Discovery Deviation Detection

Classification Application
„ Direct Marketing „ Fraud Detection „ Customer Attrition/Churn „ Sky Survey Cataloging

Data Mining Tasks: Clustering
„ Goal is to identify categories „ Natural grouping of customers by processing all the available data about them. „ Other applications
ƒ market segmentation, discovering affinity groups, and defect analysis

Data Mining Tasks: Association Rule Discovery
„ Given a set of records each of which contain some number of items from a given collection;
ƒ Produce dependency rules which will predict occurrence of an item based on occurrences of TID other items. Items
1 2 3 4 5 Bread, Coke, Milk Beer, Bread Beer, Coke, Diaper, Milk Beer, Bread, Diaper, Milk Coke, Diaper, Milk

Rules Discovered: Rules Discovered:

{Milk} --> {Coke} {Milk} --> {Coke} {Diaper, Milk} --> {Beer} {Diaper, Milk} --> {Beer}

Association Rule Discovery Application
„ Marketing and Sales Promotion „ Supermarket Shelf Management „ Inventory Management

Deviation Detection & Pattern Discovery
Deviation Detection: …discovering most significant changes in data from previously measured or normative values… Sequential Pattern Discovery: …process of looking for patterns and rules that predict strong sequential dependencies among different events…

Sequential Patterns

„ Identify frequently occurring sequences from given records „ 40 percent of female customers buy a gray skirt six months after buying a red jacket

Data Mining Methodology:
„ Sample
ƒ Extract a portion of the dataset for data mining

„ Explore „ Modify
ƒ create, select and transform variables with the intention of building a model

„ Model
ƒ Specify a relationship of variables that reliably predicts a desired goal

„ Assess
ƒ Evaluate the practical value of the findings and the model resulting from the data mining effort

Data Mining Methodology:
„ „ „ „ „ Data understanding Data preparation Modeling Evaluation Deployment

DM Phases

Phases and Tasks
Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment

Determine Collect Initial Data Data Set Business Objectives Initial Data Collection Data Set Description Background Report Business Objectives Select Data Business Success Describe Data Rationale for Inclusion Criteria Data Description Report Exclusion

Select Modeling Evaluate Results Plan Deployment Technique Assessment of Data Deployment Plan Modeling Technique Mining Results w.r.t. Modeling Assumptions Business Success Plan Monitoring and / Criteria Maintenance Generate Test Design Approved Models Monitoring and Test Design Maintenance Plan Situation Assessment Explore Data Clean Data Review Process Inventory of Resources Data Exploration ReportData Cleaning Report Build Model Review of Process Produce Final Report Requirements, Parameter Settings Final Report Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Final Presentation Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions Risks and Contingencies Generated Records Decision Review Project Terminology Assess Model Experience Costs and Benefits Integrate Data Model Assessment Documentation Merged Data Revised Parameter Determine Settings Data Mining Goal Format Data Data Mining Goals Reformatted Data Data Mining Success Criteria Produce Project Plan Project Plan Initial Asessment of Tools and Techniques

Major Application Areas for Data Mining Solutions
„Fraud/Non-Compliance Anomaly detection
ƒ Isolate the factors that lead to fraud, waste and abuse ƒ Target auditing and investigative efforts more effectively

„Recruiting/Attracting customers „Maximizing profitability (cross selling, identifying profitable customers) „Service Delivery and Customer Retention
ƒ Build profiles of customers likely to use which services

„Credit/Risk Scoring „Intrusion detection „Parts failure prediction

„Web Mining „Health Care

Case Study: Search Engines
„ Early search engines used mainly keywords on a page – were subject to manipulation „ Google success is due to its algorithm which uses mainly links to the page „ Google founders Sergey Brin and Larry Page were students in Stanford doing research in databases and data mining in 1998 which led to Google

Case Study: Direct Marketing and CRM
„ Most major direct marketing companies are using modeling and data mining „ Most financial companies are using customer modeling „ Modeling is easier than changing customer behaviour

Final Remarks
„ Data Mining can be utilized for any field that needs to find patterns or relationships in their data.

Special Data Types
„ Spatial Data „ Streamed Data „ Multimedia data

Spatial Mining
„ Spatial data is about instances located in a physical space „ Spatial data has location or geo-referenced features „ Some of these features are:
ƒ Address, latitude/longitude (explicit) ƒ Location-based partitions in databases (implicit)

Applications and Problems
„ Geographic information systems (GIS) store information related to geographic locations on Earth
ƒ Weather, community infrastructure needs, disaster management, and hazardous waste

„ Homeland security issues such as prediction of unexpected events and planning of evacuation „ Remote sensing and image classification „ Biomedical applications include medical imaging and illness diagnosis

Use of Spatial Data
„ Map overlay – merging disparate data
ƒ Different views of the same area: (Level 1) streets, power lines, phone lines, sewer lines, (Level 2) actual elevations, building locations, and rivers

„ Spatial selection – find all houses near WSU „ Spatial join – nearest for points, intersection for areas „ Other basic spatial operations
ƒ Region/range query for objects intersecting a region ƒ Nearest neighbor query for objects closest to a given place ƒ Distance scan asking for objects within a certain radius

Spatial Data Structures
„ Minimum bounding rectangles (MBR) „ Different tree structures
ƒ Quad tree ƒ R-Tree ƒ kd-Tree

„ Image databases

„ Representing a spatial object by the smallest rectangle [(x1,y1), (x2,y2)] or rectangles



„ Indexing MBRs in a tree
ƒ An R-tree of order m has at most m entries in one node ƒ An example (order of 3)
R6 R1 R7 R2 R3 R4 R5 R1 R2 R3 R4 R5 R8 R6 R8


Common Tasks dealing with Spatial Data
„ Data focusing
ƒ Spatial queries ƒ Identifying interesting parts in spatial data ƒ Progress refinement can be applied in a tree structure

„ Feature extraction
ƒ Extracting important/relevant features for an application

„ Classification or others
ƒ Using training data to create classifiers ƒ Many mining algorithms can be used
‚ Classification, clustering, associations

Spatial Mining Tasks
„ Spatial classification „ Spatial clustering „ Spatial association rules

„ Spatial data can contain both spatial and non-spatial features. „ When spatial information becomes dominant interest, spatial data mining should be applied. „ Spatial data structures can facilitate spatial mining. „ Standard data mining algorithms can be modified for spatial data mining, with a substantial part of preprocessing to take into account of spatial information.

The Stream Model
„ Data enters at a rapid rate from one or more input ports. „ The system cannot store the entire stream. „ How do you make critical calculations about the stream using a limited amount of (secondary) memory?


. . . 1, 5, 2, 7, 0, 9, 3 . . . a, r, v, t, y, h, b . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering Limited Storage Processor



Applications --- (1)
„ In general, stream processing is important for applications where
ƒ New data arrives frequently. ƒ Important queries tend to ask about the most recent data, or summaries of data.


Applications --- (2)
„ Mining query streams.
ƒ Google wants to know what queries are more frequent today than yesterday.

„ Mining click streams.
ƒ Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour.


Applications --- (3)
„ Sensors of all kinds need monitoring, especially when there are many sensors of the same type, feeding into a central controller, most of which are not sensing anything important at the moment. „ Telephone call records summarized into customer bills.

Applications --- (4)
„ Intelligence-gathering.
‚ Who calls whom? ‚ Who accesses which Web pages? ‚ Who buys what where?


Characteristics of Data Streams
„ Data Streams
ƒ Data streams—continuous, ordered, changing, fast, huge amount ƒ Traditional DBMS—data stored in finite, persistent data sets

„ Characteristics
ƒ Huge volumes of continuous data, possibly infinite ƒ Fast changing and requires fast, real-time response ƒ Data stream captures nicely our data processing needs of today ƒ Random access is expensive—single linear scan algorithm (can only have one look) ƒ Store only the summary of the data seen thus far ƒ Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

Stream Data Applications
„ „ „ „ „ „ „ „ „ Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing Sensor, monitoring & surveillance: video streams Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too expensive)

Challenges of Stream Data Processing
„ Multiple, continuous, rapid, time-varying, ordered streams „ Main memory computations „ Queries are often continuous
ƒ Evaluated continuously as stream data arrives

„ Queries are often complex
ƒ Beyond element-at-a-time processing ƒ Beyond stream-at-a-time processing ƒ Beyond relational queries (scientific, data mining, OLAP)

„ Multi-level/multi-dimensional processing and data mining
ƒ Most stream data are at pretty low-level or multi-dimensional in nature

Multi-Dimensional Stream Analysis: Examples
„ Analysis of Web click streams
ƒ Raw data at low levels: seconds, web page addresses, user IP addresses, … ƒ Analysts want: changes, trends, unusual patterns, at reasonable levels of details ƒ E.g., Average clicking traffic in North America on sports in the last 15 minutes is 40% higher than that in the last 24 hours.”

„ Analysis of power consumption streams
ƒ Raw data: power consumption flow for every household, every minute ƒ Patterns one may find: average hourly power consumption surges up 30% for manufacturing companies in Chicago in the last 2 hours today than that of the same day a week ago

Data Warehouse
A Data Warehouse stores data that have been extracted from the various operational ,external and other databases of organization It is a central source of the data that have been cleaned, transformed ,cataloged ,so they can be used by managers and other business professionals The acquisition process include consolidating data from several sources filtering out un wanted data ,correcting incorrect data ,converting data to new data elements & aggregating data into new data subsets

Where DW is used
„ Data mining-data in a ware house are analysed to reveal hidden pattern and trends in historical business activity „ OLAP „ Business analysis „ Market Research „ Decision Support

Components of DW
„ Analytical data store –holds data in a more useful form for certain analysis „ Meta data –data that defines the data in the data warehouse

„ Data ware houses may be subdivide into data marts „ Data marts holds subsets of data from the focus on specific aspects of company such as a department or a business process

Data Warehouse Architecture

Data Warehouse Options