Chapter 1.

Introduction 
     

Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Major issues in data mining

February 2, 2012

Data Mining: Concepts and Techniques

1

Necessity Is the Mother of Invention 

Data explosion problem 

Automated data collection tools and mature database technology lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data warehouses, and other information repositories 



We are drowning in data, but starving for knowledge! Solution: Data warehousing and data mining 

Data warehousing and on-line analytical processing 

Miing interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

February 2, 2012

Data Mining: Concepts and Techniques

2

IMS and network DBMS Relational data model. data warehousing. engineering. multimedia databases. scientific.) Data mining. etc.Evolution of Database Technology  1960s:  Data collection. database creation. OO. deductive. advanced data models (extended-relational. 2012 3 . relational DBMS implementation RDBMS. etc.) Application-oriented DBMS (spatial. and Web databases Stream data management and mining Data mining with a variety of applications Web technology and global information systems Data Mining: Concepts and Techniques  1970s:   1980s:    1990s:   2000s    February 2.

previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD). data archeology. information harvesting. data/pattern analysis.What Is Data Mining?  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial. etc. implicit. knowledge extraction. Expert systems or small ML/statistical programs Data Mining: Concepts and Techniques  Alternative names   Watch out: Is everything data mining ?   February 2. (Deductive) query processing. business intelligence. 2012 4 . data dredging.

improved underwriting. 2012 5 . market basket analysis. competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications    Text mining (news group. quality control. customer retention. customer relationship management (CRM). email.Why Data Mining? Potential Applications  Data analysis and decision support  Market analysis and management  Target marketing. cross selling. market segmentation  Risk analysis and management  Forecasting. documents) and Web mining Stream data mining DNA and bio-data analysis Data Mining: Concepts and Techniques February 2.

& prediction based on such association  Target marketing    Cross-market analysis   Customer profiling  What types of customers buy what products (clustering or classification)  Customer requirement analysis   identifying the best products for different customers predict what factors will attract new customers  Provision of summary information   multidimensional summary reports statistical summary information (data central tendency and variation) Data Mining: Concepts and Techniques February 2. 2012 6 . customer complaint calls. Determine customer purchasing patterns over time Associations/co-relations between product sales. loyalty cards. income level. etc. spending habits. plus (public) lifestyle studies Find clusters of model customers who share the same characteristics: interest.Market Analysis and Management  Where does the data come from?  Credit card transactions. discount coupons.

trend analysis. 2012 Data Mining: Concepts and Techniques 7 .Corporate Analysis & Risk Management  Finance planning and asset evaluation    cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio. etc.) summarize and compare the resources and spending monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market  Resource planning   Competition    February 2.

duration. and ring of references Unnecessary or correlated screening tests Phone call model: destination of the call.    Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance   Professional patients. credit card service. retail. outlier analysis Applications: Health care.Fraud Detection & Mining Unusual Patterns   Approaches: Clustering & model construction for frauds. 2012 8 . time of day or week. Analyze patterns that deviate from an expected norm Analysts estimate that 38% of retail shrink is due to dishonest employees  Telecommunications: phone-call fraud   Retail industry   Anti-terrorism Data Mining: Concepts and Techniques February 2. telecomm. ring of doctors.

Data Mining: Concepts and Techniques February 2. and fouls) to gain competitive advantage for New York Knicks and Miami Heat  Astronomy  JPL and the Palomar Observatory discovered 22 quasars with the help of data mining  Internet Web Surf-Aid  IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages. analyzing effectiveness of Web marketing.Other Applications  Sports  IBM Advanced Scout analyzed NBA game statistics (shots blocked. assists. 2012 9 . improving Web site organization. etc.

2012 Data Mining: Concepts and Techniques Selection 10 .Data Mining: A KDD Process  Data mining core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases February 2.

etc.  Choosing functions of data mining     Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation  visualization. 2012 11 . dimensionality/variable reduction. clustering.  Use of discovered knowledge Data Mining: Concepts and Techniques February 2. transformation. regression. removing redundant patterns. association.Steps of a KDD Process  Learning the application domain  relevant prior knowledge and goals of application    Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation  Find useful features. invariant representation. summarization. classification.

Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis. Database Systems. Information Providers. OLTP February 2. 2012 Data Mining: Concepts and Techniques DBA 12 . MDA Data Sources Paper. Querying and Reporting Business Analyst Data Analyst Data Warehouses / Data Marts OLAP. Files.

2012 Data Warehouse 13 Data Mining: Concepts and Techniques .Architecture: Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Database or data warehouse server Data cleaning & data integration Filtering Knowledge-base Databases February 2.

Data Mining: On What Kinds of Data?     Relational database Data warehouse Transactional database Advanced database and information repository  Object-relational database  Spatial and temporal data  Time-series data  Stream data  Multimedia database  Heterogeneous and legacy database  Text databases & WWW Data Mining: Concepts and Techniques February 2. 2012 14 .

g. neural network Predict some unknown or missing numerical values Data Mining: Concepts and Techniques February 2.g. or classify cars based on gas mileage   Presentation: decision-tree. e. classify countries based on climate. summarize.5%. classification rule.. and contrast data characteristics. wet regions  Association (correlation and causality)  Diaper Beer [0. 2012 15 ..Data Mining Functionalities  Concept description: Characterization and discrimination  Generalize. 75%]  Classification and Prediction  Construct models (functions) that describe and distinguish classes or concepts for future prediction  E. dry vs.

.g. cluster houses to find distribution patterns  Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis  Outlier: a data object that does not comply with the general behavior of the data  Noise or exception? No! useful in fraud detection.Data Mining Functionalities (2)     Cluster analysis  Class label is unknown: Group data to form new classes. periodicity analysis  Similarity-based analysis Other pattern-directed or statistical analyses Data Mining: Concepts and Techniques February 2. rare events analysis Trend and evolution analysis  Trend and deviation: regression analysis  Sequential pattern mining. 2012 16 . e.

confidence. e. support. actionability. Subjective: based on user s belief in the data.g... potentially useful. unexpectedness. or validates some hypothesis that a user seeks to confirm  Objective vs.Are All the Discovered Patterns Interesting?  Data mining may generate thousands of patterns: Not all of them are interesting  Suggested approach: Human-centered. valid on new or test data with some degree of certainty. e. focused mining  Interestingness measures  A pattern is interesting if it is easily understood by humans. query-based.  February 2. novelty. 2012 Data Mining: Concepts and Techniques 17 . etc. novel. subjective interestingness measures  Objective: based on statistics and structures of patterns. etc.g.

classification vs.Can We Find All and Only Interesting Patterns?  Find all the interesting patterns: Completeness    Can a data mining system find all the interesting patterns? Heuristic vs. clustering  Search for only interesting patterns: An optimization problem   Can a data mining system find only the interesting patterns? Approaches  First general all the patterns and then filter out the uninteresting ones. 2012 Data Mining: Concepts and Techniques 18 . exhaustive search Association vs. Generate only the interesting patterns mining query optimization  February 2.

2012 19 .Data Mining: Confluence of Multiple Disciplines Database Systems Statistics Machine Learning Data Mining Visualization Algorithm Other Disciplines Data Mining: Concepts and Techniques February 2.

Data Mining: Classification Schemes  General functionality   Descriptive data mining Predictive data mining  Different views. different classifications     Kinds of data to be mined Kinds of knowledge to be discovered Kinds of techniques utilized Kinds of applications adapted Data Mining: Concepts and Techniques February 2. 2012 20 .

outlier analysis. stock market analysis. visualization. active. classification. multi-media. 2012 21 . machine learning. data warehouse. banking. statistics. Retail. clustering. etc. association. bio-data mining. stream. objectoriented/relational. time-series. trend/deviation. transactional. etc. Data Mining: Concepts and Techniques  Knowledge to be mined    Techniques utilized   Applications adapted  February 2. Web mining. spatial. telecommunication. data warehouse (OLAP). text. fraud analysis. discrimination. Multiple/integrated functions and mining at multiple levels Database-oriented. etc. legacy.Multi-Dimensional View of Data Mining  Data to be mined  Relational. WWW Characterization. heterogeneous.

pivoting. tight-coupling  On-line analytical mining data  integration of mining and OLAP technologies  Interactive mining multi-level knowledge  Necessity of mining knowledge and patterns at different levels of abstraction by drilling/rolling.  Integration of multiple mining functions  Characterized classification. semi-tight-coupling. DBMS.OLAP Mining: Integration of Data Mining and Data Warehousing  Data mining systems. etc. Data warehouse systems coupling  No coupling. slicing/dicing. first clustering and then association Data Mining: Concepts and Techniques February 2. 2012 22 . loose-coupling.

2012 Data Data integration Warehouse Data Mining: Concepts and Techniques Data Repository 23 .An OLAM Architecture Mining query User GUI API Mining result Layer4 User Interface Layer3 OLAP/OLAM OLAM Engine Data Cube API OLAP Engine Layer2 MDDB Meta Data Filtering&Integration MDDB Database API Data cleaning Filtering Layer1 Databases February 2.

and privacy Data Mining: Concepts and Techniques        User interaction     Applications and social impacts   February 2. bio.. stream. and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel. effectiveness. distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Domain-specific data mining & invisible data mining Protection of data security.g. 2012 24 .Major Issues in Data Mining  Mining methodology  Mining different kinds of knowledge from diverse data types. Web Performance: efficiency. integrity. e.

Summary  Data mining: discovering interesting patterns from large amounts of data A natural evolution of database technology. in great demand. pattern evaluation. Data mining systems and architectures Major issues in data mining Data Mining: Concepts and Techniques       February 2. outlier and trend analysis. data mining. clustering. transformation. 2012 25 . etc. and knowledge presentation Mining can be performed in a variety of information repositories Data mining functionalities: characterization. data integration. discrimination. association. data selection. with wide applications A KDD process includes data cleaning. classification.

etc. Data Mining: Concepts and Techniques February 2. (IEEE) ICDM (2001). Uthurusamy. and SIGKDD Explorations  More conferences on data mining  PAKDD (1997).A Brief History of Data Mining Society  1989 IJCAI Workshop on Knowledge Discovery in Databases (PiatetskyShapiro)  Knowledge Discovery in Databases (G. SIAM-Data Mining (2001). P. Frawley. 1996)  1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD 95-98)  Journal of Data Mining and Knowledge Discovery (1997)  1998 ACM SIGKDD. Smyth. Piatetsky-Shapiro and W. SIGKDD 1999-2001 conferences. G. and R. 1991)  1991-1994 Workshops on Knowledge Discovery in Databases  Advances in Knowledge Discovery and Data Mining (U. Fayyad. Piatetsky-Shapiro. PKDD (1997). 2012 26 .

Sign up to vote on this title
UsefulNot useful