Chapter 1.

Introduction
      

Why Data Mining? What Is Data Mining? A Multi-Dimensional View of Data Mining What Kind of Data Can Be Mined? What Kinds of Patterns Can Be Mined? What Technology Are Used? Major Issues in Data Mining

1

Google Flu Trends: US Flu Activity

Blue: Google Flu Trends estimate; Orange: US data

http://www.google.org/flutrends/

March 30, 2014

Data Mining: Concepts and Techniques

2

showing data available at four points in the 2007-2008 influenza season.1038/nature07634 3 . J Ginsberg et al.―Influenza-like Illness‖ (ILI) percentages estimated by model (black) and provided by the CDC (red) in the mid-Atlantic region. 1-3 (2008) doi:10. Nature 000.

digital cameras. but starving for knowledge! ―Necessity is the mother of invention‖—Data mining—Automated analysis of massive data sets 4 . bioinformatics. database systems. YouTube     We are drowning in data. scientific simulation. transactions. computerized society  Major sources of abundant data  Business: Web.Why Data Mining?  The Explosive Growth of Data: from terabytes to petabytes  Data collection and data availability  Automated data collection tools. … Society and everyone: news. Web. … Science: Remote sensing. stocks. e-commerce.

advanced data models (extended-relational.) Data mining.)  1970s:   1980s:   Application-oriented DBMS (spatial. database creation. etc. scientific. data warehousing.Evolution of Database Technology  1960s:  Data collection. relational DBMS implementation RDBMS. deductive. and Web databases Stream data management and mining Data mining and its applications Web technology (XML. OO. engineering. data integration) and global information systems 5  1990s:   2000s    . IMS and network DBMS Relational data model. etc. multimedia databases.

previously unknown and potentially useful) patterns or knowledge from huge amount of data  Data mining: a misnomer? Knowledge discovery (mining) in databases (KDD). business intelligence. implicit. data/pattern analysis. data dredging. knowledge extraction. data archeology. etc. Simple search and query processing (Deductive) expert systems 6  Alternative names   Watch out: Is everything ―data mining‖?   . information harvesting.What Is Data Mining?  Data mining (knowledge discovery from data)  Extraction of interesting (non-trivial.

Knowledge Discovery (KDD) Process   This is a view from typical database systems and data Pattern Evaluation warehousing communities Data mining plays an essential role in the knowledge discovery Data Mining process Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 7 .

Web documents. and Reporting Data Preprocessing/Integration. Database Systems DBA 8 . Data Warehouses Data Sources Paper. Querying. Scientific experiments.Data Mining in Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst Data Exploration Statistical Summary. Files.

KDD Process: A Typical View from ML and Statistics Data Mining PostProcessing Input Data Data PreProcessing Data integration Normalization Feature selection Dimension reduction Pattern discovery Association & correlation Classification Clustering Outlier analysis ………… Pattern Pattern Pattern Pattern evaluation selection interpretation visualization  This is a view from typical machine learning and statistics communities 9 .

Example: Medical Data Mining  Health care & medical data mining – often adopted such a view in statistics and machine learning Preprocessing of the data (including feature extraction and dimension reduction) Classification or/and clustering processes Post-processing for presentation    10 .

bio-data mining. data warehouse (OLAP). data warehouse. multi-media. 11 . stock market analysis. etc. outlier analysis. etc. heterogeneous. text and web. spatiotemporal. fraud analysis.Multi-Dimensional View of Data Mining     Data to be mined  Database data (extended-relational. object-oriented. etc. transactional data. stream. pattern recognition. discrimination. banking. Web mining. high-performance. association. predictive data mining  Multiple/integrated functions and mining at multiple levels Techniques utilized  Data-intensive. statistics. text mining. time-series.  Descriptive vs. telecommunication. clustering. Applications adapted  Retail. graphs & social and information networks Knowledge to be mined (or: Data mining functions)  Characterization. visualization. trend/deviation. sequence. legacy). classification. machine learning.

data warehouse. social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web 12  Advanced data sets and advanced applications          . sequence data (incl. transactional database Data streams and sensor data Time-series data.Data Mining: On What Kinds of Data?  Database-oriented data sets and applications  Relational database. graphs. temporal data. bio-sequences) Structure data.

summarize.e. integration. e. and contrast data characteristics.. dry vs. wet region 13 . and multidimensional data model Scalable methods for computing (i. materializing) multidimensional aggregates OLAP (online analytical processing)  Data cube technology    Multidimensional concept description: Characterization and discrimination  Generalize. transformation.Data Mining Function: (1) Generalization  Information integration and data warehouse construction  Data cleaning..g.

Data Mining Function: (2) Association and Correlation Analysis  Frequent patterns (or frequent itemsets)  What items are frequently purchased together in your Walmart? A typical association rule   Association. confidence)  Are strongly associated items also strongly correlated?  How to mine such patterns and rules efficiently in large datasets? How to use such patterns for classification. and other applications? 14  . correlation vs. causality  Diaper  Beer [0. clustering.5%. 75%] (support.

. … 15  Typical methods   Typical applications:  .g. web-pages. patternbased classification. classifying stars. … Credit card fraud detection. naïve Bayesian classification. classify countries based on (climate). neural networks. logistic regression. diseases.Data Mining Function: (3) Classification  Classification and label prediction   Construct models (functions) based on some training examples Describe and distinguish classes or concepts for future prediction  E. rule-based classification. direct marketing. or classify cars based on (gas mileage)  Predict some unknown class labels Decision trees. support vector machines.

.e.g...e. e. clusters). Class label is unknown) Group data to form new categories (i.Data Mining Function: (4) Cluster Analysis   Unsupervised learning (i. cluster houses to find distribution patterns Principle: Maximizing intra-class similarity & minimizing interclass similarity   Many methods and applications 16 .

rare events analysis    17 . … Useful in fraud detection.Data Mining Function: (5) Outlier Analysis  Outlier analysis  Outlier: A data object that does not comply with the general behavior of the data Noise or exception? ― One person’s garbage could be another person’s treasure Methods: by product of clustering or regression analysis.

Time and Ordering: Sequential Pattern. time-varying. Trend and Evolution Analysis   Sequence. regression and value prediction  Sequential pattern mining  e. then buy large SD memory cards  Periodicity analysis  Motifs and biological sequence analysis  Approximate and consecutive motifs  Similarity-based analysis Mining data streams  Ordered. time-series. first buy digital camera.. and deviation analysis: e.g. data streams 18 . potentially infinite..g. trend and evolution analysis  Trend.

may be transient. location. predictive Coverage Typicality vs. …  Evaluation of mined knowledge → directly mine only interesting knowledge?       Descriptive vs. novelty Accuracy Timeliness … 19 .Evaluation of Knowledge  Are all mined knowledge interesting?    One can mine tremendous amount of ―patterns‖ and knowledge Some may fit only certain dimension space (time. …) Some may not be representative.

Data Mining: Confluence of Multiple Disciplines Machine Learning Pattern Recognition Statistics Applications Data Mining Visualization Algorithm Database Technology High-Performance Computing 20 .

temporal data. graphs. multimedia. scientific simulations  High-dimensionality of data   High complexity of data        New and sophisticated applications 21 . spatiotemporal.Why Confluence of Multiple Disciplines?  Tremendous amount of data  Algorithms must be highly scalable to handle potentially tera-bytes of data Micro-array may have tens of thousands of dimensions Data streams and sensor data Time-series data. sequence data Structure data. text and Web data Software programs. social networks and multi-linked data Heterogeneous databases and legacy databases Spatial.

uncertainty.Major Issues in Data Mining (1)  Mining Methodology       Mining various and new kinds of knowledge Mining knowledge in multi-dimensional space Data mining: An interdisciplinary effort Boosting the power of discovery in a networked environment Handling noise.or constraint-guided mining  User Interaction    Interactive mining Incorporation of background knowledge Presentation and visualization of data mining results 22 . and incompleteness of data Pattern evaluation and pattern.

and global data repositories Social impacts of data mining  Diversity of data types    Data mining and society    Privacy-preserving data mining Invisible data mining 23 . stream. networked. and incremental mining methods Handling complex types of data Mining dynamic.Major Issues in Data Mining (2)  Efficiency and Scalability   Efficiency and scalability of data mining algorithms Parallel. distributed.