Professional Documents
Culture Documents
Data Mining:: Concepts and Techniques
Data Mining:: Concepts and Techniques
1 4
2 5
The Explosive Growth of Data: from terabytes to petabytes Data mining (knowledge discovery from data)
Data collection and data availability Extraction of interesting (non-trivial, implicit, previously
Automated data collection tools, database systems, Web, unknown and potentially useful) patterns or knowledge from
computerized society huge amount of data
Major sources of abundant data Data mining: a misnomer?
3 6
1
Knowledge Discovery (KDD) Process Example: Mining vs. Data Exploration
This is a view from typical
database systems and data
Pattern Evaluation Business intelligence view
warehousing communities
Data mining plays an essential
Warehouse, data cube, reporting but not much mining
role in the knowledge discovery
Data Mining Business objects vs. data mining tools
process
Supply chain example: tools
Task-relevant Data
Data presentation
Data Warehouse Selection Exploration
Data Cleaning
Data Integration
Databases
7 10
8 11
Visualization Techniques
Analyst Preprocessing of the data (including feature
Data Mining Data extraction and dimension reduction)
Information Discovery Analyst
Classification or/and clustering processes
Data Exploration
Statistical Summary, Querying, and Reporting Post-processing for presentation
Data Preprocessing/Integration, Data Warehouses
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9 12
2
Data Mining Function: (2) Association and
Multi-Dimensional View of Data Mining
Correlation Analysis
Data to be mined Frequent patterns (or frequent itemsets)
Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal, What items are frequently purchased together in your
time-series, sequence, text and web, multi-media, graphs & social Walmart?
and information networks
Association, correlation vs. causality
Knowledge to be mined (or: Data mining functions)
Characterization, discrimination, association, classification, A typical association rule
clustering, trend/deviation, outlier analysis, etc. Diaper Beer [0.5%, 75%] (support, confidence)
Descriptive vs. predictive data mining
Data Mining: On What Kinds of Data? Data Mining Function: (3) Classification
Heterogeneous databases and legacy databases Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
Spatial data and spatiotemporal data
based classification, logistic regression, …
Multimedia database
Typical applications:
Text databases
Credit card fraud detection, direct marketing, classifying stars,
The World-Wide Web diseases, web-pages, …
14 17
Data Mining Function: (1) Generalization Data Mining Function: (4) Cluster Analysis
Information integration and data warehouse construction Unsupervised learning (i.e., Class label is unknown)
Data cleaning, transformation, integration, and Group data to form new categories (i.e., clusters), e.g.,
multidimensional data model cluster houses to find distribution patterns
Data cube technology Principle: Maximizing intra-class similarity & minimizing
Scalable methods for computing (i.e., materializing) interclass similarity
multidimensional aggregates Many methods and applications
OLAP (online analytical processing)
Multidimensional concept description: Characterization
and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet region
15 18
3
Data Mining Function: (5) Outlier Analysis Evaluation of Knowledge
Outlier analysis Are all mined knowledge interesting?
Outlier: A data object that does not comply with the general
One can mine tremendous amount of “patterns” and knowledge
behavior of the data
Some may fit only certain dimension space (time, location, …)
Noise or exception? ― One person’s garbage could be another
person’s treasure Some may not be representative, may be transient, …
Methods: by product of clustering or regression analysis, … Evaluation of mined knowledge → directly mine only
Useful in fraud detection, rare events analysis interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
19 22
20 23
4
Applications of Data Mining A Brief History of Data Mining Society
Web page analysis: from web page classification, clustering to 1989 IJCAI Workshop on Knowledge Discovery in Databases
PageRank & HITS algorithms Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
Collaborative analysis & recommender systems
1991-1994 Workshops on Knowledge Discovery in Databases
Basket data analysis to targeted marketing
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Biological and medical data analysis: classification, cluster analysis Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
(microarray data analysis), biological sequence analysis, biological 1995-1998 International Conferences on Knowledge Discovery in Databases
network analysis and Data Mining (KDD’95-98)
Data mining and software engineering (e.g., IEEE Computer, Aug. Journal of Data Mining and Knowledge Discovery (1997)
2009 issue) ACM SIGKDD conferences since 1998 and SIGKDD Explorations
From major dedicated data mining systems/tools (e.g., SAS, MS SQL- More conferences on data mining
Server Analysis Manager, Oracle Data Mining Tools) to invisible data PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
mining (2001), etc.
ACM Transactions on KDD starting in 2007
25 28
Major Issues in Data Mining (1) Conferences and Journals on Data Mining
26 29
Major Issues in Data Mining (2) Where to Find References? DBLP, CiteSeer, Google
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
Diversity of data types AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Handling complex types of data
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
27 30
5
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining technologies and applications
Major issues in data mining
31
32