You are on page 1of 31

Data Mining:

Concepts and Techniques


(3rd ed.)

Chapter 1
Jiawei Han, Micheline Kamber, and Jian Pei
University of Illinois at Urbana-Champaign &
Simon Fraser University
2013 Han, Kamber & Pei. All rights reserved.
11

Introduction
n

Why Data Mining?

What Is Data Mining?

A Mul2-Dimensional View of Data Mining

What Kinds of Data Can Be Mined?

What Kinds of Pa?erns Can Be Mined?

What Kinds of Technologies Are Used?

What Kinds of Applica2ons Are Targeted?

Major Issues in Data Mining

A Brief History of Data Mining and Data Mining Society

Why Data Mining?


n

n
n

Major sources of abundant data


n Business: Web, e-commerce, transac2ons, stocks
n Science: Remote sensing, bioinforma2cs, scien2c simula2on
n Society and everyone: news, digital cameras, YouTube
The Explosive Growth of Data: from terabytes to petabytes
n Automated data collec2on using database systems, sensors
and APIs
n Data availability on the Web and through APIs
We are drowning in data, but starving for knowledge!
Necessity is the mother of inven2onData mining
Automated analysis of massive data [sets]
3

Data Mining Example

What Is Data Mining?


n

Data mining (knowledge discovery from data)


n

Alterna2ve names
n

Extrac2on of interes2ng (non-trivial, implicit, previously


unknown and poten2ally useful) pa?erns or knowledge from
huge amount of data
Knowledge discovery (mining) in databases, knowledge
extrac2on, data/pa?ern analysis, informa2on harves2ng,
business intelligence, etc.

Watch out: Is everything data mining?


n

Simple search and query processing

(Deduc2ve) expert systems

Knowledge Discovery (KDD) Process


n

This is a view from typical database


systems and data warehousing
Pattern Evaluation
communi2es
Data mining plays an essen2al role in
the knowledge discovery process
Data Mining
Task-relevant Data
Data Warehouse

Selection

Data Cleaning
Data Integration
Databases

Example: A Web Mining Framework


n

Web mining usually involves


n Data cleaning and integra2on from mul2ple sources
n Warehousing the data
n Data cube construc2on
n Data selec2on for data mining
n Data mining
n Presenta2on of the mining results
n Pa?erns and knowledge to be used or stored into
knowledge-base
7

Data Mining in Business Intelligence


Increasing potential
to support
business decisions

Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery

End User

Business
Analyst
Data
Analyst

Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
8

KDD Process: A Typical View from ML and


Statistics

Input Data

Data PreProcessing

Data integration
Normalization
Feature selection
Dimension reduction

Data
Mining

Pattern discovery
Association & correlation
Classification
Clustering
Outlier analysis

PostProcessing

Pattern
Pattern
Pattern
Pattern

evaluation
selection
interpretation
visualization

This is a view from typical machine learning and sta2s2cs communi2es


9

Which View Do You Prefer?


n

Which view do you prefer?


n

KDD vs. ML/Stat. vs. Business Intelligence

Depending on the data, applica2ons, and your focus

Data Mining vs. Data Explora2on


n

Business intelligence view


n

n
n

Warehouse, data cube, repor2ng but not much mining

Business objects vs. data mining tools


Supply chain example: mining vs. OLAP vs. presenta2on
tools
Data presenta2on vs. data explora2on
10

Multi-Dimensional View of Data Mining


n

Data to be mined
n Database data (extended-rela2onal, object-oriented, heterogeneous,
legacy), data warehouse, transac2onal data, stream, spa2otemporal, 2me-
series, sequence, text and web, mul2-media, graphs & social and
informa2on networks
Knowledge to be mined (or: Data mining funcNons)
n Characteriza2on, discrimina2on, associa2on, classica2on, clustering,
trend/devia2on, outlier analysis, etc.
n Descrip2ve vs. predic2ve data mining
n Mul2ple/integrated func2ons and mining at mul2ple levels
Techniques uNlized
n Data-intensive, data warehouse (OLAP), machine learning, sta2s2cs,
pa?ern recogni2on, visualiza2on, high-performance, etc.
ApplicaNons adapted
n Retail, telecommunica2on, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.

11

On What Kinds of Data?


n

Database-oriented data sets and applica2ons


n

Rela2onal database, data warehouse, transac2onal database

Object-rela2onal databases, Heterogeneous databases and legacy


databases

Advanced data sets and advanced applica2ons


n

Data streams and sensor data

Time-series data, temporal data, sequence data (incl. bio-sequences)

Structure data, graphs, social networks and informa2on networks

Spa2al data and spa2otemporal data

Mul2media database

Text databases

The World-Wide Web

12

Data Mining Function: (1) Generalization


n

Informa2on integra2on and data warehouse construc2on


n

Data cube technology


n

Data cleaning, transforma2on, integra2on, and


mul2dimensional data model
Scalable methods for compu2ng (i.e., materializing)
mul2dimensional aggregates
OLAP (online analy2cal processing)

Mul2dimensional concept descrip2on: Characteriza2on and


discrimina2on
n

Generalize, summarize, and contrast data characteris2cs,


e.g., dry vs. wet region
13

Data Mining Function: (2) Association and


Correlation Analysis
n

Frequent pa?erns (or frequent itemsets)


n

What items are frequently purchased together in your


Walmart?

Associa2on, correla2on vs. causality


n

A typical associa2on rule


n

Diaper Milk [0.5%, 95%] (support, condence)

Are strongly associated items also strongly correlated?

How to mine such pa?erns and rules eciently in large


datasets?
How to use such pa?erns for classica2on, clustering, and
other applica2ons?
14

Data Mining Function: (3) Classification


n

Classica2on and label predic2on


n

Construct models (func2ons) based on some training examples

Describe and dis2nguish classes or concepts for future predic2on


n

Predict some unknown class labels

Typical methods
n

E.g., classify countries based on (climate), or classify cars


based on (gas mileage)

Decision trees, nave Bayesian classica2on, support vector


machines, neural networks, rule-based classica2on, pa?ern-
based classica2on, logis2c regression,

Typical applica2ons:
n

Credit card fraud detec2on, direct marke2ng, classifying stars,


diseases, web-pages,

15

Data Mining Function: (4) Cluster Analysis


n

Unsupervised learning (i.e., Class label is


unknown)
Group data to form new categories (i.e.,
clusters), e.g., cluster houses to nd distribu2on
pa?erns
Principle: Maximizing intra-class similarity &
minimizing interclass similarity
Many methods and applica2ons
16

Data Mining Function: (5) Outlier Analysis


n

Outlier: A data object that does not comply with


the general behavior of the data
Noise or excep2on? One persons garbage
could be another persons treasure
Methods: by-product of clustering or regression
analysis,
Useful in fraud detec2on, rare events analysis

17

Time and Ordering: Sequential Pattern,


Trend and Evolution Analysis
n

Sequence, trend and evolu2on analysis


n Trend, 2me-series, and devia2on analysis: e.g., regression
and value predic2on
n Sequen2al pa?ern mining
n e.g., rst buy digital camera, then buy large SD memory
cards
n Periodicity analysis
n Mo2fs and biological sequence analysis
n Approximate and consecu2ve mo2fs
n Similarity-based analysis
Mining data streams
n Ordered, 2me-varying, poten2ally innite, data streams
18

Structure and Network Analysis


n

Graph mining
n Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
Informa2on network analysis
n Social networks: actors (objects, nodes) and rela2onships (edges)
n e.g., author networks in CS, terrorist networks
n Mul2ple heterogeneous networks
n A person could be mul2ple informa2on networks: friends, family,
classmates,
n Links carry a lot of seman2c informa2on: Link mining
Web mining
n Web is a big informa2on network: from PageRank to Google
n Analysis of Web informa2on networks
n Web community discovery, opinion mining, usage mining,
19

Evaluation of Knowledge
n

Are all mined knowledge interes2ng?


n One can mine tremendous amount of pa?erns
n Some may t only certain dimension space (2me, loca2on,
)
n Some may not be representa2ve, may be transient,
Evalua2on of mined knowledge directly mine only interes2ng
knowledge?
n Descrip2ve vs. predic2ve
n Coverage
n Typicality vs. novelty
n Accuracy
n Timeliness
20

Introduction
n

Why Data Mining?

What Is Data Mining?

A Mul2-Dimensional View of Data Mining

What Kinds of Data Can Be Mined?

What Kinds of Pa?erns Can Be Mined?

What Kinds of Technologies Are Used?

What Kinds of Applica2ons Are Targeted?

Major Issues in Data Mining

A Brief History of Data Mining and Data Mining Society

Summary
21

Data Mining: Confluence of Multiple Disciplines


Machine
Learning

Applications

Algorithm

Pattern
Recognition

Data Mining

Database
Technology

Statistics

Visualization

High-Performance
Computing

22

Why Confluence of Multiple Disciplines?


n

Tremendous amount of data


n Algorithms must be scalable to handle big data
High-dimensionality of data
n Micro-array may have tens of thousands of dimensions
High complexity of data
n Data streams and sensor data
n Time-series data, temporal data, sequence data
n Structure data, graphs, social and informa2on networks
n Spa2al, spa2otemporal, mul2media, text and Web data
n Sonware programs, scien2c simula2ons
New and sophis2cated applica2ons
23

Applications of Data Mining


n

Web page analysis: from web page classica2on, clustering to


PageRank & HITS algorithms

Collabora2ve analysis & recommender systems

Basket data analysis to targeted marke2ng

n
n

Biological and medical data analysis: classica2on, cluster analysis


(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and sonware engineering
From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
24

Major Issues in Data Mining (1)


n

Mining Methodology
n

Mining various and new kinds of knowledge

Mining knowledge in mul2-dimensional space

Data mining: An interdisciplinary eort

Boos2ng the power of discovery in a networked environment

Handling noise, uncertainty, and incompleteness of data

Pa?ern evalua2on and pa?ern- or constraint-guided mining

User Interac2on
n

Interac2ve mining

Incorpora2on of background knowledge

Presenta2on and visualiza2on of data mining results


25

Major Issues in Data Mining (2)


n

Eciency and Scalability


n

Eciency and scalability of data mining algorithms

Parallel, distributed, stream, and incremental mining methods

Diversity of data types


n

Handling complex types of data

Mining dynamic, networked, and global data repositories

Data mining and society


n

Social impacts of data mining

Privacy-preserving data mining

Invisible data mining


26

A Brief History of Data Mining Society


n

1989 IJCAI Workshop on Knowledge Discovery in Databases


n

1991-1994 Workshops on Knowledge Discovery in Databases


n

Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)


Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 Interna2onal Conferences on Knowledge Discovery in Databases and Data


Mining (KDD95-98)
n

Journal of Data Mining and Knowledge Discovery (1997)

ACM SIGKDD conferences since 1998 and SIGKDD Explora2ons

More conferences on data mining


n

PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001),
WSDM (2008), etc.

ACM Transac2ons on KDD (2007)


27

Conferences and Journals on Data Mining


n

KDD Conferences
n
Other related conferences
n ACM SIGKDD Int. Conf. on Knowledge
n DB conferences: ACM SIGMOD,
Discovery in Databases and Data
VLDB, ICDE, EDBT, ICDT,
Mining (KDD)
n Web and IR conferences: WWW,
n SIAM Data Mining Conf. (SDM)
SIGIR, WSDM
n (IEEE) Int. Conf. on Data Mining
n ML conferences: ICML, NIPS
(ICDM)
n PR conferences: CVPR,
n European Conf. on Machine Learning
and Principles and prac2ces of
n
Journals
Knowledge Discovery and Data
n Data Mining and Knowledge
Mining (ECML-PKDD)
Discovery (DAMI or DMKD)
n Pacic-Asia Conf. on Knowledge
n IEEE Trans. On Knowledge and
Discovery and Data Mining (PAKDD)
Data Eng. (TKDE)
n Int. Conf. on Web Search and Data
n KDD Explorations
Mining (WSDM)
n

ACM Trans. on KDD


28

Where to Find References? DBLP, CiteSeer, Google


n

Data mining and KDD (SIGKDD: CDROM)


n
n

Database systems (SIGMOD: ACM SIGMOD AnthologyCD ROM)


n
n

Conferences: SIGIR, WWW, CIKM, etc.


Journals: WWW: Internet and Web Informa2on Systems,

Sta2s2cs
n
n

Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
Journals: Machine Learning, Ar2cial Intelligence, Knowledge and Informa2on Systems, IEEE-PAMI,
etc.

Web and IR
n

Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA


Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

AI & Machine Learning


n

Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.


Journal: Data Mining and Knowledge Discovery, KDD Explora2ons, ACM TKDD

Conferences: Joint Stat. Mee2ng, etc.


Journals: Annals of sta2s2cs, etc.

Visualiza2on
n
n

Conference proceedings: CHI, ACM-SIGGraph, etc.


Journals: IEEE Trans. visualiza2on and computer graphics, etc.
29

Recommended Reference Books


n

E. Alpaydin. IntroducNon to Machine Learning, 2nd ed., MIT Press, 2011

S. ChakrabarN. Mining the Web: StaNsNcal Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

R. O. Duda, P. E. Hart, and D. G. Stork, Pa[ern ClassicaNon, 2ed., Wiley-Interscience, 2000

T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/
MIT Press, 1996

U. Fayyad, G. Grinstein, and A. Wierse, InformaNon VisualizaNon in Data Mining and Knowledge Discovery, Morgan Kaufmann,
2001

J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed. , 2011

T. HasNe, R. Tibshirani, and J. Friedman, The Elements of StaNsNcal Learning: Data Mining, Inference, and PredicNon, 2nd ed.,
Springer, 2009

B. Liu, Web Data Mining, Springer 2006

T. M. Mitchell, Machine Learning, McGraw Hill, 1997

Y. Sun and J. Han, Mining Heterogeneous InformaNon Networks, Morgan & Claypool, 2012

P.-N. Tan, M. Steinbach and V. Kumar, IntroducNon to Data Mining, Wiley, 2005

S. M. Weiss and N. Indurkhya, PredicNve Data Mining, Morgan Kaufmann, 1998

I. H. Wi[en and E. Frank, Data Mining: PracNcal Machine Learning Tools and Techniques with Java ImplementaNons, Morgan
Kaufmann, 2nd ed. 2005

30

Summary
n

n
n

Data mining: Discovering interes2ng pa?erns and knowledge from massive


amount of data
A natural evolu2on of science and informa2on technology, in great demand,
with wide applica2ons
A KDD process includes data cleaning, data integra2on, data selec2on,
transforma2on, data mining, pa?ern evalua2on, and knowledge
presenta2on
Mining can be performed in a variety of data
Data mining func2onali2es: characteriza2on, discrimina2on, associa2on,
classica2on, clustering, trend and outlier analysis, etc.

Data mining technologies and applica2ons

Major issues in data mining


31