Professional Documents
Culture Documents
1
What Is Data Mining?
2
Data Mining: A KDD Process
Task-relevant Data
Data Cleaning
Data Integration
Databases
3
4
5
6
Data Mining: On What Kinds of Data?
Relational database
Data warehouse
Transactional database
Advanced database and information repository
Spatial and temporal data
Time-series data
Stream data
Multimedia database
7
Why Data Mining?—Potential Applications
9
Market Analysis and Management
Cross-market analysis
Associations/co-relations between product sales, &
prediction based on such association
Customer profiling
What types of customers buy what products
Customer requirement analysis
Identifying the best products for different customers
Predict what factors will attract new customers
10
Fraud Detection & Mining Unusual Patterns
11
Architecture: Typical Data Mining System
Pattern evaluation
Data
Databases Warehouse
12
Data Mining: Confluence of Multiple Disciplines
Database
Statistics
Systems
Machine
Learning
Data Mining Visualization
Algorithm Other
Disciplines
13
Data Mining: Classification Schemes
14
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream,
object-oriented/relational, active, spatial, time-series,
text, multi-media, heterogeneous, WWW
Knowledge to be mined
Characterization, discrimination, association,
classification, clustering, trend/deviation, outlier
analysis, etc.
Multiple/integrated functions and mining at multiple
levels
15
Multi-Dimensional View of Data Mining
Techniques utilized
Database-oriented, data warehouse (OLAP), machine
learning, statistics, visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis,
bio-data mining, stock market analysis, Web mining,
etc.
16
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
17
Major Issues in Data Mining (2)
18
Getting to Know Your Data
Data Visualization
Summary
19
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
2 Beer, Bread
Sequential Data: transaction sequences
Genetic sequence data 3 Beer, Coke, Milk
Spatial, image and multimedia: 4 Beer, Bread, Milk
Spatial data: maps 5 Coke, Milk
Image data:
Video data:
20
Data Objects
Types:
Nominal
Binary
Numeric: quantitative
Interval-scaled
Ratio-scaled
22
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {black, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, rankings
23
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order can be +ve or –ve.
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
It is numeric attribute with inherent zero-point.
We can speak of values as being an order of
magnitude larger than the unit of measurement
(10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts
24
Discrete vs. Continuous Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
floating-point variables
25
Getting to Know Your Data
Summary
26
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency,
variation and spread
i i n x 2
2 2
s 2
( x 2
x ) [ x ( 2
xi ) ] 2
( xi 2
) i
n 1 i 1 n 1 i 1 i 1 N i 1 N i 1