Professional Documents
Culture Documents
Data Mining and Knowledge Discovery in Databases (KDD) State of The Art
Data Mining and Knowledge Discovery in Databases (KDD) State of The Art
Data Mining and Knowledge Discovery in Databases (KDD) State of The Art
1
Conference overview
User
Shell System-
Knowledge Engineer
Base
User Interface
Data Mining
Inference
Data/Expert
What is KDD?
Why is KDD necessary
The KDD process
KDD operations and methods
Prediction
– What? Opaque
Description
– Why? Transparent
Verification driven
– Validating hypothesis
– Querying and reporting (spreadsheets, pivot
tables)
– Multidimensional analysis (dimensional
summaries); On Line Analytical Processing
– Statistical analysis
Discovery driven
– Exploratory data analysis
– Predictive modeling
– Database segmentation
– Link analysis
– Deviation detection
Data Mining
Transformation
Preprocessing Knowledge
Selection
Patterns
Transformed
Preprocessed Data
Target Data
Original
Data
Data
AI
Machine learning
Statistics
Databases and data warehousing
High performance computing
Visualization
T. Nouri Data Mining 15
Need for data mining tools
Scalability
– Efficient and sufficient sampling
– In-memory vs. disk-based processing
– High performance computing
Automation
– Ease of use
– Using prior knowledge
T. Nouri Data Mining 23
Types of data mining tasks
Association rules
Sequence mining
Classification(decision tree etc.)
Clustering
Deviation detection
K-nearest neighbors
701 38 London
701 38 Cairo
11 531 Cairo
2 43 Sports High
High
3 68 Family Low
4 32 Truck Low
5 20 Family High
High Low
Distance-based
– Numeric
• Euclidean distance (root of sum of squared
differences along each dimension)
• Angle between two vectors
– Categorical
• Number of common features (categorical)
Partition-based
– Enumerate partitions and score each
T. Nouri Data Mining 44
K-means algorithm
Initial seeds
New centers
Final centers
outlier
Neighborhood
5 of class
3 of class
=