Professional Documents
Culture Documents
Introduction
1
Why Data Mining?
data selection (where data relevant to the analysis task are retrieved
from the database)
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
June 30, 2020 Data Mining: Concepts and Techniques 8
Architecture: Typical Data Mining System
Pattern Evaluation
Knowl
Data Mining Engine edge-
Base
Database or Data
Warehouse Server
Database
Technology Statistics
Machine Visualization
Learning Data Mining
Pattern
Recognition Other
Algorithm Disciplines
Example: The characteristics of customers who spend more than $1000 a year
at (some store called ) AllElectronics. The result can be a general profile
such as age, employment status or credit ratings.
Example: The user may like to compare the general features of software
products whose sales increased by 10% in the last year with those whose
sales decreased by about 30% in the same duration.
Performance Issues
Issues
The diagram
Fraud Detection
Apart from these, data mining can also be used in the areas of
production control, customer retention, science exploration, sports,
astrology, and Internet Web Surf-Aid
Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it
helps to find the destination of the call, duration of the call, time
of the day or week, etc. It also analyzes the patterns that deviate
from expected norms
will help
and healthy)
Identify gene sequence patterns that play roles in various diseases
separately
Visualization tools and genetic data analysis
Completeness
Consistency
Timeliness
Believability
Value added
Interpretability
Accessibility
Broad categories:
Intrinsic, contextual, representational, and accessibility
Motivation
To better understand the data: central tendency, variation
and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
June 30, 2020 Data Mining: Concepts and Techniques 39
Data Cleaning
Importance
“Data cleaning is one of the three biggest problems
warehousing”—DCI survey
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Correct inconsistent data
Resolve redundancy caused by data integration
June 30, 2020 Data Mining: Concepts and Techniques 40
Missing Data
technology limitation
incomplete data
inconsistent data
width) bins
then one can smooth by bin means, smooth by bin
regression functions
Clustering
detect and remove outliers
Y1
Y1’ y=x+1
X1 x