Professional Documents
Culture Documents
Introduction
Data Mining
o Data cleaning
o Data Integration
o Data Selection
o Data Transformation
o Data Mining
o Pattern Evaluation
o Knowledge Presentation
Figure – 1: Data mining as a step in the process of
knowledge discovery
• First 4 steps are different forms of data
preprocessing, where the data are prepared
for mining
• Data mining step interacts with user or
knowledge base
• Interesting patterns presented to the user and
that can be stored as a new knowledge in a
separate knowledge base
Kinds of data
• Classification
Construct models (functions) that describe and
distinguish classes or concepts to predict the class of
objects whose class label is unknown
Example:
• In weather problem the play or don’t play
judgment
• In contact lenses problem the lens
recommendation
• Derived model can be represented in various forms:
Decision tree
Neural network
If Then rules
Clustering:
• Grouping similar instances into clusters
2) Redundancy
• Redundant data occur often when integration of multiple
databases
• Redundant attributes may be able to be detected by
correlation analysis
v
• normalization by decimal scaling: v' 10 j
where, j is the smallest integer such that Max(|ν’|) < 1
5) Attribute construction: New attributes constructed from
the given ones
Data Reduction:
Data reduction techniques can be applied to obtain a reduced
representation of the dataset that is much smaller in volume,
yet closely maintains the integrity of the original data
• Various strategies used for data reduction are as follows:
1) Data cube aggregation
2) Attribute subset selection: it includes the following
techniques:
• Stepwise forward selection
• Stepwise backward elimination
• Combination of forward selection and backward
elimination
• Decision tree induction
3) Dimensionality reduction: two effective
methods are as following:
• Wavelet Transforms
– Length, L, must be an integer power of 2 (padding with 0’s,
when necessary)
– Each transform has 2 functions: smoothing, difference
– Applies to pairs of data, resulting in two set of data of length L/2
– Applies two functions recursively, until reaches the desired
length
• Principal Component Analysis
- Given N data vectors from n-dimensions, find k ≤ n orthogonal
vectors (principal components) that can be best used to represent data
- Works for numeric data only
- Used when the number of dimensions is large
4) Numerosity reduction: some of the techniques in numerosity reduction are
as follows:
• Regression and Log-Linear Models
In Linear regression, the data are modeled to fit a straight line. y = wx+ b
Multiple linear regression allows a response variable, y, to be modeled as a
linear function of two or more predictor variables
Log-linear models approximate discrete multidimensional probability
distributions.
• Histograms
Divide data into buckets and store average (sum) for each bucket
Partitioning rules:
– Equal-width: equal bucket range
– Equal-frequency (or equal-depth)
– V-optimal: with the least histogram variance (weighted sum of the
original values that each bucket represents)
– MaxDiff: set bucket boundary between each pair for pairs have the β–
1 largest differences
• Clustering
- Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
- Can be very effective if data is clustered but not if data is “smeared”
• Sampling
Obtaining a small sample s to represent the whole data set N
- Simple random sample without replacement (SRSWOR) of size s
- Simple random sample with replacement (SRSWR) of size s
- Cluster sample
- Stratified sample
Discretization and concept hierarchy generation
count
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4: