Professional Documents
Culture Documents
• e.g., occupation=“ ”
• e.g., Salary=“-10”
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or
similar analytical results
• Data discretization
• Part of data reduction but with particular importance, especially for
numerical data
• Importance
• “Data cleaning is one of the three biggest problems in data
warehousing”
• Data Cleaning tasks
• Fill in missing values
• Identify outliers and smooth out noisy data
• Correct inconsistent data
• Resolve redundancy caused by data integration
• Binning
• first sort data and partition into (equal-frequency) bins
• then one can smooth by bin means, smooth by bin median,
smooth by bin boundaries, etc.
• Regression
• smooth by fitting the data into regression functions
• Clustering
• detect and remove outliers
5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215;
bin 1 5,10,11,13
bin 2 15,35,50,55
bin 3 72,92,204,215
bin 1 5,10,11,13,15
bin 2 35,50,55,72,92
bin 3 204,215
( ex. 5; 10; 11; 13; 15; 35; 50; 55; 72; 92; 204; 215;)
5/14/22 Data Mining: Concepts and Techniques 16
Data Integration
• Data integration:
• Combines data from multiple sources into a coherent store
• Schema integration: e.g., A.cust-id B.cust-#
• Integrate metadata from different sources
• Entity identification problem:
• Identify real world entities from multiple data sources, e.g., Bill
Clinton = William Clinton
• Detecting and resolving data value conflicts
• For the same real world entity, attribute values from different
sources are different
• Possible reasons: different representations, different scales,
e.g., metric vs. British units
• Such analysis can measure how strongly one attributes implies the
other based on the available data
• The larger the Χ2 value, the more likely the variables are
related
=90
2 2 2 2
( 250 90) (50 210) ( 200 360) (1000 840)
2 507.93
90 210 360 840
So,
• If rA,B > 0,
A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.
• rA,B = 0: independent;
Positive covariance: If CovA,B > 0, then A and B both tend to be larger than
their expected values
Some pairs of random variables may have a covariance of 0 but are not
independent. Only under some additional assumptions (e.g., the data follow
multivariate normal distributions) does a covariance of 0 imply independence
5/14/22 Data Mining: Concepts and Techniques 25
Co-Variance: An Example
• It can be simplified in computation as
Example:
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices
rise or fall together? Days Stock Stock
A B
Monday 2 5
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
Tuesday 3 8
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
Wednesday 5 10
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4 Thursday 4 11
Friday 6 14
• Thus, A and B rise together since Cov(A, B) > 0.
Data Transformation
• A function that maps the entire set of values of a given attribute to a new set
of replacement values so that each old value can be identified with one of
the new values
• Methods
• Smoothing: Remove noise from data
• Techniques include binning, regression, clustering
• Attribute/feature construction
• New attributes constructed from the given ones
27
Data Transformation
28
Normalization
• Ex. Let min and max value for the attribute income are
$12,000 and $98,000 resp. Now map income to the range
[0.0, 1.0]. Then the value $73,000 for income is transformed
as,
73,600 12,000
(1.0 0) 0 0.716
98,000 12,000
29
Normalization
31
Example
Use the two methods below to normalize the following group of
data:
(a) Use min-max normalization to transform the value 35 for age onto the range
[0:0; 1:0]
(b) Use z-score normalization to transform the value 35 for age, where the
standard deviation of age is 12.94 years.
(c) Use normalization by decimal scaling to transform the value 35 for age.
$0…$1000
36
Automatic Concept Hierarchy Generation
• Histogram analysis
• Top-down split, unsupervised
38
Data Reduction
• Data reduction
• Obtain a reduced representation of the data set that is much
smaller in volume but yet produce the same (or almost the same)
analytical results
39
Data reduction strategies
y=wx + b
• Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only
• Can be very effective if data is clustered but not if data is “smeared”
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a ce
re
SRSW
R
Raw Data
5/14/22 Data Mining: Concepts and Techniques 46
Sampling: Cluster or Stratified Sampling