Professional Documents
Culture Documents
Fall 2010
W O R
SRS le random
i m p h ou t
( s e wi t
l
samp ment)
pl a c e
re
SRSW
R
Raw Data
December 8, 2021 Data Mining: Concepts and Techniques 8
Raw Data Cluster/Stratified Sample
(-$1,000 - $2,000)
Step 3:
(-$400 -$5,000)
Step 4:
(-$400 - 0) (0 -
($1,000($1,000
- - $2, 000)
(-$400 - $200) (0 - $1,000)
$1,200) ($2,000 -
-$300) (0 -
($200 - ($1,000 - $3,000)
(-$400 - $200) ($1,200 -
-$300) $400) $1,200)
(-$300 - $1,400)
-$200) ($200 - ($3,000 -
($1,200 -
(-$300 - $400)
($400 - ($1,400 - $4,000)
$1,400)
(-$200 -
-$200) $600) $1,600) ($4,000 -
-$100) ($400 - ($600 - ($1,400 - $5,000)
($1,600 -
(-$200 - $600) $800) $1,600) ($1,800 -
($800 - $1,800)
-$100)(-$100 - $2,000)
($600 - $1,000) ($1,600 -
0) $800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
December 8, 2021 Data Mining: Concepts and Techniques 17
Specification of a partial/total ordering of attributes
explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by
explicit data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute
levels) by the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state,
country}
December 8, 2021 Data Mining: Concepts and Techniques 18
Some hierarchies can be automatically
generated based on the analysis of the number
of distinct values per attribute in the data set
The attribute with the most distinct values is
placed at the lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter,
year
country 15 distinct
15 distinct values
values
street 674,339
674,339 distinct
distinct values
values
December 8, 2021 Data Mining: Concepts and Techniques 19
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary