Professional Documents
Culture Documents
1
W HAT IS D ATA ?
● An attribute is a property or
characteristic of an object
– Examples: eye color of a
person, temperature, etc.
– Attribute is also known as
variable, field, characteristic,
or feature
● A collection of attributes Objects
describe an object
– Object is also known as record,
point, case, sample, entity, or
instance
TYPES OF ATTRIBUTES
3
DISCRETE AND CONTINUOUS ATTRIBUTES
● Discrete Attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection
of documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete
attributes
● Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-
point variables.
4
T Y P E S O F DATA S E T S
● Record
– Data Matrix
– Document Data
– Transaction Data
● Graph
– World Wide Web
– Molecular Structures
● Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
5
I M P O RTA N T C H A R A C T E R I S T I C S OF
S T R U C T U R E D D ATA
– Dimensionality
Curse of Dimensionality
– Sparsity
Only presence counts
– Resolution
Patterns depend on the scale
6
R E C O R D D ATA
7
D ATA M AT R I X
8
D O C U M E N T D ATA
9
T R A N S AC T I O N D ATA
item
transaction
10
G R A P H D ATA
11
C H E M I C A L D ATA
● Benzene Molecule: C 6 H 6
12
O R D E R E D D ATA
● Sequences of transactions
Items/Events
An element of
the 13
sequence
O R D E R E D D ATA
14
O R D E R E D D ATA
● Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Trajectories of
Moving Objects
15
D ATA Q UA L I T Y
16
NOISE
17
Two Sine Waves Two Sine Waves + Noise
O U TLI E R S
18
D EVIATION /A NOMALY D E T E C T I O N
⚫ Network Intrusion
Detection
19
day
MISSING VALUES
20
D U P L I C AT E D ATA
● Examples:
– Same person with multiple email addresses
● Data cleaning
– Process of dealing with duplicate data issues
21
D ATA P R E P R O C E S S I N G
● Aggregation
● Sampling
● Dimensionality Reduction
● Feature subset selection
● Feature creation
● Discretization and Binarization
● Attribute Transformation
A G G R E G A TIO N
● Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability
23
S AM P L I N G
● Sampling is the main technique employed for data
selection.
– It is often used for both the
preliminary investigation of the data and the final
data analysis.
25
SAMPLE SIZE
26
CURSE OF D I M E N S I O NA L I T Y
● Purpose:
– Avoid curse of dimensionality
– Reduce amount of time and memory required by data
mining algorithms
– Allow data to be more easily visualized
– May help to eliminate irrelevant features or reduce
noise
28
F E AT U R E S U B S E T S E L E C T I O N
Remove:
● Redundant features
– duplicate much or all of the information contained in one or
more other attributes
– Example: purchase price of a product and the amount of
sales tax paid
● Irrelevant features
– contain no information that is useful for the data mining
task at hand
– Example: students' ID is often irrelevant to the task of
predicting students' G PA 29
F E AT U R E S U B S E T S E L E C T I O N
● Techniques:
– Brute-force approch:
Try all possible feature subsets as input to data mining algorithm
– Embedded approaches:
Featureselection occurs naturally as part of the data mining
algorithm
– Filter approaches:
Features are selected before data mining algorithm is run
30