Professional Documents
Culture Documents
03b Data PDF
03b Data PDF
Data
z Visualization
Nominal The values of a nominal attribute are zip codes, employee mode, entropy,
just different names, i.e., nominal ID numbers, eye color, contingency
attributes provide only enough sex: {male, female} correlation, χ2 test
information to distinguish one object
f
from another.
h (=,( ≠))
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
Attribute Allowed Transformations Comments
Level
R ti
Ratio new_value
l = a * old_value
ld l Length
L th can be
b measuredd in
i
meters or feet.
Discrete and continuous attributes
z Discrete attribute
– Has only a finite or countably infinite set of values
– Examples: zip codes, counts, or the set of words in a collection of
documents
– Often represented as integer variables.
– Note: binary attributes are a special case of discrete attributes
z Continuous attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically,
Practically real values can only be measured and represented
using a finite number of digits.
– Continuous attributes are typically represented as floating-point
variables.
z Record
– Data matrix
– Document data
– Transaction data
z G h
Graph
– World Wide Web
– Molecular structures
z Ordered
– Spatial data
– Temporal (time series) data
– Sequential data
– Genetic
G ti sequence data
d t
timeou
seaso
coach
game
score
team
ball
lost
pla
wi
n
y
m
on
e
e
h
ut
Jeff Howbert Introduction to Machine Learning Winter 2012 13
Transaction data
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer,
ee , Coke,
Co e, Diaper,
pe , Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
z Sequences of transactions
It
Items/Events
/E t
An element of
the sequence
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
z Spatio-temporal data
Average monthly
temperature of
land and ocean
Interpretation /
evaluation
Knowledge
Machine
learning
Transformation Patterns
Preprocessing
Transformed
data
Selection Preprocessed
d t
data
Data Target
data
z Handling
g missing
g values
– Eliminate data objects
– Estimate missing values
– Ignore the missing value during analysis
– Replace with all possible values (weighted by their
probabilities)
z Example:
– Same person with multiple email addresses
z Data cleaning
– Includes process of dealing with duplicate data issues
z Aggregation
z Sampling
z Discretization and binarization
z Attribute transformation
z Feature creation
z Feature
F t selection
l ti
– Choose subset of existing features
z Di
Dimensionality
i lit reduction
d ti
– Create smaller number of new features through linear
or nonlinear combination of existing features
z Purpose
– Data reduction
Reduce the number of attributes or objects
– Change of scale
Cities aggregated into regions, states, countries, etc
– More “stable” data
Aggregated data tends to have less variability
Standard
S a da d de
deviation
a o o of a
average
e age Standard
S a da d de
deviation
a o o of a
average
e age
monthly precipitation yearly precipitation
Definition:
A function that maps the entire set of
values of a given attribute to a new set
of replacement values, such that each
old value can be identified with one of
the new values.
z Simple functions
– Examples of transform functions:
xk log( x ) ex |x|
– Often used to make the data more like some standard distribution,
to better satisfy assumptions of a particular algorithm.
Example: discriminant analysis explicitly models each class distribution as a
multivariate Gaussian
log( x )
z Standardization or normalization
– Usually involves making attribute:
mean =0
standard deviation =1
in MATLAB, use zscore() function
– Important when working in Euclidean space and attributes have
very different numeric scales.
– Also necessaryy to satisfy
y assumptions
p of certain algorithms.
g
Example: principal component analysis (PCA) requires each attribute to be
mean-centered (i.e. have mean subtracted from each value)
z Fourier transform
– Eliminates noise present in time domain
Let’s
Let s use a tool that’s
that s good at those things …
PowerPoint isn
isn’tt it
petal width
Iris virginica. Robert H. Mohlenbrock. USDA
petal length NRCS. 1995. Northeast wetland flora: Field office
guide to plant species
species. Northeast National
– Species is class label Technical Center, Chester, PA. Courtesy of USDA
NRCS Wetland Science Institute.
Jeff Howbert Introduction to Machine Learning Winter 2012 43