Professional Documents
Culture Documents
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
Source: Books by Tan, Steinbach, Kumar Han, Kamber & Pei Evans Dinesh Kumar + Experiential Knowledge
Source: Books by
Tan, Steinbach, Kumar ; Han, Kamber &
Pei; Evans; Dinesh Kumar + Experiential
Knowledge
Big Data
Big data refers to high volume of data generated at high
velocity that contains large variety of data.
Nominal The values of a nominal attribute zip codes, employee mode, entropy,
are just different names, i.e., ID numbers, eye color, contingency
nominal attributes provide only sex: {male, female} correlation, 2 test
enough information to distinguish
one object from another. (=, )
Ratio For ratio variables, both differences temperature in Kelvin, geometric mean,
and ratios are meaningful. (*, /) monetary quantities, harmonic mean,
counts, age, mass, percent variation
length, electrical
current
al
rv
te
In
l
va
er
Example : Classifying Data Elements
t
In
io
at
R
io
at
R
io
at
R
io
at
R al
ic
or
eg
at al
C ic
or
eg
at
C
al
in
rd
O
al
ic
or
eg
at
C
Properties of Attribute Values (Cont)
Continuous Attribute
– Has real numbers as attribute values
– Examples: temperature, height, or weight.
– Practically, real values can only be measured and represented using a
finite number of digits (limited precision).
– Continuous attributes are typically represented as floating-point variables.
Typically, nominal and ordinal attributes discrete while integer and ratio
attributes are continuous. (However, count attributes which are discrete
are also ratio attributes)
Asymmetric Attributes
Graph
– World Wide Web
– Molecular Structures
Ordered
– Spatial Data
– Temporal Data
– Sequential Data
– Genetic Sequence Data
Important Characteristics of Structured Data
– Dimensionality
Number of attributes that the objects in the data set possess.
Curse of Dimensionality
Dimensionality reduction
– Sparsity
For some data sets such as those with asymmetric features, most
attributes of an object have 0 values. An advantage in practical
terms.
– Resolution
Properties of data are different at different resolutions (earth -
uneven/smooth)
Patterns depend on the scale of resolution (atmospheric
pressure - movement of storms)
Record Data
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Graph Data
<a href="papers/papers.html#bbbb">
Data Mining </a>
<li>
2 <a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
5 1 <a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
2 <a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
5
Chemical Data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Time Series Data
Special type of sequential data; series of measurements taken over
time;
Average monthly temperature of Delhi from 1980 to 2017
Daily prices of various stocks
Average Monthly
Temperature of
land and ocean
(Lat-Long)
Explain: Spatial
auto-correlation
Handling Non Record Data
Most data analysis ( or data mining) algorithms are
designed for record data or its variations (eg transaction
data and data matrices)