Professional Documents
Culture Documents
Overview
• Data – types and formats
• Descriptive Analytics
Data – types and formats
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multi-dimensional space, where
each dimension represents a distinct attribute
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
Document 1 3 0 y
5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction,
while the individual products that were purchased are the items.
TID Items
1 Bread, Coke, Milk
2 Butter, Bread
3 Butter, Coke, Donut, Milk
4 Butter, Bread, Donut, Milk
5 Coke, Donut, Milk
Graph Data
• Examples: Generic graph and HTML Links
2
5 1
2 <a href="papers/papers.html#bbbb">
Data Mining </a>
5 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Chemical Data
• Benzene Molecule: C6H6
Ordered Data
• Sequences of transactions
Items/Events
An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data
Average Monthly
Temperature of
land and ocean
Important Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion
Data Objects
• For instance, the 50th percentile is the value x50%such
that 50% of all values of x are less than x50%.
The mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly used.
Skewness
• The first thing you usually notice about a distribution’s shape is
whether it has one mode (peak) or more than one.
• If it’s unimodal (has just one peak), like most data sets, the next thing
you notice is whether it’s symmetric or skewed to one side.
• If the bulk of the data is at the left and the right tail is longer, we say
that the distribution is skewed right or positively skewed;
• If the peak is toward the right and the left tail is longer, we say that
the distribution is skewed left or negatively skewed.
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric
(Kurtosis = 3)
(Kurtosis < 3)
Platykurtic
Measuring the Dispersion of Data
• Covariance Matrix
• The variance–covariance information for the two attributes X 1 and X2 can be
summarized in the square 2×2 covariance matrix, given as
Median
Because n = 150 is even, the sample median is the value at positions
n/2 = 75 and n/2 + 1 = 76 in sorted order. For sepal length both these
values are 5.8; thus the sample median is 5.8
Mode
The sample mode for sepal length is 5
Range
Variance
σ2= [(5.9 – 5.843)2 + (6.9 – 5.843)2 + (6.6 – 5.843)2 + (4.6 – 5.843)2 + …]/150
= 0.681
Standard Deviation
Example…
• Sample Mean and Covariance
The angle is close to 90◦, that is, the two attribute vectors are almost orthogonal, indicating weak correlation.
Further, the angle being greater than 90◦ indicates negative correlation.