You are on page 1of 46

Descriptive Analytics

Overview
• Data – types and formats
• Descriptive Analytics
Data – types and formats
Data Matrix
• If data objects have the same fixed set of numeric attributes, then the data
objects can be thought of as points in a multi-dimensional space, where
each dimension represents a distinct attribute

• Such data set can be represented by an m by n matrix, where there are m


rows, one for each object, and n columns, one for each attribute

Projection Projection Distance Load Thickness


of x Load of y load

10.23 5.27 15.22 2.7 1.2


12.65 6.25 16.22 2.2 1.1
Document Data
• Each document becomes a `term' vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
Document 1 3 0 y
5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Transaction Data
• A special type of record data, where
• each record (transaction) involves a set of items.
• For example, consider a grocery store. The set of products purchased
by a customer during one shopping trip constitute a transaction,
while the individual products that were purchased are the items.

TID Items
1 Bread, Coke, Milk
2 Butter, Bread
3 Butter, Coke, Donut, Milk
4 Butter, Bread, Donut, Milk
5 Coke, Donut, Milk
Graph Data
• Examples: Generic graph and HTML Links

2
5 1
2 <a href="papers/papers.html#bbbb">
Data Mining </a>
5 <li>
<a href="papers/papers.html#aaaa">
Graph Partitioning </a>
<li>
<a href="papers/papers.html#aaaa">
Parallel Solution of Sparse Linear System of Equations </a>
<li>
<a href="papers/papers.html#ffff">
N-Body Computation and Dense Linear System Solvers
Chemical Data
• Benzene Molecule: C6H6
Ordered Data
• Sequences of transactions
Items/Events

An element of
the sequence
Ordered Data
• Genomic sequence data
GGTTCCGCCTTCAGCCCCGCGCC
CGCAGGGCCCGCCCCGCGCCGTC
GAGAAGGGCCCGCCTGGCGGGCG
GGGGGAGGCGGGGCCGCCCGAGC
CCAACCGAGTCCGACCAGGTGCC
CCCTCTGCTCGGCCTAGACCTGA
GCTCATTAGGCGGCAGCGGACAG
GCCAAGTAGAACACGCGAAGCGC
TGGGCTGCCTGCTGCGACCAGGG
Ordered Data
• Spatio-Temporal Data

Average Monthly
Temperature of
land and ocean
Important Characteristics of Structured Data

• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale

• Distribution
• Centrality and dispersion
Data Objects

• Data sets are made up of data objects.


• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Attributes
• Attribute (or dimensions, features, variables): a data
field, representing a characteristic or feature of a data
object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
Attribute Types
• Nominal or categorical: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Classifications must be mutually exclusive (every element should belong to one category with no
ambiguity).
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive values is not known.
• Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than
the unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts,
monetary quantities
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete
attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented
using a finite number of digits
• Continuous attributes are typically represented as floating-
point variables
Ordinal Variables

• An ordinal variable can be discrete or continuous


• Order is important, e.g., rank
• Can be treated like interval-scaled
rif {1,...,M f }
• replace xif by their rank
• map the range of each variable onto [0, 1] by replacing i-th
object in the f-th variable by
rif 1
zif 
M f 1

• compute the dissimilarity using methods for interval-scaled


variables
Attributes of Mixed Type

• A database may contain all attribute types


• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and
zif  r
if1
• Treat zif as interval-scaled M 1 f
Descriptive Analytics
Summary Statistics
• Summary statistics are numbers that summarize
properties of the data

• Summarized properties include frequency, location and


spread
• Examples: location - mean
spread - standard deviation

• Most summary statistics can be calculated in a single pass


through the data
Frequency and Mode
• The frequency of an attribute value is the
percentage of time the value occurs in the
data set
• For example, given the attribute ‘gender’ and a
representative population of people, the gender ‘female’
occurs about 50% of the time.
• The mode of a an attribute is the most frequent
attribute value
• The notions of frequency and mode are typically used
with categorical data
Percentiles
• For continuous data, the notion of a percentile is more
useful.

Given an ordinal or continuous attribute x and a number


p between 0 and 100, the pth percentile is a value xp of
x such that p% of the observed values of x are less than
xp.


• For instance, the 50th percentile is the value x50%such
 that 50% of all values of x are less than x50%.


The mean is very sensitive to outliers.
Thus, the median or a trimmed mean is also commonly used.
Skewness
• The first thing you usually notice about a distribution’s shape is
whether it has one mode (peak) or more than one.
• If it’s unimodal (has just one peak), like most data sets, the next thing
you notice is whether it’s symmetric or skewed to one side.
• If the bulk of the data is at the left and the right tail is longer, we say
that the distribution is skewed right or positively skewed;
• If the peak is toward the right and the left tail is longer, we say that
the distribution is skewed left or negatively skewed.
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, symmetric

positively and negatively skewed data

positively skewed negatively skewed


Interpreting
• If skewness is positive, the data are positively skewed or skewed right, meaning that the right
tail of the distribution is longer than the left.
• If skewness is negative, the data are negatively skewed or skewed left, meaning that the left tail
is longer.
• If skewness = 0, the data are perfectly symmetrical.
• But a skewness of exactly zero is quite unlikely for real-world data, so how can you interpret
the skewness number?
• There’s no one agreed interpretation, but for what it’s worth Bulmer (1979) — a classic —
suggests this rule of thumb:
• If skewness is less than −1 or greater than +1, the distribution can be called highly skewed.
• If skewness is between −1 and −½ or between +½ and +1, the distribution can be called moderately
skewed.
• If skewness is between −½ and +½, the distribution can be called approximately symmetric.
• With a skewness of −0.1098, the sample data for student heights are approximately symmetric.
Kurtosis
• The other common measure of shape is called the kurtosis.
• As skewness involves the third moment of the distribution, kurtosis involves the fourth moment.
• The outliers in a sample, therefore, have even more effect on the kurtosis than they do on the
skewness.
• Traditionally, kurtosis has been explained in terms of the central peak.
• Higher values indicate a higher, sharper peak; lower values indicate a lower, less distinct peak.
(Kurtosis > 3)

(Kurtosis = 3)

(Kurtosis < 3)

Platykurtic
Measuring the Dispersion of Data

• Quartiles, outliers and boxplots


• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and
plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
• Standard deviation s (or σ) is the square root of variance s2 (or σ2)
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to
Minimum and Maximum
• Outliers: points beyond a specified outlier
threshold, plotted individually
Measures of Association
Measures of Association…

• Covariance Matrix
• The variance–covariance information for the two attributes X 1 and X2 can be
summarized in the square 2×2 covariance matrix, given as

• Because σ12 =σ21, the matrix is a symmetric matrix.


• The covariance matrix records the attribute specific variances on the main
diagonal, and the covariance information on the off-diagonal elements.
• The total variance of the two attributes is given as the sum of the diagonal
elements
Total variance var(D) = σ21 + σ22
Measures of Association…
• Correlation
• The correlation between variables X1 and X2 is the standardized
covariance, obtained by normalizing the covariance with the
standard deviation of each variable, given as:

• The correlation is then the cosine of the angle between them


Example – Iris Data set
Example – Iris Data set
Sample Mean

Median
Because n = 150 is even, the sample median is the value at positions
n/2 = 75 and n/2 + 1 = 76 in sorted order. For sepal length both these
values are 5.8; thus the sample median is 5.8

Mode
The sample mode for sepal length is 5

Range

Variance
σ2= [(5.9 – 5.843)2 + (6.9 – 5.843)2 + (6.6 – 5.843)2 + (4.6 – 5.843)2 + …]/150
= 0.681

Standard Deviation
Example…
• Sample Mean and Covariance

• Sample covariance matrix

• The variance for sepal length is σ12 =0.681, and


that for sepal width is σ22 =0.187.
• Covariance
• The covariance between the two attributes is
σ12 = −0.039
• Correlation

The angle is close to 90◦, that is, the two attribute vectors are almost orthogonal, indicating weak correlation.
Further, the angle being greater than 90◦ indicates negative correlation.

You might also like