Professional Documents
Culture Documents
— Chapter 2 —
Descriptive Data Summarization
1
Types of Data Sets
Record
Relational records
Data matrix, e.g., numerical matrix,
timeout
season
coach
game
score
team
ball
lost
pla
crosstabs
wi
n
y
Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
Transaction data
Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
Social or information networks
Molecular Structures
Ordered TID Items
Video data: sequence of images
1 Bread, Coke, Milk
Temporal data: time-series
Sequential Data: transaction sequences 2 Beer, Bread
Genetic sequence data 3 Beer, Coke, Diaper, Milk
Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
Spatial data: maps 5 Coke, Diaper, Milk
Image data:
Video data:
2
Data Objects
Types:
Nominal
Binary
Ordinal
Numeric: quantitative
4
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white }
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between
successive values is not known.
Size = {small, medium, large}, grades, army rankings
Numeric Attributes - is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
5
Data Quality: Why Preprocess the Data?
6
DESCRIPTIVE DATA
SUMMARISATION
7
Descriptive data summarization
Essential to have overall picture of your data
Data summarization tech used to identify typical
properties of data
8
Basic Statistical Descriptions of Data
Motivation
To better understand the data: central tendency, variation and spread
Measure of central tendencies- Mean, Median, Mode etc
Measure of Data dispersion- Quartiles, Inter Quartile Range, Percentile
Distributive measure
A measure that can be computed by partitioning the data into smaller
subsets, computing the measure for each subset and then merging
the results. Eg sum, count
Algebraic measure
A measure that can be computed by applying algebraic fn to one or
more distributive measure. Eg Average= Sum/Count
Holistic Measure
A measure that is to be computed on entire set of data and cannot be
computed by partitioning data. Eq Median
9
Measuring the Central Tendency
1 n
Mean (algebraic measure)
x xi
10
Symmetric vs. Skewed Data
Median, mean and mode of symmetric, symmetric
positively and negatively skewed data
N
i 1
( xi
2
)
N
xi 2
i 1
2
12
13
Properties of Normal Distribution Curve
14
Boxplot Analysis
Five-number summary of a distribution
Minimum, Q1, Median, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
The median is marked by a line within the
box
Whiskers: two lines outside the box extended
to Minimum and Maximum
Outliers: points beyond a specified outlier
threshold, plotted individually
15
Visualization of Data Dispersion: 3-D Boxplots