You are on page 1of 16

Data Mining:

Concepts and Techniques

— Chapter 2 —
Descriptive Data Summarization

1
Types of Data Sets
 Record
 Relational records
 Data matrix, e.g., numerical matrix,

timeout

season
coach

game
score
team

ball

lost
pla
crosstabs

wi
n
y
 Document data: text documents: term-
frequency vector
Document 1 3 0 5 0 2 6 0 2 0 2
 Transaction data
 Graph and network Document 2 0 7 0 2 1 0 0 3 0 0
 World Wide Web
Document 3 0 1 0 0 1 2 2 0 3 0
 Social or information networks
 Molecular Structures
 Ordered TID Items
 Video data: sequence of images
1 Bread, Coke, Milk
 Temporal data: time-series
 Sequential Data: transaction sequences 2 Beer, Bread
 Genetic sequence data 3 Beer, Coke, Diaper, Milk
 Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
 Spatial data: maps 5 Coke, Diaper, Milk
 Image data:
 Video data:

2
Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 Database rows -> data objects; columns ->attributes.
3
Attributes
 Attribute (or dimensions, features, variables):
a data field, representing a characteristic or feature
of a data object.
 E.g., customer _ID, name, address

 Types:
 Nominal

 Binary

 Ordinal

 Numeric: quantitative

4
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white }
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important
 e.g., gender
 Asymmetric binary: outcomes not equally important.
 e.g., medical test (positive vs. negative)
 Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude between
successive values is not known.
 Size = {small, medium, large}, grades, army rankings
 Numeric Attributes - is quantitative; that is, it is a measurable quantity,
represented in integer or real values.
5
Data Quality: Why Preprocess the Data?

 Measures for data quality: A multidimensional view


 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be
understood?

6
DESCRIPTIVE DATA
SUMMARISATION

7
Descriptive data summarization
 Essential to have overall picture of your data
 Data summarization tech used to identify typical
properties of data

8
Basic Statistical Descriptions of Data
 Motivation
 To better understand the data: central tendency, variation and spread
 Measure of central tendencies- Mean, Median, Mode etc
 Measure of Data dispersion- Quartiles, Inter Quartile Range, Percentile
 Distributive measure
 A measure that can be computed by partitioning the data into smaller
subsets, computing the measure for each subset and then merging
the results. Eg sum, count
 Algebraic measure
 A measure that can be computed by applying algebraic fn to one or
more distributive measure. Eg Average= Sum/Count
 Holistic Measure
 A measure that is to be computed on entire set of data and cannot be
computed by partitioning data. Eq Median

9
Measuring the Central Tendency
1 n
Mean (algebraic measure)
x   xi

 Weighted arithmetic mean: n i 1


n
 Trimmed mean: chopping extreme values
w x i i
x i 1
L1- lower bdy of median interval n
 Median: N- total values in data set 
i 1
wi
 Middle value(ΣFreq)l-
if odd number of values, or average of median interval
summation of all freq lower than
Freg median- freq of the median interval
the middle two values otherwise
Width- width of median interval
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median  L1  ( ) width
 Mode freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal

10
Symmetric vs. Skewed Data
 Median, mean and mode of symmetric, symmetric
positively and negatively skewed data

positively skewed negatively skewed

January 16, 2021 Data Mining: Concepts and Techniques 11


Measuring the Dispersion of Data
 Quartiles, outliers and boxplots
 Quartiles: Q1 (25th percentile), Q3 (75th percentile)
 Inter-quartile range: IQR = Q3 – Q1
 Five number summary: min, Q1, median, Q3, max
 Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n 1 n
1 n
s 
2

n  1 i 1
( xi  x ) 
2
[ xi  ( xi ) 2 ]
n  1 i 1 n i 1
 
2

N

i 1
( xi  
2
) 
N
 xi   2
i 1
2

 Standard deviation s (or σ) is the square root of variance s2 (or σ2)

12
13
Properties of Normal Distribution Curve

 The normal (distribution) curve


 From μ–σ to μ+σ: contains about 68% of the

measurements (μ: mean, σ: standard deviation)


 From μ–2σ to μ+2σ: contains about 95% of it
 From μ–3σ to μ+3σ: contains about 99.7% of it

14
Boxplot Analysis
 Five-number summary of a distribution
 Minimum, Q1, Median, Q3, Maximum
 Boxplot
 Data is represented with a box
 The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
 The median is marked by a line within the
box
 Whiskers: two lines outside the box extended
to Minimum and Maximum
 Outliers: points beyond a specified outlier
threshold, plotted individually

15
Visualization of Data Dispersion: 3-D Boxplots

January 16, 2021 Data Mining: Concepts and Techniques 16

You might also like