Data Mining and Business Intelligence

Data Mining and
Business Intelligence
3rd Lecture
Data preprocessing
Iraklis Varlamis
Data preprocessing
• The need for data preprocessing

• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics
2
The need for preprocessing
• Bad data🡺 Bad knowledge is extracted

– e.g. duplicates, inconsistent data may lead to wrong
statistics
– Data consistency must be guaranteed for the whole
collection process
• The data we collect are:
– incomplete: missing values from specific attributes,
missing attributes, aggregate instead of analytic data
– noisy: wrong values, out of limits, e.g. age="240”
– inconsistent: use different values/codings/names for the
same things
• e.g., duplicate ids with different values
3
Reasons for “bad data”
• Incomplete data: A missing value for an attribute of an

instance
– The value has been accidentally omitted during data entry
– The value was not considered important during data collection
– A hardware/software failure in the data collection mechanism
• Noisy data
– Noise at collection
– Noise at data entry
– Noise at transmission
• Inconsistent data
– Different data source use different representation schemas (e.g.,
Age=“42” Birthday=“03/07/1997”)
– Data representation changed over time (e.g., rating with values
“A,B,C”, changed to a 5-scale rating “1-5”)
– Functional dependencies among data have been violated (e.g.
entering a second person with the same id number violates
IDNumber🡺 Name, Surname)
4
Data quality
• A multidimensional concept
• May refer to … data
– Accurate
– Complete
– Consistent
– Up to date
– Trustful
– Easy to understand
– Accessible
– Added value
5
Basic operations
• Data cleaning
– Fill missing values, remove noise, find and remove
extreme/wrong values
• Data integration
– Integrate multiple databases, use data from tables or files
• Data transformation
– Value normalization, aggregation etc
• Data reduction
– Reduce the dataset size without decreasing the overall
performance
• Discretisation
– Part of data reduction, important for numeric data
6
Data preprocessing

• Data cleaning
• Data reduction
• Data types
7
Data cleaning tasks (1/2)
• We have missing values because

– Data are not always available
• e.g. customers may not reveal their address
– Data are not always recorded
• e.g. due to a sensor malfunction,
– Data have been deleted
• e.g. because we decided that they were not necessary or
obsolete
• We treat missing values
– Either by assuming a value that can replace them: e.g.
a medium, median or default value, decided for the
whole dataset or part of it
– Or by ignoring the whole record
8
Data cleaning tasks (2/2)
• We have outliers or noise because

– An error or variance in a measurable variable, e.g.
transmission error, recording device limitations
• We treat noise with:
• Binning: Measures are sorted and grouped into groups of the
same size. The values of each group are smoothed using
average or boundary values
• We treat outliers with:
• Regression: We fit the data to a known distribution function and
correct outliers
• Clustering: We group similar values together and remove the
remaining values
• Manually: by plotting values and examining them
9
Data binning
• Equi-width split:
– Select Ν sub-ranges of the same width
– The width of each sub-range is: W = (max–min)/N.
– Quality is affected by outliers, the final result is biased
towards outliers.
• Equi-depth split:
– Select N sub-ranges that contain the same number of
instances
– Better distribution of data
– It is hard for nominal/ordinal data (categorical and not
continuous values)
10
Smoothing
⚫ We record the daily temperatures and sort them:

4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
⚫ We then split to equi-width bins:
◦ Bin 1: 4, 8, 9, 15
◦ Bin 2: 21, 21, 24, 25
◦ Bin 3: 26, 28, 29, 34
⚫ .. and smooth
⚫ Using bin’s average:
◦ Bin 1: 9, 9, 9, 9
◦ Bin 2: 24, 24, 24, 24
◦ Bin 3: 29, 29, 29, 29
⚫ Using bin’s boundary values
◦ Bin 1: 4, 4, 4, 15
◦ Bin 2: 21, 21, 25, 25
◦ Bin 3: 26, 26, 26, 34
11
Regression – Curve fitting
outliers
noise
12
Clustering
13
Data preprocessing

• Data cleaning
• Data reduction
• Data types
14
Data integration
• May result in
– inconsistent data
– redundant data
• Because of
– Different metrics
– Different representation models
– Different ids
• Example:
– Year_of_birth= 1980
– Age = 30
which is correct?
15
Transformations
• Data aggregation
• Data discretisation and generalization: from
values to categories and hierarchies
• Range conversion: map continuous values to a
different range
– Using xk, log(x), ex, |x| etc
• Normalization: scale all values to the same range
usually 0..1 or -1..1
• Create additional features by
processing/combining existing features (e.g.
create age from date_of_birth, create
day_of_week from date)
16
Normalization
• Min-max Normalization
0…1
⚫ z-score Normalization
0…1
⚫ Decimal scaling
-1…1
where j is the smallest integer so that Max(|newA|)<1

e.g. if oldMin=-56775 and oldMax=4646 then j must be 6
17
Data preprocessing

• Data cleaning
• Data reduction
• Data types
18
Data reduction
• Data selection
– Compress
– Fit to functions
– Sample
• Dimension reduction
– Select the most representative/useful
dimensions (features)
– Create new composite dimensions that merge
multiple old dimensions
– Avoid repetitions (or highly correlated
dimensions)
19
Data compression
• String compression
– Usually lossless
– It is difficult to process without decompression
• Image/sound compression
– Usually lossy with gradient quality improvement
– Only part of the initial information is enough to rebuild the total
– Wavelet transformations can be employed
• Time-frequency analysis
20
Data reduction
• Parametric methods
– They assume that data fit to a model. They compute the
parameters of a model and store them instead of the
data
– Log-linear models: find associations between features
which are significantly different from zero (important
subspaces). Then replace the initial vector
representation (in the original m-dimensional space) to
a product of probabilities in these sub-spaces
– Regression
• Non parametric methods
– They do not assume models
– Histograms, Clusters, Sampling
21
Histograms
• A popular data reduction

technique
• Data are split into ranges
and only the average or
sum is kept for each range
• They can be optimally
created for a single
dimension
22
Sampling
• A linear complexity algorithm, selects a representative

sub-set of the original data (in a single scan)
• Random sampling may not be effective if there is a skew on
the data
• Stratified sampling
– We first compute the ratio of each class in the original dataset
and keep the same ratio in the selected data sample
Raw Data Cluster/Stratified Sample
23
Dimensionality reduction
• Feature selection
– We select the minimum set of features that, once
selected, gives the same class distribution as the
original set of features (or as similar as possible)
– We reduce the dimensions of the created models and
make them easier to understand
• Heuristics
– Step-wise forward selection: we select the best attribute
and keep on adding one attribute at a time
– Step-wise backward elimination: the reverse process
– Combination of selection and elimination
– Decision tree induction
24
Example
• The initial set of features is {height, waist, arm_length, weight,

age, gender}
gender
height weight
height
weight
Thin Normal Fat
• We select gender first. Then select height etc:

• The reduced dataset has 3 features: {gender, height, weight}
25
Principal Component Analysis
• We have Ν vectors of size k, and search for c vectors,

which are irrelevant (orthogonal) to each other and can
represent the initial data
– To initial dataset is reduces to a new dataset with Ν vectors
of length c
• Every vector is a linear combination of the c principal
component vectors
• It is applicable in numeric data only
• It is used when the initial dataset

has a large number of features
Data reduction through hierarchies
• We define different levels of detail

• Hierarchical aggregation
– A hierarchy divides the dataset into subgroups of
instances with values in the same range for a certain
feature
– A hierarchical histogram stores in each node of the
hierarchy the aggregated value of the values in the
respective group
• We can use clustering algorithms for creating
groups of instances
27
Data preprocessing

• Data cleaning
• Data reduction
• Data types
28
Attribute types
⚫ Categorical/nominal attributes: two or more

values without order (e.g. department, gender,
colour etc.)
⚫ Ordinal attributes: two or more values with
ordering but unequal distances between levels
(e.g. degree, MSc, PhD, or 5 steps likert-scale)
⚫ Interval attributes: The values are ordered and
the distances between them are equal (e.g.
temperature, pressure, pH etc)
⚫ Ratio attributes: They have all the properties of
interval attributes and also:
⚫ 0.0 has a special meaning (e.g. height, weight etc).
⚫ A double value means double intense.
29
Allowed transformations
• Depending on the type of attributes, several

“values” can be defined
Nominal Ordinal Interval Ratio

Frequency distribution Yes Yes Yes Yes
Median No Yes Yes Yes
Percentiles
Add No No Yes Yes
Subtract
Mean No No Yes Yes
standard deviation
standard error of the mean
ratio No No No Yes
coefficient of variation
Transformations Any mapping Any mapping that Add or multiply with Multiply with a
preserves ordering a value constant
30
Data preprocessing

• Data cleaning
• Data reduction
• Data types
31
Discretize
• Convert an attribute with continuous values to a

nominal or ordinal attribute
• Results in data reduction. Range labels replace
individual values
• Discretization methods
– Binning
– Histogram analysis
– Clustering analysis
– Entropy-based discretization
– Natural grouping
32
Entropy-Based Discretization
• For a set of S instances, which is split to ranges S1 and S2 at limit
T, the entropy after splitting is
• We select the limit that minimizes the resulting Entropy after split
• We then split at another dimension or a different split point until
33
Value hierarchies
• Define a partial ordering of features in the

schema
– street<city<state<country
• Automatically group values using the hierarchy
• Explicit grouping distinct values

countr
– {Athens, Piraeus}<Attika 15
y
• Selective grouping
state 50
– street<city
city 2000
street 700000
34
Data preprocessing

• Data cleaning
• Data reduction
• Data types
35
Similarity and dissimilarity
• Similarity
– A measure of how two instances resemble each other
– higher resemblance 🡺 higher similarity
– Values usually range in [0,1]
• Dissimilarity
– A measure of how two instances differ from each other
– Higher resemblance 🡺 lower similarity
– The lower limit is usually 0 and the upper limit varies
• Proximity
– A measure which is connected to the similarity between
two instances
36
Instance similarity
• Say that p and q are the features of two insances
37
Distance
• Is usually related to dissimilarity

• Under conditions, it is a metric
– d(x, y) ≥ 0
– d(x, y) = 0 ⬄ x = y
– d(x, y) = d(y, x)
– d(x, z) ≤ d(x, y) + d(y, z) (triangular inequality)
What is
their
distance
Euclidian distance
⚫ Notation: Given n instances with d features

(i=1..n). Each instance i is represented with a
vector :x(i)=(x1(i), x2(i),…, xd(i))
⚫ Euclidian distance is a metric and defined as:
⚫ It is valid when all attributes are measured in the

same scale (commensurate variables)
⚫ It is valid when all attributes are independent
from each other (orthogonal space)
Distance matrix
40
Other distance norms
• P-norm: Minkowski distance Lp(p>=1):
• p=2 norm: Euclidian distance (L2 norm)
• The norm p is irrelevant to the number of

features (space dimensionality)
Manhattan distance
• p=1: Manhattan (city blocks) distance
42
Example
43
Scaling and weights
• When attributes use different scales we can

remove bias using standard deviation
• Where σk
• When the importance of attributes differs then we

use weights
44
Attribute relation
⚫ We examine the relation between attributes using

covariance or correlation
⚫ Covariance: Examines when two attributes vary together
⚫ Given two attributes Χ and Y and n objects with values x(1),
…, x(n) and y(1), …, y(n) the covariance of X and Y is:
⚫ The covariance is related to the value range of X and Y
45
Mahalanobis distance
• It accounts for the scaling in each attribute

• Computes the covariance matrix, which contains the
covariance for all pairs of attributes, and corrects the
computation of distance (in each dimension)
• Assumes that all pairs of attributes are almost linearly
correlated
• For two instances x,y (vectors of size d)
Vector difference in Inverse covariance

d-dimensional space matrix
46
Example
• and finally dAB=5

• whereas
Different scales/impact
• dAC=4
47
Variations
⚫ When
◦ the covariance matrix is diagonal and
isotropic
◦ all dimensions have the same
variance (are orthogonal)
◦ Μahalanobis becomes Euclidian
⚫ When
◦ the covariance matrix is diagonal and
non-isotropic
◦ Dimensions have different variance
◦ Μahalanobis becomes Euclidian with
weights
48
For binary vectors
• Α= 1000000000
• Β= 0000001001
• Hamming Distance: the number of different bits or

the number of 1 in A xor B.
• Simple Matching coefficient
similarities
• Jaccard similarity coefficient
49
Other distance/similarity measures
⚫ Similarity for nominal attributes

⚫ The number of features that have matching values divided by
the total number of features
⚫ Or each nominal attribute is converted to a n-size binary tuple
(where n is the number of different values) and a binary
similarity measure is used
⚫ Similarity between images, waves
⚫ Must remain unaffected from transformations (e.g. scaling,
rotation etc)
⚫ Similarity between strings
⚫ Semantic similarity
⚫ Similarity in character level
⚫ Similarity between texts
⚫ Map texts to bag-of-words
⚫ Map texts to n-gram sets
50
Cosine similarity
• Usually in high dimensional spaces

• Each instance is a high dimensional vector
• For instances d1 and d2 cosine similarity is defined as
cos(d1,d2)=(d1●d2)/||d1||∙||d2||
• Example: Documents d1 and d2 use words for a lexicon of 10
words (10 dimensional space). Words have different degrees of
occurrence in the documents(values in each dimension)
– d 1= 3 2 0 5 0 0 0 2 0 0
– d 2= 1 0 0 0 0 0 0 1 0 2
– d1• d2= 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2= 5
– ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= (42)0.5=
6.481
– ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6)0.5=
2.245
• cos( d1, d2) = 0.3150
51
Combined similarity
• When the features are of different type then we

must combine partial similarities
– For the k-th feature we compute sk in [0,1]
– Then multiply every similarity with a factor δk
• δk= 0 if the k-the feature is binary and is 0 for both instances or a
missing value for one of the instances
• else δk= 1
• Each attribute has an associated weight wk in
[0,1] and all wk sum to 1:
52

Data Mining and Business Intelligence

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining and Business Intelligence

Uploaded by

Copyright:

Available Formats

Data Mining and

• The need for data preprocessing

• Bad data🡺 Bad knowledge is extracted

• Incomplete data: A missing value for an attribute of an

• The need for data preprocessing

• We have missing values because

• We have outliers or noise because

⚫ We record the daily temperatures and sort them:

• The need for data preprocessing

where j is the smallest integer so that Max(|newA|)<1

• The need for data preprocessing

• A popular data reduction

• A linear complexity algorithm, selects a representative

Raw Data Cluster/Stratified Sample

• The initial set of features is {height, waist, arm_length, weight,

Thin Normal Fat

• We select gender first. Then select height etc:

• We have Ν vectors of size k, and search for c vectors,

• It is used when the initial dataset

• We define different levels of detail

• The need for data preprocessing

⚫ Categorical/nominal attributes: two or more

• Depending on the type of attributes, several

Nominal Ordinal Interval Ratio

• The need for data preprocessing

• Convert an attribute with continuous values to a

• Define a partial ordering of features in the

• Explicit grouping distinct values

• The need for data preprocessing

• Say that p and q are the features of two insances

• Is usually related to dissimilarity

⚫ Notation: Given n instances with d features

⚫ It is valid when all attributes are measured in the

• P-norm: Minkowski distance Lp(p>=1):

• p=2 norm: Euclidian distance (L2 norm)

• The norm p is irrelevant to the number of

• p=1: Manhattan (city blocks) distance

• When attributes use different scales we can

• When the importance of attributes differs then we

⚫ We examine the relation between attributes using

⚫ The covariance is related to the value range of X and Y

• It accounts for the scaling in each attribute

Vector difference in Inverse covariance

• and finally dAB=5

• Hamming Distance: the number of different bits or

⚫ Similarity for nominal attributes

• Usually in high dimensional spaces

• When the features are of different type then we

You might also like