You are on page 1of 52

Data Mining and

Business Intelligence
3rd Lecture
Data preprocessing

Iraklis Varlamis
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

2
The need for preprocessing

• Bad data🡺 Bad knowledge is extracted


– e.g. duplicates, inconsistent data may lead to wrong
statistics
– Data consistency must be guaranteed for the whole
collection process
• The data we collect are:
– incomplete: missing values from specific attributes,
missing attributes, aggregate instead of analytic data
– noisy: wrong values, out of limits, e.g. age="240”
– inconsistent: use different values/codings/names for the
same things
• e.g., duplicate ids with different values

3
Reasons for “bad data”

• Incomplete data: A missing value for an attribute of an


instance
– The value has been accidentally omitted during data entry
– The value was not considered important during data collection
– A hardware/software failure in the data collection mechanism
• Noisy data
– Noise at collection
– Noise at data entry
– Noise at transmission
• Inconsistent data
– Different data source use different representation schemas (e.g.,
Age=“42” Birthday=“03/07/1997”)
– Data representation changed over time (e.g., rating with values
“A,B,C”, changed to a 5-scale rating “1-5”)
– Functional dependencies among data have been violated (e.g.
entering a second person with the same id number violates
IDNumber🡺 Name, Surname)

4
Data quality

• A multidimensional concept
• May refer to … data
– Accurate
– Complete
– Consistent
– Up to date
– Trustful
– Easy to understand
– Accessible
– Added value

5
Basic operations

• Data cleaning
– Fill missing values, remove noise, find and remove
extreme/wrong values
• Data integration
– Integrate multiple databases, use data from tables or files
• Data transformation
– Value normalization, aggregation etc
• Data reduction
– Reduce the dataset size without decreasing the overall
performance
• Discretisation
– Part of data reduction, important for numeric data

6
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

7
Data cleaning tasks (1/2)

• We have missing values because


– Data are not always available
• e.g. customers may not reveal their address
– Data are not always recorded
• e.g. due to a sensor malfunction,
– Data have been deleted
• e.g. because we decided that they were not necessary or
obsolete
• We treat missing values
– Either by assuming a value that can replace them: e.g.
a medium, median or default value, decided for the
whole dataset or part of it
– Or by ignoring the whole record

8
Data cleaning tasks (2/2)

• We have outliers or noise because


– An error or variance in a measurable variable, e.g.
transmission error, recording device limitations
• We treat noise with:
• Binning: Measures are sorted and grouped into groups of the
same size. The values of each group are smoothed using
average or boundary values
• We treat outliers with:
• Regression: We fit the data to a known distribution function and
correct outliers
• Clustering: We group similar values together and remove the
remaining values
• Manually: by plotting values and examining them

9
Data binning

• Equi-width split:
– Select Ν sub-ranges of the same width
– The width of each sub-range is: W = (max–min)/N.
– Quality is affected by outliers, the final result is biased
towards outliers.
• Equi-depth split:
– Select N sub-ranges that contain the same number of
instances
– Better distribution of data
– It is hard for nominal/ordinal data (categorical and not
continuous values)

10
Smoothing

⚫ We record the daily temperatures and sort them:


4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
⚫ We then split to equi-width bins:
◦ Bin 1: 4, 8, 9, 15
◦ Bin 2: 21, 21, 24, 25
◦ Bin 3: 26, 28, 29, 34
⚫ .. and smooth
⚫ Using bin’s average:
◦ Bin 1: 9, 9, 9, 9
◦ Bin 2: 24, 24, 24, 24
◦ Bin 3: 29, 29, 29, 29
⚫ Using bin’s boundary values
◦ Bin 1: 4, 4, 4, 15
◦ Bin 2: 21, 21, 25, 25
◦ Bin 3: 26, 26, 26, 34

11
Regression – Curve fitting

outliers

noise

12
Clustering

13
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

14
Data integration

• May result in
– inconsistent data
– redundant data
• Because of
– Different metrics
– Different representation models
– Different ids
• Example:
– Year_of_birth= 1980
– Age = 30
which is correct?
15
Transformations

• Data aggregation
• Data discretisation and generalization: from
values to categories and hierarchies
• Range conversion: map continuous values to a
different range
– Using xk, log(x), ex, |x| etc
• Normalization: scale all values to the same range
usually 0..1 or -1..1
• Create additional features by
processing/combining existing features (e.g.
create age from date_of_birth, create
day_of_week from date)

16
Normalization

• Min-max Normalization

0…1
⚫ z-score Normalization

0…1

⚫ Decimal scaling

-1…1

where j is the smallest integer so that Max(|newA|)<1


e.g. if oldMin=-56775 and oldMax=4646 then j must be 6

17
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

18
Data reduction

• Data selection
– Compress
– Fit to functions
– Sample
• Dimension reduction
– Select the most representative/useful
dimensions (features)
– Create new composite dimensions that merge
multiple old dimensions
– Avoid repetitions (or highly correlated
dimensions)

19
Data compression

• String compression
– Usually lossless
– It is difficult to process without decompression
• Image/sound compression
– Usually lossy with gradient quality improvement
– Only part of the initial information is enough to rebuild the total
– Wavelet transformations can be employed
• Time-frequency analysis

20
Data reduction

• Parametric methods
– They assume that data fit to a model. They compute the
parameters of a model and store them instead of the
data
– Log-linear models: find associations between features
which are significantly different from zero (important
subspaces). Then replace the initial vector
representation (in the original m-dimensional space) to
a product of probabilities in these sub-spaces
– Regression
• Non parametric methods
– They do not assume models
– Histograms, Clusters, Sampling

21
Histograms

• A popular data reduction


technique
• Data are split into ranges
and only the average or
sum is kept for each range
• They can be optimally
created for a single
dimension

22
Sampling

• A linear complexity algorithm, selects a representative


sub-set of the original data (in a single scan)
• Random sampling may not be effective if there is a skew on
the data
• Stratified sampling
– We first compute the ratio of each class in the original dataset
and keep the same ratio in the selected data sample

Raw Data Cluster/Stratified Sample

23
Dimensionality reduction

• Feature selection
– We select the minimum set of features that, once
selected, gives the same class distribution as the
original set of features (or as similar as possible)
– We reduce the dimensions of the created models and
make them easier to understand
• Heuristics
– Step-wise forward selection: we select the best attribute
and keep on adding one attribute at a time
– Step-wise backward elimination: the reverse process
– Combination of selection and elimination
– Decision tree induction

24
Example

• The initial set of features is {height, waist, arm_length, weight,


age, gender}

gender

height weight

height
weight

Thin Normal Fat

• We select gender first. Then select height etc:


• The reduced dataset has 3 features: {gender, height, weight}

25
Principal Component Analysis

• We have Ν vectors of size k, and search for c vectors,


which are irrelevant (orthogonal) to each other and can
represent the initial data
– To initial dataset is reduces to a new dataset with Ν vectors
of length c
• Every vector is a linear combination of the c principal
component vectors
• It is applicable in numeric data only

• It is used when the initial dataset


has a large number of features
Data reduction through hierarchies

• We define different levels of detail


• Hierarchical aggregation
– A hierarchy divides the dataset into subgroups of
instances with values in the same range for a certain
feature
– A hierarchical histogram stores in each node of the
hierarchy the aggregated value of the values in the
respective group
• We can use clustering algorithms for creating
groups of instances

27
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

28
Attribute types

⚫ Categorical/nominal attributes: two or more


values without order (e.g. department, gender,
colour etc.)
⚫ Ordinal attributes: two or more values with
ordering but unequal distances between levels
(e.g. degree, MSc, PhD, or 5 steps likert-scale)
⚫ Interval attributes: The values are ordered and
the distances between them are equal (e.g.
temperature, pressure, pH etc)
⚫ Ratio attributes: They have all the properties of
interval attributes and also:
⚫ 0.0 has a special meaning (e.g. height, weight etc).
⚫ A double value means double intense.
29
Allowed transformations

• Depending on the type of attributes, several


“values” can be defined

Nominal Ordinal Interval Ratio


Frequency distribution Yes Yes Yes Yes
Median No Yes Yes Yes
Percentiles
Add No No Yes Yes
Subtract
Mean No No Yes Yes
standard deviation
standard error of the mean
ratio No No No Yes
coefficient of variation
Transformations Any mapping Any mapping that Add or multiply with Multiply with a
preserves ordering a value constant

30
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

31
Discretize

• Convert an attribute with continuous values to a


nominal or ordinal attribute
• Results in data reduction. Range labels replace
individual values
• Discretization methods
– Binning

– Histogram analysis

– Clustering analysis

– Entropy-based discretization

– Natural grouping

32
Entropy-Based Discretization
• For a set of S instances, which is split to ranges S1 and S2 at limit
T, the entropy after splitting is

• We select the limit that minimizes the resulting Entropy after split
• We then split at another dimension or a different split point until

33
Value hierarchies

• Define a partial ordering of features in the


schema
– street<city<state<country
• Automatically group values using the hierarchy

• Explicit grouping distinct values


countr
– {Athens, Piraeus}<Attika 15
y
• Selective grouping
state 50
– street<city

city 2000

street 700000
34
Data preprocessing

• The need for data preprocessing


• Data cleaning
• Data integration and transformations
• Data reduction
• Data types
• Discretisation and values hierarchy
• Similarity/Distance metrics

35
Similarity and dissimilarity

• Similarity
– A measure of how two instances resemble each other
– higher resemblance 🡺 higher similarity
– Values usually range in [0,1]
• Dissimilarity
– A measure of how two instances differ from each other
– Higher resemblance 🡺 lower similarity
– The lower limit is usually 0 and the upper limit varies
• Proximity
– A measure which is connected to the similarity between
two instances

36
Instance similarity

• Say that p and q are the features of two insances

37
Distance

• Is usually related to dissimilarity


• Under conditions, it is a metric
– d(x, y) ≥ 0
– d(x, y) = 0 ⬄ x = y
– d(x, y) = d(y, x)
– d(x, z) ≤ d(x, y) + d(y, z) (triangular inequality)

What is
their
distance
Euclidian distance

⚫ Notation: Given n instances with d features


(i=1..n). Each instance i is represented with a
vector :x(i)=(x1(i), x2(i),…, xd(i))
⚫ Euclidian distance is a metric and defined as:

⚫ It is valid when all attributes are measured in the


same scale (commensurate variables)
⚫ It is valid when all attributes are independent
from each other (orthogonal space)
Distance matrix

40
Other distance norms

• P-norm: Minkowski distance Lp(p>=1):

• p=2 norm: Euclidian distance (L2 norm)

• The norm p is irrelevant to the number of


features (space dimensionality)
Manhattan distance

• p=1: Manhattan (city blocks) distance

42
Example

43
Scaling and weights

• When attributes use different scales we can


remove bias using standard deviation

• Where σk

• When the importance of attributes differs then we


use weights

44
Attribute relation

⚫ We examine the relation between attributes using


covariance or correlation
⚫ Covariance: Examines when two attributes vary together
⚫ Given two attributes Χ and Y and n objects with values x(1),
…, x(n) and y(1), …, y(n) the covariance of X and Y is:

⚫ The covariance is related to the value range of X and Y

45
Mahalanobis distance

• It accounts for the scaling in each attribute


• Computes the covariance matrix, which contains the
covariance for all pairs of attributes, and corrects the
computation of distance (in each dimension)
• Assumes that all pairs of attributes are almost linearly
correlated
• For two instances x,y (vectors of size d)

Vector difference in Inverse covariance


d-dimensional space matrix

46
Example

• and finally dAB=5


• whereas
Different scales/impact
• dAC=4

47
Variations

⚫ When
◦ the covariance matrix is diagonal and
isotropic
◦ all dimensions have the same
variance (are orthogonal)
◦ Μahalanobis becomes Euclidian
⚫ When
◦ the covariance matrix is diagonal and
non-isotropic
◦ Dimensions have different variance
◦ Μahalanobis becomes Euclidian with
weights

48
For binary vectors

• Α= 1000000000
• Β= 0000001001

• Hamming Distance: the number of different bits or


the number of 1 in A xor B.
• Simple Matching coefficient
similarities
• Jaccard similarity coefficient

49
Other distance/similarity measures

⚫ Similarity for nominal attributes


⚫ The number of features that have matching values divided by
the total number of features
⚫ Or each nominal attribute is converted to a n-size binary tuple
(where n is the number of different values) and a binary
similarity measure is used
⚫ Similarity between images, waves
⚫ Must remain unaffected from transformations (e.g. scaling,
rotation etc)
⚫ Similarity between strings
⚫ Semantic similarity
⚫ Similarity in character level
⚫ Similarity between texts
⚫ Map texts to bag-of-words
⚫ Map texts to n-gram sets

50
Cosine similarity

• Usually in high dimensional spaces


• Each instance is a high dimensional vector
• For instances d1 and d2 cosine similarity is defined as
cos(d1,d2)=(d1●d2)/||d1||∙||d2||
• Example: Documents d1 and d2 use words for a lexicon of 10
words (10 dimensional space). Words have different degrees of
occurrence in the documents(values in each dimension)
– d 1= 3 2 0 5 0 0 0 2 0 0
– d 2= 1 0 0 0 0 0 0 1 0 2
– d1• d2= 3*1+2*0+0*0+5*0+0*0+0*0+0*0+2*1+0*0+0*2= 5
– ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5= (42)0.5=
6.481
– ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2)0.5= (6)0.5=
2.245
• cos( d1, d2) = 0.3150

51
Combined similarity

• When the features are of different type then we


must combine partial similarities
– For the k-th feature we compute sk in [0,1]
– Then multiply every similarity with a factor δk
• δk= 0 if the k-the feature is binary and is 0 for both instances or a
missing value for one of the instances
• else δk= 1
• Each attribute has an associated weight wk in
[0,1] and all wk sum to 1:

52

You might also like