Topic 4 - Data Preprocessing

TOPIC 4: DATA PRE-PROCESSING
Han and Kamber: Chapter 2

WHAT IS IT & WHY IS IT REQUIRED?
• Data used in Data Mining has origins from multiple heterogeneous
sources that must be integrated
• Collected for reasons other than learning
• Missing Data
• Not considered important (relevant) during data collection
• Equipment Malfunction
• Inconsistent Data
• Different data formats
• Noise
• Duplicate tuples
• Data entry errors
• Technology limitations
• Rubbish In, Rubbish Out
• How can we improve the quality of the inputs?

• Data Cleaning: Missing values and Noise reduction
• Data Transformation: Normalization, Standardization, Rotation
• Data Reduction: Sampling, Dimensionality Reduction, Aggregation
What to expect in this part…
• Descriptive Data Summarization: Simple statistical methods for
– Visualizing data
– Measuring variance in data
– Measuring interrelationships between pairs of variables
• Techniques for transforming
– Numeric attributes into categorical ones
– Categorical attributes into numeric ones
– Attribute Normalization
• Methods for smoothing outliers and filling missing data
• Dimensionality Reduction techniques
• Data Sampling
Objective of the Data Audit
• To discover issues/inconsistency within the data
– Is the range of values reasonable?
– Is the distribution of values reasonable?
– Should the variable be removed or kept?
• Rationale for the attribute
– Is the level of missing data reasonable?
• What you do once the inconsistencies are found is

domain specific
• From the high level descriptive statistics

– Drill into an attribute that shows some inconsistency, for
example, calendar_month containing 14 unique values
Descriptive Data Summarization
• Assessing the usefulness of each variable for modelling
• Deciding on the usage/interpretation of the variables
– Simple descriptive statistics
• Statistics used depend on the data type
– Distributions
• Noting skew in the distributions
• Types of Statistics
– Distributive: can be computed by partitioning data, computing the measure
and aggregating values across partitions e.g. min, max, sum, count
– Algebraic: applying an algebraic function to one or more distributive
measures e.g. average
– Holistic: measured on the entire data set e.g. median
• Discovering interactions between variables using Cross Tabulation
and Correlation
Data Types
• Categorical
– A finite set of values
– Comparison: Values are either equal or distinct
– Examples: Campaign Type, Product Purchased, Line Purchased
• Ordinal
– An ordered set of distinct values
– Measure of distance between value not necessarily defined
– Examples: AgeBracket, Dates
• Numeric
– An infinite set of values
– Defined by a valid range
– Arithmetic operations can be defined
– Measure of distance between two values is defined
– Examples: Length of Relationship, Time to Response
Descriptive Statistics (Categorical)
• Modal Value (Mode)
– The most frequent value
– Multi-modal distributions have
multiple values with high
frequency
• Frequency Distribution of all

values (Bar Chart)
– As percentage or/and count
• Set of Unique Values

Descriptive Statistics (Ordinal)
• Minimum and Maximum
Values
• Median Value
• The example above shows the
• Modal Value distribution of Age bracket
• Frequency distribution • Each value, while coded as a

– Bar chart number is not necessarily
– Histogram equidistant
• Calculating a mean does not

necessarily make sense
Measures of Central Tendency
• Arithmetic Mean or Average,  n
x i
– Sensitive to extreme values = i =1
n
• Weighted Average
n
– When not all values are equal, each value may have an wx i i
associated weight, wi = i =1
n
w
• Trimmed Mean
i
i =1
– Mean obtained by removing (giving a weight of 0) to values at the top and bottom
s% (the value of s is normally small)
• Median
– Less sensitive to skew (outliers)
– If n is odd, the median is the middle value of the ordered set
– If n is even then the median is the average of the two middle values
– Approximate value may be calculated from grouped data (see next slide)
• Mode
– The value that occurs the most frequently
– Data may be multi-modal (have multiple values with maximum frequency)
• For a unimodal frequency curve with moderate skew, mean - mode  3
 (mean –median)
Approximating median
• Bin data and calculate frequency
• Find the bin that contains the Bin Count

median, say k defined as [lk,uk) with [0,10) 18
count ck [10,20) 35
• Assume that the data in the bin [20,30) 15
containing the median is uniformly [30,40) 32
distributed Total 100
𝑘−1
𝑢𝑘 − 𝑙𝑘 𝑁
𝑚𝑒𝑑𝑖𝑎𝑛𝑎 = 𝑙𝑘 + × − ෍ 𝑐𝑖 20 − 10
𝑐𝑘 2
𝑖=1 𝑚𝑒𝑑𝑖𝑎𝑛𝑎 = 10 + × 50 − 18
35
= 19.14
Symmetric vs. Skewed
Data
• Median, mean and mode of
symmetric, positively and
negatively skewed data
Measures of Variability
• An attribute has a range
• The distribution of values of the attribute within a sample, is

an estimate of the behaviour of the variable in the
population
– The quality of the estimate is only as good as the sample is a
reflection of the population
• Visual representations of variance such as the histogram are

not very accurate measures of variability
• Statistics provides a number of metrics for measuring

variability with respect to the deviation of the attribute
values around its mean
Measures of Variability
• Deviation
– Calculate the distance for each data point x, from the mean value of the
attribute
d = x−x
• For a sample some form of averaging of the deviation for each data
point in the sample could provide a useful statistic of variability
– Over the mean is Zero!
– Common Measures used
• Mean Absolute Deviation: Average of the absolute value of deviation
• Variance
– Does not have the same metric as the variable
– Does not have an intuitive interpretation
• Standard Deviation
– Square root of variance
Measures of Dispersion
• Degree to which numeric data is spread
– Range: max-min values
– k-th percentile is the value x such that k% of the data is less than or
equal to x
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, M, Q3, max
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance 1 n 1 n 2
 =  ( xi −  ) =  xi −  2
2 2
N i =1 N i =1
Boxplot Analysis
• Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to
Minimum and Maximum
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks")
x = np.random.randn(100)
f, (ax_box, ax_hist) = plt.subplots(2, sharex=True,gridspec_kw={"height_ratios": (.15, .85)})
sns.boxplot(x, ax=ax_box)
sns.distplot(x, ax=ax_hist)
ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
Histograms
• A univariate graphical method
– Summarizes the distribution of a single
attribute
– Partitions data into n disjoint partitions
(intervals) or buckets
• Width of bucket is typically uniform
(equal-width histogram)
• Displayed as a set of rectangles (one
per bucket)
– Height reflect the count or
frequency of the interval in the
given data
• Column represents counts (y-axis)
• X-axis represents the range of the attribute

– Generally partitioned into smaller ranges of
equal size (buckets)
• A visual representation of variability of a

numeric attribute
Scatter Plot
• Plots two numeric variables
– One on the X-axis
– Other on Y-axis
• Visualise covariance or correlation

between the two attributes
– Viewing the joint distribution of the
values
• Provides a view of the density of the

values in different parts of the plane
defined by the two attributes
• Overlaying a categorical attribute on top

of these points is a useful method for
adding a third dimension to the plot
Positively and Negatively Correlated
Data
Bi-Variate Investigation
• Investigating the relationship between two variables
– For example, does the Campaign_Type affect the Product
bought?
• Generally involves a test of statistical significance that

lets you know the degree of confidence you can have in
accepting or rejecting a hypothesis
– The hypothesis in this case is that the two variables have no
interaction between them
– If the test returns a very small probability, < 0.05, then generally
the conclusion is that there is an interaction between the
variables i.e. the hypothesis is rejected
• Note that what is being measured is association not

cause
Correlation
• Measures the interaction between two numeric variables
n
 ( x − x)  ( y − y )
i i
rxy = i =1
(n − 1) x y
• Bivariate measure of association (strength) of the relationship

between two variables
– Does not provide causal information
• Usually ranges from –1 to 1

– 0 implies no relationship
– -1 implies perfect negative linear relationship
– 1 implies perfect linear relationship
• Measures linear correlation

Covariance
• Measures the tendency for two attributes to co-vary
– Variance of an attribute is defined as the average squared deviation
from the mean
– Co-variance of two attributes X and Y is defined as
n
( xi − x)  ( yi − y )
c( X , Y ) = 
i =1 n −1
• Properties of c(X,X)
– If attributes X and Y tend to increase together, then c(X,Y) > 0
– If attributes X tends to decrease when attribute Y increases, then c(X,Y)
<0
– If attributes X and Y are independent then c(X,Y) = 0
• Notice that
– c(X,X) = Variance of X
Standardisation
• The covariance between two pairs of variables cannot be compared due
to differences in scale and variance
• Resolved by normalising all variables prior to computing the covariance

– The normalisation used is defined as z x = x − 

results in a variable with µ= 0 and  = 1 and is often referred to the z-
score
• The correlation coefficient (r2) is the slope of the regression line drawn in
the state space defined by X and Y when both the X and Y are
standardised n
z x  zy
r =
2 i =1
n −1
• The coefficient of determination is the square of the correlation
coefficient
– represents the percentage of the variance of X explained by Y
Cross Tabulation
Chi Square Test 2
( )
• Works by comparing the actual frequencies in each cell of the cross-
tabulation with the expected frequency if their was no interaction
between the two variables in the cross-tabulation
– How is the expected frequency calculated?
• Steps to performing the Chi Square test

– Cross tabulate data
– Decide on a significance threshold, generally 0.05 or 0.01
– Measure the expected frequency for each cell
– Calculate the chi-square value, measuring the difference between the
Observed (O) and Expected (E) frequency, as
𝑟,𝑐 2
𝑜𝑖𝑗 − 𝑒𝑖𝑗
𝜒2 = ෍
𝑒𝑖𝑗
𝑖,𝑗
– Calculate the degree of freedom (c-1)*(r-1)
– Get a probability of significance of the difference in O and E
Expected and Observed Frequencies
X1 X2
Y1 20 15 35 Marginal Sums
Y2 30 15 45 (Ri)
50 30 80
Total
Frequency (T)
Marginal Sums (Cj)
X1 X2
Y1 21.875 13.125 35
Ri C j
Y2 28.125 16.875 45 eij =   T
T T
50 30 80
Measuring the Strength of a
Relationship
• Chi-square statistic tells you if the relationship is statistically significant
– For 1 degree of freedom and 95% confidence interval
• 2 > 3.84 to reject the hypothesis that the two variables are independent
of each other
• The Cramer’s phi presents a measure of strength of the relationship and is

calculated as 2/(T*(k-1))
– k is the smaller of the number of rows and columns in the cross-
tabulation
– T is the total frequency
• Interpretation
– phi2*100 is the percentage of variation within one variable attributed
to the other variable
Relationship between Categorical
and Numeric Attributes
• Figure shows the distribution of shoe
size in the population (Total) as well as
that partitioned by Gender
• There are a number of statistical

techniques like the t-test to compare two
distributions
– Provide a measure of confidence that the
two distributions are/are not from the
same population
• If these tests suggest that there is no

statistically significant difference
between these two distributions, it may
be said that the numeric attribute is
independent of the categorical attribute
DATA TRANSFORMATION
TRANSFORMING DATA TYPES
• Some modelling techniques can only handle
categorical attributes
– Example, some decision tree implementations
• Others can only handle numeric attributes

– Example, Neural Networks
• Yet other can handle only data types of one type at a

time (without introducing a bias), all categorical or all
numeric
– Example, Distance based Techniques such as Clustering and
Nearest Neighbour
BINNING
• Mapping numeric attributes to a categorical attribute
• Divide the range of the attribute into sub-ranges
• Use sub-range labels as substitutes for the actual value
• Common methods for binning

• Domain Knowledge
• Equal width
• Sensitive to Outlier Values
• Reflects distribution of underlying attribute
• Equal frequency
• In supervised learning problems, binning can be carried out

to maximise the information content of the binned attribute
EQUAL LENGTH VS FREQUENCY BINS
MEASURING INFORMATION CONTENT
• When a target attribute is defined in the data i.e.
supervised learning is being performed
• Binning can be carried out to increase the information

content of the attribute with respect to the target
attribute
• Requires the measurement of information content
• Requires a method for measuring the gain in information

due to the availability of new information (the attribute
to be binned)
• The optimal binning regime is one that maximises the gain in
information
ENTROPY AND INFORMATION GAIN
• Key issue
• Entropy is the level of uncertainty within • How should an attribute be binned
an attribute and is measured using the so as to maximise the amount of
formula information it provides about the
dependent attribute measured by
𝑒 = − ෍ 𝑝𝑖 𝑙𝑜𝑔𝑝𝑖 the reduction in its entropy
(Information Gain)?
• pi is the probability of a random
selection choosing value vi
• The summation is defined over all
unique values of the attribute
• Entropy exhibits the following useful
characteristics
• Ranges from 0 to log2n
• If the data only contains one value,
Entropy is 0
• If the data contains two values with
equal probability, Entropy is 1,
signifying complete uncertainty
BINNING TO MAXIMISE IG
Dependent Attribute to
Attribute be Binned
Sort by Bin
Attribute
Ideal bin
boundary
to Max
IG
MEASURING INFORMATION GAIN
• After binning, the independent attribute partitions the
data into ‘k’ partitions, Sj, where k is the cardinality of the
attribute
• Information Gain is calculated as

| Sj |
IG( PP, L) = ePP −  eS j
j |S|
• epp is the entropy of the full data set
• eSj is the entropy of the partition Sj
• |Sj| is the number of instances in the partition Sj
Data Granularity and Information Gain
IG = 0
IG = 0.000017
IG = 0.0028
LENGTH_OF_RELATIONSHIP
• Usual approach is to build a binary hierarchy and select a

level in the hierarchy that provides the optimal split
MAPPING CATEGORICAL VALUES TO
NUMERIC SCALE
• Some modelling techniques cannot handle categorical
attributes e.g. neural networks
• Prior to discovery, categorical attributes must be mapped

onto a numeric scale
– Note that there must be a rationale for any mapping
• Domain based Mapping

– Zip Code can be mapped to latitude and longitude
– Colour can be mapped to RGB values
1-HOT ENCODING
• For each category value, create a new binary attribute that takes
the value 1 if the categorical attribute has that value set, else it
takes the value 0
• Dimensionality can increase dramatically

• Density of some values can be very low
• modelling technique may not be able to use an attribute effectively with
such a skewed distribution
NORMALISATION
• Attribute distribution is the way values of the attribute are spread
across a range
• Some distributions can cause problems for modelling tools
• Normalisation is the process of transforming the attribute

distribution to make it more useable by the modelling tool
• Normalisation can be of two types

• Range
• Distribution
• Range normalisation is the mapping of all attributes onto a new

range with a minimum and maximum value pre-defined
• Example, mapping all attributes to the [0,1] scale
DISTRIBUTION NORMALISATION
• Additional constraints may also be added to the mapping
• Example, the mean of the new attributes must be 0 and the standard
deviation must be 1
• Appropriate if the spread in values is due to normal random variation but
not if characteristic of the domain
x−
x' =

• Referred to as z-score normalization
• Such constraints affect the shape of the distribution and is therefore

referred to as distribution normalisation
• Normalisation does introduce biases and distortions in the data
hence is often referred to as “massaging”
• The aim is to better expose the information content to the data mining/
modelling tool
• Care needs to be taken not to do the opposite
MIN-MAX NORMALIZATION
• Linear transformation
x − min
f ( x) =
max − min
• f (x) is the new normalised value, belonging to the range [0,1]
• min and max are the minimum and maximum values for the attribute
being normalised
– More generally if the new range for the attribute is [nmin, nmax]
x − min
f ( x) = (nmax − nmin) + nmin
max − min
• Range Normalisation as it does not affect the distribution of the values within the
range, i.e. the shape of the distribution remains unaltered
EFFECT OF SAMPLING ON RANGE
NORMALISATION
• The minimum and maximum values within the training data will be
estimates of the minimum and maximum values of the population
• If the sample is not representative of the population more out-of-range

values will be encountered by the model while scoring/deployment of
the model
• Generating the training sample to some degree of confidence that it is

representative of the population gives some insights into
• the likelihood of encountering an out-of-range value
• the magnitude of the error in the expected (sample) and actual
range of the population
• This allows the sample minimum and maximum value to be adjusted

prior to use within scaling
DEALING WITH OUT-OF-RANGE VALUES
• Leave as is
• Implies the state space is not a unit state space
• A number of mining techniques can handle non unit state spaces
• Ignore the whole record
• Clip the values

• If the value is larger than the maximum, set it equal to the maximum
• If the value is smaller than the minimum, set it equal to the minimum
• If the out-of-range values are indeed valid, these techniques will result
in the loss of information and introduce a bias in the data
• What would be a more robust way of dealing with possible out-of-range

values?
• What additional information is required to do this?
PLANNING FOR OUT-OF-RANGE VALUES
• Reduce the part of the transformed range that holds in-range
values
• How much of the range should be kept for linear transformation and how
much for out-of-range values?
• Depends on the distribution of the attribute as it would allow inferences of
the population limits
• Use the confidence measure that the sample is representative of the
population
• 95% confidence would suggest keeping the linear transformation within the
range [0.025, 0.975]
• For out-of-range values

• the larger the value, the lower the chance that it will appear
• a good scaling should
• reflect this fact as the more likely out-of-range values will be closer to linear
scaling
• never reach its limit, as this would result in a larger value getting a
transformed value greater than 1
• No two values should map to the same transformed value
ALTERNATIVE SCALING
• Softmax Scaling
– Takes one parameter that defines the extent of the linear
scaling
– Never reaches the value of 0 or 1
– Uses the Logistic function for scaling
1
f ( x) =
1 + e−s ( x)
– To factor in the extent of the linear scaling the softmax

function is defined as
x−x
s ( x) =
r
• where r is a user defined parameter, that defines the range of the
variable to be linearly scaled in terms of the number of standard
deviations around the mean
SOFTMAX SCALING
from scipy.special import expit
def softmax(r,x):
mu = np.mean(x)
sig = np.std(x)
y = expit((x-mu)/(r*sig))
return y
y1=softmax(0.25,x)
y2=softmax(0.5,x)
y3=softmax(1,x)
plt.scatter(x,y1,c="r",label="r=0.25")
plt.scatter(x,y2,c="b",label="r=0.5")
plt.scatter(x,y3,c="c",label="r=1")
plt.legend()
plt.xlabel("x")
plt.ylabel("soft_x")
MISSING VALUES AND NOISE
MISSING VALUES AND NOISE
• Reasons for Missing Values
– Data not available
• Data Survey may show some default values input during data collection
– Data value not appropriate
• Why do these need to be handled explicitly?

– Some Modelling techniques cannot handle missing values
• If they do, they may use undesirable ways to deal with them
– Modeller had more control on how they are handled
– There may be a domain meaning to the missing value
• Understanding Missing Value patterns similar to any

classification data mining task
– Significant patterns that allow you to predict that the attribute value is missing
or not
DEALING WITH MISSING VALUES
• Unbiased Estimators
– These are estimates for the data that do not change
important data characteristics
– What are the important characteristics?
• Mean
• Standard Deviation
• Correlation with other variables
• Mean commonly used to fill numeric attributes
• Mode used to fill categorical attributes
• These generate bias within the distribution and are

not necessarily valid values
USING MODELS TO FILL MISSING VALUES
• Preserving Standard Deviation is more accurate as it takes the
mean and the variation around it as the basis for the
prediction
• Most data mining tools can be used to predict missing values

by training on instances that have no missing values
• Less bias introduced into the data as it takes correlation among
variables into account when predicting missing values
• Can introduce bias if the missing values show a strong pattern of their
own as the training data used to build the model is biased
• Use Linear Regression (or any other supervised learning

algorithm) to fill missing values
• Takes interaction between attributes into account
• Assumes a linear relationship
• Does fairly well unless relationship is very non-linear
DIMENSIONALITY REDUCTION
J. Shlens. A Tutorial on Principal Component Analysis.
https://www.cs.cmu.edu/~elaw/papers/pca.pdf
MATRIX FACTORIZATION
• The ratings are a reflection of the user’s taste that is driven by certain
domain dependant latent factors e.g. plot, actor, cinematography etc.
• Assuming k latent factors, we can approximate the rating matrix
𝑅 ≈ 𝑃𝑄
where P and Q are n x k and m x k matrices respectively
• P is a representation of users based on their ‘k’ interests
• Q is a representation of movies based on its ability to fulfil the user’s
preferences
• Typically use an iterative optimization algorithm to learn P and Q such that,
they minimize
2 𝛽
෍ 𝑟𝑖𝑗 − 𝑟ෞ
𝑖𝑗 + | 𝑃 |2 + | 𝑄 | 2
2
𝑖𝑗
where, b = 0.02 and 𝑟𝑖𝑗

Ƹ = 𝑝𝑖 . 𝑞𝑗𝑇
• Multiplying P and Q to obtain an approximation of R provides ratings to

items, predicted for a user who had not previously consumed (or at least not
rated) the item Number of Parameters?
THE DATA: USER ITEM RATING MATRIX (R)
Cinematography
Animation
Special Effects
Disney
Dharma
Pixar
Action/Thriller
Action Heroes
Amir Khan
Comedy 𝑓: 𝑈 × 𝐼 → [1, 𝑟]
𝑓: 𝑈 × 𝐼 × 𝐶 → [1, 𝑟]
Rating Prediction using Matrix Factorization
i1 i2 i3 i4 i5 i6
Tim 2 4 3
Tom 3 4 3 1
Peter 1 3 4
Paul 4 4 1
Kathy 3 2 4
h1 h2 h1 h2
i1
Tim
i2
Tom
i3
Peter i4
Paul i5
Kathy i6
Rating Prediction using Matrix Factorization
i1 i2 i3 i4 i5 i6 h1 h2
Tim 2 4 3 Tim 1.38 1.22
Tom 3 4 3 1 Tom 0.51 1.7
Peter 1 3 4 Peter 1.39 0.78
Paul 4 4 1 Paul 0.80 1.88
Kathy 3 2 4 Kathy 2.41 0.46
h1 h2
i1 0.69 0.04
i2 1.13 0.43
i3 1.46 1.43
i4 0.41 2.10
i5 2.09 1.2
i1 i2 i3 i4 i5 i6
i6 1.66 -0.06
Tim 1.01 2.10 3.78 3.14 4.37 2.21
Tom 0.42 1.32 3.18 3.78 3.11 0.73
Peter 0.99 1.92 3.16 2.22 3.85 2.26
Paul 0.64 1.74 3.88 4.29 3.95 1.21
Kathy 1.69 2.93 4.2 1.98 5.61 3.97
def MF(X,num_dims,step_size,epochs,thres,lam_da):
import scipy
P = scipy.sparse.rand(X.shape[0],num_dims,1,format='csr')
P=scipy.sparse.csr_matrix(P/scipy.sparse.csr_matrix.sum(P,axis=1))
Q = scipy.sparse.rand(num_dims, X.shape[1],1,format='csr')
Q=scipy.sparse.csr_matrix(Q/scipy.sparse.csr_matrix.sum(Q,axis=0))
prev_error = 0
for iterat in range(epochs):
errors = X - make_sparse(P.dot(Q),X.indices,X.indptr)
mse=np.sum(errors.multiply(errors))/len(X.indices)
if abs(mse-prev_error) > thres:
P_new = P+step_size*2*(errors.dot(Q.T)-lam_da*P)
Q_new = Q+step_size*2*(P.T.dot(errors)-lam_da*Q)
P = P_new
Q = Q_new
prev_error = mse
else:
break
if iterat%1 == 0:
print(iterat,mse)
print(P.dot(Q).todense())
return P,Q
Movielens
• Number of Users: 610

• Number of movies: 9724
• Number of Ratings: 100,836

0 12.658056670291035
1 12.3679403784389
2 12.054535441993796
3 11.71142099647553
4
9511.333958784038623
1.224751090129586
5
9610.919755000620956
1.215996873239419
……
97 1.2074389361582032
……
98 1.1990709981876604
99 1.1908870413582395
Stochastic Gradient Descent
● Stochastic Gradient Descent
● Prediction Error for each user, item pair
𝑒𝑢𝑖 = 𝑟𝑢𝑖 − 𝑞𝑖𝑇 𝑝𝑢
● Update P and Q
𝑞𝑖 ← 𝑞𝑖 + 𝛾(𝑒𝑢𝑖 . 𝑝𝑢 − 𝜆. 𝑞𝑖 )
𝑝𝑢 ← 𝑝𝑢 + 𝛾(𝑒𝑢𝑖 . 𝑞𝑖 − 𝜆. 𝑝𝑢 )
● Loop through all user, item pairs

● Alternating Least Squares (ALS)
● qi and pu are not known, hence convex cost function
● Alternately fix qi and pu so as to make the cost function
quadratic and hence making the cost function convex
Incorporating Biases
● Decomposing the rating
𝑟𝑢𝑖 = 𝜇 + 𝑟𝑖 + 𝑟𝑢 + 𝑟𝑢𝑖
Ƹ
Interaction
rating
Average
Item User
rating
bias bias How does the user
deviate from the
average?
Everyone loves Star Wars (ri=0.8)

Derek is a harsh rater (ru= -0.4)
Average rating = 3.4
Derek’s interaction with Star Wars contributed 0.5
rui = 3.4+0.8-0.4+0.5 = 4.3
How do we learn the biases?

Learning the biases
● Decouple the computation of bi and bu
σ𝑢∈𝑅(𝑖)(𝑟𝑢𝑖 − 𝜇)
𝑏𝑖 =
𝜆1 + |𝑅 𝑖 |
σ𝑖∈𝑅(𝑢)(𝑟𝑢𝑖 − 𝜇 − 𝑏𝑖 )
𝑏𝑢 =
𝜆2 + |𝑅 𝑢 |
● l1 and l2 are regularization parameters that shrink

the biases towards zero when the number of ratings
for an item or ratings by a user is small
● Alternatively, compute bi and bu symmetrically by
minimizing the cost function
𝑏 ∗ = argmin ෍ (𝑟𝑢𝑖 − 𝜇 − 𝑏𝑖 − 𝑏𝑢 )2 + 𝜆( ෍ 𝑏𝑢2 + ෍ 𝑏𝑖2 )

(𝑏𝑖 ,𝑏𝑢 )
(𝑢,𝑖)∈𝒦 𝑢 𝑖
Number of Parameters?
Adding in the user item interaction term
𝑇
𝑟ෞ
𝑢𝑖 = 𝜇 + 𝑏𝑖 + 𝑏𝑢 + 𝑝𝑢 𝑞𝑖
● The predicted rating is

● Assuming k dimensions used in decomposing the rating
matrix, the number of unknown parameters in this model
are (n+m)(k+1)
● The Cost function is defined as
𝑏 ∗ = argmin ෍ (𝑟𝑢𝑖 − 𝜇 − 𝑏𝑖 − 𝑏𝑢 − 𝑝𝑢 𝑞𝑖𝑇 )2 + 𝜆( ෍ 𝑏𝑢2 + ෍ 𝑏𝑖2 + | 𝑃 |2 + | 𝑄 |2 )
(𝑏𝑖 ,𝑏𝑢 )
(𝑢,𝑖)∈𝒦 𝑢 𝑖
● Update Rules 𝑒𝑢𝑖 = 𝑟𝑢𝑖 − 𝑟ෞ

𝑢𝑖
𝑏𝑖 ← 𝑏𝑖 + 𝛾(𝑒𝑢𝑖 − 𝜆. 𝑏𝑖 )
𝑏𝑢 ← 𝑏𝑢 + 𝛾(𝑒𝑢𝑖 − 𝜆. 𝑏𝑢 )
𝑞𝑖 ← 𝑞𝑖 + 𝛾(𝑒𝑢𝑖 . 𝑝𝑢 − 𝜆. 𝑞𝑖 )
𝑝𝑢 ← 𝑝𝑢 + 𝛾(𝑒𝑢𝑖 . 𝑞𝑖 − 𝜆. 𝑝𝑢 )
l is the learning rate (step size) and g is the
regularization constant
Reducing the number of Parameters
● Typically the number of users is much larger than the number

of items e.g. Netflix Prize data contained 480,189 users and
17,770 movies with a total of 100,480,507 ratings
● Patarek proposed modelling the user as a linear combination
of item vectors
1
𝑝𝑢 = ෍ 𝑦𝑗
1 + |𝐼 𝑢 | 𝑖
𝑗 ∈𝐼(𝑢)
● Number of parameters for the matrix factorization part is now

O(mk) instead of O(n+m)k
Not Missing At Random: The Bias in Rating Data
• Missing at Random means that

the rating of an item has no
bearing on the probability that
it is going to be missing
• The complete Rating matrix is

hidden and some process has
hidden a large proportion of the
ratings
• If the missing ratings are not

missing at random it is
important to understand the
process that caused the rating
to be missing
RECOMMENDATION AS SUBSET SELECTION
• While accurately predicting ratings is one way to achieve

improved recommendation
• It is more important to get the ranking of items correct

– There is only limited real estate to push
recommendations to
– Being accurate in rating a large number of items that are
not interesting to the user does not impact perceived
value of the recommender
• An alternative problem statement

– Given a set of n items, and a user ua who has consumed
m (<< n ) items select N such that the ratings of these
items by a user will be high
• Also referred to as Top-N recommendation

EVALUATION METRICS FOR TOP-N
• Withhold one rating at a time and predict it
– If the test item exists in the recommendation list L(u), it is
referred to as a hit and the Top-k hit rate is defined as
#ℎ𝑖𝑡𝑠
𝐻𝑅 =
|𝐿 𝑢 |
– All hits are equal (no credit given for the rank achieved by the
item), Average Reciprocal Hit Rank is defined as
1 1 1
𝐴𝑅𝐻𝑅 = ෍ ෍
|𝑈| |𝑇 𝑢 | 𝑟𝑎𝑛𝑘(𝑖, 𝐿 𝑢 )
𝑢∈𝑈 𝑖𝑗 ∈𝑇(𝑢)
• Alternatively, a set of items per user are withheld from training

and precision and recall are computed given the Note: k = |L(u)|
recommendations, L(u) of size k, for each user
1 |𝐿(𝑢) ∩ 𝑇(𝑢)|
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛@𝑘 = ෍
|𝑈| |𝐿 𝑢 |
𝑢∈𝑈
1 |𝐿(𝑢) ∩ 𝑇(𝑢)|
𝑟𝑒𝑐𝑎𝑙𝑙@𝑘 = ෍
|𝑈| |𝑇 𝑢 |
𝑢∈𝑈
Adapting Matrix Factorization to TOP-N Recommendation
• Cost function adapted to account for the fact that most missing
values are uninteresting
– Missing rating are imputed (parameter) and set to rm
– A weight (parameter) is assigned to each rating to account for sparsity
and the fact that most ratings are missing
1 𝑖𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑟𝑎𝑡𝑖𝑛𝑔
𝑤𝑢𝑖 = ቊ
𝑤𝑚 𝑖𝑓 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑛𝑔
– Cost function defined over all items (not just the observed)
2
෍ ෍ 𝑤𝑢𝑖 𝑟𝑢𝑖 − (𝑟𝑚 + 𝑝𝑢 𝑞𝑖𝑇 ) 2
+ 𝜆 ෍ 𝑝𝑢𝑑 2
+ 𝑞𝑖𝑑
𝑢∈𝑈 𝑖∈𝐼 𝑑
– Hyper-Parameters, rm, wui, and l are set using cross validation to

maximize recall@k (Top-k HR)
STATE SPACE
• Each record in the data can be viewed as a state of the
real-world process being modelled
• For example, a state of a high-value customer
• If each attribute is considered to be a dimension defining

a multi-dimensional space, then each record can be
represented as a point in that space
• This space is referred to as State Space
• A state space with all attributes normalised to the [0,1]

range is called a unit state space
• The plot on the right shows a two-dimensional state

space with attributes income and age normalised to the
unit scale [0,1]
THE CURSE OF DIMENSIONALITY
• In a two dimensional unit state space, the maximum distance
between any two points, using Pythagoras theorem, is (the
number of dimensions)
• As we increase the dimensionality of the state space, unless the

points have the same value along the additional dimensions,
the distance between them increases, to a maximum of
• Hence adding dimensions increases the sparsity of the state

space as the points in the space are pushed further apart
• Learning from sparse data is less effective, so adding

dimensions makes the learning less robust – a fact referred to
as the curse of dimensionality
VECTOR ALGEBRA BASICS (1)
• A Vector Space is a collection of objects called vectors that may be scaled
and added (conforming to some axioms)
• Vector
• Is an entity that has a magnitude (length) and a direction
• Basis
• A set of vectors {b1,…..,bn} such that all vectors, v, in the space can be
defined as a unique linear combination of the basis vectors where the
ai’s are reals
• No vector in the set can be defined as a linear combination of the other

basis vectors
• Also referred to as a linearly independent spanning set
• Given a basis B, any n-dimensional vector can be expressed as a column
vector (a nx1 matrix of ai’s)
• Length
• Length of the vector, v, is the dot product of the vector with itself
• A unit vector is a vector of length 1
• Orthogonal
• A pair of vectors, u and v, are said to orthogonal if u.v = 0
• Orthonormal
• A pair of vectors are said to be orthonormal if they are orthogonal unit
vectors
VECTOR ALGEBRA BASICS (2)
• Given a non-negative integer n, we are interested in the space of all n-tuples of real
numbers which forms an n-dimensional Vector space, Rn, called the real coordinate
space
– The basis {e1,…en} is called the standard basis consists of n vectors where ei has a
1 in the i-th coordinate and a 0 elsewhere
– Data sets encountered in data mining can be expressed as a set of vectors within
the real coordinated space
• The n*m matrix, X, where each column is a single example represents a data
set of size m in the n-dimensional real coordinate space
• Transformation
– Given
• A data set represented as points in the real coordinate space defined using
some basis, and
• a new basis {p1,p2,…pn} that is a linear combination of the original basis
– The matrix P, where the ith row is pi, is said to transform a set of coordinates to
another set of coordinates. In matrix notation we write Y = PX where,
• X is the original data set, xi’s are the columns of X
• Y is the re-representation of the data using the new basis
TRANSFORMATION EXAMPLE
• Consider the example on the right of a 2 dimensional vector space defined using
the standard basis {(1,0),(0,1)}
• Suppose our data set, X, consists of two points (1,0) and (-1,1). Using matrix
notation we can write
• Suppose now that we want to change the basis to

(shown in red)
• Now let Y be the new coordinates for the tuples in

the data set given the new basis,
TRANSFORMATION
from scipy.stats import multivariate_normal
rv = multivariate_normal.rvs([0.5, -0.2], [[2.0, 0.3], [0.3, 0.125]],1000).T
plt.scatter(rv[0,:],rv[1,:])
plt.xlabel('X')
plt.ylabel('Y')
trans = np.array([[1/np.sqrt(2),1/np.sqrt(2)],[-1/np.sqrt(2),1/np.sqrt(2)]])
t_data = trans.dot(rv)
plt.scatter(t_data[0,:],t_data[1,:])
plt.xlabel('X')
plt.ylabel('Y')
PRINCIPAL COMPONENT ANALYSIS
• Given the data set defined by two attributes, a subset of R2
• Is there another basis that is a linear combination of the original basis that best re-
expresses our data set?
40
• Noise 38
• Measurement noise must be low 36
for information to be extracted 34
• Hence assume that the direction 32
attribute 2
with large variances contain the 30
dynamics of interest 28
26
• Redundancy 24
22
• Was it really necessary to record 20
all variables? 20 25 30 35 40
• To what extent does one variable attribute 1

vary with the other (covariance)
• Covariance high => high redundancy
• Find that basis which has no covariance between the variables and capture directions
of maximum variance
• A mathematical procedure that transforms a number of
(possibly) correlated attributes into a (smaller) number of
uncorrelated attributes called principal components
• Each principal component is a linear combination of the

underlying attributes
• The first principal component accounts for as much of the

variability in the data as possible
• Each succeeding component accounts for as much of the remaining
variability
• The Principal Components correspond to the Eigen vectors of the

Covariance Matrix
• The covariance matrix of the new set of variables is a diagonal matrix
EIGEN VECTORS AND PRINCIPAL
COMPONENTS
• Let Y be the new coordinates of the data points and let P be the
transformation on the original data X that results in Y
• Covaraince of Y, Y = rYY T
= rPX(PX)T
= rPXXTPT
= rPXPT
• Let
– EX be the matrix of Eigen vectors of X, and
– X be the diagonal matrix of Eigen values
– As EX is orthonormal, EX-1 = EXT
Y = rPXPT = rPEXXEXTPT = rPEXX(PEX)T
– Hence if P = EXT then Y = rX
• The rows of P are the Eigen Vectors of X
From k original variables: x1,x2,...,xk:
Produce k new variables: y1,y2,...,yk:
y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
... yk's are
yk = ak1x1 + ak2x2 + ... + akkxk Principal Components
such that:
yk's are uncorrelated (orthogonal)
y1 explains as much as possible of original variance in data set
y2 explains as much as possible of remaining variance
etc.
PRINCIPAL COMPONENTS
• The number of principal components are
the same as the original dimensionality of
the state space
λ2
• The ranking of the principal components PC2
provides a method of choosing a smaller
number of PCs than the original λ1
Attribute 2
dimensionality
• The eigen value associated with the PC
is a measure of the amount of variation
explained by the PC
• If the variability explained by a PC is PC1
relatively small, it may be noice in the
“signal” and be ignored Attribute 1
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import make_blobs

data=make_blobs(n_samples = 1000, n_features=3,centers = 2,random_state = 101)
from sklearn.preprocessing import MinMaxScaler

scaler=MinMaxScaler()
scaled_data = scaler.fit_transform(data[0])
x_data = scaled_data[:,0]
y_data = scaled_data[:,1]
z_data = scaled_data[:,2]
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
ax.scatter(x_data,y_data,z_data,c=data[1])
AFTER PCA
mean_vector = np.average(scaled_data,axis = 0)
diff = scaled_data - mean_vector
cov_matrix = diff.T.dot(diff)/(data[0].shape[0]-1) #cov_matrix = np.cov(data[0].T)
eig_val_cov, eig_vec_cov = np.linalg.eig(cov_matrix)
[0.22489886, 0.00479848, 0.01377635]] Eigen Values
[-0.38685896, 0.88423682, -0.26165893], Eigen Vectors

[ 0.13081827, -0.22825677, -0.96477221],
[ 0.91281253, 0.40746055, 0.02737116]
eig_val_sc/sum(eig_val_sc) Explained
[ 0.923, 0.019, 0.056] variance
t_data = eig_vec_sc.T[[0,2],:].dot(data[0].T)
USING AUTOENCODERS
• A neural network with the same number of
nodes in the input and output layer
• Objective to encode the input using a set of
hidden layers is a way that can reproduce
the input (output an approximation of the
input)
• No restriction on the number of hidden
layers or the number of nodes within each
hidden layer
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
num_inputs = 3
num_hidden = 2
num_outputs = num_inputs
learning_rate = 0.01
X=tf.placeholder(tf.float32,shape=[None,num_inputs])
hidden = fully_connected(X,num_hidden,activation_fn=None)
outputs = fully_connected(hidden,num_outputs,activation_fn=None)
loss = tf.reduce_mean(tf.square(outputs-X))
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
bias1
Parameters to be learned:
Bias1 and Bias 2
W: 3x2 matrix of weights connecting input to hidden layer
W’: 2x3 matrix of weights connecting hidden layer to output
bias2
num_steps = 1000
tvars = tf.trainable_variables()
with tf.Session() as sess:

sess.run(init)
for iter in range(num_steps):
sess.run(train,feed_dict={X:scaled_data})
fc1_var_np = sess.run([tvars])
output_2d = hidden.eval(feed_dict={X:scaled_data})
print(fc1_var_np)
var_names = []
for var in tvars:
var_names.append(var.name)
Bias and Weight Vectors:

[-0.28791779 -0.27566475]
[[-0.77610379 -0.6351704]
[ 0.8446877 0.18886252]
[ 0.33508259 0.74900287]]
SINGULAR VALUE DECOMPOSITION
• Consider the set {v1,v2,…vr} of orthonormal Eigen vectors of XTX
• r is the rank of XTX
• Then, by definition, (XTX)vi = livi
• Define the set of vectors {u1,u2,….ur} such that

1
ui = Xvi
li
• It turns out that ui’s are also orthonormal
• Hence XV = U or X = UVT
•  is a diagonal matrix with diagonal element = li
• Note also that XXT = (UVT)(UVT)T= U2UT
• So SVD decomposes a rectangular matrix X(nxm) into the product of three matrices
U (n x r), S(r x r) and VT(r x m)
• U defines the rows in terms of r orthonormal vectors
• V defines the column in terms of r orthonormal vectors
• S is a diagonal matrix containing scaling (singular) values
• Maximum number of vectors (dimensions of the new space, r) is min(m,n)
FEATURE EXTRACTION: LATENT SEMANTIC ANALYSIS
• Assumes some latent semantic space obscured by randomness of word
choice e.g. human vs user
– Lexical level (“what was said”) Vs Semantic level (“what was meant”)
– Method for removing “noise” in the “semantic signal”
• Estimating the hidden concept space (associates syntactically different but semantically
equivalent terms/ documents)
• Given an mxn matrix X consisting of the word frequency counts for m words
in n documents
• A singular value decomposition of X results in three matrices U, S and VT
– The singular values in S provide a basis for reducing the dimensionality of U
and V by choosing the r largest values
– Each row, ui, in U is a representation of a word in r-dimensional space
• Columns (right singular vectors) are the Eigen vectors of XXT
– Each row, vi, in V is a representation of a document in r-dimensional space
• Columns (left singular vectors) are the Eigen vectors of XTX
SVD EXAMPLE
3.34
S
2.54
2.35
U
1.64
.22 -.11 .29 -.41 -.11 -.34 .52 -.06 -.41
1.5
.2 -.07 .14 -.55 .28 .5 -.07 -.01 -.11
1.31
.24 .04 -.16 -.59 -.11 -.25 -.3 .06 .49
.85
.4 .06 -.34 .1 .33 .38 0 0 .01
.56
.64 -.17 .36 .33 -.16 -.21 -.17 .03 .27
.36
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.2 .61 .46 .54 .28 0 .01 .02 .08
.30 -.14 .33 .19 .11 .27 .03 -.02 -.17
-.06 .17 -.13 -.23 .11 .19 .44 .62 .53
.21 .27 -.18 -.03 -.54 .08 -.47 -.04 -.58
.11 -.5 .21 .57 -.51 .1 .19 .25 .08
.01 .49 .23 .03 .59 -.39 -.29 .25 -.23
-.95 -.03 .04 .27 .15 .02 .02 .01 -.03
.04 .62 .22 0 -.07 .11 .16 -.68 .23
.05 -.21 .38 -.21 .33 .39 .35 .15 -.6
.03 .45 .14 -.01 -.3 .28 .34 .68 .18
-.08 -.26 .72 -.37 .03 -.3 -.21 0 .36
.18 -.43 -.24 .26 .67 -.34 -.15 .25 .04
-.01 .05 .01 -.02 -.06 .45 -.76 .45 -.07
VT
-.06 .24 .02 -.08 -.26 -.62 .02 .52 -.45
LSA EXAMPLE (RECONSTRUCTION WITH 2
DIMENSIONS)
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Human 1 0 0 1 0 0 0 0 0 Human .16 .4 .38 .47 .18 -.05 -.12 -.16 -.09
Interface 1 0 1 0 0 0 0 0 0 Interface .14 .37 .33 .4 .16 -.03 -.07 -.1 -.04
Computer 1 1 0 0 0 0 0 0 0 Computer .15 .51 .36 .41 .24 .02 .06 .09 .12
User 0 1 1 0 1 0 0 0 0 User .26 .84 .61 .7 .39 .03 .08 .12 .19
System 0 1 1 2 0 0 0 0 0 System .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05
response 0 1 0 0 1 0 0 0 0 response .16 .58 .38 .42 .28 .06 .13 .19 .22
Time 0 1 0 0 1 0 0 0 0 Time .16 .58 .38 .42 .28 .06 .13 .19 .22
EPS 0 0 1 1 0 0 0 0 0 EPS .22 .55 .51 .63 .24 -.07 -.14 -.2 .11
Survey 0 1 0 0 0 0 0 0 1 Survey .1 .53 .23 .21 .27 .14 .31 .44 .42
Trees 0 0 0 0 0 1 1 1 0 Trees -.06 .23 -.14 -.27 .14 .24 .55 .77 .66
Graph 0 0 0 0 0 0 1 1 1 Graph -.06 .34 -.15 -.3 .2 .31 .69 .98 .85
minors 0 0 0 0 0 0 0 1 1 minors -.04 .25 -.1 -.21 .15 .22 .5 .71 .62
The matrix resulting from the reduced number of dimensions is the optimal
approximation (with respect to least squares error) of the original matrix X
WORD SIMILARITY
0.7
0.6
0.5
0.4
0.3
0.2
U
0.1
.22 -.11 .29 -.41 -.11 -.34 .52 -.06 -.41
0
.2 -.07 .14 -.55 .28 .5 -.07 -.01 -.11
-0.1 0 0.2 0.4 0.6 0.8
.24 .04 -.16 -.59 -.11 -.25 -.3 .06 .49 -0.2
.4 .06 -.34 .1 .33 .38 0 0 .01 -0.3
.64 -.17 .36 .33 -.16 -.21 -.17 .03 .27
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.30 -.14 .33 .19 .11 .27 .03 -.02 -.17
.21 .27 -.18 -.03 -.54 .08 -.47 -.04 -.58
.01 .49 .23 .03 .59 -.39 -.29 .25 -.23
.04 .62 .22 0 -.07 .11 .16 -.68 .23
.03 .45 .14 -.01 -.3 .28 .34 .68 .18
DOCUMENT SIMILARITY
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1 0 0.2 0.4 0.6 0.8
VT -0.2
.2 .61 .46 .54 .28 0 .01 .02 .08 -0.3
-.06 .17 -.13 -.23 .11 .19 .44 .62 .53
.11 -.5 .21 .57 -.51 .1 .19 .25 .08
-.95 -.03 .04 .27 .15 .02 .02 .01 -.03
.05 -.21 .38 -.21 .33 .39 .35 .15 -.6
-.08 -.26 .72 -.37 .03 -.3 -.21 0 .36
.18 -.43 -.24 .26 .67 -.34 -.15 .25 .04
-.01 .05 .01 -.02 -.06 .45 -.76 .45 -.07
-.06 .24 .02 -.08 -.26 -.62 .02 .52 -.45
MULTI-DIMENSIONAL
SCALING
• Method for reducing the dimensionality of data
• Aims to map data onto a lower dimensionality while

minimising the loss of information
• May be used for

• Reducing the number of attributes in the data
• Reducing the number of numeric attributes that a categorical
attribute is mapped onto
• Uses orientation as a mechanism for reducing the

overall loss in precision due to mapping onto a space of
a lower dimension
MDS EXAMPLE c
• Consider three points (a,b,c) in two
dimensional space, as represented by
the vertices of the triangle 5
3
• The sides of the triangle represent the
distances between the points in Euclidean
space a
4 b
• Projecting these points into a single 5

dimension can be carried out as shown a c
but will lead to some loss of information
• In particular the distances between these
three points will change 4 3
• The distances of these points from each b

other are invariant to rotation Distance Project Rotate &
• However as can be seen from the Project
example, projecting the points after a b 4 4 2.86
rotation can reduce the loss in information
b c 3 0 2.14
a c 5 4 5
Total Error 4 2
from sklearn import manifold
from sklearn.metrics import euclidean_distances
similarities = euclidean_distances(scaled_data)
from sklearn.manifold import MDS
model = MDS(n_components=2,
dissimilarity='precomputed', random_state=1)
out = model.fit_transform(similarities)
plt.scatter(out[:, 0], out[:, 1],c=data[1])
plt.axis('equal')
SCREE PLOTS AND STRESS
• The loss in information by • A Scree plot plots the stress to the
number of dimensions of the
projecting multi-dimensional data projected space
into a lower dimensionality space is
• The point at which the relative
referred to as stress change in stress per dimension
• Stress is the percentage change in increases indicates the optimal
total pairwise distance between number of dimensions of the
points in the two spaces data
• For the example, of three 2-
dimensional points projected onto a 90
single dimension the stress values 80

70
are 0.33 (4/12) and 0.16 (2/12) 60
Stress
respectively 50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Dimensions
Sampling
Sampling
• Why sample data in the context of data mining?
– Isn’t data mining all about using all the data to discover patters?
• What is a sample?
– A sample is a subset of the total data (population)
– The data that is available in a database is generally a subset of total data
in any case
• A database in a bank only contains customer data not data on ALL mortgage
customers
• knowledge discovered from the data needs to be valid for the population for
it to be useful
• Also when data mining we often require three samples of the

available data
– Training
– Validation
– Test
Types of Sampling
• Random Sampling
– By the far the most commonly used sampling technique
– Any instance in the population has an equal chance to be a
member of the sample
• Stratified Sampling
– Used when certain attributes within the data have very skewed
distributions, for example, fraudulent Vs non-fraudulent
transactions
– Sample taken from each partition of the data based on the
skewed attribute
• x% random sample of the fraudulent transactions
• y% random sample of the non-fraudulent transactions
– Some modelling techniques may not be able to handle the skew
in the distribution, requiring a larger % of the fraudulent
transactions being samples than the non-fraudulent ones
Topic 2: Wrap-Up
• Methods for auditing data
– Gaining insights into the
• Normalizing
characteristics of data attributes distributions
– Measuring relationships between two
attributes
• Handling Missing
• Transforming attributes Values
– from categorical to numeric
• Domain based mapping
• 1-to-n mapping • Methods for
• Using associations with dimensionality
numeric attributes reduction
– Numeric to categorical
– Multi-dimensional Scaling
• Binning – Principal Component
– Simple range and frequency Analysis
based methods
– Entropy based methods
• Sampling

Topic 4 - Data Preprocessing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic 4 - Data Preprocessing

Uploaded by

Copyright:

Available Formats

TOPIC 4: DATA PRE-PROCESSING

Han and Kamber: Chapter 2

• How can we improve the quality of the inputs?

• What you do once the inconsistencies are found is

• From the high level descriptive statistics

• Frequency Distribution of all

• Set of Unique Values

• Frequency distribution • Each value, while coded as a

• Calculating a mean does not

• Find the bin that contains the Bin Count

• The distribution of values of the attribute within a sample, is

• Visual representations of variance such as the histogram are

• Statistics provides a number of metrics for measuring

• Quartiles: Q1 (25th percentile), Q3 (75th percentile)

f, (ax_box, ax_hist) = plt.subplots(2, sharex=True,gridspec_kw={"height_ratios": (.15, .85)})

• Column represents counts (y-axis)

• X-axis represents the range of the attribute

• A visual representation of variability of a

• Visualise covariance or correlation

• Provides a view of the density of the

• Overlaying a categorical attribute on top

• Generally involves a test of statistical significance that

• Note that what is being measured is association not

• Bivariate measure of association (strength) of the relationship

• Usually ranges from –1 to 1

• Measures linear correlation

• Resolved by normalising all variables prior to computing the covariance

• Steps to performing the Chi Square test

• The Cramer’s phi presents a measure of strength of the relationship and is

• There are a number of statistical

• If these tests suggest that there is no

• Others can only handle numeric attributes

• Yet other can handle only data types of one type at a

• Divide the range of the attribute into sub-ranges

• Use sub-range labels as substitutes for the actual value

• Common methods for binning

• In supervised learning problems, binning can be carried out

• Binning can be carried out to increase the information

• Requires the measurement of information content

• Requires a method for measuring the gain in information

• Information Gain is calculated as

• Usual approach is to build a binary hierarchy and select a

• Prior to discovery, categorical attributes must be mapped

• Domain based Mapping

• Dimensionality can increase dramatically

• Some distributions can cause problems for modelling tools

• Normalisation is the process of transforming the attribute

• Normalisation can be of two types

• Range normalisation is the mapping of all attributes onto a new

• Such constraints affect the shape of the distribution and is therefore

• If the sample is not representative of the population more out-of-range

• Generating the training sample to some degree of confidence that it is

• This allows the sample minimum and maximum value to be adjusted

• Ignore the whole record

• Clip the values

• What would be a more robust way of dealing with possible out-of-range

• For out-of-range values

– To factor in the extent of the linear scaling the softmax

• Why do these need to be handled explicitly?

• Understanding Missing Value patterns similar to any

• Mean commonly used to fill numeric attributes

• Mode used to fill categorical attributes

• These generate bias within the distribution and are