Professional Documents
Culture Documents
• Ordinal
– An ordered set of distinct values
– Measure of distance between value not necessarily defined
– Examples: AgeBracket, Dates
• Numeric
– An infinite set of values
– Defined by a valid range
– Arithmetic operations can be defined
– Measure of distance between two values is defined
– Examples: Length of Relationship, Time to Response
Descriptive Statistics (Categorical)
• Modal Value (Mode)
– The most frequent value
– Multi-modal distributions have
multiple values with high
frequency
• Median Value
• The example above shows the
• Modal Value distribution of Age bracket
x i
– Sensitive to extreme values = i =1
n
• Weighted Average
n
– When not all values are equal, each value may have an wx i i
associated weight, wi = i =1
n
w
• Trimmed Mean
i
i =1
– Mean obtained by removing (giving a weight of 0) to values at the top and bottom
s% (the value of s is normally small)
• Median
– Less sensitive to skew (outliers)
– If n is odd, the median is the middle value of the ordered set
– If n is even then the median is the average of the two middle values
– Approximate value may be calculated from grouped data (see next slide)
• Mode
– The value that occurs the most frequently
– Data may be multi-modal (have multiple values with maximum frequency)
• For a unimodal frequency curve with moderate skew, mean - mode 3
(mean –median)
Approximating median
• Bin data and calculate frequency
N i =1 N i =1
Boxplot Analysis
• Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum
• Boxplot
– Data is represented with a box
– The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
– The median is marked by a line within the box
– Whiskers: two lines outside the box extend to
Minimum and Maximum
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks")
x = np.random.randn(100)
sns.boxplot(x, ax=ax_box)
sns.distplot(x, ax=ax_hist)
ax_box.set(yticks=[])
sns.despine(ax=ax_hist)
sns.despine(ax=ax_box, left=True)
Histograms
• A univariate graphical method
– Summarizes the distribution of a single
attribute
– Partitions data into n disjoint partitions
(intervals) or buckets
• Width of bucket is typically uniform
(equal-width histogram)
• Displayed as a set of rectangles (one
per bucket)
– Height reflect the count or
frequency of the interval in the
given data
( x − x) ( y − y )
i i
rxy = i =1
(n − 1) x y
• Properties of c(X,X)
– If attributes X and Y tend to increase together, then c(X,Y) > 0
– If attributes X tends to decrease when attribute Y increases, then c(X,Y)
<0
– If attributes X and Y are independent then c(X,Y) = 0
• Notice that
– c(X,X) = Variance of X
Standardisation
• The covariance between two pairs of variables cannot be compared due
to differences in scale and variance
• The correlation coefficient (r2) is the slope of the regression line drawn in
the state space defined by X and Y when both the X and Y are
standardised n
z x zy
r =
2 i =1
n −1
• The coefficient of determination is the square of the correlation
coefficient
– represents the percentage of the variance of X explained by Y
Cross Tabulation
Chi Square Test 2
( )
• Works by comparing the actual frequencies in each cell of the cross-
tabulation with the expected frequency if their was no interaction
between the two variables in the cross-tabulation
– How is the expected frequency calculated?
X1 X2
Y1 21.875 13.125 35
Ri C j
Y2 28.125 16.875 45 eij = T
T T
50 30 80
Measuring the Strength of a
Relationship
• Chi-square statistic tells you if the relationship is statistically significant
– For 1 degree of freedom and 95% confidence interval
• 2 > 3.84 to reject the hypothesis that the two variables are independent
of each other
• Interpretation
– phi2*100 is the percentage of variation within one variable attributed
to the other variable
Relationship between Categorical
and Numeric Attributes
• Figure shows the distribution of shoe
size in the population (Total) as well as
that partitioned by Gender
Sort by Bin
Attribute
Ideal bin
boundary
to Max
IG
MEASURING INFORMATION GAIN
• After binning, the independent attribute partitions the
data into ‘k’ partitions, Sj, where k is the cardinality of the
attribute
IG = 0
IG = 0.000017
IG = 0.0028
LENGTH_OF_RELATIONSHIP
• min and max are the minimum and maximum values for the attribute
being normalised
– More generally if the new range for the attribute is [nmin, nmax]
x − min
f ( x) = (nmax − nmin) + nmin
max − min
• Range Normalisation as it does not affect the distribution of the values within the
range, i.e. the shape of the distribution remains unaltered
EFFECT OF SAMPLING ON RANGE
NORMALISATION
• The minimum and maximum values within the training data will be
estimates of the minimum and maximum values of the population
• If the out-of-range values are indeed valid, these techniques will result
in the loss of information and introduce a bias in the data
def softmax(r,x):
mu = np.mean(x)
sig = np.std(x)
y = expit((x-mu)/(r*sig))
return y
y1=softmax(0.25,x)
y2=softmax(0.5,x)
y3=softmax(1,x)
plt.scatter(x,y1,c="r",label="r=0.25")
plt.scatter(x,y2,c="b",label="r=0.5")
plt.scatter(x,y3,c="c",label="r=1")
plt.legend()
plt.xlabel("x")
plt.ylabel("soft_x")
MISSING VALUES AND NOISE
MISSING VALUES AND NOISE
• Reasons for Missing Values
– Data not available
• Data Survey may show some default values input during data collection
– Data value not appropriate
i1 i2 i3 i4 i5 i6
Tim 2 4 3
Tom 3 4 3 1
Peter 1 3 4
Paul 4 4 1
Kathy 3 2 4
h1 h2 h1 h2
i1
Tim
i2
Tom
i3
Peter i4
Paul i5
Kathy i6
Rating Prediction using Matrix Factorization
i1 i2 i3 i4 i5 i6 h1 h2
Tim 2 4 3 Tim 1.38 1.22
Tom 3 4 3 1 Tom 0.51 1.7
Peter 1 3 4 Peter 1.39 0.78
Paul 4 4 1 Paul 0.80 1.88
Kathy 3 2 4 Kathy 2.41 0.46
h1 h2
i1 0.69 0.04
i2 1.13 0.43
i3 1.46 1.43
i4 0.41 2.10
i5 2.09 1.2
i1 i2 i3 i4 i5 i6
i6 1.66 -0.06
Tim 1.01 2.10 3.78 3.14 4.37 2.21
Tom 0.42 1.32 3.18 3.78 3.11 0.73
Peter 0.99 1.92 3.16 2.22 3.85 2.26
Paul 0.64 1.74 3.88 4.29 3.95 1.21
Kathy 1.69 2.93 4.2 1.98 5.61 3.97
def MF(X,num_dims,step_size,epochs,thres,lam_da):
import scipy
P = scipy.sparse.rand(X.shape[0],num_dims,1,format='csr')
P=scipy.sparse.csr_matrix(P/scipy.sparse.csr_matrix.sum(P,axis=1))
Q = scipy.sparse.rand(num_dims, X.shape[1],1,format='csr')
Q=scipy.sparse.csr_matrix(Q/scipy.sparse.csr_matrix.sum(Q,axis=0))
prev_error = 0
for iterat in range(epochs):
errors = X - make_sparse(P.dot(Q),X.indices,X.indptr)
mse=np.sum(errors.multiply(errors))/len(X.indices)
if abs(mse-prev_error) > thres:
P_new = P+step_size*2*(errors.dot(Q.T)-lam_da*P)
Q_new = Q+step_size*2*(P.T.dot(errors)-lam_da*Q)
P = P_new
Q = Q_new
prev_error = mse
else:
break
if iterat%1 == 0:
print(iterat,mse)
print(P.dot(Q).todense())
return P,Q
Movielens
𝑞𝑖 ← 𝑞𝑖 + 𝛾(𝑒𝑢𝑖 . 𝑝𝑢 − 𝜆. 𝑞𝑖 )
𝑝𝑢 ← 𝑝𝑢 + 𝛾(𝑒𝑢𝑖 . 𝑞𝑖 − 𝜆. 𝑝𝑢 )
𝑟𝑢𝑖 = 𝜇 + 𝑟𝑖 + 𝑟𝑢 + 𝑟𝑢𝑖
Ƹ
Interaction
rating
Average
Item User
rating
bias bias How does the user
deviate from the
average?
σ𝑢∈𝑅(𝑖)(𝑟𝑢𝑖 − 𝜇)
𝑏𝑖 =
𝜆1 + |𝑅 𝑖 |
σ𝑖∈𝑅(𝑢)(𝑟𝑢𝑖 − 𝜇 − 𝑏𝑖 )
𝑏𝑢 =
𝜆2 + |𝑅 𝑢 |
Number of Parameters?
Adding in the user item interaction term
𝑇
𝑟ෞ
𝑢𝑖 = 𝜇 + 𝑏𝑖 + 𝑏𝑢 + 𝑝𝑢 𝑞𝑖
– All hits are equal (no credit given for the rank achieved by the
item), Average Reciprocal Hit Rank is defined as
1 1 1
𝐴𝑅𝐻𝑅 =
|𝑈| |𝑇 𝑢 | 𝑟𝑎𝑛𝑘(𝑖, 𝐿 𝑢 )
𝑢∈𝑈 𝑖𝑗 ∈𝑇(𝑢)
1 𝑖𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑟𝑎𝑡𝑖𝑛𝑔
𝑤𝑢𝑖 = ቊ
𝑤𝑚 𝑖𝑓 𝑚𝑖𝑠𝑠𝑖𝑛𝑔 𝑟𝑎𝑡𝑖𝑛𝑔
– Cost function defined over all items (not just the observed)
2
𝑤𝑢𝑖 𝑟𝑢𝑖 − (𝑟𝑚 + 𝑝𝑢 𝑞𝑖𝑇 ) 2
+ 𝜆 𝑝𝑢𝑑 2
+ 𝑞𝑖𝑑
𝑢∈𝑈 𝑖∈𝐼 𝑑
• Suppose our data set, X, consists of two points (1,0) and (-1,1). Using matrix
notation we can write
attribute 2
with large variances contain the 30
dynamics of interest 28
26
• Redundancy 24
22
• Was it really necessary to record 20
all variables? 20 25 30 35 40
• Find that basis which has no covariance between the variables and capture directions
of maximum variance
PRINCIPAL COMPONENT ANALYSIS
• A mathematical procedure that transforms a number of
(possibly) correlated attributes into a (smaller) number of
uncorrelated attributes called principal components
• Covaraince of Y, Y = rYY T
= rPX(PX)T
= rPXXTPT
= rPXPT
• Let
– EX be the matrix of Eigen vectors of X, and
– X be the diagonal matrix of Eigen values
– As EX is orthonormal, EX-1 = EXT
Y = rPXPT = rPEXXEXTPT = rPEXX(PEX)T
– Hence if P = EXT then Y = rX
• The rows of P are the Eigen Vectors of X
PRINCIPAL COMPONENT ANALYSIS
From k original variables: x1,x2,...,xk:
Produce k new variables: y1,y2,...,yk:
y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
... yk's are
yk = ak1x1 + ak2x2 + ... + akkxk Principal Components
such that:
yk's are uncorrelated (orthogonal)
y1 explains as much as possible of original variance in data set
y2 explains as much as possible of remaining variance
etc.
PRINCIPAL COMPONENTS
• The number of principal components are
the same as the original dimensionality of
the state space
λ2
• The ranking of the principal components PC2
provides a method of choosing a smaller
number of PCs than the original λ1
Attribute 2
dimensionality
• The eigen value associated with the PC
is a measure of the amount of variation
explained by the PC
• If the variability explained by a PC is PC1
relatively small, it may be noice in the
“signal” and be ignored Attribute 1
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
x_data = scaled_data[:,0]
y_data = scaled_data[:,1]
z_data = scaled_data[:,2]
eig_val_sc/sum(eig_val_sc) Explained
[ 0.923, 0.019, 0.056] variance
t_data = eig_vec_sc.T[[0,2],:].dot(data[0].T)
USING AUTOENCODERS
• A neural network with the same number of
nodes in the input and output layer
• Objective to encode the input using a set of
hidden layers is a way that can reproduce
the input (output an approximation of the
input)
• No restriction on the number of hidden
layers or the number of nodes within each
hidden layer
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected
num_inputs = 3
num_hidden = 2
num_outputs = num_inputs
learning_rate = 0.01
X=tf.placeholder(tf.float32,shape=[None,num_inputs])
hidden = fully_connected(X,num_hidden,activation_fn=None)
outputs = fully_connected(hidden,num_outputs,activation_fn=None)
loss = tf.reduce_mean(tf.square(outputs-X))
optimizer = tf.train.AdamOptimizer(learning_rate)
train = optimizer.minimize(loss)
init = tf.global_variables_initializer()
bias1
Parameters to be learned:
Bias1 and Bias 2
W: 3x2 matrix of weights connecting input to hidden layer
W’: 2x3 matrix of weights connecting hidden layer to output
bias2
num_steps = 1000
tvars = tf.trainable_variables()
• Hence XV = U or X = UVT
• is a diagonal matrix with diagonal element = li
• Note also that XXT = (UVT)(UVT)T= U2UT
• So SVD decomposes a rectangular matrix X(nxm) into the product of three matrices
U (n x r), S(r x r) and VT(r x m)
• U defines the rows in terms of r orthonormal vectors
• V defines the column in terms of r orthonormal vectors
• S is a diagonal matrix containing scaling (singular) values
• Maximum number of vectors (dimensions of the new space, r) is min(m,n)
FEATURE EXTRACTION: LATENT SEMANTIC ANALYSIS
• Assumes some latent semantic space obscured by randomness of word
choice e.g. human vs user
– Lexical level (“what was said”) Vs Semantic level (“what was meant”)
– Method for removing “noise” in the “semantic signal”
• Estimating the hidden concept space (associates syntactically different but semantically
equivalent terms/ documents)
• Given an mxn matrix X consisting of the word frequency counts for m words
in n documents
• A singular value decomposition of X results in three matrices U, S and VT
– The singular values in S provide a basis for reducing the dimensionality of U
and V by choosing the r largest values
– Each row, ui, in U is a representation of a word in r-dimensional space
• Columns (right singular vectors) are the Eigen vectors of XXT
– Each row, vi, in V is a representation of a document in r-dimensional space
• Columns (left singular vectors) are the Eigen vectors of XTX
SVD EXAMPLE
3.34
S
2.54
2.35
U
1.64
.22 -.11 .29 -.41 -.11 -.34 .52 -.06 -.41
1.5
.2 -.07 .14 -.55 .28 .5 -.07 -.01 -.11
1.31
.24 .04 -.16 -.59 -.11 -.25 -.3 .06 .49
.85
.4 .06 -.34 .1 .33 .38 0 0 .01
.56
.64 -.17 .36 .33 -.16 -.21 -.17 .03 .27
.36
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.2 .61 .46 .54 .28 0 .01 .02 .08
.30 -.14 .33 .19 .11 .27 .03 -.02 -.17
-.06 .17 -.13 -.23 .11 .19 .44 .62 .53
.21 .27 -.18 -.03 -.54 .08 -.47 -.04 -.58
.11 -.5 .21 .57 -.51 .1 .19 .25 .08
.01 .49 .23 .03 .59 -.39 -.29 .25 -.23
-.95 -.03 .04 .27 .15 .02 .02 .01 -.03
.04 .62 .22 0 -.07 .11 .16 -.68 .23
.05 -.21 .38 -.21 .33 .39 .35 .15 -.6
.03 .45 .14 -.01 -.3 .28 .34 .68 .18
-.08 -.26 .72 -.37 .03 -.3 -.21 0 .36
.18 -.43 -.24 .26 .67 -.34 -.15 .25 .04
-.01 .05 .01 -.02 -.06 .45 -.76 .45 -.07
VT
-.06 .24 .02 -.08 -.26 -.62 .02 .52 -.45
LSA EXAMPLE (RECONSTRUCTION WITH 2
DIMENSIONS)
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
Human 1 0 0 1 0 0 0 0 0 Human .16 .4 .38 .47 .18 -.05 -.12 -.16 -.09
Interface 1 0 1 0 0 0 0 0 0 Interface .14 .37 .33 .4 .16 -.03 -.07 -.1 -.04
Computer 1 1 0 0 0 0 0 0 0 Computer .15 .51 .36 .41 .24 .02 .06 .09 .12
User 0 1 1 0 1 0 0 0 0 User .26 .84 .61 .7 .39 .03 .08 .12 .19
System 0 1 1 2 0 0 0 0 0 System .45 1.23 1.05 1.27 .56 -.07 -.15 -.21 -.05
response 0 1 0 0 1 0 0 0 0 response .16 .58 .38 .42 .28 .06 .13 .19 .22
Time 0 1 0 0 1 0 0 0 0 Time .16 .58 .38 .42 .28 .06 .13 .19 .22
EPS 0 0 1 1 0 0 0 0 0 EPS .22 .55 .51 .63 .24 -.07 -.14 -.2 .11
Survey 0 1 0 0 0 0 0 0 1 Survey .1 .53 .23 .21 .27 .14 .31 .44 .42
Trees 0 0 0 0 0 1 1 1 0 Trees -.06 .23 -.14 -.27 .14 .24 .55 .77 .66
Graph 0 0 0 0 0 0 1 1 1 Graph -.06 .34 -.15 -.3 .2 .31 .69 .98 .85
minors 0 0 0 0 0 0 0 1 1 minors -.04 .25 -.1 -.21 .15 .22 .5 .71 .62
The matrix resulting from the reduced number of dimensions is the optimal
approximation (with respect to least squares error) of the original matrix X
WORD SIMILARITY
0.7
0.6
0.5
0.4
0.3
0.2
U
0.1
.22 -.11 .29 -.41 -.11 -.34 .52 -.06 -.41
0
.2 -.07 .14 -.55 .28 .5 -.07 -.01 -.11
-0.1 0 0.2 0.4 0.6 0.8
.24 .04 -.16 -.59 -.11 -.25 -.3 .06 .49 -0.2
.4 .06 -.34 .1 .33 .38 0 0 .01 -0.3
.64 -.17 .36 .33 -.16 -.21 -.17 .03 .27
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.27 .11 -.43 .07 .08 -.17 .28 -.02 -.05
.30 -.14 .33 .19 .11 .27 .03 -.02 -.17
.21 .27 -.18 -.03 -.54 .08 -.47 -.04 -.58
.01 .49 .23 .03 .59 -.39 -.29 .25 -.23
.04 .62 .22 0 -.07 .11 .16 -.68 .23
.03 .45 .14 -.01 -.3 .28 .34 .68 .18
DOCUMENT SIMILARITY
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-0.1 0 0.2 0.4 0.6 0.8
VT -0.2
.2 .61 .46 .54 .28 0 .01 .02 .08 -0.3
-.06 .17 -.13 -.23 .11 .19 .44 .62 .53
.11 -.5 .21 .57 -.51 .1 .19 .25 .08
-.95 -.03 .04 .27 .15 .02 .02 .01 -.03
.05 -.21 .38 -.21 .33 .39 .35 .15 -.6
-.08 -.26 .72 -.37 .03 -.3 -.21 0 .36
.18 -.43 -.24 .26 .67 -.34 -.15 .25 .04
-.01 .05 .01 -.02 -.06 .45 -.76 .45 -.07
-.06 .24 .02 -.08 -.26 -.62 .02 .52 -.45
MULTI-DIMENSIONAL
SCALING
• Method for reducing the dimensionality of data
Stress
respectively 50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Dimensions
Sampling
Sampling
• Why sample data in the context of data mining?
– Isn’t data mining all about using all the data to discover patters?
• What is a sample?
– A sample is a subset of the total data (population)
– The data that is available in a database is generally a subset of total data
in any case
• A database in a bank only contains customer data not data on ALL mortgage
customers
• knowledge discovered from the data needs to be valid for the population for
it to be useful
• Stratified Sampling
– Used when certain attributes within the data have very skewed
distributions, for example, fraudulent Vs non-fraudulent
transactions
– Sample taken from each partition of the data based on the
skewed attribute
• x% random sample of the fraudulent transactions
• y% random sample of the non-fraudulent transactions
– Some modelling techniques may not be able to handle the skew
in the distribution, requiring a larger % of the fraudulent
transactions being samples than the non-fraudulent ones
Topic 2: Wrap-Up
• Methods for auditing data
– Gaining insights into the
• Normalizing
characteristics of data attributes distributions
– Measuring relationships between two
attributes
• Handling Missing
• Transforming attributes Values
– from categorical to numeric
• Domain based mapping
• 1-to-n mapping • Methods for
• Using associations with dimensionality
numeric attributes reduction
– Numeric to categorical
– Multi-dimensional Scaling
• Binning – Principal Component
– Simple range and frequency Analysis
based methods
– Entropy based methods
• Sampling