You are on page 1of 23

Data Mining and Machine Learning

Sarah Constantin April 16, 2012

Lecture 1

Textbook: Hasties Elements of Statistical Learning. Grade: 60 percent bi-weekly homework, 40 percent nal project. Class demonstrations are in R, work may be in R or MATLAB. There are two basic types of problems: classication and regression. Regression relates input variables to a numerical variable, trying to predict the response variable from the input variables. Classication is the same thing, except the output variable is discrete. Regression example: predict house price from things like size, school district, number of bedrooms, etc. Classication example: distinguishing spam from ham emails based on the text of an email. Well start with traditional linear models for classication and regression, and from there try to make linear models more exible. The other main issue to consider is high dimensional data many possible features. For example: a grayscale image with many pixels. Example: Autism. One of the characteristics of autism is impaired social interaction. When watching a movie, autistics will pay less attention to social scenes than normal people do. Eye-tracking data can pay attention to where the subject is looking on the screen. Subjects watched Whos Afraid of Virginia Woolf. Subjects have classication labels (autistic or neurotypical) and each frame has a data point indicating where the subjects are looking. Can we use this data as a binary classier? Some frames have little discriminatory power, but some frames show signicant dierence between autistic and neurotypical subjects. So part of this is a variable selection problem. Additionally, we need to take into account the time ordering of the frame.


Least Squares and Nearest Neighbors

Toy example: simple prediction method. Two input variables, x1 and x2, and a class label, red or green. Linear function of the input: Y = 0 + Or, in other words, Y = XT The residual sum of squares is given by RSS() = (yi xT )2 = (y X)T (y X) i Xj j

This is a measure of goodness of t of the linear model. K-nearest neighbors algorithm: for each data point, rank the distances to all other points and identify the k nearest neighboring points. For each grid point, calculate k nearest neighbors in the data set. The majority vote will assign the classication. This creates a classication boundary which is not necessarily linear. Note that not every data point will have an eect on the classication boundary. Its more of a local method, trying to use local features for classication rule. The linear method is more of a global method.

Lecture 2

Simulation example: red points and green points. We assume we dont know the underlying probability distribution. Two input variables, X1 and X2 and classication labels. Can we nd a classication rule that predicts the label of a new data point? Procedures: least squares and clustering. Least squares treats Y as a linear function of X1 and X2. If the predictor is 0.5 predict red, otherwise predict green. K-nearest-neighbors says that if two data points are close in terms of their input variables, we expect the labels to be similar. Prediction is based on neighboring points. For any data point, look at its k nearest neighbors, and give the new point the label of the majority class. The smaller k is, the more seriously it takes outliers. How to choose k is a very important question. As the size of the neighborhood increases, are you using more degrees of freedom or fewer? Youre using less. At the extreme where the size of the neighborhood was all the data, there would be only one possible option. Dene N/K, number of clusters, approximate degrees of freedom. Linear method vs. K nearest neighbor procedure... plotting degrees of freedom against error, theres a minimum error point for test data (training data, of course, more tting is always better.) 2


Statistical Decision Theory

Choose f (x) to minimize expected squared error loss (Y f (X))2 . Expected Prediction Error: EP E(f ) = (y f (x))2 p(y|x)p(x)dxdy = [(y f (x))2 p(y|x)dy]p(x)dx = Ex EY |X ([Y f (x)]2 |X) So its sucient to minimize EY |X ([Y f (X))2 |X) pointwise. The least squares solution is the regression function. K-nearest neighbors is f (x) = Ave(y|x Nk (x)) Linear regression: f (x) = X T . EPE is minimized by [E(XX T )]1 E[XY ] K-nearest-neighbors directly approximates the Bayes classier; conditional probability of a point is relaxed to conditional probability within a neighborhood of a point, and probabilities are estimated by training sample proportions.

Lecture 3
yi = x t + i

Linear regression: assume


where E[ i ] = 0, V ar[ i ] = minimizing by setting


Cov( i , j ) = 0. We minimize the sum of squared errors by

(yi xT )2 = (y X)T (y X) i 0 = dS/d = d/d(y T y T X T y y T X + T X T X) = 2X T y + 2X T X = (X T X)1 X T y Then if we use the least-squares estimator = (X T X)1 X T y 3

its an unbiased estimate: E[] = Indeed, E[] = E[(X T X)1 X T (X + )] = + E[(X T X)1 ] = + E[E[(X T X)1 X T |X]] = because E[ |X] = 0. Cov() = (X T X)1 2 Typical estimate of 2 : 2 = 1 N p1 (yi yi )2 N (0, 2 ) then follows

unbiased estimator of 2 . If we further assume that the errors a multivariate normal distribution, N (, (X T X)1 2 ) So we can do statistical tests on whether j = 0 or not.

If we do simulations of data generated from an underlying distribution, with randomly generated error term, look at the coecient estimate for each sample so generated. Your expectation, the center of the sampling distribution, will be the underlying truth. Gauss-Markov Theorem: the least squares estimate has the smallest variance among all linear unbiased estimates. M SE() = E[( )2 ] = Bias2 + V ariance but if bias is 0, minimum MSE is minimum variance. There may, however, exist a biased estimator with smaller variance. We can trade a little bias for a large reduction in variance. If variables are highly correlated, the residual vector will be very close to zero and the coef cient p will be unstable. Think of this as the coecient in simple linear regression/ Xj , j = 1 . . . p

if these are all independent, then y x1 . . . xp the eect of the xi s will be the same as if you did a simple linear regression for each of them. But in practice very often the xs have some dependence. Then the coecient obtained from multiple linear regression would be dierent from simple linear regression. How dierent? Marginal coecient j represents the additional contribution of xj on y after xj has been adjusted for all the other xs. When we do the linear regression of y on x, the residual is orthogonal to the input space, therefore orthogonal to all the input variables. You can regress variables one at a time on the residuals, and this gives the correct coecients. Its a way of nding the pure eect of a variable. Colinearity can lead to unstable estimates because the residual of (xj xi ) has smaller variance. Problems with least squares estimates: Prediction accuracy least squares estimates often have low bias but large variance. Can we select variables (subset selection) so that the estimator is biased but has much lower variance. Best subset regression nds for each k < p the subset of size k that gives smallest residual sum of squares. Unfortunately the number of subsets is 2p , so its hard to search through all of subset space. There are two commonly used search procedures. Forward stepwise selection and backwards stepwise selection. Forward: start with no variables; for all predictors not in the model, choose the one to optimize a variable selection criterion such as AIC or BIC continue until no new predictors can be added to improve the criterion. Backwards is the same way, but reversed; start with all the variables, and remove one at a time until you cant improve the variable selection criterion any more. Akaike Information Criterion: AIC() = 2 log(likelihood) + 2k where k is the number of parameters. Bayesian Information Criterion: BIC() = 2 log(likelihood) + (log N )k where N is the dimension. Consistency of model selection: if the true f is among the candidate families of regression functions, the probability of selecting the true model by BIC approaches 1 as n .

Lecture 4

Now we look at coecient shrinkage by ridge regression. This is a more stable estimator: shrink the regression coecients by imposing a penalty on their size. The ridge coecients minimize a penalized residual sum of squares. ridge = argmin (yi 0 xij j )2 +
2 j

as increases, you shrink the coecients harder. This makes mean squared error smaller than least squares. For the prostate cancer example (Hastie, Chapter 3) the coecients of some of the input variables fall as increases, while others grow. The OLS estimate corresponds to = 0. If = then all coecients are forced to be zero. If the OLS estimator for a coecient for a variable is nonzero, then the ridge regression is nonzero. Basically everything has nonzero coecients. If you want to reduce the number of input variables, this is not ideal. Lasso instead penalizes the

norm. (yi 0 xij j )2 + |j |

lasso = argmin

The eect of raising on the coecients is quite dierent; the number of nonzero coecients gets smaller and smaller. Geometric intuition: an ellipse approaching the L1 ball will hit closer to (1, 0) and if it approaches the L2 ball itll hit farther out. Cross-validation: build your model using your training set, evaluate using your test set. This avoids overtting. You can minimize training error without being good at predicting new data. If you have a lot of data, you can split it into dierent subsets, train a model on one of them and test it on the others. K-fold cross-validation: Ek () =

(yi xi k ())2

Ridge regression can show that multiple variables converge to each other; this can separate and reconstruct variables. The shrinkage method standardizes your shrinkage variables so they have the same standard deviation so the coecients will be comparable to each other. You can have a prediction using each training set, for each of several choices of . Least Angle Regression: commonly 6

used algorithm for implementing Lasso. So you can see the range of cross validated MSE and how it changes with L1 norm, so that you can be more condent about choosing the minimum.

Lecture 5

There are still other regression techniques group Lasso, principal components regression, etc. Fused lasso: x1 . . . xp x variable has some time ordering or spatial structure

|t | +

|t t1 |

penalize dierences between adjacent coecients. Today is all about classication problems. Youre interested in predicting categories. Linear methods: this means the decision boundaries are linear. Decision boundary of the form 0 + j xj = 0. Two regions separated by hyperplane. Expected prediction error: E[L(G, G(x))] loss is 0 if you assign a point to the right category, otherwise 1. This is called 0-1 loss. Bayes classier means classify to the most probable class, using the conditional distribution: maxP (g|X = x). Linear regression of an indicator matrix. K class indicators Yk , each Yk is 1 if G = k, else 0. So you can treat them as a numeric value in regression. k = (X T X)1 X T yk yk = X k Training data is of the form (xi , gi ), data point and classication. Compare the three values from the regression Y1 , Y2 , etc. Pick the class with the highest Y . Actual Y can be negative or above one, because of the nature of the linear regression line. This is one problem with linear regression for classication. You are not guaranteed that your predicted value is in the appropriate range. Alternative: mixture of Gaussians distribution: sum of Gaussians with dierent scales and means. P1 1 T fk (x) = e1/2(xk ) k (xk ) (2)p/2 | k |1/2 7

each class density. Linear discriminant analysis: we see that the linear discriminant functions
1 1

k (x) = xT

1/2T k

k + log k

is an equivalent description of the decision rule, and you choose the k that maximizes k (x) where k are the priors. You only retain the terms that are related to k.

Lecture 6

Masking problem: if there are more than two classes, simple linear classication will assume there are too few classes. So we need other options. For instance, linear discriminant analysis. The linear discriminant functions are
1 1

k (x) = x

1/2t k

k + log k

Maximizing this is equivalent to maximizing the posterior probability of the data if we model each class density as a multivariate Gaussian fk (x) = 1 p/2 | (2) e1/2(xk ) |1/2 k


(xk k )

P (G = k, X = x) = P (X = x|G = k)P (G = k) = fk (x)k All the probabilities of being in classes are f1 (x) . . . fk (x) P (X = x) = P (X = x k G = k = P (X = x, G = k)

To maximize THIS, maximize the log-likelihood, log fk (x) + log k k (x) modulo constant terms. The classication boundary is where 1 = 2 , so that will wind up a linear classication boundary. log k 1/2(k l )T l
1 1

(k l ) + xT 8

(k l ) = 0

Whats the estimate of the Gaussian distribution? Quite simple. k = Nk /N Proportion in the kth class. k = The mean of each class. =
k i

xi /Nk

(xi k )(xi )T /(N K)

Each data point gets equal weight. This is reasonable; it estimates the variance-covariance structure for each Gaussian and assumes they are equal. The LDA and linear regression are equivalent when N1 = N2 . Quadratic discriminant analysis is what happens if you dont assume all k to be equal. The discriminant functions are quadratic: k (x) = 1/2 log |k | 1/2(x k )T 1 (x k ) + log k k So the decision boundary between each of the two classes is a quadratic curve. Where does 2 it come from? Diagonalize k = Vk Dk VkT . Vk is a p by p orthonormal and Dk is a diagonal matrix of non-negative eigenvalues dkl . Then (x k )T and log |
1 k 1 1 (x k ) = [Dk VkT (x k )]T [Dk VkT (x k )


log dkl

Whats going on: within-class variance is W = . Between-class variance B: k k (k T . How much the classes dier from the center of all the clases. The Fisher )(k ) method is to spread out the between-class variance as much as possible with respect to the within-class variance. Maximize aT Ba aT W a where Z = aT X.

Lecture 7

One approach to compromise between linear and quadratic discriminant analysis is to low variance-covariance to be a weighted sum of the pooled covariances and the individual ones.

At one extreme its QDA dierent for each class and at the other extreme itll be LDA pooled between all the classes. The tuning parameter can be decided by the data. Dimensionality reduction perspective; project the data into a sequence of directions so the data are as well separated as possible. Two-dimensional data with two classes, concentrated around two overlapping ellipses. How can we project the data into a one-dimensional direction so we can separate the classes as well as possible? Finding this direction is a general eigendecomposition problem. Both individual and pooled variance-covariance structure play a role. Two goals: separate the classes, and be orthogonal to the previous directions. (Uncorrelated.) Logistic regression: a generalized linear model. Its still a linear model, in the sense that youre modeling a linear function of your input variables. But the generalization part is due to the fact that now you have a classication problem, and you have a transformation: log pi = 0 + T xi 1 pi

The 0-1 response yi is generated from a Bernoulli distribution with probability pi . The link function links the underlying parameter to the 0-1 response. The generalization with k classes: P r(G = k|X = x) T log = k0 + k x P r(G = K|X = x) The coecients are calculated by maximum likelihood. The likelihood L() = pgi (xi , )

P (Y = yi ) = pyi (1 pi )1yi i The log-likelihood for the two class case is l() = = log pgi (xi , )

yi log p(xi , ) + (1 yi ) log(1 p(xi , ))

Maximizing this usually cant be solved explicitly you have to use Newtons Method to nd roots. The pi should be getting close to 1/2 as we get close to the classication boundary. Far away from the classication boundary they should be getting closer to 0 or 1. More emphasis on the more dicult cases. p(1 p) is large when p is close to 1/2 and small when p is close to 0 or 1. 10

Logistic regression or LDA? LDA is log-posterior odds between class k and K are linear functions of x. P (G = k|X = x) k 1/2(k i )T = log P r(G = K)|X = x) 1
T = k0 + k x 1 1


(k l ) + xT

(k l )

This linearity comes from Gaussian assumption and assumption of a common covariance matrix.

Lecture 9

(missed lecture 8) Input variable: univariate variable; unknown true relationship between y and x, add some noise and the points are the observed data. We want to t a nonlinear trend for the data. How can we do this systematically? What basis functions? Piecewise constant; piecewise linear; continuous piecewise linear; knots are the continuity constraint points. (You can also force the rst derivative and second derivative to be continuous at the knots for more smoothness. Piecewise cubic polynomials: discontinuous, continuous, continuous rst derivative, or continuous second derivative. This is the idea of the so-called regression spline. In each local region, to which degree do you want to t a polynomial term? And where are the knots? Then you will have a set of basis functions and you can just treat the problem as a linear regression problem. Another method has to do with smoothing splines. This avoids the knot selection problem to use a maximal set of knots. Minimizes RSS(f, ) = (yi f (xi ))2 + f (t)2 (t)dt

Penalize curvature. Or wiggling. We assume f has continuous second derivatives. Least square t if = , on the one hand, and if = 0 f can be any function no penalty on wiggling. f (x) = Nj (x)j

basis functions N1 . . . NN (x) basis functions for a natural spline basis. For a specic choice of knots, you are tting up to a third degree polynomial, but if you go beyond the last


boundary knot you are allowed to t the linear rather than the cubic. The data becomes sparse and you dont want to overt the data. The criterion reduces to RSS(, ) = (y N )T (y N ) + T N where N = Compare to ridge regression, (y x)T (y x) + T Here instead of on the end, we add the matrix which is the p by p matrix. Eective degrees of freedom: trace of S where S = N (N T N + N )1 N T in the equation f = N (N T N + N )1 N T y Recall y = f (x) + . Analogous to ridge regression. Expected prediction error, combines bias and variance. EP E(fx ) = Ex,y ET |x,y (Y f (X))2 = EX,Y ET |X,Y (Y f (X) + F (x) ET f (X) + ET f (X) f (X))2 = EX,Y (Y f (X))2 + EX,Y ET [(f (X) ET f (X))2 + (ET f (X) f (X))2 ] bias plus variance. Nj (t)Nk (t)dt

Lecture 10

Nonparametric Logistic Regression Smoothing splines for classication: Two class logistic regression with an input X. log P r(Y = 1|X = x) = f (x) P r(Y = 0)|X = x

which implies the probability that Y = 1 given X =x is ef (x) 1 + ef (x) 12

Predict the predicted log-likelihood criterion penalized with curvature. (yi log p(xi ) + (1 yi ) log(1 p(xi )) 1/2 = (yi f (xi ) log(1 + ef (xi ) )) 1/2 (f (t))2 dt

(f (t))2 dt

Optimal f is a nite-dimensional natural spline with knots t the values of xi . Suppose you have more than one input. Want to model Y f (x1 , x2 ). Nonlinear basis expansion of x1 and x2 separately, and also interaction terms between x1 and x2 . This is called the tensor product of the two sets of basis functions. f (x1 , x2 ) = f1 (x1 ) + f2 (x2 ) additive model without interaction term. Let hj (x1 ) be basis functions for x1 and gl (x2 ) be the basis functions for x2 . The tensor product is the set of all possible product pairs of hs and gs. Generalized additive model E(Y |X1 , X2 . . . Xp ) = + f1 (X1 ) + f2 (X2 ) + . . . fp (Xp ) fj s are smooth functions. Each is t with, say, a cubic smoothing spline. A penalized residual sum of squares can be specied as a criterion to minimize (yi fj (xij )2 + j fj (tj )2 dtj

dierent s for each component. Iterative approach: t the smooth functions one at a time. Let be the mean of y, and minimize (yi

fk (xij )

Generalized cross validation is used to choose each value of .


Lecture 11

Tree based methods. Extending linear methods to nonlinear models via basis expansion. Nonlinear transformation of the linear input variable and transform it to a linear tting model. Multivariate case: tensor product space. High dimensionality of the problem. Generalized additive model: ignore the interaction term for pairs of variables. Size of the 13

problem now grows linearly instead of exponentially. Alternative: tree based method. 2d case: split the input space into small subregions. Rectangles. Fit the simplest possible function in the subregion. Constant predictor within each region. Not a smooth model, but takes account of interactions. If Y does not vary very much within each subregion, its not that bad. identify smaller regions if you notice higher values of Y within a region. Divide input space into smaller regions so that within each region you see a smaller region wth a dierent value. criterion for partition. Homogeneous subregion. Binary tree: rst splitting point is at X1 t1 . Then two subsets. Growing the tree: regression tree. Look at marginal distributions of all input variables. Note spikes this suggests things about how the data was collected. For some pairs of variables, linear relationship is not enough. Curve looks like the slope is changing. How do we choose splitting value? Fit one constant in one subset and another in another. Seach for optimal splitting value. Look at residual sum of squared error. (yi y )2 for all splitting values. Split is the ith sorted value, minimize (yi c1 )2 +
R1 R2

(yi c)2]

Fewer data points in a subset; easier to t with constant. If the dierence in standard error is close to constant, stop. But this only goes for local minima. So instead, grow a large tree and prune. Tree pruning: take a subtree. Collapse some nodes together.


Lecture 12

This is from Chapter 8 of Hastie. Divide data into 200 bootstrap samples, and t classication trees to all of them; if the trees are variable then the model has high variance. In our test data, we have two classes, 5 features, each Gaussian distributed and pairwise correlation of 0.95, and the response Y only depends on the rst input. Bootstrap estimation: give the observed values equal probability based on your data. (xi , yi ) training data. Give equal probability for all of these. (x , y ) is drawn from the empirical distribution putting equal probability 1/N on each of the training data points. Bootstrapped sample of original data: draw the same number of data points, but with replacement. You may see some repeated points. From the bootstrapped sample you can create parameter estimation and see how variable your parameter estimation can be.


f (x) = 1/B

f b (x)

tted model based on a bootstrap sample Z b p. For k-classication, you can take the majority vote of the trees. Average class probability of the B trees. This gives the data a better chance of being repeatedly used in the training model. Averaging multiple trees. Why does bagging work? Assume the ideal case, where (xi , yi ) are drawn from the population distribution P . The ideal aggregation fag (x) = Ep f (x). This is a tted model 2 ] < E[(Y f (x))2 ]. Why? based on bootstrap sample. Prediction error, E[(Y fag (x)) Bias-variance decomposition. (Y f (x))2 = (Y fag (x) + fag (x) f (x))2 (Y fag )2 + (fag f )2


Lecture 13

Bagging or bootstrap aggregation is best for unstable methods. The tree method is an example of an unstable method that responds very well for bagging. Construct a tree for each bootstrap sample, and base the prediction on the average of the trees. For classication there are two ways of combining the trees majority vote, or average of classication probabilities. Random forests are an improved version of the bagging idea. Reduce the correlation between the tree youre adding and the bootstrapped sample. You reduce variance by averaging independent identically distributed random variables. V ar(xi ) = 2

V ar(1/B

xi ) = 2 /B

Dierent trees are drawn from the same distribution, so they probably have some positive correlation. cor(xi , xj ) = . So we cant use the independence assumption.

1/B (

V ar(xi ) +

cov(xi , xj )


= 1/B 2 (B 2 + B(B 1) 2 ) = 2 (1/B + (1 1/B)) = 2 + 1 2 B

Random forest algorithm (see Ch. 15) Draw a bootstrap sample from the training ata. Grow a random forest tree to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree: 1. select m variables at random from the p variables 2. Pick the best variable/split-point among the m predictor variables 3. Split the node into two daughter nodes. The smaller m, the lower the chance that this tree actually picks the best option. But the tradeo is that smaller m gives the option of greater sparsity. You have to tune the parameter. Sometimes, in a bootstrap sample, some observations are repeated and some are missing. In a random forest, what do you do about missing values? You will evaluate the prediction accuracy on (x, y) based on the prediction from those trees where (x, y) didnt show up. Youre getting an out-of-sample performance evaluation from that data point automatically. Error drops sharply after only a fairly small number of trees. Variable importance: to measure the prediction strength of each variable. Record the prediction accuracy (compared to the out of sample error) is recorded.Then the values for the jth variable are randomly permuted in the OOB samples, and the accuracy is again permuted. That is: we have a matrix of missing x values and a vector of missing y values. Permute the xs, so we take out the predicting eect of the xs, since the relationship of xs and y will be canceled. If the xj was irrelevant, then this wont change the prediction much; but if it is, the prediction accuracy will decrease.


Lecture 14

The idea of boosting is to have a classier



m Gm (x)]


This is a weighted sum rather than a plain average. Also, the Gm are not generated by a bootstrap sample; each of them is a modied sample, reweighted in some intelligent way. The samples are not independent, as in bootstrapping. Here, they depend on the performance of the previous sample. There is a sequential order. AdaBoost denes how to generate the weights and samples. Generate your classier Gm to t the training data using weights wi . Compute the error rate: wi I(yi = Gm (xi )) wi How many mistakes did you make? weighted by the wi . errm )/errm ). Logit of the mistakes. Set wi = wi exp(m I(yi = Gm (xi ))) Upweight the misclassied points. Repeat for m = 1 : M . Then output a weighted sum of classiers

Compute m = log((1

G(x) = sign[

m Gm (x)

Boosting ts an additive model. f (x) = m b(x; m )

Tree-based methods are examples of this; the basis functions are step functions and parametrizes the split variables and split points. Fit these functions by minimizing an average loss function over the training data. min L(yi , m b(xi , m ))

Forward stagewise additive modeling, in general, computes the best and by (m , m ) = argmin Set fm = fm1 (x) + m b(x, m ). AdaBoost is an example of this. Its equivalent to forward stagewise additive modeling using the loss function L(y, f (x)) = eyf (x) But there are other possible monotone decreasing functions of the error term. (m , Gm ) = argmin 17
m wi exp(yi G(xi ))

L(yi , fm1 (xi ) + b(xi , ))

If you x and optimize with respect to G, youre minimizing weighted classication error. m Gm = argmin wi I(yi = G(xi )) Plugging this Gm in and solving for , one obtains m = 1/2 log 1 errm errm

The exponential loss is more sensitive to changes in the estimated class probabilities compared to the 0-1 loss. The misclassication error rate will suggest you stop sooner than the exponential loss.


Lecture 15

Generalize the idea of AdaBoost to a model of boosted trees.


fM (x) =

T (x, m )

A sum of individual trees, parametrized by m , which indicates tree structure. Like splitting value and splitting variable. Training loss:

L(yi , fM (xi ))

for example, the loss function can be (yi fm (xi ))2 , the regression loss or L2 loss. The exponential loss is eyi fm (xi ) . In practice it should be dierentiable. Solving

m = argmin

L(y, fm1 (x) + T (xi , m ))

gives you the best choice of next tree, given all the trees you had so far. To nd the tted value for each subregion, you just need to optimize the loss function falling in that subregion to choose the optimal constants. We will be tting the tree model with gradient descent. Go in the direction of steepest descent: choose fm = m gm where m is a scalar and gm is the gradient of L(f ) evaluated at f = fm1 . Then update the solution: fm = fm1 m gm . Gradient tree boosting (MART) at each step, compute the derivative of the loss function at point i. Fit a


regression tree, to targets rm , the gradients, giving terminal regions Rjm . Choose the optimal coecients jm such that jm = argmin Update fm to include fm = fm1 + L(yi , fm1 (xi ) + ) Rjm ).

Jm j=1 jm I(x

Shrinkage: shrink the contribution of each tree by a factor v when it is added to the current approximation. fm = fm1 + v jm I(x Rjm 0 This is analagous to penalized least squares (like ridge regression or lasso.) J is a meta-parameter, the tree size, which determines how much overtting happens.


Lecture 16

Non-parametric, unsupervised learning techniques: we have no training data. PCA: linear approximation of the data which captures the most variation of the data. projection on a smaller number of dimensions, on a nite-dimensional space. Useful in high-dimensional situations. Also useful for visualization, compression. The rst linear component: z1 = a11 x1 + a12 x2 + . . . a1p xp Sample variance of the projection z1 is greatest among all such linear combinations with ||a1 || = 1. Second linear component: z2 = a21 x1 + a22 x2 + . . . a2p xp such that a2 a1 = 0, orthogonal to the rst projection, and ||a2 || = 1, and the variance is maximized. X is the variance-covariance matrix of the original data; the jth principal component zj is the linear combination zj = aj X, which has the greatest variance subject to the conditions that ||aj || = 1 and aj is orthogonal to all previous components. Note var(z1 ) = var(Xa1 ) = a1 Sa1 where S = 1/(N 1)X X, the sample variancecovariance matrix. ||a1 || =

a2 1l

Your optimization problem: maximize the variance of the rst linear combination subject to unit norm. This can be solved by an eigendecomposition problem. 19

Intuitively: you have a data cloud; the direction of greatest eccentricity, the greatest radius, is the rst principal direction. There is a theorem that any symmetric matrix has a decomposition A = T where is diagonal and is orthogonal. Why is PCA optimal? Consider a rank-q linear model for representing observations xi as i + + Vq . Fitting this by least squares means minimizing the reconstruction error ||xi Vq i ||2 or ||(xi x)2 Vq VqT (xi X)||2 if we assume x = 0, then this yields the projection Vq VqT xi . singular value decomposition T gives an optimal choice for V . of X = U DV


Lecture 17

PCA is a linear method: project onto a linear combination of eigenvectors, which are linear combinations of basis vectors. Kernel PCA: map the input domain to a feature space, : X H. This transformation is generally nonlinear. Look at data in new feature space. Look at the covariance matrix in the feature space, 1/n eigendecomposition of this. < (xi ), v >=< (xi ), Sv > v= ai (xi )
n T j=1 (xj )(xj ) .

Find the

The ai are unknown. but we also have v = ai xi . Puttng these together, if Kij =< xi , xj >, nKa = K 2 a Kernel trick: formulate PCA as eigendecomposition of kernel matrix. Centering K: subtract the column mean and the row mean. Compute the projections on the eigenvectors: j < v j , (x) >= i k(xi , x) where n are the eigenvector expansion coecients. Examples of positive denite kernels: Linear kernel K(x, x ) = xT x Polynomial kernels: K(x, x ) = (c + xT x )d . Gaussian 2 2 kernels: K(x, x ) = e||xx || /2 . Mercers Theorem: positive denite kernel, (x) = ( 1 1 (x), 20 2 2 (x), . . .)


Lecture 18

Sparce PCA Formulate PCA as a regression-type optimization problem, impose the lasso constraint, and solve the penalized optimization problem. ||Y x||2 + ||||1 Thats the lasso penalty. It imposes sparseness. Or the ridge penalty: ||Y x||2 + ||||2 If ( , ) = argmin ||xi T xi ||2 + ||||2 with ||||2 = 1, then is proportional to V1 , the rst principal component. This is the same thing, with a slackness condition. Suppose were looking at the rst k principal components, Apk = [1 . . . k ], Bpk = [1 . . . k ]

(A, B) = argmin

||xi AB xi || +

||j ||2

subject to AT A = Ikk . Then j is proportional to Vj . Alternative: elastic net. You have variables x1 , x2 , ... and you have a subset which capture most of the data, but they may be highly collinear. Instead of picking just one to include in the model, pick them as a subset. Create a matrix whose ith column is the ith principal component. UT U where X = U DV T . Weight on ridge regression: ridge = (X T X + I)1 X T (XVi ) =V( D2 )V T Vi D2 + I D2 = Vi 2 i Di + 21


Lecture 19

Clustering: partition objects into homogeneous groups. Objects are more similar within each cluster. Obvious measure of dissimilarity: distance. Euclidean distance, sum of absolute dierences (L1 distance). In practice you might have categorical data, binary data, etc. Categorical variables: cost matrix, zero along the diagonal, otherwise d(A, B) is a cost value based on your domain knowledge. How bad a wrong choice is that? K-means algorithm; choose K centroids, one for each classes; then iterate the following: cluster an object to the cluster with the closest centroid; update the centroid of each cluster to the mean of all objects in the cluster. This miniizes the square error criterion

W (S, c) =
k=1 iSk

d(i, ck )

Step 2 minimizes the above given a choice of centroids c, and Ste 3 minimizes it given a set of clusters S. This converges in a nite number of steps. Let T (Y ) =
i=1 N

(Yi Y )2

variance generally.

W (S, C) =
k=1 iSk K

(Yi Yk )2

B(S, c) =

Nk (Yk Y )2

Within-cluster variance and between-cluster variance. T (Y ) = W (S, C) + B(S, C) Bias-variance tradeo strikes again. Cross term equals 0 (Yi Yk )(Yj Y )
k iSk

(Yk Y )
k iSk

(Yi Yk )


The inner sum is 0, so the total is 0. How do we choose K: plot within-cluster variance and look for a kink, when the K-1 clustering doesnt resemble K clustering at all. Consistency: how much does cluster assignment change over dierent random subsets of the data? (Cross validation.)


Lecture 20

Last time: K-means. The between-cluster sum of squares should be large how separated the data are. Within-cluster sum of squares should be as small as possible. Model-based clustering: assume theres an underlying model. Underlying probability function f (x) = pk (y, k , k ) where pk are mixture probabilities. Estimate pk , k , k by maximum likelihood. Estimated by an iterative procedure, the EM algorithm. The latent variable is the class membership. If you know that, the estimate for and is straightforward. In the E step, estimate the expectations of the latent variable given current parameters; in M step, re-estimate the parameters given the expectation of the latent variable. At each E-step, estimate posterior probability for each data point: pk (yi , k , k ) gik = pk (yi , k , k ) At each M-step, given gik nd pk , k , k maximizing the log-likelihood. k =

gik yi /gk


gik (y k )(y k ) /gk

data point will be assigned to the k for which gij is maximal for all k. Connection with Kmeans: if we assume k is diagonal, no correlation between multivariate data. To maximize the log-likelihood, youre minimizing l(k , 2 , Sk ) =
k iSk

(yi k ) (yi k )/2 2

This is the squared error criterion in K-means algorithm. In the extreme case, if I observe two extreme clusters, what would you expect to see if you t a regression line? A line that follows the axis between the clusters. But the relationship within each cluster would be two separate regression lines. 23