Machine Learning Handbook

Predrag Radivojac and Martha White

October 10, 2016

Table of Contents

1 Introduction to Probabilistic Modeling 3
1.1 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Axioms of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . 9
1.1.4 Multidimensional distributions . . . . . . . . . . . . . . . . . . . . . . 13
1.1.5 Conditional probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.6 Independence of events . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.1.7 Interpretation of probability . . . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2.1 Formal definition of random variable . . . . . . . . . . . . . . . . . . . 17
1.2.2 Joint and marginal distributions . . . . . . . . . . . . . . . . . . . . . 19
1.2.3 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.2.4 Independence of random variables . . . . . . . . . . . . . . . . . . . . 22
1.2.5 Expectations and moments . . . . . . . . . . . . . . . . . . . . . . . . 22
1.2.6 Mixtures of distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.2.7 Graphical representation of probability distributions** . . . . . . . . . 29

2 Basic Principles of Parameter Estimation 33
2.1 Maximum a posteriori and maximum likelihood . . . . . . . . . . . . . . . . . 33
2.2 The relationship with KL divergence** . . . . . . . . . . . . . . . . . . . . . . 39

3 Introduction to Prediction Problems 40
3.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Useful notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Optimal classification and regression models . . . . . . . . . . . . . . . . . . . 42
3.4 Bayes Optimal Models** . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 Linear Regression 48
4.1 Maximum likelihood formulation . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Ordinary Least-Squares (OLS) Regression . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Weighted error function . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.2.2 Predicting multiple outputs simultaneously . . . . . . . . . . . . . . . 54
4.2.3 Expectation and variance for the solution vector . . . . . . . . . . . . 54
4.3 Linear regression for non-linear problems . . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Polynomial curve fitting . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4.1 Stability and sensitivity of the solutions . . . . . . . . . . . . . . . . . 57
4.4.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Handling big data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

1

5 Generalized Linear Models 61
5.1 Loglinear link and Poisson distribution . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Exponential family of distributions . . . . . . . . . . . . . . . . . . . . . . . . 63
5.3 Formalizing generalized linear models . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Connection to Bregman divergences** . . . . . . . . . . . . . . . . . . . . . . 66

6 Linear Classifiers 67
6.1 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.1 Predicting class labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.2 Maximum conditional likelihood estimation . . . . . . . . . . . . . . . 69
6.1.3 Stochastic optimization for logistic regression . . . . . . . . . . . . . . 71
6.1.4 Issues with minimizing Euclidean distance . . . . . . . . . . . . . . . . 72
6.2 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.1 Binary features and linear classification . . . . . . . . . . . . . . . . . 74
6.2.2 Continuous naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.3 Multinomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7 Representations for machine learning 79
7.1 Radial basis function networks and kernel representations . . . . . . . . . . . 79
7.2 Learning representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7.2.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2.2 Unsupervised learning and matrix factorization . . . . . . . . . . . . . 85

8 Empirical Evaluation 91
8.1 Comparison of Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . 91
8.2 Performance of Classification Models . . . . . . . . . . . . . . . . . . . . . . . 92

9 Advanced topics 93
9.1 Bayesian estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2 Parameter estimation for mixtures of distributions . . . . . . . . . . . . . . . 95
9.3 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Bibliography 101

A Optimization background 102
A.1 First-order gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.2 Second-order optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
A.3 Quasi-second-order gradient descent . . . . . . . . . . . . . . . . . . . . . . . 103
A.4 Constrained optimization and Lagrange multipliers . . . . . . . . . . . . . . . 103

B Linear algebra background 104
B.1 An Algebraic Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
B.1.1 The four fundamental subspaces . . . . . . . . . . . . . . . . . . . . . 104
B.1.2 Minimizing kAx − bk2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

2

.). or a neural network illustrate the influence of physics. studied individually. political campaigns. algorithms. While these situations do not usually involve an explicit use of probabilities and probabilistic models. geography. and then analyze. Some form of reasoning under uncertainty is a necessary component of everyday life. statistics. we often make decisions based on our expectations about which way would be best to take. Computer science provides us with theories. The increased complexity of models and availability of data in the 19th century emphasized the importance of computing machines. these fields have relatively discernible roles and can be. armies). Concepts such as a Boltzmann distribution. which is generally attributed to the introduction of the von Neumann architecture and formalization of the concept of an algorithm. etc. on the other hand. firmly grounded in its axioms. Interestingly. The correspondence between Blaise Pascal and Pierre de Fermat in 1654 serves as the oldest record of modern probability theory. When driving. inference. to a degree. Probability theory and statistics have a relatively long history..g. Probability theory was developed out of efforts to understand games of chance and gambling. space. these three disciplines form the core quantitative framework for all of empirical science and beyond. and study the relationship be- tween solutions and available resources (time. The two disciplines started to merge in the 18th century with the use of data for inferential purposes in astronomy. medicine. biology. public revenues. solutions. an intelligent driverless car such as Google Chauffeur must make use 3 . manufacturing. for manipulating probabilities and equips us with a broad range of models with well-understood theoretical properties.Chapter 1 Introduction to Probabilistic Modeling Modeling the world around us and making predictions about the occurrence of events is a multidisciplinary endeavor standing on the solid foundations of probability theory. computer architecture. Although intertwined in the process of modeling. This contributed to establishing the foundations of the field of computer science in the 20th century. various other disciplines have also contributed to the core of probabilistic modeling. and software to manage the data. and decision making based on prob- abilistic models as probabilistic reasoning or reasoning under uncertainty. psychology. and social sciences. for example. The convergence of the three disciplines has now reached the status of a principled theory of probabilistic inference with widespread applications in science. Statistics contributes frameworks to formulate inference and the process of nar- rowing down the model space based on the observed data and our experience in order to find. mortality causes. and engineering. taxation. Probability the- ory brings the mathematical infrastructure. military. originated from data collection initiatives and attempts to understand trends in the society (e. compute the solutions. Statistics. the formal origins of both can be traced to the 17th century. etc. and computer science. As such. business.g. a genetic algorithm. We will refer to the process of modeling. value of land) and political affairs (e.

probabilities are used to quantify the chance of the occurrence of events. but to any event in general. safer. if we rolled the die many times. road works.” The techniques of probabilistic modeling formalize many intuitive concepts. in judgements and actions.1. [. would then facilitate reasoning about events such as making it on time to an important meeting at 9 am. friction.] is the degree of certainty. they can be difficult to incorporate or such actions may not even be allowed by the rules of the experiment. Both approaches are illustrated in Figure 1. or so we think. To provide a quick insight into the concept of uncertainty and modeling. Assigning an equal chance (probability) to each outcome of the roll of a die provides an efficient and elegant way of modeling uncertainties inherent to the experiment. Therefore. we may always choose or follow that which has been found to be better.. We could accurately predict. As the examples above suggest. Thus. we first need to understand the concept of probability and then introduce a formal theory to incorporate evidence (e. He later adds. often in the presence of evidence. and it differs from the latter as a part differs from the whole”. Such a distribution. As long as they are assigned according to the formal axioms of probability. they provide toolkits for rigorous mathe- matical analysis and inference. it is practically useful to simply assume that each outcome is equally likely. If we recorded the “time to work” for a few months we would observe that trips generally took different times depending on many internal (e. data collected from instruments) in order to make good decisions in a range of situations. more realistic example in which collecting data provides a basis for simple probabilistic modeling is a situation of driving to work every day and predicting how long it will take us to reach the destination tomorrow. force.. One way of using the recorded data is to create histograms and calculate percentiles. a credit card fraud detection system. in the absence of any other information. in fact. However. if known. consider rolling a fair six-sided die. “[p]robability. There- fore.of them.. shape defects. Therefore. we can make inferences because the mathematical formalism does not depend on how the probabilities are assigned. could be used to predict the exact duration of the commute. Another. In a nutshell. Another would be to estimate the parameters of some mathematical function that fits the data well. preferred speed for the day) and also external factors (e. 4 . “To make a conjecture [prediction] about something is the same as to measure its probability. the techniques of probabilistic modeling provide a for- malism for dealing with repetitive “experiments” influenced by a number of external factors over which we have little control or knowledge.g. we shall see later that probabilities need not be assigned only to events that repeat. But the physical laws may not be known. encountering a slow driver). While these events. or an algorithm that infers whether a particular genetic mutation will result in disease. more satisfactory. At a basic level.g. to the end that. As Jacob Bernoulli brilliantly put it in his work The Art of Conjecturing (1713). we would indeed observe that each number is observed roughly equally. the outcome of a roll if we carefully incorporated the initial position. about events influenced by factors that we either do not fully understand or have no control of.. it is useful to provide ways of aggregating external factors via collecting data over a period of time and providing the distribution of the commute time. as the art of measuring probabilities of things as accurately as possible. we define the art of conjecturing [science of prediction] or stochastics. or more carefully considered. and other physical factors and then executed the experiment. it is unrealistic to expect to have full information.. weather.g. And so must a spam detection software in an email client.

Although it might seem that fitting the data using a gamma distribution brings little value in this one-dimensional situation. 0.25 Gamma(31. which is an element “drawn” from a set of predefined options.20 . in office etc.352) . checking the temperature tomorrow or figuring out the location of one’s keys. The outcome of a roll of a die is a number between one and six. this approach is far superior on high-dimensional data where the number of bins in a multidimensional histogram can be orders of magnitude larger than the data set size. In many ways. Let the sample space (Ω) be a non-empty set of outcomes of the experiment and the event space (F) be a non-empty set of subsets of Ω such that 1. 1. The data was modeled using a gamma family of probability distributions. ∈ F ⇒ i=1 Ai ∈F 5 .15 .10 .1 Probability Theory Probability theory can be seen as a branch of mathematics that deals with set functions.05 4 6 8 10 12 14 16 18 20 22 24 t Figure 1. A ∈ F ⇒ Ac ∈ F S∞ 2. This provides us an opportunity to incorporate our assumptions and existing knowledge into modeling. Once a model is created. At the heart of probability theory is the concept of an experiment. and then construct a model. A1 . the outcome of the location of one’s keys can be a discrete set of places such as a kitchen table.1 miles. . the main goal of probabilistic modeling is to formulate a particular question or a hypothesis pertaining to the physical world as an experiment. An experiment can be the process of tossing a coin. under a couch.3. with the particular location and scale parameters estimated from the raw data. each experiment has an outcome. . . . collect the data. But let us start from the beginning. A2 . for a distance of roughly 3. The data set contains 340 measurements collected over one year. 1. rolling a die. potentially infinite in size. The values of the gamma distribution are shown as dark circles. including the subjective assessments (beliefs) about occurrence of non-repetitive events.1: A histogram of recordings of the commute time (in minutes) to work. the temperature tomorrow is typically an integer or sometimes a real number. When carried out. we can compute quantitative measures of sets of outcomes we are interested in and assess the confidence we should have in these measures.1.1 Axioms of probability We start by introducing the axioms of probability.

P where the last line followed from the axioms of probability. If both conditions hold. j ⇒ P (∪∞ i=1 Ai ) = i=1 P (Ai ) is called a probability measure or a probability distribution and is said to satisfy the axioms of probability. However. 2. one can simply think of a sigma field as the set of events to which we can assign probabilities. It is important to emphasize that the definition of sigma field requires that F be closed under both finite and countably infinite number of basic set operations (union. a sigma field is also closed under intersection. A roll of a die draws numbers from a finite space Ω = {1. 6 . closure under finite unions may not result in closure under infinite unions. Any intersection of sets in F must again be in F because F is closed under union and complementation. intersection. F is usually the power 1 This terminology is due to historical reasons. Another important expression. A2 ∈ F with A1 ∩ A2 = Ø. The beauty of these axioms lies in their compactness and elegance. is that P (A ∪ B) = P (A) + P (B) − P (A ∩ B).e. which then implies A1 − A2 ∈ F. complementation and set difference). Because F is non-empty. shown here without a derivation. Similarly for set difference. i. a set of d non-overlapping sets {Bi }di=1 such that Ω = ∪di=1 Bi . F) be a measurable space. 2 It seems intuitive that the second condition could be replaced with a union of finite sets (the simpler requirement of additivity rather than σ-additivity). we can use De Morgan’s laws: ∪Ai = (∩Aci )c and ∩Ai = (∪Aci )c .. where Ø is the empty set. we can write A1 − A2 = (A1 ∩ A2 )c ∩ A1 .. For finite and other countable sample spaces (e. ∈ F. 6}. we observe that all the above conditions imply that Ω ∈ F and Ø ∈ F. . the set of integers Z).e. Many useful ex- pressions can be derived from the axioms of probability. P (Ω) = 1 P∞ 2. 3. Let (Ω. if A is any set in Ω and if {Bi }di=1 is a partition of Ω it follows that P (A) = P (A ∩ Ω)    = P A ∩ ∪di=1 Bi   = P ∪di=1 (A ∩ Bi ) (1. A1 . set Ai = Ø for i > 2 to get ∞ P (A1 ∪ A2 ) = P (∪∞ A i=1 i ) = i=1 P (Ai ) = P (A1 ) + P (A2 ). P ) is called the probability space. Another formula that is partic- ularly important can be derived by considering a partition of the sample space. and Ac is the complement of A.. A2 .1) = di=1 P (A ∩ Bi ) . Ai ∩ Aj = Ø ∀i. F) is then called a measurable space. 1] where 1. Therefore. Similarly. Ac = Ω − A.2 The tuple (Ω. because the remaining sets can be set to the empty set Ø: ∀AP 1 . i. For intersection. The operations union and complementation are in the definition. That is. We will refer to this expression as the sum rule. for sigma fields. For example.g. or sigma algebra. Any function P : F → [0. . 4. closure under infinite unions of disjoint sets (σ-additivity) implies finite closure (additivity). It is convenient to separately consider discrete (countable) and continuous (uncountable) sample spaces. and is a set of so-called measurable events. F is called a sigma field. 5. so though it sounds complex. . F.1 The tuple (Ω.where A and all Ai ’s are events. it is obvious that P (Ø) = 0 or P (Ac ) = 1 − P (A).

P ({1. 1] and P ([0. P ({2}) = p(2). P ({1}) = p(1).e. 1] × {0.. Thus. 1]) = 13 because probabilities of complement sets must sum to one. 1] is called a probability mass function (pmf) if X p (ω) = 1. we know that p(ω) = 61 for ∀ω ∈ Ω. Now. P ) directly. Ω = [0. F must be a proper subset of P(Ω). 1] ∪ {2} or Ω = [0.e. For example. whereas p is defined on the elements of Ω.g.e. These functions are defined directly on the sample space where we have fewer restrictions to be concerned with compared to the event space. if Ω = [0. and the probability of a set is always equal to the sum of probabilities of individual elements.. A = {5. 3. thus. P ). F ⊂ P(Ω). it is clear that it cannot be chosen arbitrarily. because there exist sets over which one cannot integrate. It turns out. A function p : Ω → [0. but we will generally avoid such spaces here. 6}. 6}. 2}) = p(1) + p(2). 3 ω∈A It is important to note that P is defined on the elements of F. ω∈Ω The probability of any event A ∈ F is defined as X P (A) = p(ω).1. respectively. An example of continuous sample space is the set of real numbers R. That is. 4. sample spaces can also be mixed. 1}. we shall 7 . because the die is fair. In this case we say that the probability measure P is induced by a pmf p. This is by no means a comprehensive review of the subject. In fact. we rarely define (Ω. Discrete and continuous sample spaces give rise to discrete and continuous probability distributions. by selecting a probability mass function or a probability density function. Technically. 1. F. F. Owing to many constraints in defining the distribution function P . As we shall see later.. X 1 P (A) = p(ω) = . 2. What is the probability that the outcome is a number greater than 4? First. but it is often much simpler to define P indirectly by assuming that F = P (Ω) and providing a probability mass function p. A few useful pmfs Let us now look at some families of functions that are often used to induce discrete prob- ability distributions. i. is a probability distribution. in practice it is easier to define P indirectly. ω∈A It is straightforward to verify that P satisfies the axioms of probability and.2 Probability mass functions Let Ω be a discrete (finite or countably infinite) sample space and F = P (Ω). 21 )) = 12 . P ({ω}) = p (ω) for every ω ∈ Ω. for uncountable spaces. We address these two ways of defining probability distributions next. We can define a discrete probability space by providing the tuple (Ω. Example 1: Consider a roll of a fair six-sided die.  In discrete cases. Ω = {1.set P(Ω). let A be an event in F that the outcome is greater than 4.. i. e. we cannot arbitrarily assign P ([ 21 . and the event space F = P(Ω). etc. i. 5.

an experiment that has two possible outcomes: success and failure. For simplicity. we will often refer to both pmfs and probability distributions they induce as distribution functions. To be precise. . failure} and ( α ω = success p(ω) = 1 − α ω = failure where α ∈ (0. A toss of a coin (heads/tails). for ∀k ∈ Ω the binomial pmf is defined as   n k p(k) = α (1 − α)n−k .} and for ∀k ∈ Ω λk e−λ p(k) = . The Poisson distribution is often used to model counts of 8 . failure occurs with probability 1 − α. the binomial coefficient   n n! = k k!(n − k)! enumerates all ways in which one can pick k elements from a list of n elements (e. 1. This experiment results in a multidimensional probability mass function (one dimension per possible outcome) called the multinomial distribution. given Ω = {0.) Bernoulli trials. At each value k in the sample space the distribution gives the probability that the success happened exactly k times out of n trials. a success occurs with probability α and. 1}. 1). thus. is the parameter indicating the probability of success in a single trial. In a Bernoulli trial. n}. where of course 0 ≤ k ≤ n. We shall refer to each such distribution as Bernoulli(α).. . 1) is a parameter. The Binomial distribution is used to describe a sequence of n independent and identically distributed (i. as before. there are 3 different ways in which one can pick k = 2 elements from a group of n = 3 elements). Here. . However. Here. Ω = {0. we can compactly write the Bernoulli distribution as p(k) = αk · (1 − α)1−k for k ∈ Ω. we do not need to concern ourselves with semantics because the correct interpretation of a family vs. If we take instead that Ω = {0. k! where λ ∈ (0. Here we replaced ω with k. ∞) is a parameter (the relationship with the binomial distribution can be obtained by taking α = λ/n). individual distributions will usually be clear from the context.g. . We will refer to a binomial distribution with parameters n and α as Binomial(n.simply focus on a few basic concepts and will later introduce other distributions as needed. The experiment leading to a binomial distribution can be generalized to a situation with more than two possible outcomes. one for each α.i. Ω = {success. . .d. More formally. 1. α). the Bernoulli distribution as presented above is actually a family of dis- crete probability distributions. The Poisson distribution can be derived as a limit of the binomial distribution as n → ∞ with a fixed expected number of successes (λ). a basketball game (win/loss). which is a more common notation when the sample space is comprised of integers. More specifically. k where α ∈ (0. The Bernoulli distribution derives from the concept of a Bernoulli trial. or a roll of a die (even/odd) can all be seen as Bernoulli trials. . We model this distribution by setting a sample space to two elements and defining the probability of one of them as α.

. The hypergeometric distribution is intimately related to the binomial distribution where the elements are drawn with replacement (α = K/N ). The geometric distribution. At each point k ∈ Ω. Unlike the previous two distributions. with probability density functions replacing probability mass functions and integrals replacing sums. The uniform distribution for discrete sample spaces is defined over a finite set of outcomes each of which is equally likely to occur. For example. . one could have a table of 365 values between zero and one for the probability of a birthday falling on each day. n}. where α ∈ (0. it gives the probability that the first success occurs exactly in the k-th trial. . In general. there are fundamental differences between the two situations which we should keep in mind whenever working with continuous spaces. without replacement. . is defined over an infinite sample space. i.3 Probability density functions We shall see soon that the treatment of continuous probability spaces is analogous to that of discrete spaces. For the hypergeometric distribution. but still countable. which we can verify by summing over all possible outcomes in the sample space. Geometric(α).. We will see that for continuous spaces it is more difficult to define consistent probabilities and we will often restrict ourselves to a limited set of known distributions. All of the functions above satisfy the definition of a probability mass function. success). for discrete spaces. it is defined by the size of the sample space. as long as the probabilities sum to 1. 1.1. We will refer to the hypergeometric distribution as Hypergeometric(n. . 1) is a parameter. Four examples are shown in Figure 1. K).. n The uniform distribution does not contain parameters. 2.g. . then for ∀k ∈ Ω 1 p(k) = . the probability of drawing a success does not change in subsequent trials. The experiment consists of drawing n elements.2. K of which are of one type (e.events occurring sequentially and independently but with a fixed average (λ) in a particular time interval. We refer to this distribution as Uniform(n).e. Here. . Ω = N. Poisson(λ) is defined over an infinite sample space. Here we can set Ω = {1. The probability of drawing k successes out of n trials can be described as K  N −K  k · n−k p(k) = N  . success and failure). Mathematically. The geometric distribution is also used to model a sequence of independent Bernoulli trials with the probability of success α. consider a finite population of N elements of two types (e. from this population such that the elements remaining in the population are equiprobable in terms of being selected in the next draw.g. . n where 0 ≤ n ≤ N and k ≤ n. however. Ω = {1. There. The main obstacle in generalizing the theory to uncountable sample spaces lies in 9 . N. we can assign probabilities to outcomes quite freely. We will see later that the uniform distribution can also be defined over finite intervals in continuous spaces.} and for ∀k ∈ Ω p(k) = (1 − α)k−1 α.

10 .15 . we will usually take Ω = R to be the sample space and implicitly consider F = B(R) to be the event space. the use of P(R) as the event space would lead to a flawed theory. Ω The probability of an event A ∈ B(Ω) is defined as ˆ P (A) = p(ω)dω. the standard Riemann integration does not work for some sets in the Borel field (e.25 Uniform(9) Poisson(4) . The construction of subsets of R that are not in B(R) is difficult and only of theoretical importance (e.. It is therefore necessary to define an adequate event space which would be applicable to a vast majority of practically important situations. Vitali sets). addressing mathematical nuances involving infinitesimal calculus.g. the countable nature of a sigma field. but still smaller than P(R).4) Geometric(0. To operate with continuous sample spaces we proceed by taking Ω = R and defining the Borel field. A There are a few mathematical nuances associated with this definition.05 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Figure 1. ∞) is called a probability density function (pdf) if ˆ p(ω)dω = 1. R). By definition. but nevertheless. as well as sets that can be obtained by a countable number of basic set operations on them.2: Four discrete probability mass functions.g.10 . Let now Ω be a continuous sample space and F = B (Ω). denoted by B(R). there exist sets over which we cannot integrate and thus the set of events F cannot be the power set of any uncountable set (e. First.. all open. semi-open and closed intervals in R.g.25 Binomial(9. when discussing probability distributions over continuous sample spaces. B(R) is a sigma field. B(R) is an uncountably infinite set.25) . and limitations of the definition of integrals. interestingly.. 0.10 .15 . . The Borel field on R.20 . For example. Therefore. is a set that contains all points in R.05 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 .20 . A function p : Ω → [0.

∞). As before. b] the uniform probability density function ∀ω ∈ [a. the value of a pdf at point ω is not a probability. in which case we would set p(ω) = 0 for ω < 0. In contrast. i. The exponential distribution is defined over a set of non-negative numbers. but also over any finite or countably infinite set is 0 (i. the sample space will be defined for each distribution and the Borel field will be implicitly assumed as the event space. where λ > 0 is a parameter. The probability at any single point. The Gaussian distribution or normal distribution is one of the most frequently used probability distributions. when possible. If you have not heard of this distinction before. you can safely ignore these two terms and just go forward with the definition of integration that you are used to. b−a Note that one can also define Uniform(a. Here. P ({ω}) = p(ω).e. Its probability density function is p(ω) = λe−λω . Luckily.e. Thus. The uniform distribution is defined by an equal value of a probability density function over a finite interval in R. a potentially large value of the density function is compensated by the small interval ∆x to result in a number between 0 and 1. For that reason.how would you integrate over the set of rational or irrational numbers within [0. the sample space can be extended to all real numbers. As before. One way to think about the probabilities in continuous spaces is to look at small intervals A = [x.. A few useful pdfs Some important probability density functions are reviewed below. x + ∆x] as ˆ x+∆x P (A) = p(ω)dω x ≈ p(x)∆x. Second. i. 1] for any pdf?). Ω = [0. provides identical results as Lebesgue’s integrals. probability density functions are formally defined using Lebesgue integration which allows us to integrate over all sets in B(Ω). This form is convenient because Ω = R can then be used consistently for all one-dimensional probability distributions. it can actually be greater than 1. it will suffice to use Riemann integration in all situations of our interest... b) by taking Ω = R and setting p(ω) = 0 whenever ω is outside of [a. we mentioned before for pmfs that the probability of a singleton event {ω} is the value of the pmf at the sample point ω. thus. they constitute a set of measure zero). we will refer to the subset of R where p(ω) > 0 as support of the density function. b]. for Ω = [a. Riemann integration. b] is defined as 1 p(ω) = . It is defined over Ω = R as 1 1 2 p(ω) = √ e− 2σ2 (ω−µ) 2πσ 2 11 . When we do this.e.

2 Because the probability of any individual event in a continuous case is 0. We will refer to this distribution as Gaussian(µ. We will refer to this distribution as Lognormal(µ. ωmin ). We will see a general definition of this family later. σ 2 ) or ln N (µ. there is no difference in integration if we consider open or closed intervals. β). Its probability density function is defined on Ω = [ωmin . with two parameters. ω 2πσ 2 where µ ∈ R and σ > 0 are parameters. σ 2 ). 1]. We will refer to the Pareto distribution as Pareto(α. for Ω = (0. Example 2: Consider selecting a number (x) between 0 and 1 uniformly randomly (Figure 1. Its proba- bility density function is defined on Ω = R as 1 − ω−α − ω−α β p(ω) = e β e−e . Here. µ ∈ R and σ > 0. Figure 1. 2]. σ 2 ) or N (µ. The Pareto distribution is useful for modeling events with rare occurrences of extreme values. We define an event of interest as A = 0. Both Gaussian and exponential distribution are members of a broader family of distributions called the exponential family.3: Selection of a random number (x) from the unit interval [0. The Gumbel distribution belongs to the class of extreme value distributions. 1] and calculate its probability as ˆ 1/4 ˆ 1 1 P (A) = dω + dω . β where α ∈ R is the location parameter and β > 0 is the scale parameter. We will refer to this distribution as Gumbel(α. What is the probability that the number is greater than 43 or lower than 14 ? We know that Ω = [0. ω α+1 where α > 0 is a parameter and ωmin > 0 is the minimum allowed value for ω. σ 2 ). p(ω) = =1 0 3/4 b−a 1 = . It leads to a scale-free property when α ∈ (0. 41 ∪ ( 34 .  12 . The lognormal distribution is a modification a normal distribution. ∞) the lognormal density can be expressed as 1 1 2 p(ω) = √ e− 2σ2 (ln ω−µ) . 1].3). ∞) as αωmin p(ω) = .

Σ). each coefficient αi gives P the probability of outcome i in any trial. In the discrete case. . etc. The multinomial distribution is used to model a sequence of n independent and identically distributed (i. . . 4. . ω2 . ω2 . .1. kd ) in the sample space. any function p : Ω1 × Ω2 × . . etc. which generalizes the binomial distribution to the case when the number of outcomes in any trial is a positive integer d ≥ 2. Then. Of course. .1. with Ω = Rd . i.. . −∞ −∞ The multivariate Gaussian distribution is a generalization of the Gaussian or normal distri- bution to the d-dimensional case. Then. k2 . At each point (k1 . (2π)d |Σ| 2 with parameters µ ∈ Rd and a positive definite d-by-d matrix Σ (|Σ| is the determinant of Σ). An experiment consisting of n tosses of a fair six-sided die and counting the number of occurrences of each number can be described by a multinomial distribution. × Ωd → [0.d. . Clearly. . outcome 2 occurred k2 times. Σ) or N (µ. 0 ≤ ki ≤ n for ∀i and di=1 ki = n. That is. . αd k1 + k2 + · · · + kd = n  p(k1 . kd ) =  0 otherwise  where αi ’s are positive coefficients such that di=1 αi = 1.. 1. × Ωd . . . . 2. 6}. . where Ωi can be seen as the sample space along dimension i. . we can think of the sample space as the d-dimensional Euclidean space. More formally.4 Multidimensional distributions It is often convenient to think of the sample space as a multidimensional space. . Ω = Ω1 × Ω2 × .k2 . The multinomial coefficient   n n! = k1 . . .e. the second box k2 balls. in this case αi = 1/6. 5.. for each i ∈ {1. the d-dimensional probability density function can be defined as any function p : Rd → [0. k2 . 13 . n}d . . .kd α1 α2 . . ωd ) = 1. one can think of the sample space Ω as a multidimensional array or as a d-dimensional tensor (note a matrix is a 2D tensor). the multinomial pmf is defined as n  k1 k2 kd   k1 . It is defined as   1 1 T −1 p(ω) = p exp − (ω − µ) Σ (ω − µ) . k2 . . That is.i. the multinomial pmf provides the probability that the outcomeP 1 occurred k1 times.. .) trials with d outcomes. . Ω = Rd and an event space as F = B(R)d .. 3. . We will refer to this distribution as Gaussian(µ. given the sample space Ω = {0. kd k1 !k2 ! · · · kd ! generalizes the binomial coefficient by enumerating all ways in which one can distribute n balls into d boxes such that the first box contains k1 balls. In the continuous case. ∞) such that ˆ ∞ ˆ ∞ ··· p(ω1 . 1] is called a multidimensional probability mass function if X X ··· p (ω1 . ωd )dω1 · · · dωd = 1. ω1 ∈Ω1 ωd ∈Ωd One example of the multidimensional pmf is the multinomial distribution. . . . .

i. . We are interested in the probability that event A also occurred. But imagine that someone had observed the experiment and told us that the number was even (B = {2. etc. The probability after hearing this news becomes P (A|B) = 13 . referred to as the chain rule. F. ∩ Ad−1 ). i. one for each subset excluding the empty set and singletons.. if P (A|B) = P (A) or P (B|A) = P (B). there are 2d −d−1 independence tests.e. 4. a fair die has been rolled and we are interested in an event that the outcome was 4. The probability P (A|B) is referred to as posterior probability because it quantifies the uncertainty about A in the presence of additional information (event B). three. . The product rule from Equation (1. applies to a collection of d events {Ai }di=1 and can be derived by recursively applying the product rule. P (A|B)..) can be expressed as the product of probabilities of individual events. In some situations we refer to the probability P (A) as prior probability because it quantifies the likelihood of occurrence of event A in absence of any other information or evidence. Mutually exclusive events are in fact never independent because the knowledge that the outcome of the experiment belongs to event A excludes the possibility that it is in B (Figure 14 . From this expression. ∩ Ad ) = P (A1 )P (A2 |A1 ) . P (A1 ∩ A2 . Then. which is sometimes referred to as product rule. . but that we do not know the outcome yet. . P (Ad |A1 ∩ A2 . The first one is Bayes’ rule P (B|A)P (A) P (A|B) = . 1. F. The probability P (B) is also an unconditional (prior) probability but in this context can be thought of as the probability of observing evidence B. it was first derived by Abraham de Moivre in 1718. For d events. if the probability of intersection of any group of events (of size two.2) has long history. Proper estimation of posterior probabilities from data is central to statistical inference. 6}). . The conditional probability is defined as P (A ∩ B) P (A|B) = . two or more events are (mutually or jointly) independent.5 Conditional probabilities Let (Ω.1. alternatively. Two events A and B from F are defined as independent if P (A ∩ B) = P (A) · P (B) or. P (B) The second formula. For example. More broadly. . P ) be a probability space.2) P (B) where P (B) > 0. we can now derive two important formulas.e. (1. P ) be a probability space and B an event that already occurred. One way to think about conditional probabilities is to consider that the experiment has already been conducted. It is important to distinguish between mutually exclusive events and independent events.6 Independence of events Let (Ω.1. The prior probability of event A is P (A) = 61 .1. A = {4}.

this long-standing philosophical debate has almost no bearing on the use of probability theory in practice. We shall see an example later. if P (A|B ∩ C) = P (A|C).1. On the other end of the spectrum is a purely subjectivist view in which probabilities represent an observer’s degree of belief or conviction about the outcome of the experiment. Let (Ω. It is often difficult. an objectivist and a subjectivist one. and C some events from F. 1. Independence between events does not imply conditional independence and. as long as the assignments adhere to the axioms of probability. while their intersection occupies 1/16 of the sample space. conditional independence between events does not imply their independence. Sometimes there may exist deep physical reasons why particular events are independent or assumed to be independent. An objectivist is restricted by the known facts about reality (assuming these facts are agreed upon) and derives from them to estimate probabilities. the mechanics of probabilistic manipulations are the same and valid. this underlying probability then needs to be estimated from data. Ω Ω D A B C Figure 1.4: Visualization of dependent and independent events. In other occasions it may just be a numerical coincidence. Objectivists see probability as a concept rooted in reality. and it mostly through a combination of subjective and objective steps. F. P ) be a probability space and A. likewise. Each event occupies 1/4 of the sample space Ω. Events C and D on the right are independent. One should (almost) always calculate P (A ∩ B) and P (A) · P (B) and numerically verify independence. Their scientific method is based on the existence of an underlying true probability for an experiment or a hypothesis in question. 15 . alternatively. The good news is. No matter how probabilities are assigned.4).7 Interpretation of probability There are two opposing philosophical views of probability. A subjectivist is unrestricted by the agreed upon facts and can express any views about an experiment because probabilities are inherently related to one’s perception. to simply look at events and conclude whether they are independent or not. B. and quite non-intuitive. 1. Events A and B on the left are dependent because the occurrence of one excludes the occurrence of the other one. Events A and B are defined as conditionally independent given C if P (A ∩ B|C) = P (A|C) · P (B|C) or.

Consider a process of three coin tosses and two random variables.. however. citizenship. P ). a group of people) where elements can be associated with various descriptors. Ω = {HHH. To calculate PY . with a few technical caveats we will introduce later. Y : Ω → {0. HTT. THT. 3} but we also need to find FY and PY . . F. to denote random variables (such as Disease) and low- ercase letters x. Our goal is to find the probability spaces that are created after the transformations. This is a cluttered notation so we may wish to simplify it by using P (Disease = cold). ΩY .2 Random Variables Until now we operated on relatively simple sample spaces and produced measure functions over sets of outcomes. takes values non-deterministically. HHT. THH. The mechanism of a random variable facilitates addressing all such situations in a simple. For example. or marital status and we may be interested in events related to such descriptors. let us 16 . We first note that. defined on the sample space. Generally. . . Y. we may be interested in transformations of sample spaces such as those corresponding to digitizing an analog signal from a microphone into a set of integers based on some set of voltage thresholds. . We define X as the number of heads in the first toss and Y as the number of heads over all three tosses. For example. our approach also maps the probability distribution P to a new probability distribution PDisease that is defined on ΩDisease . i. height. where Disease is a “random variable”. not cold} that maps the sample space Ω to a new binary sample space ΩDisease = {cold. a person may be associated with his/her age. First. For example. a simple approach is to find its pmf pY . X and Y . however. Example 3: Consecutive tosses of a fair coin. In many situations. we can calcu- late PDisease ({cold}) from the probability distribution of the aforementioned event A. to indicate elements (such as “cold”) of ΩX . where Ω is a set of people and let us investigate the probability that a randomly selected person ω ∈ Ω has a cold (we may assume we have a diagnostic method and tools at hand). with generally different preferences for different outcomes. not cold}.1. This is a perfectly legitimate approach. but it can be much simplified using the random variable mechanism. from the observer’s point of view. Before we proceed to formally define random variables. we shall look at two illustrative examples.. Clearly. y. our diagnostic method corresponds to a function Disease : Ω → {cold. We will use capital letters X. Mathe- matically. we would like to use probabilistic modeling on sets (e. In other situations. rigorous and unified manner. We start by defining an event A as A = {ω ∈ Ω : Disease(ω) = cold} and simply calculate the probability of this event. . which is a notational relaxation from P ({ω : X(ω) = x}). TTH. TTT} and ω HHH HHT HTH HTT THH THT TTH TTT X(ω) 1 1 1 1 0 0 0 0 Y (ω) 3 2 2 1 2 1 1 0 Let us only focus on variable Y .g. technically. Let us motivate the need for random variables. HTH. 1. . A random variable is a variable that. . . it is defined as function that maps one sample space into another. Even more interestingly. IQ. Consider a probability space (Ω.e. 2. PDisease ({cold}) = P (A). we write such probabilities as P (X = x).

Thus.  Example 4: Quantization. we took that F = P(Ω) and FY = P(ΩY ). {0. and that P (Y = 1) = 3/8. THH}) 3 = . F. F = B(Ω). 1} as ( 0 ω ≤ 0.5]}) 1 = 2 and pX (1) = PX ({1}) = P (X = 1) = P ({ω : ω ∈ (0. 1]. 1]}) 1 = 2 From here we can easily see that PX ({0. P ). HTH. Define X : Ω → {0. we mention that all the randomness is defined in the original probability space (Ω. Technically. 1}) = 1 and PX (Ø) = 0. As a final note. FX . P ).5 X(ω) = 1 ω > 0. We have pX (0) = PX ({0}) = P (X = 0) = P ({ω : ω ∈ [0. a random variable X is a function X : Ω → ΩX such that for every A ∈ B (ΩX ) it holds that 17 . P ) and that the new probability space (ΩY . 1}} we would like to understand the new probability distribution PX . F.5 and find the transformed probability space.1 Formal definition of random variable We now formally define a random variable. {1}. {0}. Consider (Ω. F.  1. FY .2. Given a probability space (Ω. P ) into (ΩX . P ) where Ω = [0. PX ). F. In this example. In a similar way. we can calculate that P (Y = 0) = P (Y = 3) = 1/8. we have transformed the probability space (Ω. For an event space FX = P(ΩX ) = {Ø. F. and P is induced by a uniform pdf.calculate pY (2) = PY ({2}) as PY ({2}) = P (Y = 2) = P ({ω : Y (ω) = 2}) = P ({HHT. and so PX is indeed a probability distribution. we have changed the sample space to ΩX = {0. Again. 0. 8 because of the uniform distribution in the original space (Ω. 1}. PX is naturally defined using P . PY ) simply inherits it through a deterministic transformation.5.

we are working with the largest possible event spaces for both discrete and continuous random variables. as before. the probability distribution for X can be found as pX (x) = PX ({x}) = P ({ω : X (ω) = x}) for ∀x ∈ ΩX . If the cumulative distri- bution function is differentiable. The case of continuous random variables is more complicated. by default.{ω : X (ω) ∈ A} ∈ F. It follows that PX (A) = P ({ω : X (ω) ∈ A}) . As we can see from the previous examples. Consider now a discrete random variable X defined on (Ω. P ). where P (X ≤ t). we defined the event space of a random variable to be the Borel field of ΩX . The probability of an event A can be found as PX (A) = P ({ω : X (ω) ∈ A}) X = pX (x) x∈A for ∀A ⊆ ΩX . It is important to mention that. the probability density function of a continuous random variable is defined as . Here we first define a cumulative distribution function (cdf) as FX (t) = PX ({x : x ≤ t}) = P ({ω : X(ω) ≤ t}) = P (X ≤ t) . presents a minor abuse of notation. but reduces to an ap- proach that is similar to that of discrete random variables. Thus. This is convenient because a Borel field of a countable set Ω is its power set. F.

dFX (t) .

.

dt . pX (x) = .

Our focus will be exclusively on random variables that have their probability density functions. however. b] can now be calculated as PX ((a. for a more general view. then ˆ t FX (t) = pX (x) dx. The probability that a random variable will take a value from interval (a. 18 . −∞ for each t ∈ R. we should always keep in mind “if one exists” when referring to pdfs. t=x Alternatively. b]) = PX (a < X ≤ b) ˆ b = pX (x) dx a = FX (b) − FX (a) . if pX exists.

. In the continuous case. . i.g. . The language of random variables. A marginal distribution is defined for a subset of X = (X1 . P ). T = N) the random process is called a discrete-time random process. Y = y) = P ({ω : X(ω) = x} ∩ {ω : Y (ω) = y}). X2 . where x = (x1 . σ 2 ). A generalization of a random vector to infinite sets is referred to as a random process or stochastic process. will deal with simpler settings only requiring (i. There are many models in machine learning that deal with temporally-connected random variables (e. d}) = P (X1 ≤ t1 . xd ) .g. . {Xi : i ∈ T }.. such that each xi is chosen from some ΩXi . Xd ≤ td ) 19 . otherwise (e.e. . 1. The previous equation directly follows from Equation (1. autoregressive models for time series. . P ) is called a random vector or a multivariate (multidimensional) random variable. T = R) it is called a continuous-time random process. Xd ) and define a multidimensional probability mass function as pX (x).. x2 . Most of these notes. A group of d random variables {Xi }di=1 defined on the same probability space (Ω.2. F.e. Markov chains.g. . . y) = P (X = x. x1 xi−1 xi+1 xd where the variable in the j-th sum takes values from ΩXj . F. σ 2 ) or X ∼ N (µ. . Suppose now that the random variable X transforms a probability space (Ω.) multivariate random variables.2 Joint and marginal distributions Let us first look at two discrete random variables X and Y defined on the same probability space (Ω. 2πσ 2 This distribution in turn provides all the necessary information about the probability space. hidden Markov models). X2 . X2 ≤ t2 . where T is an index set usually interpreted as a set of time indices.d. . We can extend this to a d-dimensional random variable X = (X1 ..i. we commonly use probability mass and density functions inducing PX . . Y ) in Example 3. if PX is induced by a Gaussian distribution with parameters µ and σ 2 . we use X : N (µ. . We have already seen an example of a random vector provided by random variables (X. . For example.. In the case of discrete time indices (e. . through stochastic processes. .which follows from the properties of integration. nicely enables formalization of these models. A marginal distribution pXi (xi ) is defined as X XX X pXi (xi ) = ··· ··· pX (x1 . ΩX = R and FX = B(R). . . i.. We define the joint probability distribution pXY (x. .1) and is also referred to as sum rule. P ) into (ΩX . . . xd ) is a vector of values. i = 1 . FX . PX ). F. . To describe the resulting probability space. Both notations indicate that the probability density function for the random variable X is 1 1 2 pX (x) = √ e− 2σ2 (x−µ) . we define a multidimensional cdf as FX (t) = PX ({x : xi ≤ ti . y) of X and Y as pXY (x. . however. Xd ) by summing or integrating over the remaining variables.

is defined as ∂d .and the probability density function. if it exists.

.

pX (x) = FX (t1 , . . . td )

.
∂t1 · · · ∂td t=x

The marginal density pXi (xi ) is defined as
ˆ ˆ ˆ ˆ
pXi (xi ) = ··· ··· pX (x) dx1 · · · dxi−1 dxi+1 · · · dxd
x1 xi−1 xi+1 xd

If the sample space for each discrete random variable is seen as a countable subset of R, then
the probability space for any d-dimensional random variable X (discrete or continuous) can
be defined as Rd , B(R)d , PX .
Example 5: Three tosses of a fair coin (again). Consider two random variables from
Example 3 and calculate their probability spaces, joint and marginal distributions. Recall
X is the number of heads in the first toss and Y is the number of heads over all three tosses.
A joint probability mass function p(x, y) = P (X = x, Y = y) is shown below
Y
0 1 2 3
0 1/8 1/4 1/8 0
X
1 0 1/8 1/4 1/8

but let us step back for a moment and show how we can calculate it. Let us consider two
sets A = { HHH, HHT, HTH, HTT } and B = { HHT, HTH, THH }, corresponding to the
events that the first toss was heads and that there were exactly two heads over the three
tosses, respectively. Now, let us look at the probability of the intersection of A and B

P (A ∩ B) = P ({HHT, HTH})
1
=
4
We can represent the probability of the logical statement X = 1 ∧ Y = 2 as

pXY (1, 2) = P (X = 1, Y = 2)
= P (A ∩ B)
= P ({HHT, HTH})
1
= .
4
The marginal probability distribution can be found in a straightforward way as
X
pX (x) = pXY (x, y) ,
y∈ΩY

where ΩY = {0, 1, 2, 3} . Thus,
X
pX (0) = pXY (0, y)
y∈ΩY
1
= .
2

20

. . We can write p(x1 . for simplicity of notation. For example.2). we will make this explicit using p(X = x). .. xd−1 ) By a recursive application of the product rule. Therefore. . we know that p(x. rather than some shared distribution p.3 Conditional distributions The conditional probability distribution for two random variables X and Y is defined as p(x. p(x1 . there are 2d − 1 free elements in the joint probability distribution.g. it is better to be more explicit about the random variables for a distribution. which enables simplified notation with little lost in terms of clarity. xd ) = p(x1 ) p(xi |x1 . . y) for the joint density rather than p(x. pXY (x. . . . we will no longer be discussing the underlying outcome space for the random variable X. and thus satisfies the conditions of a probability mass (density) function. with different inputs. .  Notation remark: From this point onward.3) p(x) where p(x) > 0. in probability theory for distributions.2.4) i=2 which is referred to as the chain rule.3 Equation (1. . . e.3) now allows us to calculate the posterior probability of an event A as P  y∈A p(y|x) Y : discrete  P (Y ∈ A|X = x) = ´  A p(y|x)dy Y : continuous The extension to more than two variables is straightforward. 21 . . if |ΩXi | = 2 for ∀Xi . it is clear from context that we mean the respective distributions for each random variables. we shall consider this formula as a definition for mathematical convenience. . For random variables X. . . the common approach is to use p instead of pXY . as is typically done for probabilities P (X = x). Estimating such distributions from data is intractable and is one form of the curse of dimensionality. For discrete spaces. 1. and in other settings. y). xi−1 ) (1. . (1. However. y) and p(x) are probabilities. y) p(y|x) = . xd ) p(xd |x1 . this corresponds to an exponential growth of the number of entries in the table with the number of random variables (d). xd−1 ) = . we will no longer use subscripts. . it would be odd to use f (x) and f (y) to mean two different functions. For continuous spaces. on the other hand. when we write p(x) and p(y). . we obtain d Y p(x1 . For cases where there could be ambiguity. For correctness. These distributions pXY are functions. which gives the interpretation that p(y|x) = P (Y = y|X = x) as a direct consequence of the product rule from Equation (1. Asymptotically. . y).We note for the end that we needed |ΩX | · |ΩY | − 1 numbers (because the sum must equal 1) to fully describe the joint distribution pXY (x. and will only be using the new outcomes and distributions. Y . 3 It is straightforward to verify that p(y|x) sums (integrates) to 1. .

still allows the strong law of large numbers.. or the mean value of X. i. PX (A) = E [1A (X)] . As before.5 Expectations and moments Expectations of functions are defined as sums (or integrals) of function values weighted ac- cording to the probability distribution function. 1. It represents independence between variables in the presence of some other random variable (evidence). Given a probability space (ΩX . Interestingly.g. different. 22 . B(ΩX ). e. and some definitions allow the expected value to be infinite.e. y|z) = p(x|z) · p(y|z) and is referred to as conditional independence. We show this in two simple examples from Figure 1.2. denoted by V [X]. or differential entropy for continuous random variables. we have a standard expectation E [X] = xp(x).e. and f (x) = (x − E [X])2 provides the variance of a random variable X. p (x. Another. in such cases we say that the expectation doesPnot exist or is not well-defined. the probability of some event A ⊆ ΩX can also be expressed in the form of expectation. f (x) = log 1/p(x) gives the well-known entropy function H(X).4 For f (x) = x. we consider a function f : ΩX → C and define its expectation function as P  x∈ΩX f (x)p(x) X : discrete  E [f (X)] = ´  ΩX f (x)p(x)dx X : continuous Note that we use a capital X for the random variable (with f (X) the random variable transformed by f ) and lower case x when it is an instance (e. an expectation is not well-defined only if both left and right improper integrals are infinite. where ( 1 x∈A 1A (x) = 0 x∈ /A 4 There is sometimes disagreement on terminology.1. Using f (x) = xk results in the k-th moment. p(x) is the probability of a specific outcome).. the two forms of independence are unrelated.4 Independence of random variables Two random variables are independent if their joint probability distribution can be expressed as p (x. For our purposes. this is splitting hairs. d random variables are (mutually.5. It can happen for continuous distributions that E [f (X)] = ±∞.. Interestingly.. i. In that setting.g.2. neither one implies the other. form of independence can be found even more frequently in proba- bilistic calculations. for example. jointly) independent if a joint probability distribution of any subset of variables can be expressed as a product of individual (marginal) probability distributions of its components. PX ). which. y) = p(x) · pY (y).

5: Independence vs. Z = 1) = bc / (1 ¡ c ¡ b(1 ¡ 2c)) 1 1 c P (Y = 1 |Z = 1) = b(1 ¡ c ¡ a(1 ¡ 2c)) / d where d = P (Z = 1) B. Z = z) ≠ P (Y = y |Z = z) X Y P (Z = 1|X. where all constants a. P (Z = 1 |X = 1) = d + ce ¡ cd X P (Y = 1|X ) 0 b P (Z = 1) = d + (e ¡ d)(a(c ¡ b) + b) 1 c P (Z = z |X = x. where ⊕ is an “exclusive or” operator. Probability distributions are presented using factorization p(x. z) = p(x)p(y|x)p(z|x. (B) Variables X and Z are conditionally independent given Y . Z = X ⊕ Y . but not conditionally independent given Z. y). e ∈ [0. P (Y = 1 |X = x) = b X P (Y = 1|X ) 0 b P (Y = 1) = b 1 b P (Y = y |X = x. X and Z are conditionally independent given Y. 0 1 1{c 1 0 1{c P (Y = 1 |X = 1. (A) Variables X and Y are independent. 0 1 e 1 0 d P (Z = 1 |X = x. 23 . but are not independent. c. y. Y ) 0 0 d for example. X and Y are independent. conditional independence using probability distributions involv- ing three binary random variables. Y = y) = P (Z = z |Y = y) X Y P (Z = 1|X. When c = 0. Y = 1) = e 1 1 e P (Z = 1 |Y = 1) = e Figure 1. d. but not conditionally independent given Z P (X = 1) P (Y = y |X = x) = P (Y = y) a for example. Y ) 0 0 c for example. A. but not independent P (X = 1) P (Z = z |X = x) ≠ P (Z = z) a for example. b. 1].

Note that the moment generating function may not exist for some distributions and all values of t. With this. i.1. ϕX (t) = E[eitX ]. Several expectation functions are summarized in Table 1. For example.e. it is a probability distribution itself. we define the conditional expectation as P  y∈ΩY f (y)p(y|x) Y : discrete  E [f (Y )|x] = ´  ΩY f (y)p(y|x)dy Y : continuous P where f : Ω´Y → C is some function.. The char- acteristic function is closely related to the inverse Fourier transform of p(x) and is useful in many forms of statistical inference. Function f (x) inside the expectation can also be complex-valued. These types of integrals are often seen and evaluated in Bayesian statistics. Again.t] (X)]. using f (y) = y results in E [Y |x] = yp(y|x) or E [Y |x] = yp(y|x)dy. is an indicator function. however. where i is the imaginary unit. Given two random variables X and Y and a specific value x assigned to X. Function q(x) in the definition of the Kullback-Leibler divergence is non-negative and must sum (integrate) to 1. 24 . even when the density function does not. f (x) Symbol Name x E[X] Mean (x − E [X])2 V [X] Variance xk E[X k ] k-th moment.1: Some important expectation functions E [f (X)] for a random variable X described by its distribution p(x). the characteristic function always exists. k ∈ N (x − E [X])k E[(X − E [X])k ] k-th central moment. k ∈ N etx MX (t) Moment generating function eitx ϕX (t) Characteristic function 1 log p(x) H(X) (Differential) entropy log p(x) q(x) D(p||q) Kullback-Leibler divergence ∂ 2 ∂θ log p(x|θ) I(θ) Fisher information Table 1. The Fisher information is defined for a family of probability distributions specified by a parameter θ. defines the characteristic function of X. it is possible to express the cdf as FX (t) = E[1(−∞. We shall see later that under some conditions E [Y |x] is referred to as the regression function.

and calculate the expectation and variance for both X and Y . X and Y . with cov(X. Similarly. y) Symbol Name (x − E [X]) (y − E [Y ]) cov(X. Several important expectations for two random variables are listed in Table 1. Y : continuous Expectations can also be defined over a single variable P  x∈ΩX f (x. Y )] = ´ ´  ΩX ΩY f (x.y) H(X. y). machine learning. Then calculate E [Y |X = 0]. 25 . Y ) = E [(x − E [X]) (y − E [Y ])] = E [XY ] − E [X] E [Y ] . Both covariance and correlation functions have wide applicability in statistics. y)] = ´  ΩX f (x.2: Some important expectation functions E [f (X. Y )] for two random variables. Example 6: Three tosses of a fair coin (yet again). y)p(x)dx X : continuous where E [f (X. Y ) Joint entropy 1 log p (x|y) H(X|Y ) Conditional entropy X|Y Table 1. Y ) = p . y)p(x) X : discrete  E [f (X. Y ) Mutual information 1 log p(x. y) X. signal processing and many other disciplines. we define a correlation function as cov (X. Y : discrete  E [f (X. Y ) corr (X. We define the covariance function as cov (X. y)p(x. y)] is now a function of y.2.y) log p(x)p Y (y) I(X. Consider two random variables from Examples 3 and 5. X) = V [X] being the variance of the random variable X. y)dxdy X. Y ) Covariance (x−E[X])(y−E[Y ]) √ corr(X. Mutual information is sometimes referred to as average mutual information. described by their joint distribution p(x. y)p(x. Y ) Correlation V [X]V [Y ] p(x. V [X] · V [Y ] which is simply a covariance function normalized by the product of standard deviations. For two random variables X and Y we also define P P  x∈ΩX y∈ΩY f (x. f (x.

j=1 where Σij = cov(Xi . 3 X E [Y ] = y · pY (y) y=0 = pY (1) + 2pY (2) + 3pY (3) 3 = 2 The conditional expectation can be found as 3 X E [Y |X = 0] = y · p(y|0) y=0 = p(1|0) + 2p(2|0) + 3p(3|0) =1 where p(y|x) = p(x. and constant c ∈ R. the covariance matrix is defined as Σ = [Σij ]di. Xd is called the covariance matrix. it holds that: 1. E [cX] = cE [X] 2. Xj ) = E [(Xi − E [Xi ]) (Xj − E [Xj ])] with the full matrix written as Σ = E[(X − E[X])(X − E(X)> ]. Similarly. More formally. without proofs. some useful properties of expectations. V [c] = 0 26 . E [X + Y ] = E [X] + E [Y ] 3. y)/p(x). X2 . A simple two-dimensional summary of all pairwise covariance values involving d random variables X1 . For any two random variables X and Y . the diagonal elements of a d × d covariance matrix are individual variance values for each variable Xi and the off-diagonal elements are the covariance values between pairs of variables. Here. The covariance matrix is symmetric. . . Properties of expectations Here we review. . We start by calculating E [X] = 0 · p(0) + 1 · p(1) = 12 . It is sometimes called a variance-covariance matrix.  In many situations we need to analyze more than two random variables.

(1. i. assuming continuous random variables defined on R.. .2. It is straightforward to verify that a function defined in this manner is indeed a probability distribution. We refer to w as mixing coefficients or. Y ) = 0. V [cX] = c2 V [X]. Given a set of m probability distributions.5) i=1 where w = (w1 . a finite mixture distribution func- tion. it holds that: 1. . V [X] ≥ 0 5. V [X + Y ] = V [X] + V [Y ] 3. 1. E [XY ] = E [X] · E [Y ] 2. This approach can be generalized by considering mixtures of distributions. .e. Suppose {Xi }m i=1 is a set of m random variables described by their respective probability distribution functions {pXi (x)}m i=1 . i=1 We can now apply this formula to obtain the mean. if X and Y are independent random variables. we shall only consider random variables that have their probability mass or density functions. wm ) is a set of non-negative real numbers such that m P i=1 wi = 1. Then. A linear combination with such coefficients is called a convex combination. Here we will briefly look into the basic expectation functions of the mixture distribution. p(x) is defined as m X p(x) = wi pi (x). of the random variable X as m X E[X] = wi E[Xi ]. as mixing probabilities. . {pi (x)}m i=1 . As before. sometimes. when f (x) = (x − E[X])2 . Suppose also that a random variable X is described by a mixture distribution with coefficients w and probability distributions {pXi (x)}m i=1 . w2 .6 Mixtures of distributions In previous sections we saw that random variables are often described using particular fam- ilies of probability distributions. 4. cov (X. when f (x) = x and the variance. or mixture model. In addition. linear combinations of other probability distributions. the expectation function is given as ˆ +∞ E [f (X)] = f (x)p(x)dx −∞ ˆ +∞ m X = f (x) wi pXi (x)dx −∞ i=1 m X ˆ +∞ = wi f (x)pXi (x)dx i=1 −∞ m X = wi E[f (Xi )]. i=1 27 .

are rare in practice. radio communication. We will consider a slightly more general situation where X : Bernoulli(α) and Y : Gaussian(µ. σ 2 ). ¾ 2) Figure 1. Noise Y X Z Source Channel Receiver X: Bernoulli(®) Z=X+Y Y : Gaussian(¹. Such distributions. 2πσ 2 2πσ 2 28 . The magnitude of the signal X emitted by the source is equally likely to be 0 or 1 Volt. Consider transmission of a single binary digital signal (bit) over a noisy communication channel shown in Figure 1. written as ϕX (t) = E[eitX ]. A mixture distribution can also be defined for countably and uncountably infinite numbers of components. i=1 i=1 respectively. Without derivation we write ϕX (t) = 1 − α + αeit σ 2 t2 ϕY (t) = eitµ− 2 and subsequently ϕZ (t) = ϕX+Y (t) = ϕX (t) · ϕY (t) σ 2 t2 = 1 − α + αeit · eitµ− 2  σ 2 t2 σ 2 t2 = αeit(µ+1)− 2 + (1 − α)eitµ− 2 .6.6: A digital signal communication system with additive noise. Y and Z.. The signal is sent over a transmission line (e. By performing integration on ϕZ (t) we can easily verify that 1 1 2 1 1 2 pZ (z) = α · √ e− 2σ2 (z−µ−1) + (1 − α) · √ e− 2σ2 (z−µ) . however. Example 7: Signal communications. optical fiber. Derive the probability distribution of the signal Z = X + Y that enters the receiver. ϕY (t) = E[eitY ] and ϕZ (t) = E[eitZ ].g. To find pZ (z) we will use characteristic functions of random variables X. magnetic tape) in which a zero-mean normally distributed noise component Y is added to X. and m X m X V [X] = wi V [Xi ] + wi (E[Xi ] − E[X])2 .

the absence of an edge does imply conditional independence. but are conditionally dependent through another node. Selecting a proper factorization and estimating the conditional probability distributions from data will be discussed in detail later. however. We will not further examine d- separation rules at this time. z) where variable Z is independent of X given Y . y. On the other hand. 29 . z) = p(x)p(y|x)p(z|x. by reversing the order of variables p(x. For example. To carefully determine conditional independence and dependence properties. Such factorizations can be visualized using a directed graph representation. They facilitate interpretation as well as effective statistical infer- ence. which has a different graphical representation and its own conditional probability distribu- tions. σ 2 ) with coefficients w1 = α and w2 = 1−α. y.8A. p(x.  1. in Figure 1. y. Belief networks have a simple. In Figure 1. node Y is a parent of Z.which is a mixture of two normal distributions N (µ + 1. . z) = p(z)p(y|z)p(x|y. two nodes do not have an edge. It is important to mention that there are multiple (how many?) ways of factorizing a distribution. Graphical representations of probability distributions using directed acyclic graphs. . Figure 1. belief networks factorize the joint probability distribution of X as d Y  p(x) = p xi |xParents(Xi ) .2 [Advanced] Graphical representation of probability distributions We saw earlier that a joint probability distribution can be factorized using the chain rule from Equation (1. .8B. respectively. Observe that a convex combination of random variables Z = w1 X + w2 Y does not imply pZ (z) = w1 p(x) + w2 pY (y).7B. z). sometimes dependence properties can get more complicated due to multiple relationships between nodes. Xd ). where nodes represent random variables and edges depict dependence.7 1. For example. Given a set of d random variables X = (X1 . For example. but node X is not a parent of Z. y) is shown in Figure 1. in Figure 1. they can easily be found in any standard textbook on graphical models. Visualizing relationships between variables becomes particularly convenient when we want to understand and analyze conditional independence properties of variables.7B shows the the same factorization of p(x. formal definition. together with conditional probability distributions. . y. are called Bayesian net- works or belief networks. yet the same joint probability distribution as the earlier factorization.4).7A. i=1 where Parents(X) denotes the immediate ancestors of node X in the graph. σ 2 ) and N (µ. Though often relationships are intuitive. one usually uses the d-separation rules for belief networks.2. z) can be also factorized as p(x.

9 Z P (X = x.5 1 0. Y = y.7 Y X P (Y = 1|X ) 1 1 0.5 1 0. Each node is associated with a conditional probability distribution.3 0 0 0. Y. these conditional distributions are referred to as conditional probability tables. Z = z) = P (X = x)P (Y = y |X = x)P (Z = z |X = x.3 0 0. (A) Full factorization.7 Y X P (Y = 1|X ) 0 0. Z = z) = P (X = x)P (Y = y |X = x)P (Z = z |Y = y) Figure 1.9 Z P (X = x.2 1 0. (B) Factorization that shows and ensures conditional independence between Z and X. Y = y) B. Z) using directed acyclic graphs. Y ) 0. In discrete cases. y. Discrete probability distribution without conditional independences X P (X = 1) X Y P (Z = 1|X. 1}3 . 30 . The probability mass function p(x. Discrete probability distribution. z) is defined over {0.4 0 0. Y = y.3 0 1 0. Z is conditionally independent of X given Y X P (X = 1) Y P (Z = 1|Y ) 0.1 1 0 0. given Y . A.7: Bayesian network: graphical representation of two joint probability distribu- tions for three discrete (binary) random variables (X.

One example of a maximum clique decomposition is shown in Figure 1. i. Undirected graphs can also be used to factorize probability distributions. used strictly for normalization purposes. x C∈C is called the partition function. In contrast to conditional probability distributions in directed acyclic graphs. We will see this representation later under Naive Bayes models. but not given Z X Y P (X = x | Y = y) = P (X = x) P (X = x | Y = y. Z = z) ≠ P (X = x | Z = z) Z B. but conditionally independent given Z Z P (X = x | Y = y. X and Y are conditionally independent. (A) A model where the lack of an edge between nodes does not indicate independence.8: Two examples of Bayesian networks.e. Z C∈C where each ψC (xC ) ≥ 0 is called the clique potential function and ˆ Y Z= ψC (xC )dx. X is independent of Y . normalization is necessary. Z C∈C 31 . Z = z) = P (Y = y | Z = z) X Y Figure 1. ψC (xC ) > 0. Given information about Z. thus. the clique potentials usually do not have conditional probability interpretations and. and expressed as ψC (xC ) = exp (−E(xC )) . X and Y are dependent. This leads to the probability distribution of the following form ! 1 X p(x) = exp log ψC (xC ) .. The main idea here is to decompose graphs into maximal cliques C (the smallest set of cliques that covers the graph) and express the distribution in the following form 1 Y p(x) = ψC (xC ). The potential functions are typically taken to be strictly positive. Z = z) = P (X = x | Z = z) P (Y = y | X = x. they are conditionally dependent through Z. A. where E(xC ) is a user-specified energy function on the clique of random variables XC . (B) A model where the lack of an edge between nodes does indicate independence. Given information about Z. X and Y are actually dependent.9.

conversely. C4 }. . . . X2. which were here considered to be a subset of X. Xd ) and N (X) is a set of random variables neighboring X in the graph. It also may involve parameters that are then estimated from the available training data.6) and. . X1 X2 X5 X6 X7 X8 X3 X4 XC1 = {X1. 32 . . X6} XC2 = {X2. X3.e. Here.6) is satisfied. X8} Figure 1.. X5} XC4 = {X6. X7. In the equation above X−Xi = (X1 . i. Consider now any probability distribution over all possible configurations of the random vector X with its underlying graphical representation. . Shown is a set of eight random variables with their interde- pendency structure and maximum clique decomposition (a clique is fully connected subgraph of a given graph). that for every probability distribution for which Equation (1. the set of variables is decomposed into four maximal cliques C = {C1 . . an undirected graph must be created to also involve the target variables. It can be shown that every Gibbs distribution satisfies the property from Equation (1. As formulated. . The energy function E(x) must be lower for values of x that are more likely. If the following property  p (xi |x−Xi ) = p xi |xN (Xi ) (1. the probability distribution is referred to as Markov network or a Markov random field. The set of random variables in N (X) is also called the Markov blanket of X.6) holds can be represented as a Gibbs distribution with some choice of parameters.9: Markov network: graphical representation of a probability distribution using maximum clique decomposition. A decomposition into maximum cliques covers all vertices and edges in a graph with the minimum number of cliques. Xi−1 . C2 . C3 . X4} XC3 = {X5. this probability distribution is called the Boltzmann distribution or the Gibbs distribution. there exists an edge between X and every node in N (X). This equivalence of Gibbs distributions and Markov networks was established by the Hammersley-Clifford theorem. Xi+1 . in a prediction problem. Of course.

our experience with modeling real-life phenomena. (ii) the ability to incorporate prior knowledge and assumptions into modeling. In this case. Second. or function. 2. Models with such performance are said to generalize well. and the ability of algorithms to find good solutions given limited resources. M = Gaussian(µ. We will formalize parameter estimation using probabilistic techniques and will subse- quently find solutions through optimization. where xi ∈ R and have knowledge that M is a family of all univariate Gaussian distributions. i. First. An easy way to think about finding the “best” model is through learning parameters of a distribution..1. We call this process estimation because the typical assumption is that the data was generated by an unknown model from M whose parameters we are trying to recover from data. Given the data set D. M ∈M where p(M |D) is called the posterior distribution of the model given the data.1 Maximum a posteriori and maximum likelihood The idea behind maximum a posteriori (MAP) estimation is to find the most probable model for the observed data. the problem can be seen as parameter estimation. p(M |D) is the probability mass function and the MAP estimate is exactly 33 . M ˆ must be able to incorporate information about the model space M from which it is selected and the process of selecting a model should be able to accept training “advice” from an analyst.e. when large amounts of data are available. In discrete model spaces. the choice of a model ultimately depends on the observations at hand. The statistical framework for model inference is shown in Figure 2. learning algorithms must be able to provide solutions in reasonable time given the resources such as memory or CPU power. The main assumption throughout this part will be that the set of observations D was generated (or collected) independently and according to the same distribution pX (x). its performance on the previously unseen data should not deteriorate once this new data is presented. M ˆ that shows good agreement with the data and respects certain additional requirements. we formalize the MAP solution as MMAP = arg max {p(M |D)} . the model should be able to stand the test of time. with µ ∈ R and σ ∈ R+ . and (iii) scalability.g. σ 2 ). we are typically presented with a set of observations and the objective is to find a model. We shall roughly categorize these requirements into three groups: (i) the ability to generalize well.Chapter 2 Basic Principles of Parameter Estimation In probabilistic modeling. In summary. Suppose we are given a set of observations D = {xi }ni=1 .. Finally. occasionally with constrains in the parameter space. that is. e. the problem of finding the best model (by which we mean function) can be seen as finding the best parameters µ∗ and σ ∗ .

34 . Observations Knowledge + Assumptions Data generator Data set Experience n D = {xi}i = 1 M... or specific starting solutions in the optimization step. The estimates of the parameters are made using a set of observations D as well as experience in the form of model space M. p(M ).1: Statistical framework for model inference. . prior distribution p(M ).. Optimization Parameter estimation e.g. MMAP = arg max {p(D|M ) p(M )} M2M MB = E [M | D = D] Model Knowledge Model inference: Observations + and + Optimization Assumptions Figure 2.

the posterior distribution can be fully described using the likelihood and the prior. and its parameters. i. and p(D) is the marginal distribution of the data. rather than a special case of MAP estimation. 35 . Then. Using the formula of total probability. which is a function. Notice that we use D for the observed data set. MML = arg max {p(D|M )} . Finding MMAP can be greatly simplified because p(D) in the denominator does not affect the solution. we should keep in mind the difference. maximum a posteriori estimation reduces to the maximization of the likelihood function. which are the coefficients of that function.1) p(D) where p(D|M ) is called the likelihood function. even if only for pedantic reasons. Formally speaking. where ∝ is the proportionality symbol. it may be useful to think of the maximum likelihood approach as a separate technique. M ∈M We will refer to this solution as the maximum likelihood solution.e. but keep this connection in mind. To calculate the posterior distribution we start by applying the Bayes rule as p(D|M ) · p(M ) p(M |D) = . However. p(M ) is the prior distribution of the model. (2. The field of research and practice involving ways to determine this distribution and optimal models is referred to as inferential statistics. somewhat interchangeably.1) as p(D|M ) · p(M ) p(M |D) = p(D) ∝ p(D|M ) · p(M ). The posterior distribution is sometimes referred to as inverse probability.the most probable model. though there are some solutions to this issue using improper priors. the assumption that p(M ) is constant is problematic because a uniform distribution cannot be always defined (say. Its counterpart in continuous spaces is the model with the largest value of the posterior density function. Note that we use words model. we can find the MAP solution by solving the following optimization problem MMAP = arg max {p(D|M )p(M )} . we can express p(D) as P  M ∈M p(D|M )p(M ) M : discrete  p(D) = ´  M p(D|M )p(M )dM M : continuous Therefore. Thus. Nonetheless. M ∈M In some situations we may not have a reason to prefer one model over another and can think of p(M ) as a constant over the model space M. but that we usually think of it as a realization of a multidimensional random variable D drawn according to some distribution p(D). We shall re-write Equation (2. over R).

λ) = ln p(D|λ) as n X n X ll(D. λ) = ln λ xi − nλ − ln (xi !) i=1 i=1 and proceed with the first derivative as n ∂ll(D.5. thus. i=1 xi ! To find λ that maximizes the likelihood. Such estimation typically results in calculating conditional expectations. We shall later contrast this estimation technique with the view of the Bayesian statistics in which the goal is to minimize the posterior risk. ∞) is enforced.i. as opposed to estimates that report confidence intervals for a particular group of parameters. The second derivative of the likelihood function is always negative because λ must be positive. (2. By substituting n = 6 and values from D. Specifically. we will first take a logarithm (a monotonic function) to simplify the calculation. Observe that MAP and ML approaches report solutions corresponding to the mode of the posterior distribution and the likelihood function.∞) We can write the likelihood function as p(D|λ) = p({xi }ni=1 |λ) Yn = p(xi |λ) i=1 Pn λ i=1 xi· e−nλ = Qn . 9.2) λ∈(0. sample from a Poisson distribution with an unknown parameter λt . and finally equate it with zero to find the maximum. we can compute the solution as n 1X λML = xi n i=1 = 5. 4. respectively. with some parameter λ ∈ R+ . Note that to properly maximize this loss. λ) 1X = xi − n ∂λ λ i=1 = 0.d. which can be complex integration problems. we know we 36 . From a different point of view. MAP and ML estimates are called point estimates. which is simply a sample mean. then find its first derivative with respect to λ. 5. we express the log-likelihood ll(D. Because the solution above is in the constraint set. We will estimate this parameter as λML = arg max {p(D|λ)} . we also need to ensure the constraint λ ∈ (0. Find the maximum likelihood estimate of λt . the previous expression indeed maximizes the likelihood. Example 8: Suppose data set D = {2. The probability density function of a Poisson distribution is expressed as p(x|λ) = λx e−λ /x!. 5. 8} is an i.

have the correct solution to Equation (2. and θ > 0. θ Γ(k) where x > 0. 4. as long as the prior does not have zero probability on M . The MAP estimate of the parameters can be found as λMAP = arg max {p(D|λ)p(λ)} . large data diminishes the importance of prior knowledge. we have Γ(k) = (k−1)!. This result shows that the MAP estimate approaches the ML solution for large data sets. in the limit of infinite samples. sample from Poisson(λt ). 9. λ∈(0.i. θ Γ(k) Now.2). we can maximize the logarithm of the posterior distribution p(λ|D) using ln p(λ|D) ∝ ln p(D|λ) + ln p(λ) n n X 1 X = ln λ(k − 1 + xi ) − λ(n + ) − ln xi ! − k ln θ − ln Γ(k) θ i=1 i=1 to obtain k − 1 + ni=1 xi P λMAP = n + 1θ =5 after incorporating all data.d.∞) As before. both the MAP and ML converge to the same model. In fact. it is a well-known result that. we will have to explicitly enforce constraints in the optimization. 5. 8} again be an i. M . however. Find the maximum a posteriori estimate of λt . Suppose the prior knowledge about λt can be expressed using a gamma distribution Γ(x|k. both numerators and denomi- nators in the expressions above become increasingly more similar. but now we are also given additional information.  Example 9: Let D = {2. This is an important conclusion because it simplifies mathematical apparatus necessary for practical inference. In other words. 37 . k > 0. when k is an integer. we write the probability density function of the gamma family as x xk−1 e− θ Γ(x|k. Γ(k) is the gamma function that generalizes the factorial function. we can write the likelihood function as Pn λ i=1 xi· e−nλ p(D|λ) = Qn i=1 xi ! and the prior distribution as λ λk−1 e− θ p(λ) = k . in other situations. 5. θ) with parameters k = 3 and θ = 1. First. as we will discuss later. θ) = k . A quick look at λMAP and λML suggests that as n grows.

.e. sn does not grow faster than n ). To get some intuition for this result. then . If limn→∞ sn /n = 0 2 (i. which is a sample from the random variable S n = i=1 Xi . we will show that the MAP and ML estimates converge Pn to the same solution for the above example with a P Poisson distribution. Let sn = n 2 i=1 x i .

.

.

k − 1 + sn sn .

|λMAP − λML | = .

.

− .

.

n + 1/θ n .

.

.

k−1 s n .

= .

.

− .

n + 1/θ n(n + 1/θ) .

σ) = − + . Consistency theorems for ML and MAP estimation state that convergence to the true parameters occurs “almost surely” or “with probability 1” to indicate that these unbounded sequences constitute a set of measure- zero. n i=1  38 .13]). |k − 1| sn ≤ + −−−→ 0 n + /θ n(n + 1/θ) 1 n→∞ Note that if limn→∞ sn /n2 6= 0.  Example 10: Let D = {xi }ni=1 be an i. such a sequence of values has an essentially zero probability of occurring. 2π σ 2σ 2 We now compute the partial derivatives of the log-likelihood with respect to all parameters as Pn ∂ (xi − µ) ln p(D|µ. under certain reasonable conditions (for more. sample from a univariate Gaussian distribution. then both estimators go to ∞. σ) = i=1 2 ∂µ σ and Pn ∂ n i=1 (xi − µ)2 ln p(D|µ. Find the maximum likelihood estimates of the parameters. σ) i=1 Pn 1 1 i=1 (xi − µ)2 = n ln √ + n ln − . We start by forming the log-likelihood function n Y ln p(D|µ. Theorem 9. we can proceed to derive that n 1X µML = xi n i=1 and n 1X 2 σML = (xi − µML )2 . σ) = ln p(xi |µ. ∂σ σ σ3 From here.i. however. see [10.d.

. 39 . This result is only one of the many connections between statistics and information theory. when the data set is sufficiently large. on the other hand.2 The relationship with KL divergence We now investigate the relationship between maximum likelihood estimation and Kullback- Leibler divergence. −∞ q(x) In information theory. Although this divergence is not a metric (it is not symmetric and does not satisfy the triangle inequality) it has important theoretical properties in that (i) it is always non-negative and (ii) it is equal to zero if and only if p(x) = q(x). This will hold for families of distributions for which a set of parameters uniquely determines the probability distribution. The first term. Thus. if the underlying assumptions are satisfied. we know that n 1X a.s. Kullback-Leibler divergence between two probability distributions p(x) and q(x) is defined as ˆ ∞ p(x) DKL (p||q) = p(x) log dx. The Kullback-Leibler divergence between p(x|θ) and p(x|θt ) is ˆ ∞ p(x|θt ) DKL (p(x|θt )||p(x|θ)) = p(x|θt ) log dx p(x|θ) ˆ−∞ ∞ ˆ ∞ 1 1 = p(x|θt ) log dx − p(x|θt ) log dx. more often than not. it will not generally hold for mixtures of distributions but we will discuss this situation later.2 [Advanced] 2. Consider now a divergence between an estimated probability distribution p(x|θ) and an underlying (true) distribution p(x|θt ) according to which the data set D = {xi }ni=1 was generated.2. maximizing the likelihood func- tion minimizes the Kullback-Leibler divergence and leads to the conclusion that p(x|θML ) = p(x|θt ). Kullback-Leibler divergence has a natural interpretation of the inef- ficiency of signal compression when the code is constructed using a suboptimal distribution q(x) instead of the correct (but unknown) distribution p(x) according to which the data has been generated. Under reasonable conditions. log p(xi |θ) → E[log p(x|θ)] n i=1 when n → ∞.g. −∞ p(x|θ) −∞ p(x|θt ) The second term in the above equation is simply the entropy of the true distribution and is not influenced by our choice of the model θ. Using the strong law of large numbers. However. e. we can infer from it that θML = θt . can be expressed as ˆ ∞ 1 p(x|θt ) log dx = −E[log p(x|θ)] −∞ p(x|θ) Therefore. Kullback-Leibler divergence is simply con- sidered to be a measure of divergence between two probability distributions. maximizing E[log p(x|θ)] minimizes the Kullback-Leibler divergence between p(x|θ) and p(x|θt ).

albeit with a very large output space. Each dimension of xi is typically called a feature or an attribute. (xn . . . where xi ∈ X is the i-th object and yi ∈ Y is the corresponding target designation. it is often easier to think that more than one value of the output space can be associated with any particular input.. Generally speaking. travel. . medicine. . An example of a data set for classification with n = 3 data points and d = 5 features is shown in Table 3. On the other hand.1. disease}. Observe that there does not exist a strict distinction between classification and regres- 40 .g.g. Finally.g. y1 ). Problems in which |Y| = 2 are referred to as binary classification problems. . e. where often times Y = R. We generally distinguish between two related but different types of prediction problems: classification and regression. . a single document may be related to more than one value in the set. . whereas problems in which |Y| > 2 are referred to as multi-class classification problems. e. . Y can be a set of structured outputs. This classification scenario is usually referred to as structured- output learning. xi2 . Such situations usually benefit from the construction of a computational model that predicts targets from a set of input values.g.1 Problem Statement We start by defining a data set D = {(x1 . for instance. . yn )}. is referred to as a classifier or a classification model. medicine.Chapter 3 Introduction to Prediction Problems 3. strings. The model is trained using a set of input objects for which target values have already been collected. the regression problem refers to constructing a model that for a previously unseen data point x approximates the target value y as closely as possible. An example of a regression problem is shown in Table 3. y2 ).}) and treat the problem as multi-class classification. However. Y = {healthy. travel. . .}. (x2 . We refer to this learning task as multi-label classification and set Y = {sports. an article on sports medicine. e. we assume that the features are easy to collect for each object (e. This can be more complex as. . or graphs. in which case xi = (xi1 . medicine. by measuring the height of a person or the square footage of a house). In both prediction scenarios. A particular function or algorithm that infers class labels. .. The classification problem refers to constructing a function that for a previously unseen data point x infers (predicts) its class label y. The cardinality of Y in classification problems is usually small.}. we can certainly say that Y = P({sports.. while the target variable is difficult to observe or expensive to collect. To account for this. travel. We usually assume that X = Rd . we have a classification problem if Y is discrete and a regression problem when Y is continuous. The cardinality of the output space in structured-output learning problems is often very high. xid ) is a d-dimensional vector called a data point (or example or sample). in classification of text documents into categories such as {sports.2. Here. . and is typically optimized to minimize or maximize some objective (or fitness) function. trees.

xd ]> ∈ Rd .80 37. Here.650 12. 41 . e.5 2. 2}. a linear combination of features and some set of coefficients w = [w1 w2 . xd ) to denote data points.6 110 62 −1 Table 3.10 Table 3. e. .2: An example of a regression problem: prediction of the price of a house in a particular region.g. For example.2 Useful notation In the machine learning literature we use d-tuples x = (x1 .. where each data point x is a column vector in the d-dimensional Euclidean space: x = [x1 x2 . . . x2 .1: An example of a binary classification problem: prediction of a disease state for a patient.85 36.050 112. . and diastolic blood pressure (dbp).35 x2 3200 9 8. if the output space is Y = {0.85 56. features indicate the size of the house (size) in square feet. diabetes. systolic blood pressure (sbp). we need not treat this problem as classification.6 121 75 −1 x2 75 1.1 3.4 128 85 +1 x3 54 1. 1. This data set contains one positive data point (x2 ) and two negative data points (x1 . For example.56 36. The class label shows a disease state. where > is the transpose operator. .34 61. the age of the house (age) in years.e. + wd xd i=1 can be expressed using an inner (dot) product of column vectors w> x. Another useful notation for such linear combinations will be hw. Here. in hundreds of thousands of dollars. x3 ). size [sqft] age [yr] dist [mi] inc [$] dens [ppl/mi2 ] y x1 1250 5 2. we can take Y = [0. Here.. A linear combination w> x results in a single number. xi. The class labels indicate presence of a particular disease. The target indicates the price a house is sold at. 2] and simply develop a regression model from which we can recover the original discrete values by rounding the raw prediction outputs.5 5.800 3. However. temperature (T). . . sion. 3. the average income in a one square mile radius (inc). the distance from the city center (dist) in miles. This is because there exists a relationship of order among elements of Y that can simplify model development. often times we can benefit from the algebraic notation. height (ht). . yi = +1 indicates the presence while yi = −1 indicates absence of disease. features indicate weight (wt).95 x3 825 12 0.21 245. wt [kg] ht [m] T [◦ C] sbp [mmHg] dbp [mmHg] y x1 91 1. The selection of a particular way of modeling depends on the analyst and their knowledge of the domain as well as technical aspects of learning. i. and the population density in the same area (dens). wd ]> ∈ Rd d X wi xi = w1 x1 + w2 x2 + .g.

X y where the integration is over the entire input space X = Rd . y)p(y|x)dx. 3. . We will refer to this classifier as the Bayes risk classifier. yˆ∈Y y for any x ∈ X . y)p(y|x) . For example. fj 1 2 j k 1 2 x Ti i xij yi n X y Figure 3. y)dx X y ˆ X = p(x) c(ˆ y . This cost function can simply be stored as a |Y| × |Y| cost matrix. regardless of the input x ∈ X given to the classifier. where for each prediction yˆ and true target value y the classification cost can be expressed as a constant c(ˆ y . xn ) to represent the entire set of data points and y to represent a column vector of targets. y). In particular. is an n-by-1 vector which contains the values of feature j for all data points. whereas y is an n-by-1 vector of targets. we are interested in minimizing the expected cost ˆ X E[C] = c(ˆ y .1: Data set representation and notation. the i-th row of x represents data point x> i . x2 . One example where the Bayes risk classifier can be useful is the medical domain. x is an n-by-d matrix representing features and data points. we can see that the optimal classifier can be expressed as ( ) X fBR (x) = arg min c(ˆ y . denoted as fj . the j-th column of x. we will consider that we are given a cost function c : Y × Y → [0. . Also. The criterion for optimality of a classifier will be probabilistic.1. Finally.3 Optimal classification and regression models Our goal now is to establish the performance criteria that will be used to evaluate predictors f : X → Y and subsequently define optimal classification and regression models. y) is either known or can be learned from data. The notation is further presented in Figure 3. y)p(x. . Suppose our goal is to decide 42 . ∞). From this equation. We start with classification and consider a situation where the joint probability distribution p(x. We will also use an n-by-d matrix x = (x> > > 1 . .

y∈Y Therefore. Y 43 . we have converted the problem of minimizing the expected classification cost or probability of error. then the classifier needs to appropriately adjust its outputs to account for the cost disparity in different forms of incorrect prediction. y)p(x.whether a patient with a particular set of symptoms (x) should be sent for an additional lab test (y = 1 if yes and y = −1 if not). it may not be possible to define a meaningful cost matrix and. X |Y {z } g(f (x)) Assuming f (x) is flexible enough to be separately optimized for each unit volume dx. The analysis for regression is a natural extension of that for classification. For simplicity. X Y where c : R × R → [0. Here too. we may incur a significant penalty. That is. in order to improve diagnosis. into the problem of learning functions. as it is expected to be. In other words. However. with cost clab . This corresponds to the situation where the cost function is defined as  0 when y = yˆ  c(ˆ y . fMAP (x) = arg max {p(y|x)} . if p(y|x) is known or can be accurately learned. a reasonable criterion would be to minimize the probability of a classifier’s error P (f (x) 6= y). y) = (f (x) − y)2 . we are fully equipped to make the prediction that minimizes the total cost. If clawsuit  clab . we will consider c(f (x). we are interested in minimizing the expected cost of prediction of the true target y when a predictor f (x) is used. say clawsuit . y)dydx. y)dydx ˆX Y ˆ = p(x) (f (x) − y)2 p(y|x)dy dx. we see that minimizing E[C] leads us to the problem of minimizing ˆ g(u) = (u − y)2 p(y|x)dy. however. In many practical situations. thus. y) =  1 when y 6= yˆ  After plugging these values in the definition for fBR (x). more specifically learning probability distributions. The expected cost can be expressed as ˆ ˆ E[C] = c(f (x). if we do not perform a lab test and the patient is later found to have needed the test for proper treatment. the Bayes risk classifier simply becomes the maximum a posteriori (MAP) classifier. ∞) is again some cost function between the predicted value f (x) and the true value y. which results in ˆ ˆ E[C] = (f (x) − y)2 p(x.

This is the best scenario in regression. f (x) must always have the same output for the same input. The expected cost can be simply expressed as ˆ ˆ E[C] = (E[Y |x] − y)2 p(x. we can now write the expected cost in the cases of both optimal and suboptimal models f (x). The next situation is when f (x) 6= E[Y |x]. we will proceed by decomposing the squared error as (f (x) − y)2 = (f (x) − E[Y |x] + E[Y |x] − y)2 = (f (x) − E[Y |x])2 + 2(f (x) − E[Y |x])(E[Y |x] − y) + (E[Y |x] − y)2 | {z } g(x. we cannot achieve a lower cost on average for this squared cost. We have already found E[C] when f (x) = E[Y |x]. To be a well-defined function. y)dydx. y) when placed under the expectation over 44 . That is. f (x) 6= E[Y |x]. Therefore. X Y which corresponds to the (weighted) squared error between the optimal prediction and the target everywhere in the feature space. we are interested in expressing E[C] when 1. this would be an invalid operation because for a single input x there may be multiple possible outputs y and they can certainly appear in the same data set. setting f (x) = y would always lead to E[C] = 0. Having found the optimal regression model.where we used a substitution u = f (x). It may appear that in the above equations. E[C] = 0 can only be achieved if p(y|x) is a delta function for every x. We can now differentiate g with respect to u as ˆ ∂g(u) = 2 (u − y)p(y|x)dy = 0 ∂u ˆ Y ˆ =⇒ u p(y|x)dy = yp(y|x)dy Y Y | {z } =1 which results in the optimal solution ˆ ∗ f (x) = yp(y|x)dy Y = E[Y |x]. f (x) = E[Y |x] 2. Unfortunately. the optimal regression model in the sense of minimizing the square error between the prediction and the true target is the conditional expectation E[Y |x]. Here.y) We will now take a more detailed look at g(x.

Models obtained by directly estimating p(y|x) are called discriminative models and mod- els obtained by directly estimating p(x|y) and p(y) are called generative models. say p(y|x. To sum up. y)dydx ˆX Y ˆ ˆ = (f (x) − E[Y |x])2 p(x)dx + (E[Y |x] − y)2 p(x.p(x. y)] = (f (x) − E[Y |x])(E[Y |x] − y)p(x. where θ is a set of weights or parameters that are to be learned from the data. Therefore. y) p(y|x) = p(x) p(x|y)p(y) =P y p(x. Using p(x. This task can be solved in different ways. y)dydx − E[Y |x]E[Y |x]p(x. θ). y) p(x|y)p(y) =P y p(x|y)p(y) we can see that these two learning approaches are equivalent in theory. y) is relatively rare as it may be more difficult to hy- pothesize its parametric or non-parametric form from which the parameters are to be found. Direct estimation of the joint distribution p(x. we can now express the expected cost as ˆ ˆ E[C] = (f (x) − y)2 p(x. The choice depends on our prior knowledge and/or preferences. we argued here that optimal classification and regression models critically depend on knowing or accurately learning the posterior distribution p(y|x). y)dydx. y)dydx − f (x)yp(x. Finally. but a straightforward approach is to assume a functional form for p(y|x). Alternatively. respectively. X X Y where the first term is the “distance” between the trained model f (x) and the optimal model E[Y |x] and the second term is the “distance” between the optimal model E[Y |x] and the correct target value y. in some situations we can train a model without an explicit probabilistic approach 45 . y)dydx ˆX ˆY ˆ ˆ = f (x)E[Y |x]p(x. we can learn the class-conditional and prior distributions. y) ˆ ˆ E[g(X. p(x|y) and p(y). y)dydx X Y X Y ˆ ˆ ˆ = f (x)E[Y |x]p(x)dx − f (x)p(x) yp(y|x)dydx ˆX X ˆ Y E[Y |x]E[Y |x]p(x)dx − E[Y |x]E[Y |x]p(x)dx X X = 0. y)dydx X Y X Y ˆ ˆ ˆ ˆ + E[Y |x]yp(x.

our task is to express p(y|x. f ∈F 46 . D). . gives us a sense that the optimal decision can be made through a mixture of distributions p(y|x. providing a vast set of possibilities for modeling distributions. where 1 p(1|x) = Pd −(w0 + wj xj ) 1+e j=1 and p(0|x) = 1 − p(1|x). . say on a large number of data sets. In a typical learning problem. Using the sum and product rules. When X = Rd and Y = {0.4 [Advanced] 3. 3. f )p(f |x. yi )}ni=1 and are asked to model p(y|x). . f ). D)p(f |x. the probability distribution p(y|x) must be modeled using a particular functional form and a set of tunable coefficients. . we will think of D as a realization of a random variable D and will assume that D was drawn according to the true underlying distribution p(x. D) = p(y|x. loss).4 Bayes Optimal Models We saw earlier that optimal prediction models reduce to the learning of the posterior dis- tribution p(y|x) which is then used to minimize the expected cost (risk. Here (w0 . f. w1 . D) = p(y|x. f. one such example is used in logistic regression. we are given a data set D = {(xi . f ). D). in practice. w1 . Thus. . D)df ˆF = p(y|x.in mind. we typically aim to show a good performance of an algorithm in practice. f ). A number of other types of functional relationships can be used as well. However. y). For this purpose. D)df. where f is a particular function from some function (hypothesis) space F. In finite hypothesis spaces F we have that X p(y|x. We can think of F as a set of all functions from a specified class. f )p(f |x. where the weights are given as posterior densities p(f |x. will rewrite our original task of estimating p(y|x) as ˆ p(y|x. wd ) ∈ Rd+1 is a set of weights that are to be inferred from a given data set D and x ∈ Rd is an input data point. F Here we used conditional independence between output Y and the data set D once a partic- ular model f was selected based on D. wd ) ∈ Rk+1 in the example above. D) = p(y|x. say for all (w0 . according to a performance measure relevant to the problems at hand. thus p(y|x. . In these cases. . 1}. This equation. D). We will adjust our notation to more precisely denote the posterior distribution as p(y|x) = p(y|x. . but we can also extend the functional class beyond simple parameter variation to incorporate non-linear decision surfaces.

This can be computed until p(y|x.and p(f |x. D)p(f |D)df y∈Y F ˆ  = arg max p(y|x. 47 . f. D) are posterior probabilities. given that the function (hypothesis) space F is generally uncountable. D) = arg max p(y|x. D). Interestingly. f )p(D|f )p(f )df . D) = p(f |D). This leads to more efficient calculations of the posterior probabilities. In classification. the Bayes optimal model also hints that a better prediction performance can be achieved by combining multiple models and averaging their outputs. D) = arg max {p(y|x. y∈Y F It can be shown that no classifier can outperform the Bayes optimal classifier. D)} . This provides theoretical support for ensemble learning and methods such as bagging and boosting. One problem in Bayes optimal classification is efficient calculation of fMAP (x. We may further assume that p(f |x. y∈Y which readily leads to the following formulation ˆ  fMAP (x. One approach to this is sampling of functions from F according to p(f ) and then calculating p(f |D) or p(D|f ). D) converges. in which case the weights can be precomputed based on the given data set D. we can rewrite our original MAP classifier as fMAP (x.

β) is another set of parameters to be learned. . In either situation.1 Maximum likelihood formulation We now consider a statistical formulation of linear regression. Assume also that the target variable Y has an underlying linear relationship with features (X1 . We shall first lay out the assumptions behind this process and subsequently formulate the problem through maxi- mization of the conditional likelihood function. the target function is modeled as a linear combination of features and parameters. We can think of a particular optimization algorithm as the learning or training algorithm. j=0 48 . whereas all other types of relationship between the features and the target fall into a category of non-linear regression. w2 ) is a set of parameters that need to be determined (learned) and x = (x1 . the target y is a realization of a random variable Y defined as d X Y = ωj Xj + ε. . . . d X f (x) = wj xj . For example. Let us assume that the observed data set D is a product of a data generating process in which n data points were drawn independently and according to the same distribution p(x). x2 . we will show how to solve the optimization and analyze the solution and its basic statistical properties. . x2 ). That is. the regression problem can be presented as a probabilistic modeling approach that reduces to parameter estimation. Xd ).e. yi )}ni=1 the objective is to learn the relationship between features and the target. f (x) = w0 + w1 x1 + w2 x2 where w = (w0 . xd ). . ε : N (0. i. In the former case. Alternatively. x1 .e. for a given input x. σ 2 ). j=0 where we extended x to (x0 = 1. In following section.e. X2 . w1 . . modified by some error term ε that follows a zero-mean Gaussian distribution. Finding the best parameters w is then referred to as linear regression problem. 4. where θ = (α. .Chapter 4 Linear Regression Given a data set D = {(xi . i. to an optimization problem with the goal of maximizing or minimizing some performance criterion between target values {yi }ni=1 and predictions {f (xi )}ni=1 . i. we may hypothesize that f (x) = α + βx1 x2 . We usually start by hypothesizing the functional form of this relationship.

the assumption of normality for the error term is reasonable (recall the central limit theorem!). we will look at the logarithm (monotonic function) of the likelihood function and express the log-likelihood as  2 n √ n d X  1 X X ln(p(y|X.1. w) and find weights as wML = arg max {p(y|X. . i.). A simple example illustrating the linear regression problem is shown in Figure 4. ωd ) is a set of unknown coefficients we seek to recover through estimation. Using a few simple properties of expectations. its conditional density is p(y|x. w) i=1 n 2 (yi − ) Pd 1 j=0 wj xij − Y = √ e 2σ 2 .where ω = (ω0 .i. Generally. w)) = − log 2πσ 2 − 2 yi − wj xij  . we seek to approximate the target as f (x) = w> x. y2 . i=1 2πσ 2 For the reasons of mathematical convenience. w) = p(yi |xi . . . We first write the conditional likelihood function for a single pair (x. .d. . f (x2 ). . To more explicitly see why the maximum likelihood solution corresponds to minimizing Err(w). . w Since examples x are independent and identically distributed (i. . f (xn )) and the vector of observed target values y = (y1 . w)} . i=1 Geometrically. . . we can see that Y also follows a Gaussian distribution. this error is the square of the Euclidean distance between the vector of predictions y ˆ = (f (x1 ).e. yi )}ni=1 . we have n Y p(y|X. w) = √ e 2σ 2 2πσ 2 and observe that the only change from the conditional density function of Y is that coeffi- cients w are used instead of ω. y) as 2 (y− ) Pd 1 j=0 wj xj − p(y|x. . Incorporating the entire data set D = {(xi . ω1 . we can now write the conditional likelihood function as p(y|X. 2σ i=1 i=1 j=0 Given that the first term on the right-hand hand side is independent of w. yn ). In linear regression. . notice that maximizing the likelihood is equivalent to maximizing the log-likelihood 49 . maximizing the likelihood function corresponds exactly to minimizing the sum of squared errors n X d X Err(w) = (f (xi ) − yi )2 . although the independence between ε and x may not hold in practice. where weights w are to be determined. ω) = N (ω > x. f (xi ) = wj xij i=1 j=0 n X = e2i . σ 2 ).

However. (4. the assumptions include that the data D was drawn i. and that there is an absence of noise in the collection of features. the maximum likelihood wML corresponds to wML = argmin − ln(p(y|X.. that the noise (error term) is zero-mean Gaussian and independent of the features. there is an underlying linear relationship between features and the target. In particular. Note that we could have simply started with some (expert-defined) error function. Therefore. 50 . we will discuss how to solve this optimization and the properties of the solution. y1) x Figure 4. (2. as was originally done for OLS and using Err(w). 2.i. the statistical framework provides insights into the assumptions behind OLS regression.3) . (because log is monotonic) which is equivalent to minimizing the negative log-likelihood. y f(x) (x2.d.3) . w)) w∈Rd  2 n √ n d X  1 X X = argmin log 2πσ 2 + 2 yi − wj xij  w∈Rd 2σ i=1 i=1 j=0  2 n X d X = argmin yi − wj xij  w∈Rd i=1 j=0 = argmin Err(w) w∈Rd In the next sections. 3.1: An example of a linear regression fitting on data set D = {(1. y2) e1 = f(x1) { y1 (x1. (3. 2. 1.3)}.2) . The task of the optimization process is to find the best linear function f (x) = w0 + w1 x so that the sum of squared errors e21 + e22 + e23 + e24 is minimized.

we set the partial derivatives to 0 and solve the equations for each weight wj   n d ∂Err X X =2  wj xij − yi  xi0 = 0 ∂w0 i=1 j=0   n d ∂Err X X =2  wj xij − yi  xi1 = 0 ∂w1 i=1 j=0 . the intermediate steps of calculating the 51 . it does not allow us to obtain a closed-form solution for w or discuss the existence or multiplicity of solutions.   n d ∂Err X X =2  wj xij − yi  xik = 0 ∂wd i=1 j=0 This results in a system of d + 1 linear equations with d + 1 unknowns that can be routinely solved (e. However. . Finding weights for which ∇Err(w) = 0 will result in an extremum while a positive semi-definite Hessian will ensure that the extremum is a minimum.4. We now calculate the gradient ∇Err(w) and the Hessian matrix HErr(w) . . . again. √ p where kvk2 = v> v = v12 + v22 + .. We will first write the sum of square errors using the matrix notation as > Err(w) = (Xw − y) (Xw − y) = kXw − yk22 . we expanded each data point xi by xi0 = 1 to simplify the expression. While this formulation is useful. vn2 is the length of vector v. i=1 j=0 where. w We proceed by finding ∇Err(w) and HErr(w) . while the remaining points will be discussed later. by using Gaussian elimination). The gradient function ∇Err(w) is a derivative of a scalar with respect to a vector.g. To address the first point we will exercise some matrix calculus. it is also called the L2 norm. Now. we shall first re-write Err(w) as n X Err(w) = (f (xi ) − yi )2 i=1  2 n X Xd =  wj xij − yi  . We can now formalize the ordinary least-squares (OLS) linear regression problem as wML = arg min kXw − yk22 .2 Ordinary Least-Squares (OLS) Regression To minimize the sum of squared errors.

2. . if the columns of x are lin- early independent. we indeed have found a minimum.1. .3  .2) . 1. see Section B. i ∈ {1. We can now express the predicted target values as ˆ = XwML y = X(X> X)−1 X> y.    1 4 3. . The OLS fitting can now be performed using     1 1   1.3 52 . (4. n} and j ∈ {1. The matrix X(X> X)−1 X> is called the projection matrix . This is the global minimum because positive semi-definite Hessian implies convexity of Err(w). . the Hessian is positive definite. x> A> Ax = (Ax)> Ax = kAxk22 ≥ 0.3  x=  1 3  .3)} from Figure 4. therefore. e. We want to find the optimal coefficients of the least-squares fit for f (x) = w0 + w1 x and then calculate the sum of squared errors on D after the fit. which implies that the global minimum is unique. w = w1 . gradient require derivatives of vectors with respect to vectors (some of the rules of such derivatives are shown in Table 4.1: Useful derivative formulas of vectors with respect to vectors. 2. . m}.3) . Furthermore. is a special case of this situation that results in an n × 1 column vector. (3.1 to understand how it projects y to the column space of X. 2.2  1 2  w0  2. from ∇Err(w) = 0 we find that wML = (X> X)−1 X> y.g.3) . y =  2. . This results in HErr(w) = 2X> X. 2.1) The next step is to find the second derivative in order to ensure that we have not found a maximum (or saddle point).1 results in ∇Err(w) = 2X> Xw − 2X> y and. . the gradient. The derivative of vector y (say m-dimensional) with respect to vector x (say n-dimensional) is an n×m matrix M with components Mij = ∂yj/∂xi . (4. with equality only if the columns of A are linearly dependent). A derivative of scalar with respect to a vector. Application of the rules from Table 4. ∂y y ∂x Ax A> x> A A > x x 2x > x Ax Ax + A> x Table 4. . which is a positive semi-definite matrix (why? Consider that for any vector x 6= 0. Example 11: Consider again data set D = {(1. (2. Thus.1). 3.

i=1 j=0 where ci > 0 is a cost for data point i. it can be shown that the weighted least-squares solution wC can be expressed as −1 > wC = X> CX X Cy. We can see that the solutions are identical when C = I. or generally a hyperplane. . it follows that n 1X w0 = yi . it is a standard practice to add a column of ones to the data matrix x in order to ensure that the fitted line.1). however. where wML is provided by Eq. cn ). Consider the first component of the gradient vector   n d ∂Err X X =2  wj xij − yi  xi0 = 0 ∂w0 i=1 j=0 from which we obtain that n X n X d X n X w0 = yi − wj xij . does not have to pass through the origin of the coordinate system. 53 . but also when XwML = y. it follows that w0 = 0 and that the column of ones is not needed.2. . c2 . i=1 i=1 j=1 i=1 Pn When all features (columns of x) are normalized to have zero mean. when i=1 xij =0 for any column j. .where a column of ones was added to x to allow for a non-zero intercept (y = w0 when x = 0). (4. Using a similar approach as above. it can be derived that −1 wC = wML + X> CX X> (I − C) (XwML − y) .223. Substituting x and y into Eq. 4. the goal is to > minimize (Xw − y) C (Xw − y). where C = diag (c1 . i.63) and the sum of square errors is E(w) = 0. can be achieved in other ways.e.1 Weighted error function In some applications it is useful to consider minimizing the weighted error function  2 n X d X Err(w) = ci  wj xij − yi  . 0. Expressing this in a matrix form. (4. This effect.1) results in w = (0. .  As seen in the example above. n i=1 We see now that if the target variable is normalized to the zero mean as well. In addition.7.

A good resource for matrix gradients is the matrix cookbook. we have h > i ω + X† ε ω + X† ε − ωω >  Cov[wML (D)] = E = ωω > + E X† εε> X†> − ωω >   because E [ε] = 0. random variables each drawn according to N (0. rather than a scalar.4. we shall now treat the solution vector (estimator) wML as a random variable and investigate its statistical properties. . . ε2 . with error n X E(W) = kXW − Yk2F = kXi. Let us now look at the expected value (with respect to training data set D) for the weight vector wML : h −1 > i E[wML (D)] = E X> X X (Xω + ε) = ω.: W − Yi. making E X† εω >  = E X  † ω > = 0. and use ε = (ε1 .. Frobenius norm i=1 = trace (XW − Y)> (XW − Y)  and solution WML = (X> X)−1 X> Y.e.2. by taking partial derivatives or. σ 2 ). . An estimator whose expected value is the true value of the parameter is called an unbiased estimator. E εε> |X = E εε> = σ 2 I. Correspondingly. preferably.d. . we have Yi = dj=0 ωj Xij + εi . since E[ε] = 0. the weights W ∈ Rd×m to give xW ∈ Rm . by using gradient rules for matrix variables. y ∈ Rm .: k22 .3 Expectation and variance for the solution vector Under the data generating model presented above. Now because the noise terms   E [ε] are independent of the inputs. where now the target is an m-dimensional vector. For each of the data points. we can use the law of total   probability (also called the tower rule). 4. εn ) to denote a P vector of i.2 Predicting multiple outputs simultaneously The extension to multiple outputs is straightforward.2. giving target matrix Y ∈ Rn×m . Exercise: Derive this solution. to get E X† εε> X†> = E E X† εε> X†> |X      = E X† E εε> |X X†>     = σ 2 E X† X†> . i.i. The covariance matrix for the optimal set of parameters can be expressed as h i > Cov[wML (D)] = E (wML (D) − ω) (wML (D) − ω) = E wML (D)wML (D)> − ωω >   −1 Taking X† = X> X X> .   54 .

4. φ1 (x).1. and connects the linear regression optimization to solving systems. we will modify the previous expression into p X f (x) = wj x j .3 Linear regression for non-linear problems At first. It can be shown that estimator wML (D) = X† y is the one with the smallest variance among all unbiased estimators (Gauss-Markov theorem). 55 . . .2. we would look for the fit in the following form f (x) = w0 + w1 x. Here. In OLS regression. where φj (x) = xj and φ = (φ0 (x). Obtaining such a useful feature representation is a central problem in machine learning. These results are important not only because the estimated coefficients are expected to match the true coefficients of the underlying linear relationship. where x is the data point and w = (w0 . Fortunately. we will first examine a simpler expanded representation that enables non-linear learning: polynomial curve fitting. To gain further insight into the ordinary linear regression solution. Applying this transformation to every data point in x results in a new data matrix Φ. which then enables a non-linear fit. After all. To achieve a polynomial fit of degree p. φp (x)). and the space of possible solutions. see an algebraic perspective in the appendix B. It provides more insights into uniqueness of the solution. the applicability of linear regression is broader than originally thought. The main idea is to apply a non-linear transformation to the data matrix X prior to the fitting step. as shown in Figure 4.3. but also because the matrix inversion in the covariance matrix suggests stability problems in cases of singular and nearly singular matrices.Thus. Numerical stability and sensitivity to perturbations will be discussed later.1 Polynomial curve fitting We start with one-dimensional data. w1 ) is the weight vector. we will discuss this in detail in Chapter 7. . . We will rewrite this expression using a set of basis functions as p X f (x) = wj φj (x) j=0 = w> φ. it might seem that the applicability of linear regression and classification to real-life problems is greatly limited. we have −1 Cov[wML (D)] = σ 2 E X> X  where we exploited the facts that E εε> = σ 2 I and that an inverse of a symmetric matrix   is also a symmetric matrix. it is not clear whether it is realistic (most of the time) to assume that the target variable is a linear combination of features. j=0 where p is the degree of the polynomial. 4.

. ...9. φp (xn ) Figure 4. . . x2 ... the sum of squared errors on the outside data set reveal a poor generalization ability of the cubic model because we obtain E(w) = 26.. We will discuss regularization in Section 4.2.1.7. respectively.3). the best fit is achieved with the cubic polynomial..6. 0. the optimal set of weights is now calculated as −1 > wML = Φ> Φ Φ y.025) and w3 = (−3. whereas the size of the data set remained small..1. In this case. One signature of overfitting is an increase in the magnitude of the coefficients.. x4 }. It turned out that the optimal coefficients wML = (0.63) were close to the true coefficients ω = (1.65.3. We will now attempt to estimate the coefficients of a polynomial fit with degrees p = 2 and p = 3. non-linear basis functions that are commonly used are the sigmoid function 1 φj (x) = x−µj − sj 1+e 56 . 0. .. the values of the coefficients in w3 became significantly larger with alternating signs (suggesting overcompensation). n xn φ0 (xn ) .. 1..2. the overfitting occurred because the complexity of the model was increased considerably.35).3. .9. 0. For example. 0. . . given a set {x1 . −0.3). . overfitting is indicated by a significant difference in fit between the data set on which the model was trained and the outside data set on which the model is expected to be applied (Figure 4. Among others. Broadly speaking.2 as an approach to prevent this effect. We will also calculate the sum of squared errors on D after the fit as well as on a large discrete set of values x ∈ {0.. Example 12: In Figure 4. 0.. j = 0. 10} where the target values will be generated using the true function 1 + x2 . E(w2 ) = 3. 0. and E(w3 ) = 22018..1 we presented an example of a data set with four data points. even though the error terms were relatively significant. This effect is called overfitting. Thus. while the absolute values of all coefficients in w and w2 were less than one. The sum of squared errors on D equals E(w2 ) = 0. 0. .5).2: Transformation of an n × 1 data matrix x into an n × (p + 1) matrix Φ using a set of basis functions φj . p .2..575. Following the discussion from Section 4. φp (x1 ) x2 → . . .. Using a polynomial fit with degrees p = 2 and p = 3 results in w2 = (0.  Polynomial curve fitting is only one way of non-linear fitting because the choice of basis functions need not be limited to powers of x. −2. 0. −0. .5. What we did not mention was that. 6. x Φ 1 x1 φ0 (x1 ) . x3 ..4.221 and E(w3 ) ≈ 0. the targets were generated by using function 1 + x2 and then adding a measurement error e = (−0.. However. .755.

or a Gaussian-style exponential function (x−µj )2 − 2σ 2 φj (x) = e j . the number of non-zero singular values constitutes the rank of X. see Section 7. this approach works only for a one-dimensional input x. we will look at the singular value decomposition of X. The linear fit. 4.4. this approach needs to be generalized using radial basis functions. f3 (x). V ∈ Rd×d and non-negative diagonal matrix Σ ∈ Rn×d . The diagonal entries in Σ are the singular values. 4. whereas the cubic polynomial fit. and σj are constants to be determined. because any 1 > > An orthonormal matrix U is square matrix that satisfies U U = I and UU = I 57 .1 Stability and sensitivity of the solutions In practice. is shown as a solid blue line. 5 f3(x) 4 f1(x) 3 2 1 x 0 1 2 3 4 5 Figure 4. Any matrix X ∈ Rn×d can be decomposed into its singular value decomposition. we will discuss a few important considerations when implementing and using regression algorithms in practice. For higher dimensions. To see why. data sets include large numbers of features.3: Example of a linear vs. is shown as a solid green line. The singular value decomposition of X = UΣV> for orthonormal matrices1 U ∈ Rn×n .4 Practical considerations In this section. As with the previous linear algebra constructs.1 for more details. or nearly linearly dependent. f1 (x). This causes X> X to no longer be invertible and produces solutions that are sensitive to perturbations in y and X. similar. which are sometimes identical. it allows us to easily examine properties of X. The dotted red line indicates the target linear concept. where µj . However. sj . polynomial fit on a data set shown in Figure 4.1.

Instead of specifying no prior over w. and now combine the negative log-likelihood and the negative log prior to get the following ridge regression problem: Err(w) = (Xw − y)> (Xw − y) + λw> w . Σd has no zeros on the diagonal. The idea is to penalize weight coefficients that are too large. Σ† . This is a form of regularization. kwk22 = w> w where λ is a user-selected parameter that is called the regularization parameter. This means that. If we solve this equation in a similar manner as before. In the general case. (4.2 Regularization So far.. But. because (X> X)−1 = VΣ−2 > d V . 58 . λ−1 I). we can also propose a MAP objective. 4.4. Σd = Σ(1 : d. even for X that is not full rank. shown in Figure 4. 1 : d) because U is orthonormal. σi−1 is large and amplifies any changes in y. we can drop the first constant which does not affect the selection of w. we obtain wMAP = (X> X + λI)−1 X> y.2) σi i=1 Notice that for full-rank X. the larger the λ. i. the pseudo-inverse is still defined as the inverse of the non-zero singular values. followed by a scaling (multiplication by Σ). as before. 2λ 2 because |λ−1 I| = λ−d . which we discuss next. Notice that X> X = VΣU> UΣV> = VΣ2d V> . and for different numbers of samples.2) makes it clear why the solution can be sensitive to pertur- bations. The solution in Equation (4. Now we can discuss the pseudo-inverse of X in terms of the singular value decomposition. we get w> w λ − ln p(w) = ln(2π|λ−1 I|) + −1 = ln(2π) − d ln(λ) + w> w.e. due to being nearly linearly dependent. N (0. followed again by a rotation (multiplication by U). we will see different small singular values.4.linear transformation can be decomposed into a rotation (multiplication by V> ). For small singular values. the more large weights are penalized. We will discuss two common priors (regularizers): the Gaussian prior (`2 norm) and the Laplace prior (`1 norm). A common strategy to deal with this instability is to drop or truncate small singular values. (X> X)−1 X> = VΣ−2 > > −2 > −1 > d V VΣU = VΣd ΣU = VΣd U which matches the definition of the pseudo-inverse using the singular value decomposition for non-full rank matrices. we have discussed linear regression in terms of maximum likelihood. we can select a prior to help regularize overfitting to the observed data. for different samples in X. however. Taking the log of the zero-mean Gaussian prior. which will again strongly affect the solution. we obtain the solution rank of X X v i u> w = X† y = VΣ† U> y = i y. Moreover. The inverse of X> X exists if X is full rank. As before.

59 . For a large number of samples n. For specialized scenarios. however. However. as in Algorithm 2. This has the effect of feature selection. there is a long theoretical and empirical history indicating its effectiveness (see for example [4. removing stability issues with dividing by small singular values. This has the nice effect of shifting the squared singular values in Σ2d by λ. for many other error functions. that for the Lasso we no longer have a closed form solution. Notice. as long as λ is itself large enough. where samples are processed incrementally. solving for ∇Err(w) = 0 in a closed form way is not possible. 3]). where entries in w are zero. this regularizer penalized large values in w. The Gaussian prior prefers the values to be near zero. In stochastic approximation.4: A comparison between Gaussian and Laplace priors. An alternative is to approximate the gradient less accurately with fewer samples. ∇Err(w). To see how this would be done. 4. As with the `2 regularizer for ridge regression. see [8]. as X> (Xw − y) grows with the number of samples and makes it more difficult to select the stepsize. we typically approximate the gradient with one sample2 . and then step in the direction of the negative of the gradient until we reach a local minimum.3 Handling big data sets One common approach to handling big datasets is to use stochastic approximation. the generality of stochastic approximation makes it arguably the modern approach to dealing with big data. we get an `1 penalized loss Err(w) = (Xw − y)> (Xw − y) + λkwk1 which is often called the Lasso. For one example. however. there are of course other approaches.Figure 4. Though this approach may appear to be too much of an approximation. if we choose a Laplace distribution. let us revisit the gradient of the error function. Instead. computing the gradient across all samples can be expensive or infeasible. This approach is called gradient descent and is summarized in Algorithm 1. We obtained a closed form solution for ∇Err(w) = 0. however. whereas the Laplace prior more strongly prefers the values to equal zero. because zeroing entries in w is equivalent to removing the corresponding feature. Notice that here the gradient is normalized by the number of samples n.4. we start at some initial w0 (typically random). 2 Mini-batches are a way to obtain a better approximation but remain efficient. it also produces more sparse solutions. With ever increasing data-set size for many scenarios. Similarly.

. we need step-size ηt to decrease with time 7: // For example. ∇Err(w) = n1 X> (Xw − y) 8: // The step-size η could be chosen by line-search 9: η ← line search(w. for linear regression. . n do 5: gt ← ∇Err(w) . X. y. it is common to pick a fixed. α) 1: w ← 0 ∈ Rd 2: err ← ∞ 3: tolerance ← 10e−4 4: XX ← n1 X> X 5: Xy ← n1 X> y 6: η ← 1/(2kXXk2 ) 7: while |Err(w) − err| > tolerance do 8: err ← Err(w) 9: // The proximal operator projects back into the space of sparse solutions given by `1 10: w ← proxηα`1 (w − ηXXw + ηXy) 11: return w 60 . small stepsize. . 10: w ← w − η t gt 11: return w Algorithm 3: Batch gradient descent for `1 regularized linear regression (X. . y) 1: // A non-optimized. X. . . such as η0 = 1. ∇Errt (w) = (x> t w − yt )xt 6: // For convergence. a common choice is ηt = η0 t−1 or 8: // ηt = η0 t−1/2 for some initial η0 . number of epochs do 3: Shuffle data points from 1. y) 1: w ← random vector in Rd 2: for i = 1. g. basic implementation of batch gradient descent 2: w ← random vector in Rd 3: err ← ∞ 4: tolerance ← 10e−4 5: while |Err(w) − err| > tolerance do 6: err ← Err(w) 7: g ← ∇Err(w) . . . 9: // In practice. .Algorithm 1: Batch Gradient Descent(Err.0. for linear regression. . Err) 10: w ← w − ηg 11: return w Algorithm 2: Stochastic Gradient Descent(Err. . n 4: for t = 1.

These assumptions were necessary to rigorously estimate parameters of the model. We also assumed that an underlying relationship between the features and the target was linear. we introduce generalized linear models (GLMs) which extend ordinary least-squares regression beyond Gaussian probability distributions and linear dependencies between the features and the target. data points with their targets D = {(xi . we assumed that a set of i. 5. we will slightly reformulate this model.Chapter 5 Generalized Linear Models In previous sections. That is. In order to simplify generalization. To establish the GLM model. y). σ 2 ) with µ = ω T x connecting the two expressions. This way of formulating linear regression will allow us (i) to generalize the framework to non-linear relationships between the features and the target as well as (ii) to use the error distributions other than Gaussian. There. j=0 where ω was a set of unknown weights and ε was a zero-mean normally distributed random variable with variance σ 2 . we will assume (1) a loglinear link between the expectation of the target and linear combination of features. which could then be subsequently used for prediction on previously unseen data points. We summarize these assumptions as follows 61 .1 Loglinear link and Poisson distribution Let us start with an example. it will be useful to separate the underlying linear relationship between the features and the target from the fact that Y was normally distributed. we will write that 1. d X Y = ωj Xj + ε. yi )}ni=1 were drawn according to some distribution p(x.e. and (2) the Poisson distribution for the target variable. Assume that data points correspond to cities in the world (described by some numerical features) and that the target variable is the number of sunny days observed in a particular year. We shall first revisit the main points of the ordinary least-squares regression. especially with respect to explicitly stating most of the assumptions in the system (we will see the full picture only when Bayesian formulation is used). E[y|x] = ω T x 2.d. we saw that the statistical framework provided valuable insights into linear regression. p(y|x) = N (µ. In particular. called Bregman divergences.i. i. In this section. This generalization will also introduce you to a broader range of loss functions.

and thus. Tx where fjT is the j-th column of x and c is a vector with elements ci = ew i . Hence. Exploiting T the fact that E [y|x] = λ. because λ ∈ R+ and ω T x ∈ R. We express this as T xy ωT x eω · e−e p(y|x) = y! for any y ∈ N. the likelihood function has the form of the probability distribution. The gradient of the likelihood function can now be expressed as ∇ll(w) = X> · (y − c) .e. p(y|x) = Poisson(eω x ). We will now use the maximum likelihood estimation to find the parameters of the regression model.1) Note that c stores a set of predictions for each of the data points. E[y|x] = ω T x). mean) of the probability distribution. 62 . log(E[y|x]) = ω T x 2.e. i. 1. y − c is an error vector. Therefore. we connect the two formulas using λ = eω x . it is not appropriate to use a linear link between E[y|x] and ω T x (i. we will use the Newton-Raphson method in which we must first analytically find the gradient vector ∇ll(w) and the Hessian matrix Hll(w) . As in previous sections. The link function adjusts the range of the linear combination of features (so-called systematic component) to the domain of the parameters (here. where the data set is observed and the parameters are unknown. In fact. We provide a compact summary of the above assumptions via a probability distribution T for the target. The second partial derivative of the likelihood function can be derived as n ∂ 2 ll(w) X Tx = − xij · ew i · xik ∂wj ∂wd i=1 = −fjT · C · fd . (5. p(y|x) = Poisson(λ) where λ > 0 is the parameter (mean and variance) of the Poisson distribution. the log-likelihood function has the form n n n Tx X X X ll(w) = wT xi yi − ew i − yi ! i=1 i=1 i=1 It is easy to show that ∇ll(w) = 0 does not have a closed-form solution. We start by deriving the j-th element of the gradient n n ∂ll(w) X X Tx = xij yi − ew i xij ∂wj i=1 i=1 n   T X = xij · yi − ew xi i=1 = fjT · (y − c) .

This leads to the following (canonical) form of the probability distribution m ! X p(x|θ) = exp θi ti (x) − a(θ) + b(x) i=1 T  = exp θ t(x) − a(θ) + b(x) . Many of the often encountered (families of) distributions are members of the exponential family. . Gaussian. . . parameters θ1 . When qi (θ) = θi for ∀i. 5. . θ2 . . using t. However. T where C is an n-by-n diagonal matrix with cii = ew xi . t(x) = x. . Example 13: The Poisson distribution can be expressed as p(x|λ) = exp (x log λ − λ − log x!) . 2σ σ 2σ 2 63 . . and b(x) = − log x!. . . (5. t2 (x). θ = log λ. Gamma.  Example 14: Typically we have considered regressors that only learned the mean. θ2 . θm are called natural parameters. Poisson.1) and Eq. Substituting Eq. (5. . where c(t) and C(t) are calculated using the weight vector w(t) . The initial set of weights w(0) can be set randomly.g. Thus. if m = 1 and t(x) = x. then the distribu- tion is of the subclass called natural exponential family distribution or regular exponential family distributions. we can more generally consider other parameters in the distribution. The Gaussian distribution with mean µ and variance σ 2 can be written as x2 µ2   µ 1 2 p(x|µ. exponential.2) which is a negative semi-definite matrix. θm ) is a set of parameters. . it is useful to generically study the exponential family to better understand commonalities and differences between individual member functions. σ) = exp − 2 + x 2 − 2 − log(2πσ ) . Therefore. (5. The Hessian matrix can now be calculated as Hll(w) = −X> · C · X. we used a specific example to illustrate how to generalize beyond Gaussian distributions. The approach more generally extends to any exponential family distribution. or the binomial distributions. i=1 where θ = (θ1 . .2) into the Newton-Raphson formula results in the following weight update rule  −1   w(t+1) = w(t) + X> · C(t) · X · X> · y − c(t) .2 Exponential family of distributions In the previous section. The exponential family is a class of probability distributions with the following form m ! X p(x|θ) = c(θ)h(x) exp qi (θ)ti (x) . where t(x) = (t1 (x). tm (x)). Further. such as for linear regression or Poisson regression. a(θ) = eθ . e. where λ ∈ R+ and X = N0 .

for many common GLMs. i=1 Setting the gradient to zero results in n 1X ∇a(θ) = t(xi ). i=1 i=1 Maximizing the likelihood involves calculating the gradient function Xn ∇ll(θ) = t(xi ) − n∇a(θ). This result is important because it provides a general expression for estimating the parameters of all distributions in the exponential family.− 2) σ 2σ t(x) = (x. It can be derived that ∂a(θ) = E [t(x)] ∂θ ∂ 2 a(θ) = cov [t(x)] ∂θ∂θ T These properties are very useful for estimating parameters of the distribution. because all information about the parameters θ can be inferred by calculating t(x). The properties of this log-normalizer are also key for estimation of generalized linear models. the derivative of a cor- responds to the inverse of the link function. For example. Importantly. It is called this because a(θ) = log´ X exp θ T t(x) + b(x) dx. The function a(θ) is typically called the ´log-partitioning function  or simply a log-normalizer. and so plays the role of ensuring that we have a valid density: X p(x)dx = 1. σ ∈ R+ and X = R. n i=1 By combining the previous expressions we see that the likelihood is maximized when the gradient function of the log-normalizer equals the sample mean of t(x). we see that µ 1 θ = ( 2 . Therefore. x2 ) θ12 1 θ2 a(θ) = + log( ) 4θ2 2 π b(x) = 0. as we discuss below in Section 5.where µ ∈ R. 64 . for Poisson regression. which is the inverse of g. the log-normalizer for an exponential family informs what link g should be used (or correspondingly the transfer f = g −1 ). the link function g(θ) = log(θ). For example. Function t(x) is called a sufficient statistic (a statistic is simply a function of the data). and the derivative of a is eθ . let us consider a data set of observations D = {xi }ni=1 and look at the log-likelihood function n T t(x Y ll(θ) = log eθ i )−a(θ)+b(xi ) i=1 n X n X = θ T t(xi ) − n · a(θ) + b(xi ).5. Therefore.  Now let us get some further insight into the properties of the exponential family pa- rameters and why this class is convenient for estimation.

p(y|x) ∈ Exponential Family Here. While the nature of this non-linear relationship is limited (the features enter the system via a linear combination with the parameters) there is still an important flexibility they provide for modeling. The choice of the link function and the probability distribution is data dependent. the link function adjusts the range of ω T x to the domain of Y (because of this relationship. This framework is frequently en- countered and is called logistic regression. We summarize the logistic regression model as follows 65 . a single mechanism can be used for a wide range of link functions and probability distributions. The two key components of GLMs can be expressed as 1. which is useful for classification. Hence. Interestingly.4 Logistic regression One of the most popular uses of GLMs is a combination of a Bernoulli distribution with a logit link function. there is no guarantee of a closed-form solution for w. allows us to model a much wider range of target functions. g(E[y|x]) = ω T x or E[y|x] = f (ω T x) where g = f −1 2. it also provides a mechanism for a non-linear relationship between the features and the target. Similarly. the generalization to the exponential family from the Gaussian distribution used in ordinary least-squares regression. Instead. link functions are usually not selected independently of the distribution for Y ). 5. Generally. Let us write the log-likelihood n T t(y )−a(θ)+b(y ) Y ll(w) = log eθ i i i=1 XX X = θm tm (yi ) − n · a(θ) + b(yi ) i m i X = lli (w) i and also find the elements of its gradient ∂lli (w) X ∂θm ∂a(θ) = tm (yi ) − ∂wj m ∂wj ∂wj which can be used to easily calculate the update rules of the optimization. Gauss-Newton and other types of solutions are considered and are generally called iteratively reweighted least-squares (IRLS) algorithms in the statistical literature. GLM for- mulations usually resort to iterative techniques derived from the Taylor approximation of the log-likelihood. On the one hand.5. g(·) is called the link function between the linear combination of the features and parameters of the distribution. the standard versions of the GLM from the literature do not use the full version of the Newton- Raphson algorithm with the Hessian matrix. Therefore.3 Formalizing generalized linear models We shall now formalize the generalized linear models. On the other hand.

this does not mean that any other link will necessarily result in an undesirable loss function. the generic log-likelihood for the exponential family is now ∂lli (w) ∂θ ∂a(θ) = yi − ∂wj ∂wj ∂wj = (yi − f (θ))xj The same result about the mean minimizer being optimal Pn has been shown for Bregman 1 Pn divergences [2]. 5. However. when optimizing the log-likelihood for a natural exponential family. 1 + e−ωT x 1 + e−ωT x Given a data set D = {(xi .2] for more details about this relationship. argminxˆ i=1 Da (xi ||ˆ x) = n i=1 xi . we obtain a corresponding Bregman divergence and the minimizer g −1 (θ) corresponds to the mean of the data. y ∈ {0. we would like to ensure that the g provides a smooth. For example. yi )}ni=1 . the choice should have other properties as well. the optimal solution is the sample mean. 1}.e. 1. Usefully. It follows that 1 E[y|x] = 1 + e−ωT x and  y  1−y 1 1 p(y|x) = 1 − . for the common setting of m = 1 and tm (yi ) = yi . logit(E[y|x]) = ω T x 2. to simplify optimization. this provides a mechanism for ensuring a nice loss function (since Bregman divergences have nice properties). This choice in fact provides a Bregman divergence that corresponds to the negative log-likelihood of a natural exponential family: Da (x||g(θ)) = − ln p(x|θ). and α ∈ (0. where xi ∈ Rd and yi ∈ {0. In particular. the parameter a of the exponential family distribution provides us with just such a choice: f = ∇a. p(y|x) = Bernoulli(α) x where logit(x) = ln 1−x . However. convex negative log-likelihood.. 1) is the parameter (mean) of the Bernoulli distribution. 1}. Rather.5 Connection to Bregman divergences We have discussed that the link g or transfer function f is chosen to reflect the range of the output variable y. 66 . with θ = x> w. See [11. It is important to note that the chosen link does not necessarily have to correspond to the derivative of a. i. Section 2.5 [Advanced] 5. For any Bregman divergence. the parameters of the model w can be found by maximizing the likelihood function. Therefore.

. we should remember that the actual inputs are d-dimensional. x1 . w1 . . a plane or a hyperplane) that splits Rd into two half-spaces. it may try to minimize the fraction of examples on the incorrect side of the decision surface. . wd ) is a set of weights and x = (x0 = 1. y i ) + + + + w0 + w1 x1 + w2 x2 = 0 + (x1. For example. This extends the input space to X = Rd+1 but. respectively. e. . in which case the algorithm is more likely to rely on the formal parameter estimation principles. The two half-spaces act as decision regions for the positive and negative examples. Alternatively. Given a data set D = {(xi .Chapter 6 Linear Classifiers Suppose we are interested in building a linear classifer f : Rd → {−1. An example of a classifier with a linear decision surface is shown in Figure 6. The gray line represents a linear decision surface in R2 . .g. it also leads us to a simplified notation in which the decision boundary in Rd can be written as w> x = 0. 67 . .1. there are many ways in which linear classifiers can be constructed. . a training algorithm may explicitly work to position the decision surface in order to separate positive and negative examples according to some problem-relevant criteria. fortunately.1: A data set in R2 consisting of nine positive and nine negative examples. the goal of the training algorithm may be to directly estimate the posterior distribution p(y|x). The decision surface does not perfectly separate positives from negatives. y2) ( xi . +1}. we will add a component x0 = 1 to each input (x1 . In the case of x2 (x2. y1) + + + + x1 Figure 6. . a line. To simplify the formalism in the following sections. Neverthless. Linear classifiers try to find the relationship between inputs and outputs by constructing a linear function (a point. . where w = (w0 . . e. xd ). yi )}ni=1 consisting of positive and negative examples. Earlier in the introductory remarks. we presented a classifier as a function f : X → Y and have transformed the learning problem into approximating p(y|x). . . xd ) is any element of the input space. it may maximize the likelihood.g.

1) 1 + e−w> x −1 which is a monotonic function of w> x.5. e.7 0. can be achieved if p(y|x) is modeled as a monotonic function of w> x. This.1 Logistic regression Let us consider binary classification in Rd . however. tanh(w> x) or > (1 + e−w x )−1 . 68 .5. 5] interval. Then. this relationship can be expressed as 1 P (Y = 1|x.2: Sigmoid function in [−5. 1+ e−w∗T ·x If P (Y = 1|x. if P (Y = 1|x. Function σ(t) = 1 + e−t is called the sigmoid function or the logistic function and is plotted in Figure 6. the conversion from g to f is a straightforward application of the maximum a posteriori principle: the predicted output is positive if g(x) ≥ 0. (6. Of course.2. (6.1 0 −5 0 5 t Figure 6.g. where X = Rd+1 and Y = {0. Thus. we label the data point as negative (ˆ y = 0).3 0.2 0. a model trained to learn posterior probabilities p(y|x) can be seen as a function g : X → [0.5 we conclude that data point x should be labeled as positive (ˆ y = 1).1. In logistic regression.4 0.5 and negative if g(x) < 0.8 0. w∗ ) ≥ 0. (6.11). w) = . w∗ ) = .1 Predicting class labels For a previously unseen data point x and a set of coefficients w∗ found from Eq. w∗ ) < 0. our flexibility is restricted because our method must learn the posterior probabilities p(y|x) and at the same time have a linear decision surface in Rd . 6. The basic idea for many classification approaches is to hypothesize a closed-form representation for the posterior probability that the class label is positive and learn parameters w from data. 1}.4) or Eq.9 1 f (t) = 1 + e−t 0.5 0. we simply calculate the posterior probability as 1 P (Y = 1|x. Otherwise. 1 0. 1]. linear classifiers.6 0. 6.

w) (6. The expression w> x = 0 represents equation of a hyperplane that separates positive and negative examples. we can estimate ω by maximizing the conditional likelihood of the observed class labels y = (y1 .5) is equivalent to maximizing the log-likelihood function ll(w) = log(l(w)) n      X 1 1 ll(w) = yi · log + (1 − yi ) · log 1 − . X1 . x1 .2 Maximum conditional likelihood estimation To frame the learning problem as parameter estimation. (6. Based on the principles of parameter estimation.2)  1− 1 1+e −ω >x for y = 0 = σ(x> w)y (1 − σ(x> w))1−y where ω = (ω0 . xn ).d. . w). . As with the previous maximum likelihood estimation. or simply l(w). . Even more specifically. Note that P (Y = 1|x. ω1 . .the predictor maps a (d + 1)-dimensional vector x = (x0 = 1. . The parameter vector that maximizes the likelihood is wML = arg max {l(w)} w ( n ) Y = arg max p(yi |xi . . sample from a fixed but unknown probability distribution p(x. Thus.1. . Xd ). . (6.i. according to p(x) and then sets its class label Y according to the Bernoulli distribution   y 1 for y = 1 −ω > x  1+e p(y|x) =  1−y (6. maximizing Eq. cross-entropy minimization is equivalent to the maximum likelihood method. 6. ωd ) is a set of unknown coefficients we want to recover (or learn) from the observed data D. . yn ) given the inputs X = (x> > > 1 . . w∗ ) ≥ 0. . (6. y). we shall first write the conditional likelihood function p(y|X. (6. .5) i=1 1 + e−w> xi 1 + e−w> xi As before. we will assume that the data set D = {(xi . . x2 . To make 69 .6) i=1 1 + e−w> xi 1 + e−w> xi The negative of the log-likelihood from Eq. w). .6) is sometimes referred to as cross-entropy. as n Y l(w) = p(yi |xi .3) i=1 This function can be thought of as the probability of observing a set of labels y given the set of data points X and the particular set of coefficients w. because log is a monotonic function. . . xd ) into a zero or one. we will assume that the data generating process randomly draws a data point x.4) w i=1 n   yi  1−yi Y 1 1 = · 1− . the logistic regression model is a linear classifier. . thus. . yi )}ni=1 is an i. (6. . . y2 .5 only when w> x ≥ 0. a realization of the random vector (X0 = 1.

70 . we have ∇ll(w) = X> (y − p) . Notice first that > e−w xi   1 1− = 1 + e−w> xi 1 + e−w> xi giving n      > > X ll(w) = − yi · log 1 + e−w xi + (1 − yi ) · log e−w xi i=1  >  − (1 − yi ) · log 1 + e−w xi n    X > 1 = (yi − 1) w xi + log . for i = 1.8) The second partial derivative of the log-likelihood function can be found as n >x ∂ 2 ll(w) X e−w i = xij · 2 · (−xik ) ∂wj ∂wd 1 + e−w> xi i=1 n   X 1 1 = − xij · · 1− · xik i=1 1 + e−w> xi 1 + e−w> xi = −fj> P (I − P) fd .everything more suitable for further steps.. (6. both written as a function of inputs X. but not here). and the current parameter vector. n. we have to proceed with iterative optimization methods.6). y is an n-dimensional column vector of class labels and p is an n-dimensional column vector of (estimated) posterior probabilities pi = P (Yi = 1|xi . we will slightly rearrange Eq.. We will calculate ∇ll(w) and Hll(w) to enable either method (first-order with only the gradient or second-order with the gradient and Hessian).7) i=1 1 + e−w> xi It does not take much effort to realize that there is no closed-form solution to ∇ll(w) = 0 (we did have this luxury in linear regression. (6. We can calculate the first and second partial derivatives of ll(w) as follows n   ∂ll(w) X 1 −w> xi = (yi − 1) · xij − > · e · (−x ij ) ∂wj 1 + e−w xi i=1 n > ! X e−w xi = xij · yi − 1 + i=1 1 + e w > xi n   X 1 = xij · yi − > i=1 1 + e−w xi = fj> (y − p) . where fj is the j-th column (feature) of data matrix X. (6. . w). Thus. Considering partial derivatives for every component of w. class labels y..

Negative semi-definite Hessian corresponds to a concave likelihood function and has a global maximum (unique if negative definite). w) and I is an n × n identity matrix. If this is infeasible for the given dataset. such as if the number of features d is large. cn ) we can now express the gradient of the log-likelihood as ∇ll(w) = X> C (y − p) and the Hessian as Hll(w) = −X> CP (I − P) X. 71 . Note that the second term on the right-hand side of Eq. (6.g. which guarantees that the optimum we find will be a global maximum.10) is calculated in each iteration t. to indicate that w(t) was used to calculate them.9) This is a negative semi-definite matrix. The computational complexity of this procedure is O(d3 +d2 n) in each iteration. (6. assuming O(d3 ) time for finding matrix inverses. (6. (6. we can also use stochastic gradient descent to optimize the logistic regression model. 6. . (6. It is interesting to observe that the Hessian remains negative semi-definite.8-6. This modifies the conditional likelihood function from Eq. it may be justified to allow for unequal importance of each data point.10) where the initial weights w(0) can be calculated using the ordinary least squares regression −1 as w(0) = X> X X> y. c2 . Y l(w) = i=1 where 0 ≤ ci ≤ 1 is a cost for data point i.where P is an n × n diagonal matrix with Pii = pi = P (Yi = 1|xi . Weighted conditional likelihood function In certain situations. Taking that C = diag (c1 . . . Thus. LBFGS). thus we wrote P and p as P(t) and p(t) . respectively.5) to n pci i yi · (1 − pi )ci (1−yi ) . . yi )}ni=1 and the predictions on all training examples using the weight vector from the current step. The Hessian matrix Hll(w) can now be calculated as Hll(w) = −X> P (I − P) X.9) into Newton-Raphson’s method results in the following weight update rule    −1   w(t+1) = w(t) + X> P(t) I − P(t) X X> y − p(t) . As in Algorithm 2. the update rule is expected to converge to a global maximum. Substituting Eqs. we can use a first-order gradient descent update (which only requires O(dn) to compute the matrix vector product) or quasi-Newton methods that approximate the Hessian (e.. We will first rewrite the update rule of the (first-order) gradient descent method as w(t+1) = w(t) + ∆w(t) .3 Stochastic optimization for logistic regression We have derived that the weight update rule depends on the data set D = {(xi .1.

if data point xt with its class label yt is presented to the learner. For interest about this alternate route. this more haphazard problem specification results in a non-convex optimization. Typically.i. pi = σ(x> i w). This ´ P batch gradient step >is an unbiased estimate of the gradient of the true function X y∈Y p(x. this entails iterating one or more times over the dataset in order (assuming it is random. w) is a Bernoulli distribution and computing the maximum likelihood solution for σ(x> w) = E[Y |x] = P (Y = 1|x. there is a result that using the Euclidean error for the sigmoid transfer gives exponentially many local minima in the number of features [1]. 1} and then tried to minimize their difference. 6.1.where   ∆w(t) = ηX> y − p(t) n X =η xi (yi − pi ). we will show that this direction leads to a non-convex optimization. That is. with the decreasing step-size progressively smoothing out these oscillations. which comes from a wider class of methods in the area of stochastic optimization. rather than computing the batch update for all samples. The batch method is often also called sample average approximation. As with batch gradient descent. because we have i. Instead of explicitly assuming P (Y = 1|x. Let the error function with Euclidean distance now be written as n X Err(w) = (yi − pi )2 . we could have simply decided to use σ(x> w) to predict targets y ∈ {0.i. requiring them to decrease over time.4 Issues with minimizing Euclidean distance A natural question is why we went down this route for linear classification. w). The conditions for convergence typically include conditions on the step- sizes. When the training data is large. the stochastic gradient descent weight update is ∆w(t) = ηxt (yt − pt ). it may be beneficial to update the weight vector after each data point is presented to the learning algorithm. though with more oscillation around the true weight vector. i=1 We can see that the weight update is simply a linear combination of training data points. with i. y)x(y − σ(x w))dxdy. ei = yi − pi i=1 72 .d. The training algorithm can now be revised to randomly draw one data point at a time from D and then update the current weights using the previous equation.d. using one sample to compute the gradient is also an unbiased estimate and has been shown to converge under certain step- size conditions. samples). samples and so this sum con- verges to this integral over all possible values. Unfortunately. Similarly. using our favorite loss (the squared loss). these stochastic gradient descent updates will converge. In fact. in prac- tice. This stochastic gradient descent is a form of stochastic approximation.

. 73 .     ∂en ∂en ∂w1 ∂wd The second partial derivative of the error function can be found as n ∂ 2 Err(w) X ∂ei ∂ei ∂ 2 ei = 2· · + ei · ∂wj ∂wd ∂wd ∂wj ∂wj ∂wd i=1 n X xij · p2i (1 − pi )2 + pi · (1 − pi ) · (2pi − 1) · (yi − pi ) · xik .The minimization of Err(w) is formally expressed as w∗ = arg min {Err(w)} w ( n ) X 2 = arg min (yi − pi ) . This provides the gradient vector in the following form ∇Err(w) = −2X> P (I − P) (y − p) .11) w i=1 Similar to the maximum likelihood process. In general. . The partial derivatives of the error function can be calculated as follows n ∂Err(w) ∂ X 2 = ei ∂wj ∂wj i=1 n X ∂ei = 2 · ei · ∂wj i=1 n   X 1 1 >x = 2· yi − > · > · e−w i · (−xij ) i=1 1 + e−w xi (1 + e−w xi )2 n     X 1 1 1 = −2 · xij · > · 1− > · yi − > i=1 1 + e−w xi 1 + e−w xi 1 + e−w xi = −2fj> P (I − P) (y − p) . Matrix J = P (I − P) X is referred to as Jacobian. our goal will be to calculate the gradient vector and the Hessian of the error function.  = 2· i=1 Thus. (6.. .  JErr(w) =  . Jacobian is an n × d matrix calculated as  ∂e1 ∂e1 ∂e1  ∂w1 ∂w2 · · · ∂wd  . the Hessian can be computed as HErr(w) = 2X> (I − P)> P> P (I − P) X + 2X> (I − P)> P> E(2P − I)X = 2J> J + 2J> E(2P − I)X..

we have discussed discriminative approaches (linear regression. however. with three features. As you can imagine. the decision boundary is according to some linear combination of features. So far.3: Naive Bayes graphical model. it is not necessarily a linear classifier. we will assume a simpler setting with binary features. this can be a more difficult undertaking. E = diag {e} is a diagonal matrix containing elements Eii = ei = yi −pi and I is an identity matrix. 1}d and Y = {0. however. i. we 74 . Note that a linear classifier is one in which the two classes are separated by a linear plane. which attempt to learn p(y|x). we learn p(x. Under the naive Bayes assumption. and then address continuous features. As with logistic regression. Note that naive Bayes is a linear classifier for binary features. Finding a global optimum depends on how favorable the initial solution w(0) is and how well the weight update step can escape local minima to find better ones. This assumption is demonstrated by the graphical model in Figure 6. will be much more problematic than the convex cross-entropy. as we also need to learn the distribution over the features themselves. even though we learn a generative model. 1}. For naive Bayes. Therefore. This means that Err(w) is not convex. we significantly simply learning this joint distribution by making a strong assumption: the features are conditionally independent given the label.2 Naive Bayes Classifier Naive Bayes classification is a generative approach to prediction. 6.e. y}ni=1 be an input data set. y = 1) is p(y = 1|x) ≥ p(y = 0|x) where now p(y = 1|x) = p(x|y = 1)p(y = 1).3. y x1 x2 x3 Figure 6. the features are independent given the class label.. We can now see that the Hessian is not guaranteed to be positive semi-definite.2. logistic regression).1 Binary features and linear classification Let D = {(xi . 6. the decision rule for labeling a point as class 1 (i. To start. it must have multiple minima with different values of the objective function. where P = diag {p}.e. y) = p(x|y)p(y). Minimization of this non-convex function. i. where X = {0..e. more generally. For a generative setting.

since each xj is binary. To see why this is the case. and comes from similarly deriving the maximum likelihood solution.c for each class value c and for each feature. notice that the classifier will make a positive decision when p(y = 1|x) ≥ p(y = 0|x) that is. Linear classification boundary with binary features and targets Interestingly. total number of datapoints Notice that this approach could also be accomplished for more classes than just two.c = p(xj = 1|y = c).c )1−xj The parameters for the Bernoulli distributions are pj. naive Bayes classifier with binary features and two classes is a linear classifier.cj (1 − pj.can write d Y p(x|y) = p(xi |y). with a different parameter pj. i=1 A suitable choice for this simpler univariate probabilities is a Bernoulli distribution. use the log of the likelihood. The prediction on a new point x is then max p(y = c|x) = max p(x|y = c)p(y = c) c∈Y c∈Y d Y = max p(xj |y = c)p(y = c) c∈Y j=1 d Y = max pj.c pc c∈Y j=1 Exercise: The solution above is intuitive. . . 1. This is somewhat surprising. We can easily learn this parameter from data by calculating number of times xj = 1 for class c pj. we can learn the prior pc = p(y = c) using number of datapoint labeled as class c pc = p(y = c) = . number of datapoints labeled as class c Similarly. . To make things simpler. . c = 0. when p(x|y = 1)p(y = 1) ≥ p(x|y = 0)p(y = 0) 75 . pc for i = 1.c = . as this generative approach looks very different from what we did before. Assuming that you have n datapoints and the chosen distribution p(xi |y) is Bernoulli as described above. giving x p(xj |y = c) = pj. d. derive the maximum likelihood parameters pj.c .

we now have d Y d Y p(1) p(xj |1) ≥ p(0) p(xj |0) j=1 j=1 which. d} (1 − pj. i. x p(xj |1) = pj.c ! 2 −1/2 (xj − µj.0 ) wj = log j ∈ {1. . naive Bayes is a linear classifier. Recall that each feature is Bernoulli distributed.0j (1 − pj.c are estimated from the training set. µj. after applying a logarithm.2 Continuous naive Bayes For continuous features.1 )1−xj and x p(xj |0) = pj.1 (1 − pj.0 p0 j=1 j=1 We can write the previous expression as d X w0 + wj x j ≥ 0 j=1 where d p1 X 1 − pj.0 Therefore.1 p1 xj log + log + log ≥0 (1 − pj.1 (1 − pj.1j (1 − pj. 2. . we have d d X pj. σj.0 j=1 pj.1 )pj. . then becomes d X d X log p(1) + log p(xj |1) ≥ log p(0) + log p(xj |0) j=1 j=1 Let us now investigate class-conditional probabilities p(xj |y). etc.0 ) X 1 − pj. when y ∈ {0. 2σj. Using the naive Bayes assumption.1 )pj. now with a different mean and variance for each feature and class.e.2.1 w0 = log + log p0 1 − pj. 6. 1}. a Bernoulli distribution is no longer appropriate for p(xj |y) and we need to choose a different conditional distribution p(x|y).0 1 − pj.We will shorten this notation using p(y = 0) = p(0). A common choice is a Gaussian 2 : distribution.c ) exp − 2 . Taking p(y = c) = pc .c 76 . in the case of binary features.0 )1−xj where parameters pj. . p(x|y = 0) = p(x|0).c .c )2 p(xj |y = c) = (2πσj.

. wk ] is composed of k weight vectors with wk = 0. . . . because 77 .12) y1 ! . yk ! where the usual numerator n! = 1 because n = kj=1 yj = 1 since we can only have one class P value. wk−1 . we can approximate p(y) using counts as before. . .3 Multinomial logistic regression Now let us consider discriminative multiclass classification. because they are tied by the probability for the last class. This involves computing the mean and variance of feature j across the datapoints labeled with class c: Pn i=1 1(yi = c)xj µj. . The maximum likelihood mean and variance parameters correspond to the sample mean and sample co- variance for each given class separately. Here we discuss multiclass classification where we only want to label a datapoint with one class out of k. . The multinomial distribution is a member of the exponential family. p(yk = 1|x)yk (6. This setting arises naturally in machine learning. Note that these models are not learned independently.Since y is still discrete. The parameters can be represented as a matrix W ∈ Rd×k where W = [w1 . as above. In other settings. For example. where X = Rd and Y = {1. As with logistic regression. . Pk > > j=1 exp(x wj ) j=1 exp(x wj ) exp(x> w1 ) exp(x> wk )   = > . . . To do so. We will see why we fix wk = 0. where there is often more than two cate- gories. B. then we have four classes. . .c = number of datapoints labeled as class c Exercise: Derive the maximum likelihood formulation for a Gaussian naive Bayes model. . We can nicely generalize to this setting using the idea of multinomials and the cor- responding link function. we must also ensure that kj=1 p(yj = 1|x) = 1. as with the other generalized linear models. we can parametrize p(yj = 1|x) = σ(x> wj ). . . 1]k .c = number of datapoints labeled as class c Pn 2 2 i=1 1(yi = c)(xj − µj. . . We can write 1 p(y|x) = p(y1 = 1|x)y1 . Note that this model encompasses the binary setting for logistic regression. AB and O) of an individual. . . and check that the solution does in fact match the sample mean and variance for each feature and class separately. . one might want to label a datapoint with multiple classes. P p(yk = 1|x) = 1 − k−1 P j=1 p(yj = 1|x) and only explicitly learn w1 . 1 exp(x> W) 1> exp(x> W) and the prediction is softmax(x) = yˆ ∈ [0.c ) σj. 2. The transfer (inverse of the link) for this setting is the softmax transfer " # > exp(x> w1 ) exp(x> wk ) softmax(x W) = Pk . . . where y 1. we “pivot" around the final class. . 6. which gives the probability in each entry ˆ > 1 = 1 signifying that the probabilities sum to of being labeled as that class. . this is briefly mentioned at the end of this section. k}. However. if we want to predict the blood type (A.

exp(x w) >
σ(x> w) = (1+exp(−x> w))−1 = 1+exp(x > w) . The weights for multinomial logistic regression

with two classes are then W = [w, 0] giving

exp(x> w)
p(y = 0|x) =
1> exp(x> W)
exp(x> w)
=
exp(x> w) + exp(x> 0)
exp(x> w)
=
exp(x> w) + 1
= σ(x> w).

Similarly, for k > 2, by fixing wk = 0, the other weights w1 , . . . , wk−1 are learned to ensure
>w )
that p(y = k|x) = 1>exp(x = Pk−1 1 > and that kj=1 p(y = j|x) = 1.
k
P
exp(x> W) 1+ j=1 exp(x wj )
With the parameters of the model parameterized by W and the softmax transfer, we
can determine the maximum likelihood formulation. By plugging in the parameterization
into Equation (6.12), taking the negative log of that likelihood and dropping constants, we
arrive at the following loss for samples (x1 , y1 ), . . . , (xn , yn )
n
X  
min log 1> exp(x>
i W) − x>
i Wyi
W∈Rd×k :W:k =0
i=1

with gradient
n  n
X    X exp(x> > >
i W) xi
∇ log 1> exp(x>
i W) − x >
i Wy i = >
> exp(x W)
− xi yi> .
i=1 i=1
1 i

As before, we do not have a closed form solution for this gradient, and will use iterative
methods to solve for W. Note that here, unlike previous methods, we have a constraint on
part of the variable. However, this was solely written this way for convenience. We do not
optimize W:k , as it is fixed at zero; one can rewrite this minimization and gradient to only
apply to the W:(1:k−1) . This corresponds to initializing W:k = 0, and then only using the
first k − 1 columns of the gradient in the update to W:(1:k−1) .
The final prediction softmax(x> W) ∈ [0, 1] gives the probabilities of being in a class.
As with logistic regression, to pick one class, the highest probability value is chosen. For
example, with k = 4, we might predict [0.1 0.2 0.6 0.1] and so decide to classify the point
into class 3.

Remark about overlapping classes: If you want to predict multiple classes for a data-
point x, then a common strategy is to learn separate binary predictors for each class. Each
predictor is queried separately, and a datapoint will label each class as 0 or 1, with poten-
tially more than one class having a 1. Above, we examined the case where the datapoint
was exclusively in one of the provided classes, by setting n = 1 in the multinomial.

78

Chapter 7

Representations for machine learning

At first, it might seem that the applicability of linear regression and classification to real-life
problems is greatly limited. After all, it is not clear whether it is realistic (most of the
time) to assume that the target variable is a linear combination of features. Fortunately,
the applicability of linear regression is broader than originally thought. The main idea is to
apply a non-linear transformation to the data matrix x prior to the fitting step, which then
enables a non-linear fit. Obtaining such a useful feature representation is a central problem
in machine learning.
We will first examine fixed representations for linear regression: polynomial curve fitting
and radial basis function (RBF) networks. Then, we will discuss learning representations.

7.1 Radial basis function networks and kernel representations
The idea of radial basis function (RBF) networks is a natural generalization of the polynomial
curve fitting and approaches from the previous Section. Given data set D = {(xi , yi )}ni=1 ,
we start by picking p points to serve as the “centers” in the input space X . We denote those
centers as c1 , c2 , . . . , cp . Usually, these can be selected from D or computed using some
clustering technique (e.g. the EM algorithm, K-means).
When the clusters are determined using a Gaussian mixture model, the basis functions
can be selected as

1 T
Σ−1
φj (x) = e− 2 (x−cj ) j (x−cj ) ,

where the cluster centers and the covariance matrix are found during clustering. When
K-means or other clustering is used, we can use
2

kx−cj k
2σ 2
φj (x) = e j ,

where σj ’s can be separately optimized; e.g. using a validation set. In the context of
multidimensional transformations from x to Φ, the basis functions can also be referred to
as kernel functions, i.e. φj (x)= kj (x, cj ). Matrix 
φ0 (x1 ) φ1 (x1 ) · · · φp (x1 )
 φ0 (x2 ) φ1 (x2 ) 
 
Φ=
 .. .. 
 . . 

 
φ0 (xn ) φp (xn )

79

f(x)

S
w0 w1 w2 wp

Á1(x) Á2(x) ... Áp(x)
1

...
x1 x2 x3 xk

Figure 7.1: Radial basis function network.

is now used as a new data matrix. For a given input x, the prediction of the target y will
be calculated as
p
X
f (x) = w0 + wj φj (x)
j=1
p
X
= wj φj (x)
j=0

where φ0 (x) = 1 and w is to be found. It can be proved that with a sufficiently large
number of radial basis functions we can accurately approximate any function. As seen in
Figure 7.1, we can think of RBFs as neural networks.
RBF networks and kernel representations are highly related. The main distinction is that
kernel representations use any kernel function for the similarity measure k(x, cj ) = φj (x),
where radial basis functions are one example of a kernel. In addition, if an RBF kernel is
chosen, for kernel representations typical the centers are selected from the training dataset.
For RBF networks, the selection of the centers is left generally as an important step, where
they can be selected from the training set but can also be selected in other ways.

7.2 Learning representations
There are many approaches to learning representations. Two dominant approaches are (semi-
supervised) matrix factorization techniques and neural networks. Neural networks build
on the generalized linear models we have discussed, stacking multiple generalized linear
models together. Matrix factorization techniques (e.g., dimensionality reduction, sparse
coding) typically factorize the input data into a dictionary and a new representation (a
basis). We will first discuss neural networks, and then discuss the many unsupervised and
semisupervised learning techniques that are encompassed by matrix factorizations.

80

For logistic regression we estimated W ∈ Rd×m . Figure 7.2. With the probabilistic model and parameter specified. To give some intuition. we simply take a step in the direction of the negative of the gradient. but we can still take the gradient w. All the first hidden layers constitute representation layer. as there are two layers of weights.3 shows a neural network with one hidden-layer. the gradient 81 . Input Hidden Output Input Output layer layer layer layer layer x1 x1 x2 y1 x2 y1 x3 y2 x3 y2 x4 x4 Figure 7. Figure 7. for computational efficiency. As before.t. 7. When we add a hidden layer.2: Generalized linear model. where k is the dimension of the hidden layer (2)   σ(xW:1 ) (2)   σ(xW:2 )   (2)  ∈ Rk h = σ(W x) =   . Figure 7. We simply have more parameters now: W(2) ∈ Rk×d .    (2) σ(xW:k ) where the sigmoid function is applied to each entry in xW(2) and hW(1) . such as logistic regression. the output learning is the supervised prediction part. each parameter ma- trix. This hidden layer is the new set of features and again you will do the regular logistic regression optimization to learn weights on h: p(y = 1|x) = σ(hW(1) ) = σ(σ(xW(2) )W(1) ). Neural networks stack multiple generalized linear models.r. as usual. we now need to derive an algo- rithm to obtain those parameters. we will begin with one hidden layer.3: Standard two-layer neural network. with f (xW) = σ(xW) ≈ y. . our parameters. we have two parameter matrices W(2) ∈ Rd×k and W(1) ∈ Rk×m . this is called a two-layer neural network.1 Neural networks Neural networks are a form of supervised representation learning. W(1) ∈ R1×k .2 shows the graphical model for the generalized linear models we discussed in the previous chapters. where the weights and corresponding transfer can be thought of as being on the arrows (as they are not random variables).t. Once we have the gradient w. This composition of transfers seems to complicate matters. The gradients for these parameters share information.r. for the sigmoid transfer function and cross-entropy output loss. we take a maximum likelihood approach and derive gradient descent updates..

. . y). we take the partial derivative w. For example. which we describe next. y) (1) = (1) ∂Wjk ∂Wjk   ∂L(ˆ yk . . This is the reason for the term back-propagation. the number of hidden nodes and the loss for the last layer. just as with generalized linear models. Backpropagation algorithm We will start by deriving back propagation for two layers. The back-propagation algorithm is simply gradient descent on a non-convex objective. . We then compute the error between our prediction y ˆ and the true label. Then the output from the neural network is         f1 f2 . . . ordered with f1 as the output transfer. . If p(y = 1|x) is a Bernoulli distribution. . y) depends on the chosen p(y|x) and corresponding transfer function for the last layer of the neural network. The choices then involve picking the transfers at each layer. W(2) ∈ Rk2 ×k1 . with a careful ordering of computation to avoid repeating computation. First. W(2) ) ∂L(f1 (f2 (xW(2) )W(1) ).is computed first for W(1) . and k1 . We take the gradient of this error (loss) w. one first propagates forward and computes variable h = f2 (xW(2) ) ∈ R1×k and then y ˆ = f1 (f2 (xW(2) )W(1) ) = f1 (hW(1) ). .t. . and duplicate gradient information sent back to compute the gradient for W(2) . we can compute the gradient for any number of hidden layers. we will compute gradients of the loss w. to our parameters. . we will first compute this gradient assuming we only have one sample (x. . the parameters W(1) (assuming W(2) is fixed). . the extension to multiple layers will be more clear given this derivation. and then W(2) .t.y ∂yˆk ∂W (1) jk 82 . The matching convex loss L(·. . we will often learn with stochastic gradient descent.r. In particular. we define this error function as m (1) X Err(W(1) . the best ordering is to compute the gradient w. ∂Err(W(1) . . fH .t.r. W(H) ∈ Rd×kH−1 .r. W(1) where W(1) ∈ Rk1 ×m . yk ) k=1 for one sample (x. Therefore. we get Err(W(1) . . fH−1 fH xW(H) W(H−1) . W(2) ) = L(f1 (f2 (W(2) x)W:k ). for efficient computation. since the error is propagated backward from the last layer first. in this case. As before. Due to the size of the network. our parameters. to the last parameter W(1) first.t. kH−1 as the hidden dimensions with H − 1 hidden layers. for p(y = 1|x) Gaussian and identity transfer f2 . This algorithm is typically called back propagation.r. yk ) ˆk ∂y (1) = ˆ k = f1 (hW:k ) . For ease of notation. Denote each differentiable transfer function f1 . W(2) ) = (f2 (xW(2) )W(1) − y)2 . then we would chose the logistic regression loss (the cross entropy). In general. y).

∂yˆk ∂θ (1) ∂W (2) k=1 k ij Continuing. (1) ∂Err(W(1) . the entire (2) output variable y ∈ R1×m is affect by the choice of Wij for all i ∈ {1. Next.r. k1 }. (1) (1) Pk (1) ∂θk ∂hW:k ∂ l=1 hl Wlk (2) = (2) = (2) ∂Wij ∂Wij ∂Wij Pk (2) (1) ∂ l=1 f2 (xW:l )Wlk = (2) ∂Wij k (2) (1) ∂f2 (xW:l ) X = Wlk (2) l=1 ∂Wij (2) (1) ∂f2 (xW:j ) = Wjk (2) ∂Wij 83 . yk ) (2) = (2) ∂Wij ∂Wij m X ∂L(ˆ ˆk yk . . ∂Wjk (1) ∂yˆk ∂θk (1) The gradient update is as usual with W(1) = W(1) − α(ˆ y − y)h> for some step-size α. . yk ) ∂f1 (θk ) ∂θk ∂L(ˆ = . W(2) ) ∂f1 (θk )   ∂L(ˆ yk . (1) where only yˆ k is affected by Wjk in the loss. but they are simple to compute for the losses and transfers we have examined. . θk = hW:k ∂W (1) ∂yˆk ∂θ (1) ∂W (1) jk k jk (1) ∂f1 (θk )   ∂L(ˆ yk . yk ) = (ˆ yk − y k ) ∂yˆk (1) ∂f1 (θk ) (1) =1 ∂θk giving (1) ∂Err(W(1) . . Therefore. Now. . for L(ˆ yk . Continuing. however. yk ) ∂f1 (θk ) ∂θk   ∂L(ˆ (1) (1) = . yk ) = hj = (ˆ yk − yk )hj . . we compute the partial gradient with respect to W(2) . . yk ) = 12 (ˆ yk − yk )2 . . j ∈ {1. (1) (1) ∂Err(W(1) . all of y. k2 }.y ∂yˆk ∂W (2) k=1 ij m (1) (1) X yk . we need to take the partial derivative w. and f2 the identity. For example. W(2) ) yk . and so the gradient for the others is zero. W(2) ) ∂ m (2) P k=1 L(f1 (f2 (xW )W:k ). yk ) = hj ∂yˆk ∂θk (1) At this point these equations are abstract. y k ) ∂ y (1) (1) = ˆ k = f1 (hW:k ) = f1 (θk ) .t. we get ∂L(ˆ yk .

(1) (1) yk . yk ) ∂f1 (θ ) ∂L(ˆ (1) ∂f2 (θj ) X k = Wjk xi . we get m (1) (1) ∂Err(W(1) . The gradient for Wij is   ∂f3 (θ (3) ) (2) j Wj: δ (2) (3) xi ∂θj 84 . The final gradient is m ! (2) ∂Err(W(1) . (2) (1) (2) because h relies on W . W(2) ) X (1) (1) ∂f2 (θj ) (2) = δ k W jk (2) xi ∂Wij k=1 ∂θ j   ∂f2 (θ (2) ) (1) j = Wj: δ (1) (2) xi ∂θj If another layer is added before W(2) . yk ) ∂f1 (θk ) ∂θk ∂L(ˆ = ∂Wij (2) ∂yˆk ∂θ (1) ∂W (2) k=1 k ij m (1) (2) yk . h = f2 (xi W ) is a constant. i. W(2) ) X yk . and this information ∂θ (1) propagated back to get the gradient for W(2) . Now continuing the chain rule ∂Wij (2) (2) (2) ∂f2 (xW:j ) ∂f2 (θj ) ∂θj (2) (2) (2) = (2) (2) . ∂θj Putting this back together. yk ) ∂f1 (θk ) ∂L(ˆ δk = ∂yˆk ∂θ (1) k Computing these components only needs to be done once for W(1) . and so does not affect the gradient for W(1) . (2) ∂f2 (xW:l ) because (2) = 0 for l 6= j. For W . then the information propagated backward is   ∂f2 (θ (2) ) (2) (1) j δj = Wj: δ (1) (2) ∂θj (2) (3) and xi is replaced with hi . θj = xW:j ∂Wij ∂θj ∂Wij (2) ∂f2 (θj ) = (2) xi . The difference is in the gradient ∂W (2) . ˆk ∂y (1) ∂θk ∂θj (2) k=1 Notice that some of gradient is the same as for W(1) .e.

g. In fact.Example 15: Let p(y = 1|x) be a Bernoulli distribution. 85 . This general approach to obtaining a new representation using factorization is called dictionary learning. supervised dictionary learning) can actually be formulated as matrix factorizations. with f1 and f2 both sigmoid functions. cross-entropy ∂L(ˆ y . We start with some initial W(1) and W(2) (say filled with random values).2 Unsupervised learning and matrix factorization Another strategy to obtaining a new representation is through matrix factorization. dimensionality reduction. We can derive the two-layer update rule with these settings. The loss is the cross-entropy. and then apply the gradient descent rules with these gradients. by plugging-in above. y) y 1−y =− + ∂ yˆ yˆ 1 − yˆ (2) (2) 1 f2 (xW:j ) = σ(xW:j ) = (2) 1 + exp(−xW:j ) (1) (1) 1 f1 (hW:k ) = σ(hW:k ) = (1) 1 + exp(−hW:k ) ∂σ(θ) = σ(θ)(1 − σ(θ)) Now we can compute the backpropagation update by first propagating forward h = σ(xW(2) ) ˆ = σ(hW(1) ) y and then propagating the gradient back (1) (1) yk . The data matrix X is factorized into a dictionary D and a basis or new representation H (see Figure 7.  7. many unsupervised learning algorithms (e. sparse coding) and semi-supervised learning algorithms (e.. as is usual for gradient descent. The remaining algorithms are simply summarized in the below table.4). yk ) ∂f1 (θk ) ∂L(ˆ δk = ∂yˆk ∂θk (1)   yk 1 − yk = − + yˆ k (1 − y ˆk ) ˆk y 1−y ˆk ˆ k ) + (1 − yk )ˆ = −yk (1 − y yk ∂ (1) (1) = δ k hj ∂Wjk   (2) (1) δj = Wj: δ (1) hj (1 − hj ) ∂ (2) (2) = δj xi ∂Wij The update simply consists of stepping in the direction of these gradients.g. y) = −y log(ˆ y ) − (1 − y) log(1 − yˆ) . L(ˆ y . We will look at k-means clustering and principal components analysis as an example..2.

1 -3.0 2.1 -6. Further.2 -3. The overall minimization is defined across all the samples. For a point x = [0.1 2. The specified optimization should pick dictionary D of means that provides the smallest distances to points in the training dataset. However.2 0.4: Matrix factorization of data matrix X ∈ Rn×d . The goal is to minimize the squared `2 distance of each data point x to its cluster center 2 X kx − 1 (x in cluster i) di k22 = kx − hDk22 i=1 where h = [1 0] or h = [0 1] and D = [d1 . giving loss min kX − HDk2F . It would incur more error to place x in cluster 2 which has a mean that is more dissimilar: d2 = [1.3]. K-means clustering is an unsupervised learning problem to group data points into k clusters by minimizing distances to the mean of each cluster.5: K-means clustering as a matrix factorization for data matrix X ∈ Rn×d .4]. because the cluster number is not typically used as a representation. we nonetheless start with k-means because it is an intuitive example of how these unsupervised learning algorithms can be thought of as matrix factorization. H∈{0.1H=1 D∈Rk×d Different clusters vectors h are learned for each x.4 1 0 0. Imagine that you have two clusters (k = 2). This problem is not usually thought of as a representation learning approach.1 2. We will discuss this view of k-means after discussing it as a matrix factorization.1 − 6. meaning it is placed in cluster 1 with mean d1 = [0. d2 ].1 − 3.3 Mean cluster 2 Figure 7. 86 .2 0. with data dimension d = 3.0 2. the clustering approach can be seen as a representation learning approach. h = [1 0]. d k d k D n X ≈ n H × Figure 7. because it is a learned discretization of the space.0].0 Mean cluster 1 1. but the dictionary of means is shared amongst all the data points.1}n×k.2 − 3.5. An example is depicted in Figure 7. Select cluster 1 Sample 1 0. Let d1 be the mean for cluster 1 and d2 the mean for cluster 2.

in terms of minimal Frobenius norm. These principal components are the direc- tions of maximal variance in the data. This dimensionality reduction technique can also be formulated as a matrix factorization. by preventing overfitting to the noise. otherwise.> As with k-means clustering. however.2. based on sparse activations for memory in the mammalian brain. Uk = U:. For visualization. 87 .1:k and Vk = V:. This projection. In general. Another interpretation is that sparse coding effectively discretizes the space. is ˆ X = Uk Σk Vk . helps speed learning by reducing the number of features and promoting generalization. The corresponding optimization has been shown to be min kX − HDk2F D∈Rk×d . but with overlapping clusters and an associated magnitude of how much a point belongs to that cluster. i.Principal components analysis (PCA) is a standard dimensionality reduction tech- nique. A regularizer is also added to D. A common strategy to obtain sparse representations is to use a sparse regularizer on the learned representation h. This corresponds to the optimization k X k X min kX − HDk2F +α kH:i k1 + α kDi: k22 D∈Rk×d . where the input data x ∈ R1×d is projected into a lower dimensional h ∈ R1×k spanned by the space of principal components. the `1 regularizer promotes zeroed entries. Sparse coding is biologically motivated [7]. the common solution technique is to obtain the singular value decomposition of the data matrix X = UΣV> ∈ Rn×d .4. but rather creates new features: The generated h is not a subset of the original x.1:k . and what this means for identifiability. the projection to lower dimensions has the property that it removes noise and maintains only the most meaningful directions. it may be hard to immediately see why h generated by PCA could be useful as a representation. and so prefers H with as many zeros as possible.. to ensure that D does not become too large. As an exercise. Note that PCA does not subselect features. therefore. The new representation for X (using PCA) is this H. like k-means clustering. PCA is often used for visualization.e.H∈Rn×k One simple way to see why is to recall the well-known Eckart-Young-Mirsky theorem that the rank k matrix X ˆ that best approximates X. where the input data is expanded into a sparse representation. all the weight in DH would be shifted to D. see if you can explain why. In fact. and so is not always used for representation learning. To obtain these k principal components D ∈ Rk×d . Sparse coding takes a different approach. giving D = Vk> ∈ Rk×d H = Uk Σk ∈ Rn×k where Σk ∈ Rk×k consists of the top largest k singular values (in descending order) and Uk ∈ Rn×k and Vk ∈ Rk×d are the corresponding singular vectors.H∈Rn×k i=1 i=1 As discussed in Section 4. the projection is often aggressive to two or three dimensions.

H1 = 1 Partial least squares kXY0 − DHk2F PCA kX − HDk2F Kernel PCA kK − HDk2F kK − HDk2F Ratio cut for K = L† Figure 7. 1}n×k .j = kXi: − Xj: k kX − HDk2F K-means clustering with H ∈ {0. 1}n×k .6: Unsupervised learning algorithms that correspond to a matrix factorization. 88 . H1 = 1 kK − HDk2F Laplacian eigenmaps ≡ Kernel LPP for K = L† kK − HDk2F Metric multi-dimensional scaling for isotropic kernel K (Λ X − HD)Λ1/2 2 −1 Normalized-cut F with H ∈ {0.1 K-medians clustering with H ∈ {0. Algorithm Loss and constraints h 2 X(X> X)−1 i CCA ≡ orthonormal PLS Y(Y> Y)−1 − HD F kK − HDk2F Isomap K = − 12 (I − ee0 )S(I − ee0 ) with Si. 1}n×k . H1 = 1 kX − HDk1.

we can generalize this Euclidean loss to any convex loss n X Lx (H.6. say H. because it was the representation specific to the training data. We summarize this procedure in Algorithm 4. i=1 Algorithms to learn dictionaries Our focus remains on prediction. The strategy is to fix one variable. see [6] for a discussion about this more advanced approach. kernel PCA). and then switch. there is recent evidence that this procedure actually returns the global minimum (see e. The matrices D and W contain all the necessary information to perform out-of-sample prediction. however. h∈Rk With the representation for this sample. and H does not need to be stored. we need to know how to use these learned models for out-of-sample prediction (i. For a new sample xnew . For a more complete list... 89 . where first the new representation is learned in an unsupervised way and then used with supervised learning algorithms. this specifies no regularization and no constraints. These two stages could be combined into one step with supervised dictionary learning. see [9] and [11]). we can learn the supervised weights W ∈ Rk×m to obtain HW ≈ Y. [5]). the representation can be obtained using hnew = argmin Lx (hD.e. As with the regression setting. fixing D and descending in H. it is convex in each variable separately. This can be done with any of the linear regression or classification approaches we have learned so far. and then factorizing that kernel (i. This alternating minimization continues until the convergence. and de- scend in the other. D. we can then predict f (hnew W). say D. x). X) = Lx (Hi: D. Once the dictionary D and new representation H have been learned. The optimization over D and H is not jointly convex. In many cases. X) = kHi: D − Xi: k22 = kHD − Xk2F . Finally. Though this is a nonconvex optimization. If the entry is empty. for new samples).g. D. The most common strategy to learn these dictionary models is to do an alternating min- imization over the variables. Xi: ) i=1 where above we used n X Lx (H.In general. they simply correspond to defining an interesting kernel on X.e. We use a two-stage approach. there are many variants of unsupervised learning algorithms that actually correspond to factorizing the data matrix. summarized in Table 7. and so we would like to use these representations for supervised (or semi-supervised) learning.

. . . the regularizer on H the regularization weight α convergence tolerance fixed positive step-sizes ηD .Algorithm 4: Alternating minimization for dictionary learning Input: inner dimension k loss L RD . Then break prevobj ← currentobj Output: D. H 90 . xn } Initialization: D. . H ← full-rank random matrices with inner dimension k prevobj ← ∞ Loop until convergence within tolerance or reach maximum number of iterations: Update D using one step of gradient descent Update H using one step of gradient descent currentobj ← L(HD) + αRD (D) + αn RH (H) If |currentobj − prevobj| < tolerance. the regularizer on D RH . ηH dataset {x1 .

at the end of the day. . An algorithm with a better performance on a particular data set collects a win (1). as shown in Table 8. . say logistic regression or neural networks. D2 . we will distinguish between evaluation of prediction models and the evaluation of algorithms that construct them. y2 ). . we can provide a win/loss randomly). 91 . yn )}. We can also say that f = a(D). Suppose we have a set of learning problems D1 . machine learning is a field with real-life data and its goal is development of methods that would work well on such data. Suppose a1 has k wins out of m and algorithm a2 has m − k wins. We can think of a being a particular learning algorithm. (x2 . where F is called a function or hypothesis space and f ∈ F. We are now interested in providing statistical evidence that say algorithm a1 is better than algorithm a2 . We can carry out such a comparison using a counting test as follows: for each data set both algorithms are evaluated in terms of relevant performance measure and the algorithm with a higher performance accuracy is awarded a win. Dm and wish to compare learning algorithms a1 and a2 . Although the field is rife with theoretical models and extremely powerful aparatus for their analysys. while the other one is given a loss (in case of exactly the same performance. where xi ∈ X is the i-th object and yi ∈ Y is the corresponding target designation.. Thus. A prediction model is a function f : X → Y constructed from data set D using an algorihm a : (X × Y)n → F.. whereas the other algorithm collects a loss (0). it is necessary to collect a list of data sets such that the performance can be compared on each such data set.1. Given a data set D = {(x1 . . where f is a particular function with fixed set of parameters. In short.1: A counting test where learning algorithms a1 and a2 are compared on a set of m independent data sets.. 8.1 Comparison of Learning Algorithms To compare learning algorithms. y1 ). D1 D2 D3 D4 Dm−1 Dm a1 1 0 1 1 ··· 0 1 a2 0 1 0 0 1 0 Table 8.Chapter 8 Empirical Evaluation Emprical evaluation of models and algorithms that construct them lies in the core of machine learning. evaluation of methods on real-life data cannot be avoided. (xn . We would like to evaluate the null hypothesis H0 that algorithms a1 and a2 have the same performance by providing an alternative hypothesis H1 that algorithm a1 is better than a2 .

A typical approach in these cases is to establish a significance values. Now.05 and reject the null hypothesis if P ≤ α. the probability of a win on any data set will be roughly equal to p = 1/2. 8. the win/loss on each data set will be equally likely and determined by minor variation. Name Symbol Definition f p+f n Classification error error error = tp+f p+tn+f n Classification accuracy accuracy accuracy = 1 − error tp True positive rate tpr tpr = tp+f n fn False negative rate f nr f nr = tp+f n tn True negative rate tnr tnr = tn+f p fp False positive rate f pr f pr = tn+f p tp Precision pr pr = tp+f p tp Recall rc rc = tp+f n F-measure F Table 8. The choice of the significance threshold α is somehwat arbitrary. If the P-value is greater than α we say that there is insufficient evidence for rejecting H0 . that we can consider the evidence for rejecting H0 very strong. we can express the probability that algorithm a1 collected k wins or more under the null hypothesis using binomial distribution m   X m P = pi (1 − p)m−i i i=k and refer to it as the P-value.2 Performance of Classification Models 92 . Typically. Therefore. 5% is a reasonable value. α = 0. we may conclude that there is sufficient evidence that algorithm a1 is better than algorithm a2 . For sufficiently low P-values. but lower values indicate that the particular situation of k wins out of m was so unlikely. say.2: Some H0 : quality(a1 ) = quality(a2 ) H1 : quality(a1 ) > quality(a2 ) If the null hypothesis is true.

In some situations.Chapter 9 Advanced topics 9. It is important to mention that computing the posterior mean usually involves solving complex integrals. i. Bayesian esti- mation addresses those concerns. M ˆ ) is some loss function between two models.e. When ˆ ˆ 2 `(M. 9. We want to find E[λ|D]. Mˆ ) · p(M |D)dM M where M ˆ is our estimate and `(M. ˆ MB = M · p(M |D)dM M = EM [M |D]. Example 11. We shall refer to MB as the Bayes estimator. 8} yet again be an i. numerical integration is necessary. does not consider the possibility of skewed distributions. sample from Poisson(λt ). Let D = {2. The main idea in Bayesian statistics is minimization of the posterior risk ˆ R= `(M. in others. Let us first write the posterior distribution as 93 . 4. This approach. M ) = (M − M ) (ignore the abuse of notation).d. these integrals can be solved analytically. 5. we can minimize the posterior risk as follows ˆ ∂ ˆ −2 R = 2M M · p(M |D)dM ∂Mˆ M =0 from which it can be derived that the minimizer of the posterior risk is the posterior mean function. respectively.i. multi- modal distributions or simply large regions with similar values of p(M |D). 5. Find the Bayesian estimate of λt . however.1 Bayesian estimation Maximum a posteriori and maximum likelihood approaches report the solution that corre- sponds to the mode of the posterior distribution and the likelihood function. Suppose the prior knowledge about the parameter of the exponential distribution can be expressed using a gamma distribution with parameters k = 3 and θ = 1.

let us first note that ˆ ∞ Γ(α) xα−1 e−βx dx = α . θk Γ(k) Before calculating p(D). we can derive that ˆ ∞ p(D) = p(D|λ)p(λ)dλ 0 ˆ ∞ Pn λ · e−nλ λk−1 e− θ λ i=1 xi = Qn · k dλ 0 i=1 xi ! θ Γ(k) Γ(k + ni=1 xi ) P = k Pn θ Γ(k) ni=1 xi !(n + 1θ ) i=1 xi +k Q and subsequently that p(D|λ)p(λ) p(λ|D) = p(D) Pn λ Qn Pn λ i=1 · e−nλ λk−1 e− θ θk Γ(k) i=1 xi !(n + 1θ ) i=1 xi +k xi = Qn · k · Γ(k + ni=1 xi ) P i=1 xi ! θ Γ(k) Pn Pn λk−1+ i=1 xi · e−λ(n+1/θ) · (n + 1θ ) i=1 xi +k = . p(D|λ)p(λ) p(λ|D) = p(D) p(D|λ)p(λ) = ´∞ . 0 p(D|λ)p(λ)dλ where. Γ(k + ni=1 xi ) P Finally. 0 β Now. we have that Pn λ i=1 · e−nλ xi p(D|λ) = Qn i=1 xi ! and λ λk−1 e− θ p(λ) = .14 94 . ˆ ∞ E[λ|D] = λp(λ|D)dλ 0 k + ni=1 xi P = n + 1θ = 5. as shown in previous examples.

. we shall assume that each p(xi |θj ) is an exponential distri- bution with parameter λj . . . m} specifies the mixing component. when the likelihood was multiplied by the prior. . Interestingly. Let us attempt to find the maximum likelihood solution first. j=1 In the equation above. . Note that although p(D|θ) has O(mn ) terms. How would the maximization be performed then? Let us write the likelihood function as 95 . θ2 . . suppose that D = {(xi .i.2 Parameter estimation for mixtures of distributions We now investigate parameter estimation for mixture models. which is most commonly carried out using the expectation-maximization (EM) algorithm. 9. . we used θ = (w1 . . w2 . We shall refer to such prior distributions as conjugate priors. . let us for a moment present two hypothetical scenarios that will help us to understand the algorithm. We have not picked the gamma distribution by chance. By plugging the formula for p(x|θ) into the likelihood function we obtain n Y p(D|θ) = p(xi |θ) i=1   n Y m X =  wj p(xi |θj ) . θm ) to combine all parame- ters. 2. this is a major reason for their consideration. observations D = {xi }ni=1 .1) i=1 j=1 which. As before. where y ∈ Y = {1. . we shall assume that m is given and will address simultaneous estimation of θ and m later.d. Before introducing the EM algorithm.which is nearly the same solution as the MAP estimate found in Example 9. in addition to the Poisson distribution. y). where λj > 0. θ1 . it can be calculated in O(mn) time as a log-likelihood. yi )}ni=1 is an i. wm . the result- ing distribution remained in the same class of functions as the likelihood. the gamma distribution is a conjugate prior to the exponential distribution as well as the gamma distribution itself. i.e. unfortunately. sample from some distribution pXY (x. with the goal of estimating the parameters of the mixture distribution m X p(x|θ) = wj p(x|θj ). in fact. Just to be more concrete. (9. Finally. that is. . we are given a set of i. is difficult to maximize using differential calculus (why?). p(x|θj ) = λj e−λj x . Conjugate priors are also simplifying the math- ematics.  It is evident from the previous example that selection of the prior distribution has im- portant implications on calculation of the posterior mean. That is.d.i. suppose that information is available as to which mixing component generated which data point. . First.

α) = 0 for every k ∈ Y and ∂ nk ∂α L(w. More importantly. j=1 i=1 where nj is the number of data points in D generated by the j-th mixing component. θ)p(yi |θ) i=1 Yn = wyi p(xi |θyi ). . To find w = (w1 . α) = nj log wj + α  wj − 1  j=1 j=1 ∂ where α is the Lagrange multiplier. wm ) we need to solve a constrained optimization problem. . In summary. . α) = 0. yi |θ) i=1 Yn = p(xi |yi . (9.2) can be maximized in a relatively straightforward manner. . . Thus. n i=1 where I(·) is the indicator function. It is useful to observe here that when y = (y1 . which we will do by using the method of Lagrange multipliers. w2 . . (9. y2 . yn ) is known. n 1X wk = I(yi = k). ∂λk i=1 for each k ∈ Y. We shall first form the Lagrangian function as   Xm Xm L(w. The log-likelihood is n X log p(D|θ) = (log wyi + log p(xi |θyi )) i=1 Xm n X = nj log wj + log p(xi |θyi ). (9. we recall that we assumed a mixture of exponential distributions. We obtain that nk λk = Pn . the internal sum- mation operator in Eq. we proceed by setting n ∂ X log p(xi |λyi ) = 0. . Then. we observe that if the mixing component designations y are 96 . .2) i=1 where wj = pY (j) = P (Y = j). by setting ∂w k L(w.1) disappears. it follows that Eq. Let us show how. i=1 I(yi = k) · xi which is simply the inverse mean over those data points generated by the k-th mixture component. n Y p(D|θ) = p(xi . To find all θj . we derive that wk = − α and α = −n. Thus.

because of the i. The MAP estimate for yi can be found as ( ) wyi p(xi |θyi ) yˆi = arg max Pm yi ∈Y j=1 wj p(xi |θj ) for each i ∈ {1. We can then iterate these two steps until convergence. This task looks like clustering in which cluster memberships need to be determined based on the known set of mixing distributions and mixing probabilities. n} k∈Y j=1 wj p(xi |λj ) 3.known. (9. Initialize λk and wk for ∀k ∈ Y  (0) (0)  (0) wk p(xi |λk ) 2. . we arrive at the following algorithm: (0) (0) 1. θ(0) ) as in Eq. the intuition behind the EM algorithm is to form an iterative procedure by assuming that either y or θ is known and calculate the other. In the case of mixture of exponential distributions. .d. we can initially pick some value for θ. Using y(0) we can now refine the estimate of θ to θ(1) using Eq. and then estimate y by computing p(y|D.3).2). This was achieved by decoupling the estimation of mixing proportions and all parameters of the mixing distributions. thus. . n}. θ) i=1 n w p(xi |θyi ) Pmyi Y = (9. . this estimation can be completed in O(mn) time. . 2. neither “class labels” y nor the parameters θ are known. .i. say θ(0) . For example. We can refer to this estimate as y(0) . and that we would like to estimate the best configuration of the mixture designations y (one may be tempted to call them “class labels”).3) i=1 j=1 wj p(xi |θj ) and subsequently find the best configuration out of mn possibilities. Set t = 0 4. (9. assumption each element yi can be estimated separately and. Therefore. 2. Calculate yi = arg max Pm (0) (0) for ∀i ∈ {1. . θ) = p(yi |xi . To do this we can calculate the posterior distribution of y as n Y p(y|D. In reality. Report λk and wk for ∀k ∈ Y 97 . Repeat until convergence (t+1) 1 Pn (t) (a) wk = n i=1 I(yi = j) Pn (t) (t+1) i=1 I(yi =k) (b) λk = Pn (t) i=1 I(yi =k)·xi (c) t = t + 1  (t) (t)  (t+1) wk p(xi |λk ) (d) yi = arg max Pm (t) (t) k∈Y j=1 wj p(xi |λj ) (t) (t) 5. . Fortunately. suppose that parameters θ are known. we have just seen that the optimization step is relatively straightforward if one of them is known. In the second hypothetical scenario. the parameter estimation is greatly simplified. Obviously.

We will use the method of Lagrange multipliers. we can use p(y|D. Hence the name “expectation-maximization”. after all. We can now formulate the expression for the parameters in step t + 1 as n o θ(t+1) = arg max E[log p(D. y|θ)|θ(t) ] function because the parameters θ(t) have been updated from the previous step. it is a version of it referred to as classification EM algorithm. To maximize E[log p(D. θ(t) ). y|θ)|θ(t) ] = log p(D. we only need to estimate θ. y|θ)p(y|D. θ(t) ). y|θ)|θ(t) ] = log (wj p(xi |θj )) pYi (j|xi .e. that inside of it we always have to re-compute EY [log p(D. we will first derive the update rule for the mixing probabilities and then by assuming the mixing distributions are exponential. We then can perform maximization. yi |θ) p(yl |xl . Note. that we omit for space reasons.4). y|θ)|θ(t) ] with respect to w. θ(t) ) y1 =1 yn =1 i=1 l=1 Xm m X X n n Y = ··· log (wyi p(xi |θyi )) p(yl |xl . at each step t. The expectation-maximization algorithm The previous procedure was designed to iteratively estimate class memberships and param- eters of the distribution. y|θ)|θ(t) ] . To accomplish this. We now proceed as follows m X m X E[log p(D. for each k ∈ Y we need to solve 98 . In the final two steps.This procedure is not quite yet the EM algorithm. y|θ)|θ(t) ] = ··· log p(D. y|θ)p(y|D. (9. Pm we observe that this is an instance of constrained optimization because it must hold that i=1 wi = 1. (9. thus. y1 =1 yn =1 i=1 l=1 After several simplification steps. i. In the next section we will introduce the EM algorithm. i=1 j=1 from which we can see that w and {θj }m j=1 can be separately found. derive the update rules for their parameters. θ(t) ) y1 =1 yn =1 Xm m X X n n Y = ··· log p(xi . “expectation maximization”. θ(t) ) to maximize the expected log-likelihood of both D and y X EY [log p(D. In reality. however. (9.5) θ The formula above is all that is necessary to create the update rule for the EM algorithm. rather. it is not necessary to compute y.4) y which can be carried out by integrating the log-likelihood function of D and y over the posterior distribution for y in which the current values of the parameters θ(t) are assumbed to be known. the expectation of the likelihood can be written as n X X m E[log p(D. θ(t) ). although it is perfectly valid to think of the EM algorithm as an interative maximization of expectation from Eq.

Repeat until convergence (t) (t) wk p(xi |λk ) (a) pYi (k|xi . we have (t) (t) w p(xi |λ ) pYi (k|xi . Set t = 0 3. a sep- arate derivatives have to be found. θ(t) ) λk = Pni=1 i (t) (9.6) n i=1 Similarly. θ(t) ) = Pm (t) (t) for ∀(i.8) as follows: (0) (0) 1. θj ). In other words. θ ) for k ∈ Y.θ x (d) t = t + 1 (t) (t) 4. j=1 99 . Identifiability When estimating the parameters of a mixture.    m n m ∂ X X X log wj pYi (j|xi . Report λk and wk for ∀k ∈ Y Similar update rules can be obtained for different probability distributions. θ(t) ) = P k (t) k (t) . m X p(x. θ(t) ) + α  wj − 1 = 0. Initialize λk and wk for ∀k ∈ Y 2. In summary. θ ) Pn (t+1) pY (k|xi . (9. for the mixture of m exponential distributions.7) i=1 xi pYi (k|xi . θ) = wj p(x. however. we summarize the EM algorithm by combining Eqs. we obtain that Pn (t+1) pY (k|xi . k) j=1 wj p(xi |λj ) (t+1) 1 Pn (t) (b) wk = n i=1 pYi (k|xi . it is possible that for some parametric families one obtains multiple solutions. to find the solution for the parameters of the mixture distributions.6-9. (9. ∂wk j=1 i=1 j=1 where α is the Lagrange multiplier. It is relatively straightforward to show that n (t+1) 1X wk = pYi (k|xi .θ(t) ) (c) λk = Pni=1 i (t) ) i=1 i pYi (k|xi .8) m j=1 wj p(xi |λj ) which can be computed and stored as an n × m matrix. (9. θ(t) ). Notice the difference between the CEM and the EM algorithms. As previously shown.

2. where xi ∈ X and yi ∈ Y. but can also be expressed as m0 X 0 p(x. θj0 ). θj0 ). where Z is a normalizer 1 otherwise end Output: P  T f (x) = arg max t=1 w t · I(ft (x) = y) y∈Y 100 . . 9. Typically. 2. j=1 j=1 implies that m = m0 for each j ∈ {1. . j=1 The parameters are identifiable if m m 0 X X wj p(x. θj ) = wj0 p(x. 2.3 Boosting Algorithm 5: AdaBoost algorithm. n} Loop: for t = 1 to T Sample data set D(t) from D according to p(t) (i) Learn model ft (x) from D(t) Calculate error t on training data D as t = i:ft (xi )6=yi p(t) (i) P t Set βt = 1− t Set wt = ln β1t ( p(t) (i) βt if ft (xi ) = yi Set p(t+1) (i) = Z · . . . . . . T = 100. Weak learning algorithm a that maps X to Y Positive integer T Initialization: 1 Initialize sampling distribution p(1) (i) = n for ∀i ∈ {1. m} such that wj = wl0 and θj = θl0 . m} there exists some l ∈ {1. Input: D = {(xi . yi )}ni=1 . . . . . . θ ) = wj0 p(x.

[2] A Banerjee. arXiv. PhD thesis. In Advances in Neural Information Processing Systems. [4] Léon Bottou. All of Statistics: A Concise Course in Statistical Inference. 2013. [9] Ajit P Singh and Geoffrey J Gordon. 1998. 2005. [3] L on Bottou and Yann Le Cun. Applied Stochastic Models in Business and Industry. [10] Larry Wasserman. A unified view of matrix factorization models. Online Learning and Neural Networks. [6] Julien Mairal.org. 1996. and J Ghosh. In Euro- pean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. International Conference on Machine Learning. Computation-Risk Tradeoffs for Covariance-Thresholded Regression. In Advances in Neural Information Processing Systems. 2014. [5] Lei Le and Martha White. and Andrew Zisserman. 2016. 1997. Jean Ponce.Bibliography [1] P Auer. [11] M White. 2004. Journal of Machine Learning Research. Online learning and stochastic approximations. Supervised dictionary learning. Global optimization of factor models using alternating minimization. M Herbster. Regularized factor models. Clustering with Bregman divergences. Guillermo Sapiro. Springer. 101 . 2005. On-line learning for very large data sets. S Merugu. I S Dhillon. Francis Bach. and M K Warmuth. 2009. [7] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: a strategy employed by V1? Vision Research. Exponentially many local minima for single neurons. 2008. University of Alberta. [8] Dinah Shender and John Lafferty.

The Taylor approximation for a vector function can be written as 1 T f (x) ≈ f (x0 ) + ∇f (x0 )T · (x − x0 ) + (x − x0 ) · Hf (x0 ) · (x − x0 ) . . also provides a basis for an iterative process in finding the optimum of function f (x). f 00 (x0 ) Note that the approach assumes that a good enough solution x0 already exists. if x(i) is the value of x in the i-th step. f (x) is considered to be infinitely differentiable. A key aspect of this algorithm is an effective method for selecting the step-size. Also. one should check the second derivative as well) f 0 (x) ≈ f 0 (x0 ) + (x − x0 )f 00 (x0 ) = 0.1) f 00 (x(i) ) This method is called the Newton-Raphson method of optimization. then the value in step i + 1 can be obtained as f 0 (x(i) ) x(i+1) = x(i) − . 2 where 102 .Appendix A Optimization background A. For example.2 Second-order optimization A function f (x) in the neighborhood of point x0 . Solving this equation for x gives us f 0 (x0 ) x = x0 − . we will approximate this function using the first three terms of the series as 1 f (x) ≈ f (x0 ) + (x − x0 )f 0 (x0 ) + (x − x0 )2 f 00 (x0 ). (A. However. x2 . . can be approximated using the Taylor series as ∞ X f (n) (x0 ) f (x) = (x − x0 )n . n=0 n! where f (n) (x0 ) is the n-th derivative of function f (x) evaluated at point x0 . We can generalize this approach to functions of vector variables x =(x1 . xk ). . . 2 The optimum of this function can be found by finding the first derivative and setting it to zero (technically. this equation. For practical reasons. A.1 First-order gradient descent First-order gradient descent involves using only information about the first gradient to find the minimum of a function.

A. A. . . (A...    . the gradient of f and its Hessian are evaluated at point x0 . Consequently. Eq..2..   ∂2f ∂2f ∂xk ∂x1 ∂x2k is the Hessian matrix of function f .1 is modified into the following form −1 x(i+1) = x(i) − Hf (x(i) ) · ∇f (x(i) ). ∂x1 ∂x2 ∂xk is the gradient of function f and ∂2f ∂2f ∂2f   ∂x21 ∂x1 ∂x2 ··· ∂x1 ∂xk   ∂2f ∂2f   ∂x2 ∂x1 ∂x22 Hf (x) =   .   ∂f ∂f ∂f ∇f (x) = . . .2) In Eq.3 Quasi-second-order gradient descent A. A.4 Constrained optimization and Lagrange multipliers 103 .. Here. both gradient and Hessian are evaluated at point x(i) .

3) to be the column vectors of A. .Appendix B Linear algebra background B. We shall start with our example from above and write the system of linear equations as       1 2 3 x1 + x2 = . Ax = b will be solvable whenever b can be expressed as a linear combination of the column vectors a1 and a2 . 1 3 5 We can see now that by solving Ax = b we are looking for the right amounts of vectors (1. x. with vectors a1 . we may be interested in solving x1 + 2x2 = 3 x1 + 3x2 = 5 This is a convenient formulation when we want to solve the system. it is not a suitable formulation to understand the question of the existence of solutions. In linear algebra. .1 The four fundamental subspaces The objective of this section it to briefly review the four fundamental subspaces in linear algebra (column space. Thus. B. Let us define a1 = (1. Therefore. e. and b are considered to be real numbers. A = [a1 a2 ]. All elements of A. given below in a matrix form Ax = b.1. C(A). (B. This set of equations can be expressed as a11 x1 + a12 x2 = b1 a21 x1 + a22 x2 = b2 For example. an being a basis of this space. 2 × 2 matrix.1 An Algebraic Perspective Another powerful tool for analyzing and understanding linear regression comes from linear and applied linear algebra. we briefly review the basic concepts in linear algebra. row space. 1) and a2 = (2. nullspace.e. b is an m × 1 vector. left nullspace) and their mutual relationship. by Gaussian elimination. 3) so that their linear combination produces (3. However. All linear combinations of the columns of matrix A constitute the column space of A. and x is an n × 1 vector that is to be found. We shall start with a simple scenario and assume A is a square. A is an m × n matrix. these amounts are x1 = −1 and x2 = 2. Both b and C(A) lie in the m-dimensional space Rm . In order for us to do this. 5). In this section we take a detour to address fundamentals of linear algebra and then apply these concepts to deepen our understanding of regression.g. what Ax = b is saying is that b must lie in the column space of A for the equation to have 104 . we are frequently interested in solving the following set of equations. i.1) Here. 1) and (2.

then so are all linear combinations of rows. bl ∈ N (A> ). In the example at the beginning of this subsection. any x ∈ Rn can be decomposed as x = xr + xn . Otherwise. the dimension of the space spanned by column vectors equals the rank of matrix A. for a1 = (1. which in turn is identical to the dimension of C(A> ). all linear combinations of the rows of A constitute the row space. two vectors are independent if their linear combination is only zero when both x1 and x2 are zero. 1 2 x2 5 where a1 = (1. that is of full rank. such that kxk2 = kxr k2 + kxn k2 . 2 2 2 where bc ∈ C(A). the system has no solutions. To conclude this section. there is a unique solution to the system. the solution is unique. i. every b ∈ Rm can be decomposed as b = bc + bl . 1) and a2 = (2. Orthogonality is a key property of the four subspaces. However. 2) we had a one dimensional column space. One of the fundamental results in linear algebra is that the rank of A is identical to the dimension of C(A). and 3. then by definition Ax = 0.solutions. we had a two dimensional column space with basis vectors a1 = (1. any vector in C(A) is orthogonal to all vectors from N (A> ) and any vector in C(A> ) is orthogonal to all vectors from N (A). and kbk2 = kbc k2 + kbl k2 . If each row is orthogonal to x. while all solutions of A> y = 0 constitute the so-called left nullspace of A. there are infinitely many solutions.e. 2 2 2 where xr ∈ C(A> ) and xn ∈ N (A). There is a deep connection between the spaces generated by a set of vectors and the properties of the matrix A. Here. The size of the basis is also called the dimension of the space. In the example above. 3). there exists only one linear combination of the column vectors that will give b. and thus each row of A is orthogonal to x. where both x and C(A> ) are in Rn . 1) and a2 = (2. there are three different outcomes: 1.2 Minimizing kAx − bk2 Let us now look again at the solutions to Ax = b. An example of such a situation is      1 2 x1 3 = . it suffices to say that if a1 and a2 are independent the matrix A is non-singular (singularity can be discussed only for square matrices).e. 105 . if columns of A are linearly independent1 . For now. All solutions to Ax = 0 constitute the nullspace of the matrix. Similarly. We mentioned above that the properties of fundamental spaces are tightly connected with the properties of matrix A. fully determined by any of the basis vectors. whereas C(A> ) and N (A) are in Rn . Unsurprisingly. let us briefly discuss the rank of a matrix and its relationship with the dimensions of the fundamental subspaces. as it provides useful decomposition of vectors x and b from Eq.1) with respect to A (we will exploit this in the next Section). using the example above. The basis of the space is the smallest set of vectors that span the space (this set of vectors is not unique). 2). This is easy to see: if x ∈ N (A). N (A). there are no solutions to the system 2. In an equivalent manner to the column space. (B. 1) and a2 = (2. that is. because A is a square matrix. B. i.1. the pairs of subspaces are orthogonal (vectors u and v are orthogonal if u> v = 0). a1 and a2 are (linearly) dependent because 2a1 − a2 = 0. In general. Clearly. N (A> ). On the other hand. C(A) and N (A> ) are embedded in Rm . a line. denoted by C(A> ). 1 As a reminder. For example.

our goal is to try to find a point in C(A) that is closest to b. as shown in Figure B. When r = m < n (full row rank). In such a way. using the standard algebraic notation. We already know that when r = m = n (square. 3. 4). In this situation. the constraint is b3 = 2b2 − b1 . we generalize solving Ax = b to minimizing kAx − bk2 . we know that p> e = 0. unless there is some constraint on b1 . 1) and a2 = (2. x2 1 4 b3 which illustrates an instance where we are unlikely to have a solution to Ax = b. the system has infinitely many solutions. This is exactly the same solution as one that minimized the sum of squared errors and maximized the likelihood. there are either no solutions or there are infinitely many solutions. Let us consider the following example     1 2   b1 x 1 A =  1 3 . 1. If the constraint on the elements of b is not satisfied. We will refer to the projection of b to C(A) as p. Vectors p (bc ) and e (bl ) represent the projection point and the error. respectively. 106 . full rank matrix A) there is a unique solution to the system. x = . b =  b2  . when r = n < m (full column rank). These outcomes depend on the relationship between the rank (r) of A and dimensions m and n. in cases when r < m and r < n. Let us now solve for x (Ax)> (b − Ax) = 0 x> A> b − x> A> Ax = 0 x> A> b − A> Ax = 0  and thus −1 x ∗ = A> A A> b. The matrix −1 > A† = A> A A is called the Moore-Penrose pseudo-inverse or simply a pseudo-inverse. C(A) is a 2D plane in R3 spanned by the column vectors a1 = (1. we have the following equations b=p+e p = Ax Since p and e are orthogonal. invertible. here. b e C(A) p Figure B. b2 . Generally. Now. and b3 . Because Ax = b may not be solvable. even in situations when the inverse of A> A does not exist. as we will see momentarily. all situations can be considered in a unified framework. This is an important matrix because it always exists and is unique. Finally.1. This happens to be the point where b is projected to C(A). but let us investigate other situations.1: Illustration of the projection of vector b to the column space of matrix A. the system has either one solution or no solutions.

as p = Axr . a usual situation in practice is that there are infinitely many solutions. our optimization algorithm will find the one with the minimum L2 norm. When n < k. assume there exists another vector from the row space and show that it is not possible. Observe that xr is common to all such solutions. This turned out to be equivalent to minimizing the sum of square errors (or Euclidean distance) between n-dimensional vectors Xw and y. because b is already in the column space of A. the outcome of the minimization process is the solution with the minimum L2 norm kxk2 . the tools of linear algebra allow us to discuss OLS regression at a deeper level. While we arrived at the same result as in previous sections.1. Let us examine for a moment the existence and multiplicity of solutions to arg min kAx − bk2 . Thus. Let us for a moment look at the projection vector p. Consider x to be one solution to Eq. thus the rank of A> A is less than n). there are infinitely many solutions. If the columns of A are independent. It also turned out to be equivalent to the maximum likelihood solution presented in Section 4.This happens when A has dependent columns (technically. Recall that x = xr + xn and that it is multiplied by A. As we have seen above. In such cases. what exactly is the solution found by projecting b to C(A)? Let us look at it: x∗ = A† b −1 = A> A A> (p + e) −1 = A> A A> p = xr . i. (B. Let us now consider situations where Ax = b has infinitely many solutions. where α ∈ R. the solution is unique because the nullspace contains only the origin. Thus. if it is solvable. if you cannot see it.e. the OLS regression problem is sometimes referred to as the minimum-norm least-squares problem. This usually arises when r ≤ m < n. However. we shall now see that the solution to this problem is generally not unique and that it depends on the rank of A. the solution found by the least squares optimization is the one that simultaneously minimizes kAx − bk2 and kxk2 (observe that kxk2 is minimized because the solution ignores any component from the nullspace). Given that xr is unique. is also a solution. when b ∈ C(A). −1 > where A A> A A is the matrix that projects b to the column space of A. the only question is what particular solution x will be found by the minimization procedure.2) x Clearly. A and A> A will have the same nullspace that contains more than just the origin of the coordinate system. Here. 107 . thus any vector x = xr + αxn . In these situations.2). To summarize. Otherwise. (B. When k < n this is not a realistic scenario in practice. We have p = Ax −1 = A A> A A> b. the solution to this problem always exists. we relaxed the requirement and tried to find the point in the column space C(x) that is closest to y. the goal of the OLS regression problem is to solve xw = y.