Professional Documents
Culture Documents
Supervised Learning - Phase 1
Supervised Learning - Phase 1
PART 1
We will consider two types of learning, supervised and like to understand what type of apartments in Paris are
unsupervised learning. Before even explaining each of those for sale on the market, and what explains the asking price
terms, what do we mean by learning in our introductory differences of these apartments.
artificial intelligence course? Here, learning refers to acquiring
knowledge from data with the objective of making better So the question is: what determines the asking price of an
decisions than without using that data. By decisions, we can apartment in Paris? To answer this question in order to make
understand for example managerial decisions like “should I an informed decision of buying an apartment, Lyn needs
invest in this project?” but also platform automated decisions to build a dataset. What is a dataset? At this stage, we can
like “What is shown to you on the landing page of Netflix?”. imagine a matrix with N lines or observations and P columns or
With our society, that is our personal lives, businesses, variables. Yes you can think of an Excel or Google sheet in fact.
governments being highly digitalized, unprecedented
amounts of data are generated and stored in real time. These
massive amounts of data, we can call this effectively Big Data,
can be used for improved decision making. The target variable
Of course, having access to such data is interesting but Supervised learning is characterized by the fact that in the
it becomes only useful if we can identify clearly the dataset one of the P variables is a target variable. The target
relevant data sources and then quickly learn from it variable can also be called response or dependent variable.
to make decisions. Yes, you might already be thinking In Lyn’s case, the target variable is the asking price for a
that this sounds complicated. In fact; massive datasets specific apartment in Paris. In other cases, it could be sales
and swift decision making will be possible only thanks of a product, buy-not buy of a product or service, your face
to the processing power of computers. Highly scalable being recognized or not to unlock your smartphone, etc. As
computational resources and full digitalization of our society you can see, the possible target variable can be any variable
are the main reason for the so-called artificial intelligence that is of interest as long as it is well defined or labeled. In
revolution. Ok, now we have computing power and data, how current applications, supervised learning problems are the
can you learn to make better decisions? most common ones.
JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 2 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI
modern data science open source software packages this is So in practice, we split the dataset in two parts. The part
trivial to do. Just Google for it. The resulting dataset has not of the original dataset that has not been used to train is
only the features we discussed before, but also contains the called the test dataset. Using only the feature variables,
apartment floor (not all apartment buildings have elevators predictions are made with the fitted algorithm. Doing this,
in Paris…), heating system, is there a balcony, district in we can contrast the predictions with the true target variable
Paris. values from the test dataset and compute prediction errors
on the test dataset. Then the best algorithm is the one with
The supervised learning algorithm will connect the feature the smallest prediction errors. Once the best algorithm is
variables to the target variable. We will omit the details how identified, it can be used for truly out of sample prediction.
this can be formalized mathematically, but it is important That is, on new observations for which only the features are
to know that there are many ways to do this and you cannot observed.
really tell in advance which method is best. The supervised
learning algorithms you should have heard about and which Lyn, decides to use 800 apartments to train different
are available in standard data science software packages are: candidate algorithm and then to test them on the remaining
Linear regression, logistic regression, decision tree, random 200 apartments. It turns out that linear regression works
forest, and neural network. best on this data, which she particularly appreciates since
linear regressions are fairly simple to interpret. In terms of
prediction, Lyn can now assess for example the asking price
for an apartment in the 9th district in Paris with one room,
Split the dataset 30 squared meters, and a balcony. And she can do that even
before visiting the apartment with a real estate agent, so that
An interesting characteristic of supervised learning is that she can evaluate if the current asking price is in line with the
the original dataset can be split into training and test market.
datasets. The training dataset is used for estimating or
training the algorithm. This means that the parameters
or decision rules that describe the algorithm are chosen
such that a loss function is minimized. This loss function is
directly related to the prediction error the algorithm makes
on the training dataset. Optimization of this loss function is
done by means of a data science software program such as
Python.
JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 3 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI
SUPERVISED LEARNING
PART 2
Let me now give more precise details on the different The previous two types of models allow identifying the
supervised learning algorithms: significant feature variables that drive the target variable.
This is what we call attribution, that is we can tell which
features explain the target variable. On top of that, they
allow to quantify what is the expected change in the target
linear regression variable when one of the features change. Given their
construction, the above models are called smooth.
Linear regression can be applied when the target variable
takes infinitely many possible values. The asking price The next methods that I will explain you now are pure
for an apartment in Lyn’s case but other examples are: prediction models. This means that the impact of the
your electricity consumption sampled every 15 minutes features on the target is often not easy to assess since they
with smart meters, or sales for different grocery stores of are not smooth. The only purpose is maximize prediction
a supermarket chain. A linear regression will explain the performance.
expected value of the target variable as a function of the
feature variables. The parameters are extremely easy and
fast to estimate even in the case of large datasets, and
they can be typically directly interpreted as sensitivities of Decision trees
each feature on the target variable. So everything is easy to
interpret, which is the reason why this model is extremely Decision trees fall in the category of what we call
popular. In terms of predictions, the linear regression model nonparametric methods. There are no parameters to
is easy to predict given observed features. For example, estimate, only cutoff thresholds. They apply both to
give me the characteristics of a new to open grocery store quantitative and qualitative target variables. Decision trees
of a supermarket chain and I will tell you the expected are easy to interpret, since they basically can be graphically
amount of sales. Regression methods are new? Not at all, it thought of as a tree with branches. In the simplest case of
was invented in the 1800s by Gauss and Legendre, and in binary splits, each branch splits in two according to a value
particular Galton in 1877. to exceed for one of the feature variables. As usual with
easy to interpret algorithms, the machinery in the back is
quite advanced. I will spare you the details in this video.
Just imagine that to construct the first branch of the tree
Logistic regression you need to consider all the features and all possible cutoff
values, quite a computationally heavy task. I can testify that
Logistic regression type models can be applied when the you can start using your computer as home heating when
target variable takes limited possible values, the most the number of observations and features is large.
standard case being two values. For example, related to
conversion rates, a product is bought or not on a website. Or
related to churn, you stay client or not at your telco company.
The model parameters are relatively easy to estimate, and Random forests
importance of each feature to the probability of observing one
of the outcomes can be computed. In terms of predictions, Random forests are, well as the term alludes to, re-
the logistic model allows classifying the target variable when applying many decision trees to the same problem. Each of
only the features are observed. For example, given your socio those trees are constructed using a bootstrap resampling
demographic characteristics, I will obtain the probability that technique and uses only a random subset of the features.
you will buy a product online. If the probability is high, there In case of a classification problem, the final prediction is
is no need pushing a promotion to you. determined by the majority vote over all trees. This method
is relatively recently proposed by Breiman in 2001.
JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 4 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI
networks or its extensions. It can be used in both regression
or classification problems. That means AI needs a target
variable in order to train the deep neural networks. The
training of neural networks requires tuning often hundreds
of thousands parameters and therefore a lot of data are
necessary to obtain small prediction error rates. Think about
image recognition that is trained on massive sets of pictures,
and then performs amazingly well most of the time, but
can at some point make embarrassing mistakes. In terms
of machinery in the back, you can imagine a sophisticated
logistic regression algorithm.
JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 5 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI