You are on page 1of 4

SUPERVISED LEARNING

PART 1
We will consider two types of learning, supervised and like to understand what type of apartments in Paris are
unsupervised learning. Before even explaining each of those for sale on the market, and what explains the asking price
terms, what do we mean by learning in our introductory differences of these apartments.
artificial intelligence course? Here, learning refers to acquiring
knowledge from data with the objective of making better So the question is: what determines the asking price of an
decisions than without using that data. By decisions, we can apartment in Paris? To answer this question in order to make
understand for example managerial decisions like “should I an informed decision of buying an apartment, Lyn needs
invest in this project?” but also platform automated decisions to build a dataset. What is a dataset? At this stage, we can
like “What is shown to you on the landing page of Netflix?”. imagine a matrix with N lines or observations and P columns or
With our society, that is our personal lives, businesses, variables. Yes you can think of an Excel or Google sheet in fact.
governments being highly digitalized, unprecedented
amounts of data are generated and stored in real time. These
massive amounts of data, we can call this effectively Big Data,
can be used for improved decision making. The target variable
Of course, having access to such data is interesting but Supervised learning is characterized by the fact that in the
it becomes only useful if we can identify clearly the dataset one of the P variables is a target variable. The target
relevant data sources and then quickly learn from it variable can also be called response or dependent variable.
to make decisions. Yes, you might already be thinking In Lyn’s case, the target variable is the asking price for a
that this sounds complicated. In fact; massive datasets specific apartment in Paris. In other cases, it could be sales
and swift decision making will be possible only thanks of a product, buy-not buy of a product or service, your face
to the processing power of computers. Highly scalable being recognized or not to unlock your smartphone, etc. As
computational resources and full digitalization of our society you can see, the possible target variable can be any variable
are the main reason for the so-called artificial intelligence that is of interest as long as it is well defined or labeled. In
revolution. Ok, now we have computing power and data, how current applications, supervised learning problems are the
can you learn to make better decisions? most common ones.

What do we want to do with the target variable? We want


to predict it with the remaining available P-1 available
Example: Lyn & Paris real estate variables in the dataset. The latter variables are often called
feature or independent variables, or just shortly features.
The objective of this video is to understand the principles Unfortunately, not all feature variables are relevant to predict
of supervised learning. The best way to start is with an the target variable, and some important ones might be even
example. Lyn has just been accepted to the prestigious missing in the dataset. We will need to search for those
master in management program at Essec Business School. features which are related to the target variable. Note that we
Lyn currently lives in Brussels with her parents and deliberately write that we wish to predict the target variable,
accommodation becomes a pressing issue. Since she plans and not to explain it. At this stage, we would like to find an
to stay longer in France than the duration of her studies, algorithm that predicts well, irrespective of the fact that
living and working in France is her childhood dream, she any causal relationship exists. In this sense, we can call the
considers buying an apartment in Paris. Lyn knows that the supervised learning algorithms prediction machines.
Paris real estate market is different from the one in Brussels,
but such a statement will not help her making an informed Ok, Lyn has a supervised learning problem with asking
decision. price as a target variable. The features are the variables
that potentially influence the apartment asking price. For
Luckily, Lyn followed a bachelor’s degree in computer example, the number of bedrooms and bathrooms, the size
science at Louvain University with a major in data science. in squared meters, age of the building, etc. Lyn decides
Time to put her academic knowledge in practice. First, what to activate her web scraping skills and extracts 1000
is the question she would like to answer? Well, she would apartments for sale on a popular real estate website. With

JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 2 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI
modern data science open source software packages this is So in practice, we split the dataset in two parts. The part
trivial to do. Just Google for it. The resulting dataset has not of the original dataset that has not been used to train is
only the features we discussed before, but also contains the called the test dataset. Using only the feature variables,
apartment floor (not all apartment buildings have elevators predictions are made with the fitted algorithm. Doing this,
in Paris…), heating system, is there a balcony, district in we can contrast the predictions with the true target variable
Paris. values from the test dataset and compute prediction errors
on the test dataset. Then the best algorithm is the one with
The supervised learning algorithm will connect the feature the smallest prediction errors. Once the best algorithm is
variables to the target variable. We will omit the details how identified, it can be used for truly out of sample prediction.
this can be formalized mathematically, but it is important That is, on new observations for which only the features are
to know that there are many ways to do this and you cannot observed.
really tell in advance which method is best. The supervised
learning algorithms you should have heard about and which Lyn, decides to use 800 apartments to train different
are available in standard data science software packages are: candidate algorithm and then to test them on the remaining
Linear regression, logistic regression, decision tree, random 200 apartments. It turns out that linear regression works
forest, and neural network. best on this data, which she particularly appreciates since
linear regressions are fairly simple to interpret. In terms of
prediction, Lyn can now assess for example the asking price
for an apartment in the 9th district in Paris with one room,
Split the dataset 30 squared meters, and a balcony. And she can do that even
before visiting the apartment with a real estate agent, so that
An interesting characteristic of supervised learning is that she can evaluate if the current asking price is in line with the
the original dataset can be split into training and test market.
datasets. The training dataset is used for estimating or
training the algorithm. This means that the parameters
or decision rules that describe the algorithm are chosen
such that a loss function is minimized. This loss function is
directly related to the prediction error the algorithm makes
on the training dataset. Optimization of this loss function is
done by means of a data science software program such as
Python.

What happens when we train the algorithm on the entire


dataset without splitting it? Then we run the risk of
overfitting which means that the algorithm will produce very
small prediction errors on the existing data but is bound to
make large errors when new data is fed into the algorithm.

JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 3 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI
SUPERVISED LEARNING
PART 2
Let me now give more precise details on the different The previous two types of models allow identifying the
supervised learning algorithms: significant feature variables that drive the target variable.
This is what we call attribution, that is we can tell which
features explain the target variable. On top of that, they
allow to quantify what is the expected change in the target
linear regression variable when one of the features change. Given their
construction, the above models are called smooth.
Linear regression can be applied when the target variable
takes infinitely many possible values. The asking price The next methods that I will explain you now are pure
for an apartment in Lyn’s case but other examples are: prediction models. This means that the impact of the
your electricity consumption sampled every 15 minutes features on the target is often not easy to assess since they
with smart meters, or sales for different grocery stores of are not smooth. The only purpose is maximize prediction
a supermarket chain. A linear regression will explain the performance.
expected value of the target variable as a function of the
feature variables. The parameters are extremely easy and
fast to estimate even in the case of large datasets, and
they can be typically directly interpreted as sensitivities of Decision trees
each feature on the target variable. So everything is easy to
interpret, which is the reason why this model is extremely Decision trees fall in the category of what we call
popular. In terms of predictions, the linear regression model nonparametric methods. There are no parameters to
is easy to predict given observed features. For example, estimate, only cutoff thresholds. They apply both to
give me the characteristics of a new to open grocery store quantitative and qualitative target variables. Decision trees
of a supermarket chain and I will tell you the expected are easy to interpret, since they basically can be graphically
amount of sales. Regression methods are new? Not at all, it thought of as a tree with branches. In the simplest case of
was invented in the 1800s by Gauss and Legendre, and in binary splits, each branch splits in two according to a value
particular Galton in 1877. to exceed for one of the feature variables. As usual with
easy to interpret algorithms, the machinery in the back is
quite advanced. I will spare you the details in this video.
Just imagine that to construct the first branch of the tree
Logistic regression you need to consider all the features and all possible cutoff
values, quite a computationally heavy task. I can testify that
Logistic regression type models can be applied when the you can start using your computer as home heating when
target variable takes limited possible values, the most the number of observations and features is large.
standard case being two values. For example, related to
conversion rates, a product is bought or not on a website. Or
related to churn, you stay client or not at your telco company.
The model parameters are relatively easy to estimate, and Random forests
importance of each feature to the probability of observing one
of the outcomes can be computed. In terms of predictions, Random forests are, well as the term alludes to, re-
the logistic model allows classifying the target variable when applying many decision trees to the same problem. Each of
only the features are observed. For example, given your socio those trees are constructed using a bootstrap resampling
demographic characteristics, I will obtain the probability that technique and uses only a random subset of the features.
you will buy a product online. If the probability is high, there In case of a classification problem, the final prediction is
is no need pushing a promotion to you. determined by the majority vote over all trees. This method
is relatively recently proposed by Breiman in 2001.

Where is artificial intelligence? In the narrow sense, AI


is to be interpreted as intelligence reproduced by neural

JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 4 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI
networks or its extensions. It can be used in both regression
or classification problems. That means AI needs a target
variable in order to train the deep neural networks. The
training of neural networks requires tuning often hundreds
of thousands parameters and therefore a lot of data are
necessary to obtain small prediction error rates. Think about
image recognition that is trained on massive sets of pictures,
and then performs amazingly well most of the time, but
can at some point make embarrassing mistakes. In terms
of machinery in the back, you can imagine a sophisticated
logistic regression algorithm.

To sum up, we have reviewed the most important supervised


learning algorithms. The training and test datasets allows
to gauge performance among different algorithms. The best
one will necessarily always be superior and therefore need
continuous monitoring. In fact, we always need to watch
out for concept drift. Concept drift means that the data
generating mechanism changes over time.

Lyn has scrapped a Paris real estate website at the beginning


of the COVID pandemic to get her dataset. However, many
people working in Paris have decided to live elsewhere in
France to have more space while teleworking. Some workers
come only one or two days a week to Paris and find it more
flexible to stay in a hotel. The demand for apartments might
therefore go down, and therefore asking prices as well. In
such a situation, Lyn’s prediction model has the risk of
systematically overpricing the apartment prices. How can
we solve such a problem? Exactly, get new data and do
the supervised learning exercise again to determine which
algorithm works best.

JEROEN ROMBOUTS - PROFESSOR OF ECONOMETRICS AND STATISTICS 5 CORE MODULE I - FOUNDATIONAL TECHNOLOGIES OF AI

You might also like